Re: VDPA Debug/Statistics

2020-08-11 Thread Jason Wang


On 2020/8/11 下午7:58, Eli Cohen wrote:

On Tue, Aug 11, 2020 at 11:26:20AM +, Eli Cohen wrote:

Hi All

Currently, the only statistics we get for a VDPA instance comes from the 
virtio_net device instance. Since VDPA involves hardware acceleration, there 
can be quite a lot of information that can be fetched from the underlying 
device. Currently there is no generic method to fetch this information.

One way of doing this can be to create a the host, a net device for
each VDPA instance, and use it to get this information or do some
configuration. Ethtool can be used in such a case



The problems are:

- vDPA is not net specific
- vDPA should be transparent to host networking stack




I would like to hear what you think about this or maybe you have some other 
ideas to address this topic.

Thanks,
Eli

Something I'm not sure I understand is how are vdpa instances created on 
mellanox cards? There's a devlink command for that, is that right?
Can that be extended for stats?

Currently any VF will be probed as VDPA device. We're adding devlink support 
but I am not sure if devlink is suitable for displaying statistics. We will 
discuss internally but I wanted to know why you guys think.



I agree with Michael, if it's possible, integrating stats with devlink 
should be the best. Having another interface with is just for stats 
looks not good.


Thanks




--
MST



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: vdpa: handling of VIRTIO_F_ACCESS_PLATFORM/VIRTIO_F_ORDER_PLATFORM

2020-08-11 Thread Jason Wang


On 2020/8/11 下午5:52, Michael S. Tsirkin wrote:

Hi!
I'd like to raise the question of whether we can drop the requirement
of VIRTIO_F_ACCESS_PLATFORM from vdpa?
As far as I can see, it is merely required for virtio vdpa -
so should we not enforce it there?



If we don't enforce it, virtio will use PA which breaks the setup when 
IOMMU is enabled. As discussed in the past, mandating DMA API for virito 
can just solve this issue.





The point is support for legacy guests - which mostly just works
on x86.



Legacy guest should work even if we mandate ACCESS_PLATFORM.

This is because we don't simply pass through guest features (qemu will 
always set ACCESS_PLATFORM to vhost-vdpa).





Also, what is the plan for VIRTIO_F_ORDER_PLATFORM?



I think we should mandate ORDER_PLATFORM, (even for guest).

Thanks





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/4] vdpa: introduce config op to get valid iova range

2020-08-11 Thread Jason Wang


On 2020/8/11 下午4:29, Michael S. Tsirkin wrote:

On Tue, Aug 11, 2020 at 10:53:09AM +0800, Jason Wang wrote:

On 2020/8/10 下午8:05, Michael S. Tsirkin wrote:

On Thu, Aug 06, 2020 at 03:43:54PM +0300, Eli Cohen wrote:

On Thu, Aug 06, 2020 at 08:29:22AM -0400, Michael S. Tsirkin wrote:

On Thu, Aug 06, 2020 at 03:03:55PM +0300, Eli Cohen wrote:

On Wed, Aug 05, 2020 at 08:51:56AM -0400, Michael S. Tsirkin wrote:

On Wed, Jun 17, 2020 at 11:29:44AM +0800, Jason Wang wrote:

This patch introduce a config op to get valid iova range from the vDPA
device.

Signed-off-by: Jason Wang
---
   include/linux/vdpa.h | 14 ++
   1 file changed, 14 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 239db794357c..b7633ed2500c 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -41,6 +41,16 @@ struct vdpa_device {
unsigned int index;
   };
+/**
+ * vDPA IOVA range - the IOVA range support by the device
+ * @start: start of the IOVA range
+ * @end: end of the IOVA range
+ */
+struct vdpa_iova_range {
+   u64 start;
+   u64 end;
+};
+

This is ambiguous. Is end in the range or just behind it?
How about first/last?

It is customary in the kernel to use start-end where end corresponds to
the byte following the last in the range. See struct vm_area_struct
vm_start and vm_end fields

Exactly my point:

include/linux/mm_types.h:   unsigned long vm_end;   /* The first 
byte after our end address

in this case Jason wants it to be the last byte, not one behind.



Maybe start, size? Not ambiguous, and you don't need to do annoying
calculations like size = last - start + 1

Size has a bunch of issues: can overlap, can not cover the entire 64 bit
range. The requisite checks are arguably easier to get wrong than
getting the size if you need it.

Yes, so do you still prefer first/last or just begin/end which is consistent
with iommu_domain_geometry?

Thanks

I prefer first/last I think, these are unambiguous.
E.g.

 dma_addr_t aperture_start; /* First address that can be mapped*/
 dma_addr_t aperture_end;   /* Last address that can be mapped */

instead of addressing ambiguity with a comment, let's just name the field well.



Ok, will do.

Thanks









___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [GIT PULL] virtio: features, fixes

2020-08-11 Thread pr-tracker-bot
The pull request you sent on Tue, 11 Aug 2020 04:56:13 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/57b077939287835b9396a1c3b40d35609cf2fcb8

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread peterz
On Tue, Aug 11, 2020 at 11:46:51AM +0200, pet...@infradead.org wrote:

> So let me once again see if I can't find a better solution for this all.
> Clearly it needs one :/

So the below boots without triggering the debug code from Marco -- it
should allow nesting local_irq_save/restore under raw_local_irq_*().

I tried unconditional counting, but there's some _reallly_ wonky /
asymmetric code that wrecks that and I've not been able to come up with
anything useful.

This one starts counting when local_irq_save() finds it didn't disable
IRQs while lockdep though it did. At that point, local_irq_restore()
will decrement and enable things again when it reaches 0.

This assumes local_irq_save()/local_irq_restore() are nested sane, which
is mostly true.

This leaves #PF, which I fixed in these other patches, but I realized it
needs fixing for all architectures :-( No bright ideas there yet.

---
 arch/x86/entry/thunk_32.S   |  5 
 include/linux/irqflags.h| 45 +++-
 init/main.c | 16 
 kernel/locking/lockdep.c| 58 +
 kernel/trace/trace_preemptirq.c | 33 +++
 5 files changed, 134 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/thunk_32.S b/arch/x86/entry/thunk_32.S
index 3a07ce3ec70b..f1f96d4d8cd6 100644
--- a/arch/x86/entry/thunk_32.S
+++ b/arch/x86/entry/thunk_32.S
@@ -29,11 +29,6 @@ SYM_CODE_START_NOALIGN(\name)
 SYM_CODE_END(\name)
.endm
 
-#ifdef CONFIG_TRACE_IRQFLAGS
-   THUNK trace_hardirqs_on_thunk,trace_hardirqs_on_caller,1
-   THUNK trace_hardirqs_off_thunk,trace_hardirqs_off_caller,1
-#endif
-
 #ifdef CONFIG_PREEMPTION
THUNK preempt_schedule_thunk, preempt_schedule
THUNK preempt_schedule_notrace_thunk, preempt_schedule_notrace
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index bd5c55755447..67e2a16d3846 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -23,12 +23,16 @@
   extern void lockdep_hardirqs_on_prepare(unsigned long ip);
   extern void lockdep_hardirqs_on(unsigned long ip);
   extern void lockdep_hardirqs_off(unsigned long ip);
+  extern void lockdep_hardirqs_save(unsigned long ip, unsigned long flags);
+  extern void lockdep_hardirqs_restore(unsigned long ip, unsigned long flags);
 #else
   static inline void lockdep_softirqs_on(unsigned long ip) { }
   static inline void lockdep_softirqs_off(unsigned long ip) { }
   static inline void lockdep_hardirqs_on_prepare(unsigned long ip) { }
   static inline void lockdep_hardirqs_on(unsigned long ip) { }
   static inline void lockdep_hardirqs_off(unsigned long ip) { }
+  static inline void lockdep_hardirqs_save(unsigned long ip, unsigned long 
flags) { }
+  static inline void lockdep_hardirqs_restore(unsigned long ip, unsigned long 
flags) { }
 #endif
 
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -49,10 +53,13 @@ struct irqtrace_events {
 DECLARE_PER_CPU(int, hardirqs_enabled);
 DECLARE_PER_CPU(int, hardirq_context);
 
-  extern void trace_hardirqs_on_prepare(void);
-  extern void trace_hardirqs_off_finish(void);
-  extern void trace_hardirqs_on(void);
-  extern void trace_hardirqs_off(void);
+extern void trace_hardirqs_on_prepare(void);
+extern void trace_hardirqs_off_finish(void);
+extern void trace_hardirqs_on(void);
+extern void trace_hardirqs_off(void);
+extern void trace_hardirqs_save(unsigned long flags);
+extern void trace_hardirqs_restore(unsigned long flags);
+
 # define lockdep_hardirq_context() (this_cpu_read(hardirq_context))
 # define lockdep_softirq_context(p)((p)->softirq_context)
 # define lockdep_hardirqs_enabled()(this_cpu_read(hardirqs_enabled))
@@ -120,17 +127,19 @@ do {  \
 #else
 # define trace_hardirqs_on_prepare()   do { } while (0)
 # define trace_hardirqs_off_finish()   do { } while (0)
-# define trace_hardirqs_on()   do { } while (0)
-# define trace_hardirqs_off()  do { } while (0)
-# define lockdep_hardirq_context() 0
-# define lockdep_softirq_context(p)0
-# define lockdep_hardirqs_enabled()0
-# define lockdep_softirqs_enabled(p)   0
-# define lockdep_hardirq_enter()   do { } while (0)
-# define lockdep_hardirq_threaded()do { } while (0)
-# define lockdep_hardirq_exit()do { } while (0)
-# define lockdep_softirq_enter()   do { } while (0)
-# define lockdep_softirq_exit()do { } while (0)
+# define trace_hardirqs_on()   do { } while (0)
+# define trace_hardirqs_off()  do { } while (0)
+# define trace_hardirqs_save(flags)do { } while (0)
+# define trace_hardirqs_restore(flags) do { } while (0)
+# define lockdep_hardirq_context() 0
+# define lockdep_softirq_context(p)0
+# define lockdep_hardirqs_enabled()0
+# define lockdep_softirqs_enabled(p)   0
+# define lockdep_hardirq_ente

clarifying the handling of responses for virtio-rpmb

2020-08-11 Thread Alex Bennée

Hi,

The specification lists a number of commands that have responses:

  The operation of a virtio RPMB device is driven by the requests placed
  on the virtqueue. The type of request can be program key
  (VIRTIO_RPMB_REQ_PROGRAM_KEY), get write counter
  (VIRTIO_RPMB_REQ_GET_WRITE_COUNTER), write
  (VIRTIO_RPMB_REQ_DATA_WRITE), and read (VIRTIO_RPMB_REQ_DATA_READ). A
  program key or write request can also combine with a result read
  (VIRTIO_RPMB_REQ_RESULT_READ) for a returned result.

Now I'm deep in the guts of virt queues doing the implementation I'm
trying to clarify exactly what frames should be sent to the device and
if they should be out_sgs or in_sgs. I suspect there is some difference
between the original prototype interface and what we have now.

Some operations obviously have an implied response
(VIRTIO_RPMB_REQ_GET_WRITE_COUNTER and VIRTIO_RPMB_REQ_DATA_READ). As
far as I could tell the frame should be simple:

  sg[0] (out_sg=1) - frame with command
  sg[1..n] (in_sg=1..n) - space for the reply to be filled in by the device

However the language for the program key and data write say they can be
combined with a VIRTIO_RPMB_REQ_RESULT_READ to optionally return a
result. My question is is this result read meant to be in a separate
request frame and response frame so you get:

 sg[0] - VIRTIO_RPMB_REQ_PROGRAM_KEY/VIRTIO_RPMB_REQ_DATA_WRITE frame
 sg[1] - VIRTIO_RPMB_REQ_RESULT_READ (out_sg=2)
 sg[2] - empty frame for response (in_sg=1)

or:

 sg[0] - VIRTIO_RPMB_REQ_PROGRAM_KEY/VIRTIO_RPMB_REQ_DATA_WRITE frame (out_sg=1)
 sg[1] - VIRTIO_RPMB_REQ_RESULT_READ (in_sg=1)

where the result frame is filled in and sent back?

I must say I'm a little confused by the logic in rpmb_ioctl (in the
userspace tool) which creates both out_frames and resp frames:

  static int rpmb_ioctl(uint8_t frame_type, int fd, uint16_t req,
const void *frames_in, unsigned int cnt_in,
void *frames_out, unsigned int cnt_out)
  {
  int ret;
  struct __packed {
  struct rpmb_ioc_seq_cmd h;
  struct rpmb_ioc_cmd cmd[3];
  } iseq = {};

  void *frame_res = NULL;
  int i;
  uint32_t flags;

  rpmb_dbg("RPMB OP: %s\n", rpmb_op_str(req));
  dbg_dump_frame(frame_type, "In Frame: ", frames_in, cnt_in);

  i = 0;
  flags = RPMB_F_WRITE;
  if (req == RPMB_WRITE_DATA || req == RPMB_PROGRAM_KEY)
  flags |= RPMB_F_REL_WRITE;
  rpmb_ioc_cmd_set(iseq.cmd[i], flags, frames_in, cnt_in);
  i++;

  if (req == RPMB_WRITE_DATA || req == RPMB_PROGRAM_KEY) {
  frame_res = rpmb_frame_alloc(frame_type, 0);
  if (!frame_res)
  return -ENOMEM;
  rpmb_frame_set(frame_type, frame_res,
 RPMB_RESULT_READ, 0, 0, 0);
  rpmb_ioc_cmd_set(iseq.cmd[i], RPMB_F_WRITE, frame_res, 0);
  i++;
  }

  rpmb_ioc_cmd_set(iseq.cmd[i], 0, frames_out, cnt_out);
  i++;

  iseq.h.num_of_cmds = i;
  ret = ioctl(fd, RPMB_IOC_SEQ_CMD, &iseq);
  if (ret < 0)
  rpmb_err("ioctl failure %d: %s.\n", ret, strerror(errno));

  ret = rpmb_check_req_resp(frame_type, req, frames_out);

  dbg_dump_frame(frame_type, "Res Frame: ", frame_res, 1);
  dbg_dump_frame(frame_type, "Out Frame: ", frames_out, cnt_out);
  free(frame_res);
  return ret;
  }

although I'm guessing this might just be an impedance between the
chardev ioctl interface for rpmb and the virtio FE driver which is only
one potential consumer of these rpmb ioc commands? 

Can anyone clarify?

-- 
Alex Bennée
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Laurent Vivier
On 11/08/2020 16:49, Michael S. Tsirkin wrote:
> On Tue, Aug 11, 2020 at 03:53:54PM +0200, Laurent Vivier wrote:
>> On 11/08/2020 15:14, Michael S. Tsirkin wrote:
>>> On Tue, Aug 11, 2020 at 03:00:14PM +0200, Laurent Vivier wrote:
 No problem. This code is tricky and it took me several months to really
 start to understand it ...
>>>
>>> Oh great, we actually have someone who understands the code!
>>> Maybe you can help me understand: virtio_read
>>> takes the buf pointer and puts it in the vq.
>>> It can then return to caller (e.g. on a signal).
>>> Device can meanwhile write into the buffer.
>>>
>>> It looks like if another call then happens, and that
>>> other call uses a different buffer, virtio rng
>>> will happily return the data written into the
>>> original buf pointer, confusing the caller.
>>>
>>> Is that right?
>>>
>>
>> Yes.
>>
>> hw_random core uses two bufers:
>>
>> - rng_fillbuf that is used with a blocking access and protected by the
>> reading_mutex. I think this cannot be interrupted by a kill because it's
>> in  hwrng_fillfn() and it's kthread.
>>
>> - rng_buffer that is used in rng_dev_read() and can be interrupted (it
>> is also protected by reading_mutex)
>>
>> But if rng_dev_read() is called with O_NONBLOCK or interrupted and then
>> rng_fillbuf starts they can be mixed.
>>
>> We have also the first use of rng_buffer in add_early_randomness() that
>> use a different size than in rng_dev_read() with the same buffer (and
>> this size is 16 whereas the hwrng read API says it must be at least 32...).
>>
>> The problem here is core has been developped with synchronicity in mind,
>> whereas virtio is asynchronous by definition.
>>
>> I think we should add some internal buffers in virtio-rng backend. This
>> would improve performance (we are at 1 MB/s, I sent a patch to improve
>> that, but this doesn't fix the problems above), and allows hw_random
>> core to use memory that doesn't need to be compatible with virt_to_page().
>>
>> Thanks,
>> Laurent
> 
> OK so just add a bunch of 32 bit buffers and pass them to hardware,
> as they data gets consumed pass them to hardware again?
> 
> 

For virtio-rng performance we must ask for the bigger block we can (the
size given in rng_dev_read() would be great).

But the problem here is not to waste entropy. We should avoid to ask for
entropy we don't need. So we can't really enqueue buffer before knowing
the size. And if there is no enough entroy to fill the buffer but enough
for the use we can be blocked waiting for entropy we don't need.

And the change must be done at virtio-rng level, not in core because
it's useless for other backends

Moreover, the buffer in core will be used with another hw_random backend
if the user change the backend while the buffer is in use by virtio-rng.
So we really need to copy between virtio-rng buffer and core buffer.

I've also propose a change to the virtio entropy device specs to add a
command queue and a command to flush the enqueued buffers. The purpose
was to be able to remove a blocked device, but it can also be useful in
this case. to remove the buffer of an interrupted read.

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Michael S. Tsirkin
On Tue, Aug 11, 2020 at 03:53:54PM +0200, Laurent Vivier wrote:
> On 11/08/2020 15:14, Michael S. Tsirkin wrote:
> > On Tue, Aug 11, 2020 at 03:00:14PM +0200, Laurent Vivier wrote:
> >> No problem. This code is tricky and it took me several months to really
> >> start to understand it ...
> > 
> > Oh great, we actually have someone who understands the code!
> > Maybe you can help me understand: virtio_read
> > takes the buf pointer and puts it in the vq.
> > It can then return to caller (e.g. on a signal).
> > Device can meanwhile write into the buffer.
> > 
> > It looks like if another call then happens, and that
> > other call uses a different buffer, virtio rng
> > will happily return the data written into the
> > original buf pointer, confusing the caller.
> > 
> > Is that right?
> > 
> 
> Yes.
> 
> hw_random core uses two bufers:
> 
> - rng_fillbuf that is used with a blocking access and protected by the
> reading_mutex. I think this cannot be interrupted by a kill because it's
> in  hwrng_fillfn() and it's kthread.
> 
> - rng_buffer that is used in rng_dev_read() and can be interrupted (it
> is also protected by reading_mutex)
> 
> But if rng_dev_read() is called with O_NONBLOCK or interrupted and then
> rng_fillbuf starts they can be mixed.
> 
> We have also the first use of rng_buffer in add_early_randomness() that
> use a different size than in rng_dev_read() with the same buffer (and
> this size is 16 whereas the hwrng read API says it must be at least 32...).
> 
> The problem here is core has been developped with synchronicity in mind,
> whereas virtio is asynchronous by definition.
> 
> I think we should add some internal buffers in virtio-rng backend. This
> would improve performance (we are at 1 MB/s, I sent a patch to improve
> that, but this doesn't fix the problems above), and allows hw_random
> core to use memory that doesn't need to be compatible with virt_to_page().
> 
> Thanks,
> Laurent

OK so just add a bunch of 32 bit buffers and pass them to hardware,
as they data gets consumed pass them to hardware again?


-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Laurent Vivier
On 11/08/2020 16:28, mwi...@suse.com wrote:
> From: Martin Wilck 
> 
> If a program opens /dev/hwrng with O_NONBLOCK and uses poll() and
> non-blocking read() to retrieve random data, it ends up in a tight
> loop with poll() always returning POLLIN and read() returning EAGAIN.
> This repeats forever until some process makes a blocking read() call.
> The reason is that virtio_read() always returns 0 in non-blocking mode,
> even if data is available. Worse, it fetches random data from the
> hypervisor after every non-blocking call, without ever using this data.
> 
> The following test program illustrates the behavior and can be used
> for testing and experiments. The problem will only be seen if all
> tasks use non-blocking access; otherwise the blocking reads will
> "recharge" the random pool and cause other, non-blocking reads to
> succeed at least sometimes.
> 
> /* Whether to use non-blocking mode in a task, problem occurs if CONDITION is 
> 1 */
> //#define CONDITION (getpid() % 2 != 0)
> 
> static volatile sig_atomic_t stop;
> static void handler(int sig __attribute__((unused))) { stop = 1; }
> 
> static void loop(int fd, int sec)
> {
>   struct pollfd pfd = { .fd = fd, .events  = POLLIN, };
>   unsigned long errors = 0, eagains = 0, bytes = 0, succ = 0;
>   int size, rc, rd;
> 
>   srandom(getpid());
>   if (CONDITION && fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) == 
> -1)
>   perror("fcntl");
>   size = MINBUFSIZ + random() % (MAXBUFSIZ - MINBUFSIZ + 1);
> 
>   for(;;) {
>   char buf[size];
> 
>   if (stop)
>   break;
>   rc = poll(&pfd, 1, sec);
>   if (rc > 0) {
>   rd = read(fd, buf, sizeof(buf));
>   if (rd == -1 && errno == EAGAIN)
>   eagains++;
>   else if (rd == -1)
>   errors++;
>   else {
>   succ++;
>   bytes += rd;
>   write(1, buf, sizeof(buf));
>   }
>   } else if (rc == -1) {
>   if (errno != EINTR)
>   perror("poll");
>   break;
>   } else
>   fprintf(stderr, "poll: timeout\n");
>   }
>   fprintf(stderr,
>   "pid %d %sblocking, bufsize %d, %d seconds, %lu bytes read, %lu 
> success, %lu eagain, %lu errors\n",
>   getpid(), CONDITION ? "non-" : "", size, sec, bytes, succ, 
> eagains, errors);
> }
> 
> int main(void)
> {
>   int fd;
> 
>   fork(); fork();
>   fd = open("/dev/hwrng", O_RDONLY);
>   if (fd == -1) {
>   perror("open");
>   return 1;
>   };
>   signal(SIGALRM, handler);
>   alarm(SECONDS);
>   loop(fd, SECONDS);
>   close(fd);
>   wait(NULL);
>   return 0;
> }
> 
> void loop(int fd)
> {
> struct pollfd pfd0 = { .fd = fd, .events  = POLLIN, };
> int rc;
> unsigned int n;
> 
> for (n = LOOPS; n > 0; n--) {
> struct pollfd pfd = pfd0;
> char buf[SIZE];
> 
> rc = poll(&pfd, 1, 1);
> if (rc > 0) {
> int rd = read(fd, buf, sizeof(buf));
> 
> if (rd == -1)
> perror("read");
> else
> printf("read %d bytes\n", rd);
> } else if (rc == -1)
> perror("poll");
> else
> fprintf(stderr, "timeout\n");
> 
> }
> }
> 
> int main(void)
> {
> int fd;
> 
> fd = open("/dev/hwrng", O_RDONLY|O_NONBLOCK);
> if (fd == -1) {
> perror("open");
> return 1;
> };
> loop(fd);
> close(fd);
> return 0;
> }
> 
> This can be observed in the real word e.g. with nested qemu/KVM virtual
> machines, if both the "outer" and "inner" VMs have a virtio-rng device.
> If the "inner" VM requests random data, qemu running in the "outer" VM
> uses this device in a non-blocking manner like the test program above.
> 
> Fix it by returning available data if a previous hypervisor call has
> completed. I tested this patch with the program above, and with rng-tools.
> 
> v2 -> v3: Simplified the implementation as suggested by Laurent Vivier
> 
> Signed-off-by: Martin Wilck 
> ---
>  drivers/char/hw_random/virtio-rng.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/hw_random/virtio-rng.c 
> b/drivers/char/hw_random/virtio-rng.c
> index a90001e02bf7..8eaeceecb41e 100644
> --- a/drivers/char/hw_random/virtio-rng.c
> +++ b/drivers/char/hw_random/virtio-rng.c
> @@ -65,7 +65,7 @@ static int virtio_read(struct hwrng *rng, void *buf, si

Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Laurent Vivier
On 11/08/2020 15:14, Michael S. Tsirkin wrote:
> On Tue, Aug 11, 2020 at 03:00:14PM +0200, Laurent Vivier wrote:
>> No problem. This code is tricky and it took me several months to really
>> start to understand it ...
> 
> Oh great, we actually have someone who understands the code!
> Maybe you can help me understand: virtio_read
> takes the buf pointer and puts it in the vq.
> It can then return to caller (e.g. on a signal).
> Device can meanwhile write into the buffer.
> 
> It looks like if another call then happens, and that
> other call uses a different buffer, virtio rng
> will happily return the data written into the
> original buf pointer, confusing the caller.
> 
> Is that right?
> 

Yes.

hw_random core uses two bufers:

- rng_fillbuf that is used with a blocking access and protected by the
reading_mutex. I think this cannot be interrupted by a kill because it's
in  hwrng_fillfn() and it's kthread.

- rng_buffer that is used in rng_dev_read() and can be interrupted (it
is also protected by reading_mutex)

But if rng_dev_read() is called with O_NONBLOCK or interrupted and then
rng_fillbuf starts they can be mixed.

We have also the first use of rng_buffer in add_early_randomness() that
use a different size than in rng_dev_read() with the same buffer (and
this size is 16 whereas the hwrng read API says it must be at least 32...).

The problem here is core has been developped with synchronicity in mind,
whereas virtio is asynchronous by definition.

I think we should add some internal buffers in virtio-rng backend. This
would improve performance (we are at 1 MB/s, I sent a patch to improve
that, but this doesn't fix the problems above), and allows hw_random
core to use memory that doesn't need to be compatible with virt_to_page().

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Michael S. Tsirkin
On Tue, Aug 11, 2020 at 03:00:14PM +0200, Laurent Vivier wrote:
> No problem. This code is tricky and it took me several months to really
> start to understand it ...

Oh great, we actually have someone who understands the code!
Maybe you can help me understand: virtio_read
takes the buf pointer and puts it in the vq.
It can then return to caller (e.g. on a signal).
Device can meanwhile write into the buffer.

It looks like if another call then happens, and that
other call uses a different buffer, virtio rng
will happily return the data written into the
original buf pointer, confusing the caller.

Is that right?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Laurent Vivier
On 11/08/2020 14:53, Martin Wilck wrote:
> On Tue, 2020-08-11 at 14:39 +0200, Laurent Vivier wrote:
>> On 11/08/2020 14:22, Martin Wilck wrote:
>>> On Tue, 2020-08-11 at 14:02 +0200, Laurent Vivier wrote:
>>  drivers/char/hw_random/virtio-rng.c | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/char/hw_random/virtio-rng.c
>> b/drivers/char/hw_random/virtio-rng.c
>> index 79a6e47b5fbc..984713b35892 100644
>> --- a/drivers/char/hw_random/virtio-rng.c
>> +++ b/drivers/char/hw_random/virtio-rng.c
>> @@ -59,6 +59,20 @@ static int virtio_read(struct hwrng *rng,
>> void
>> *buf, size_t size, bool wait)
>>  if (vi->hwrng_removed)
>>  return -ENODEV;
>>  
>> +/*
>> + * If the previous call was non-blocking, we may have
>> got some
>> + * randomness already.
>> + */
>> +if (vi->busy && completion_done(&vi->have_data)) {
>> +unsigned int len;
>> +
>> +vi->busy = false;
>> +len = vi->data_avail > size ? size : vi-
>>> data_avail;
>> +vi->data_avail -= len;

 You don't need to modify data_avail. As busy is set to false, the
 buffer
 will be reused. and it is always overwritten by
 virtqueue_get_buf().
 And moreover, if it was reused it would be always the beginning.
>>>
>>> Ok.
>>>
>> +if (len)
>> +return len;
>> +}
>> +
>>  if (!vi->busy) {
>>  vi->busy = true;
>>  reinit_completion(&vi->have_data);
>>

 Why don't you modify only the wait case?

 Something like:

if (!wait && !completion_done(&vi->have_data)) {
return 0;
 }

 then at the end you can do "return min(size, vi->data_avail);".
>>>
>>> Sorry, I don't understand what you mean. Where would you insert the
>>> above "if" clause? Are you saying I should call
>>> wait_for_completion_killable() also in the (!wait) case?
>>
>> Yes, but only if a the completion is done, so it will not wait.
>>
> 
> Slowly getting there, thanks for your patience. Yes, I guess this would
> work, too. I'll test and get back to you.

No problem. This code is tricky and it took me several months to really
start to understand it ...

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Michael S. Tsirkin
On Tue, Aug 11, 2020 at 02:07:23PM +0200, Martin Wilck wrote:
> On Tue, 2020-08-11 at 07:26 -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 15, 2020 at 03:32:55PM +0200, mwi...@suse.com wrote:
> > >  drivers/char/hw_random/virtio-rng.c | 14 ++
> > >  1 file changed, 14 insertions(+)
> > > 
> > > diff --git a/drivers/char/hw_random/virtio-rng.c
> > > b/drivers/char/hw_random/virtio-rng.c
> > > index 79a6e47b5fbc..984713b35892 100644
> > > --- a/drivers/char/hw_random/virtio-rng.c
> > > +++ b/drivers/char/hw_random/virtio-rng.c
> > > @@ -59,6 +59,20 @@ static int virtio_read(struct hwrng *rng, void
> > > *buf, size_t size, bool wait)
> > >   if (vi->hwrng_removed)
> > >   return -ENODEV;
> > >  
> > > + /*
> > > +  * If the previous call was non-blocking, we may have got some
> > > +  * randomness already.
> > > +  */
> > > + if (vi->busy && completion_done(&vi->have_data)) {
> > > + unsigned int len;
> > > +
> > > + vi->busy = false;
> > > + len = vi->data_avail > size ? size : vi->data_avail;
> > > + vi->data_avail -= len;
> > 
> > I wonder what purpose does this line serve: busy is false
> > which basically means data_avail is invalid, right?
> > A following non blocking call will not enter here.
> 
> Well, I thought this is just how reading data normally works. But
> you're right, the remainder will not be used. I can remove the line, or
> reset data_avail to 0 at this point. What do you prefer?
> 
> Regards,
> Martin

Removing seems cleaner.

But looking at it, it is using the API in a very strange way:
a buffer is placed in the ring by one call, and *assumed*
to still be there in the next call.

which it might not be if one call is from userspace and the
next one is from fill kthread.

I guess this is why it's returning 0: yes it knows there's
data, but it does not know where it is.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: VDPA Debug/Statistics

2020-08-11 Thread Michael S. Tsirkin
On Tue, Aug 11, 2020 at 11:58:23AM +, Eli Cohen wrote:
> On Tue, Aug 11, 2020 at 11:26:20AM +, Eli Cohen wrote:
> > Hi All
> > 
> > Currently, the only statistics we get for a VDPA instance comes from the 
> > virtio_net device instance. Since VDPA involves hardware acceleration, 
> > there can be quite a lot of information that can be fetched from the 
> > underlying device. Currently there is no generic method to fetch this 
> > information.
> > 
> > One way of doing this can be to create a the host, a net device for 
> > each VDPA instance, and use it to get this information or do some 
> > configuration. Ethtool can be used in such a case
> > 
> > I would like to hear what you think about this or maybe you have some other 
> > ideas to address this topic.
> > 
> > Thanks,
> > Eli
> 
> Something I'm not sure I understand is how are vdpa instances created on 
> mellanox cards? There's a devlink command for that, is that right?
> Can that be extended for stats?
> 
> Currently any VF will be probed as VDPA device. We're adding devlink support 
> but I am not sure if devlink is suitable for displaying statistics. We will 
> discuss internally but I wanted to know why you guys think.

OK still things like specifying the mac are managed through rtnetlink,
right?

Right now it does not look like you can mix stats and vf, they are
handled separately:

if (rtnl_fill_stats(skb, dev))
goto nla_put_failure;

if (rtnl_fill_vf(skb, dev, ext_filter_mask))
goto nla_put_failure;

but ability to query vf stats on the host sounds useful generally.

As another option, we could use a vdpa specific way to retrieve stats,
and teach qemu to report them.




> --
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Laurent Vivier
On 11/08/2020 14:22, Martin Wilck wrote:
> On Tue, 2020-08-11 at 14:02 +0200, Laurent Vivier wrote:
>>
  drivers/char/hw_random/virtio-rng.c | 14 ++
  1 file changed, 14 insertions(+)

 diff --git a/drivers/char/hw_random/virtio-rng.c
 b/drivers/char/hw_random/virtio-rng.c
 index 79a6e47b5fbc..984713b35892 100644
 --- a/drivers/char/hw_random/virtio-rng.c
 +++ b/drivers/char/hw_random/virtio-rng.c
 @@ -59,6 +59,20 @@ static int virtio_read(struct hwrng *rng, void
 *buf, size_t size, bool wait)
if (vi->hwrng_removed)
return -ENODEV;
  
 +  /*
 +   * If the previous call was non-blocking, we may have got some
 +   * randomness already.
 +   */
 +  if (vi->busy && completion_done(&vi->have_data)) {
 +  unsigned int len;
 +
 +  vi->busy = false;
 +  len = vi->data_avail > size ? size : vi->data_avail;
 +  vi->data_avail -= len;
>>
>> You don't need to modify data_avail. As busy is set to false, the
>> buffer
>> will be reused. and it is always overwritten by virtqueue_get_buf().
>> And moreover, if it was reused it would be always the beginning.
> 
> Ok.
> 
>>
 +  if (len)
 +  return len;
 +  }
 +
if (!vi->busy) {
vi->busy = true;
reinit_completion(&vi->have_data);

>>
>> Why don't you modify only the wait case?
>>
>> Something like:
>>
>>  if (!wait && !completion_done(&vi->have_data)) {
>>  return 0;
>> }
>>
>> then at the end you can do "return min(size, vi->data_avail);".
> 
> Sorry, I don't understand what you mean. Where would you insert the
> above "if" clause? Are you saying I should call
> wait_for_completion_killable() also in the (!wait) case?

Yes, but only if a the completion is done, so it will not wait.

> 
> I must call check completion_done() before calling reinit_completion().

Normally, the busy flag is here for that. If busy is true, a buffer is
already registered. reinit_completion() must not be called if busy is
true. busy becomes false when the before is ready to be reused.

> OTOH, if completion_done() returns false, I can't simply return 0, I
> must at least start fetching new random data, so that a subsequent
> virtio_read() call has a chance to return something.

if you modify "if (!wait)" to becomes "if (!wait &&
!completion_done(&vi->have_data))", either we have already a registered
buffer from a previous call or the one we have registered if busy is
false. So you can return 0 as nothing is ready but we have a registered
buffer for the next time.

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Laurent Vivier
On 11/08/2020 12:37, Philippe Mathieu-Daudé wrote:
> You Cc'ed qemu-devel, so Cc'ing the virtio-rng maintainers.
> 
> On 7/15/20 3:32 PM, mwi...@suse.com wrote:
>> From: Martin Wilck 
>>
>> If a program opens /dev/hwrng with O_NONBLOCK and uses poll() and
>> non-blocking read() to retrieve random data, it ends up in a tight
>> loop with poll() always returning POLLIN and read() returning EAGAIN.
>> This repeats forever until some process makes a blocking read() call.
>> The reason is that virtio_read() always returns 0 in non-blocking mode,
>> even if data is available. Worse, it fetches random data from the
>> hypervisor after every non-blocking call, without ever using this data.
>>
>> The following test program illustrates the behavior and can be used
>> for testing and experiments. The problem will only be seen if all
>> tasks use non-blocking access; otherwise the blocking reads will
>> "recharge" the random pool and cause other, non-blocking reads to
>> succeed at least sometimes.
>>
>> /* Whether to use non-blocking mode in a task, problem occurs if CONDITION 
>> is 1 */
>> //#define CONDITION (getpid() % 2 != 0)
>>
>> static volatile sig_atomic_t stop;
>> static void handler(int sig __attribute__((unused))) { stop = 1; }
>>
>> static void loop(int fd, int sec)
>> {
>>  struct pollfd pfd = { .fd = fd, .events  = POLLIN, };
>>  unsigned long errors = 0, eagains = 0, bytes = 0, succ = 0;
>>  int size, rc, rd;
>>
>>  srandom(getpid());
>>  if (CONDITION && fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) == 
>> -1)
>>  perror("fcntl");
>>  size = MINBUFSIZ + random() % (MAXBUFSIZ - MINBUFSIZ + 1);
>>
>>  for(;;) {
>>  char buf[size];
>>
>>  if (stop)
>>  break;
>>  rc = poll(&pfd, 1, sec);
>>  if (rc > 0) {
>>  rd = read(fd, buf, sizeof(buf));
>>  if (rd == -1 && errno == EAGAIN)
>>  eagains++;
>>  else if (rd == -1)
>>  errors++;
>>  else {
>>  succ++;
>>  bytes += rd;
>>  write(1, buf, sizeof(buf));
>>  }
>>  } else if (rc == -1) {
>>  if (errno != EINTR)
>>  perror("poll");
>>  break;
>>  } else
>>  fprintf(stderr, "poll: timeout\n");
>>  }
>>  fprintf(stderr,
>>  "pid %d %sblocking, bufsize %d, %d seconds, %lu bytes read, %lu 
>> success, %lu eagain, %lu errors\n",
>>  getpid(), CONDITION ? "non-" : "", size, sec, bytes, succ, 
>> eagains, errors);
>> }
>>
>> int main(void)
>> {
>>  int fd;
>>
>>  fork(); fork();
>>  fd = open("/dev/hwrng", O_RDONLY);
>>  if (fd == -1) {
>>  perror("open");
>>  return 1;
>>  };
>>  signal(SIGALRM, handler);
>>  alarm(SECONDS);
>>  loop(fd, SECONDS);
>>  close(fd);
>>  wait(NULL);
>>  return 0;
>> }
>>
>> void loop(int fd)
>> {
>> struct pollfd pfd0 = { .fd = fd, .events  = POLLIN, };
>> int rc;
>> unsigned int n;
>>
>> for (n = LOOPS; n > 0; n--) {
>> struct pollfd pfd = pfd0;
>> char buf[SIZE];
>>
>> rc = poll(&pfd, 1, 1);
>> if (rc > 0) {
>> int rd = read(fd, buf, sizeof(buf));
>>
>> if (rd == -1)
>> perror("read");
>> else
>> printf("read %d bytes\n", rd);
>> } else if (rc == -1)
>> perror("poll");
>> else
>> fprintf(stderr, "timeout\n");
>>
>> }
>> }
>>
>> int main(void)
>> {
>> int fd;
>>
>> fd = open("/dev/hwrng", O_RDONLY|O_NONBLOCK);
>> if (fd == -1) {
>> perror("open");
>> return 1;
>> };
>> loop(fd);
>> close(fd);
>> return 0;
>> }
>>
>> This can be observed in the real word e.g. with nested qemu/KVM virtual
>> machines, if both the "outer" and "inner" VMs have a virtio-rng device.
>> If the "inner" VM requests random data, qemu running in the "outer" VM
>> uses this device in a non-blocking manner like the test program above.
>>
>> Fix it by returning available data if a previous hypervisor call has
>> completed in the meantime. I tested the patch with the program above,
>> and with rng-tools.
>>
>> Signed-off-by: Martin Wilck 
>> ---
>>  drivers/char/hw_random/virtio-rng.c | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/char/hw_random/virtio-rng.c 
>> b/drivers/char/hw_random/virtio-rng.c
>> index 79a6e47b5fbc..984713b35892 100644
>> --- a/drivers/char/hw_random/vir

Re: VDPA Debug/Statistics

2020-08-11 Thread Michael S. Tsirkin
On Tue, Aug 11, 2020 at 11:26:20AM +, Eli Cohen wrote:
> Hi All
> 
> Currently, the only statistics we get for a VDPA instance comes from the 
> virtio_net device instance. Since VDPA involves hardware acceleration, there 
> can be quite a lot of information that can be fetched from the underlying 
> device. Currently there is no generic method to fetch this information.
> 
> One way of doing this can be to create a the host, a net device for each VDPA 
> instance, and use it to get this information or do some configuration. 
> Ethtool can be used in such a case
> 
> I would like to hear what you think about this or maybe you have some other 
> ideas to address this topic.
> 
> Thanks,
> Eli

Something I'm not sure I understand is how are vdpa instances created
on mellanox cards? There's a devlink command for that, is that right?
Can that be extended for stats?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-rng: return available data with O_NONBLOCK

2020-08-11 Thread Michael S. Tsirkin
On Wed, Jul 15, 2020 at 03:32:55PM +0200, mwi...@suse.com wrote:
> From: Martin Wilck 
> 
> If a program opens /dev/hwrng with O_NONBLOCK and uses poll() and
> non-blocking read() to retrieve random data, it ends up in a tight
> loop with poll() always returning POLLIN and read() returning EAGAIN.
> This repeats forever until some process makes a blocking read() call.
> The reason is that virtio_read() always returns 0 in non-blocking mode,
> even if data is available. Worse, it fetches random data from the
> hypervisor after every non-blocking call, without ever using this data.
> 
> The following test program illustrates the behavior and can be used
> for testing and experiments. The problem will only be seen if all
> tasks use non-blocking access; otherwise the blocking reads will
> "recharge" the random pool and cause other, non-blocking reads to
> succeed at least sometimes.
> 
> /* Whether to use non-blocking mode in a task, problem occurs if CONDITION is 
> 1 */
> //#define CONDITION (getpid() % 2 != 0)
> 
> static volatile sig_atomic_t stop;
> static void handler(int sig __attribute__((unused))) { stop = 1; }
> 
> static void loop(int fd, int sec)
> {
>   struct pollfd pfd = { .fd = fd, .events  = POLLIN, };
>   unsigned long errors = 0, eagains = 0, bytes = 0, succ = 0;
>   int size, rc, rd;
> 
>   srandom(getpid());
>   if (CONDITION && fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) == 
> -1)
>   perror("fcntl");
>   size = MINBUFSIZ + random() % (MAXBUFSIZ - MINBUFSIZ + 1);
> 
>   for(;;) {
>   char buf[size];
> 
>   if (stop)
>   break;
>   rc = poll(&pfd, 1, sec);
>   if (rc > 0) {
>   rd = read(fd, buf, sizeof(buf));
>   if (rd == -1 && errno == EAGAIN)
>   eagains++;
>   else if (rd == -1)
>   errors++;
>   else {
>   succ++;
>   bytes += rd;
>   write(1, buf, sizeof(buf));
>   }
>   } else if (rc == -1) {
>   if (errno != EINTR)
>   perror("poll");
>   break;
>   } else
>   fprintf(stderr, "poll: timeout\n");
>   }
>   fprintf(stderr,
>   "pid %d %sblocking, bufsize %d, %d seconds, %lu bytes read, %lu 
> success, %lu eagain, %lu errors\n",
>   getpid(), CONDITION ? "non-" : "", size, sec, bytes, succ, 
> eagains, errors);
> }
> 
> int main(void)
> {
>   int fd;
> 
>   fork(); fork();
>   fd = open("/dev/hwrng", O_RDONLY);
>   if (fd == -1) {
>   perror("open");
>   return 1;
>   };
>   signal(SIGALRM, handler);
>   alarm(SECONDS);
>   loop(fd, SECONDS);
>   close(fd);
>   wait(NULL);
>   return 0;
> }
> 
> void loop(int fd)
> {
> struct pollfd pfd0 = { .fd = fd, .events  = POLLIN, };
> int rc;
> unsigned int n;
> 
> for (n = LOOPS; n > 0; n--) {
> struct pollfd pfd = pfd0;
> char buf[SIZE];
> 
> rc = poll(&pfd, 1, 1);
> if (rc > 0) {
> int rd = read(fd, buf, sizeof(buf));
> 
> if (rd == -1)
> perror("read");
> else
> printf("read %d bytes\n", rd);
> } else if (rc == -1)
> perror("poll");
> else
> fprintf(stderr, "timeout\n");
> 
> }
> }
> 
> int main(void)
> {
> int fd;
> 
> fd = open("/dev/hwrng", O_RDONLY|O_NONBLOCK);
> if (fd == -1) {
> perror("open");
> return 1;
> };
> loop(fd);
> close(fd);
> return 0;
> }
> 
> This can be observed in the real word e.g. with nested qemu/KVM virtual
> machines, if both the "outer" and "inner" VMs have a virtio-rng device.
> If the "inner" VM requests random data, qemu running in the "outer" VM
> uses this device in a non-blocking manner like the test program above.
> 
> Fix it by returning available data if a previous hypervisor call has
> completed in the meantime. I tested the patch with the program above,
> and with rng-tools.
> 
> Signed-off-by: Martin Wilck 
> ---
>  drivers/char/hw_random/virtio-rng.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/char/hw_random/virtio-rng.c 
> b/drivers/char/hw_random/virtio-rng.c
> index 79a6e47b5fbc..984713b35892 100644
> --- a/drivers/char/hw_random/virtio-rng.c
> +++ b/drivers/char/hw_random/virtio-rng.c
> @@ -59,6 +59,20 @@ static int virtio_read(struct hwrng *rng, void *buf, 
> size_t size, bool wait)
>   if (vi->

Re: [PATCH] vhost: vdpa: remove per device feature whitelist

2020-08-11 Thread Michael S. Tsirkin
On Mon, Jul 20, 2020 at 04:50:43PM +0800, Jason Wang wrote:
> We used to have a per device feature whitelist to filter out the
> unsupported virtio features. But this seems unnecessary since:
> 
> - the main idea behind feature whitelist is to block control vq
>   feature until we finalize the control virtqueue API. But the current
>   vhost-vDPA uAPI is sufficient to support control virtqueue. For
>   device that has hardware control virtqueue, the vDPA device driver
>   can just setup the hardware virtqueue and let userspace to use
>   hardware virtqueue directly. For device that doesn't have a control
>   virtqueue, the vDPA device driver need to use e.g vringh to emulate
>   a software control virtqueue.
> - we don't do it in virtio-vDPA driver
> 
> So remove this limitation.
> 
> Signed-off-by: Jason Wang 


Thinking about it, should we block some bits?
E.g. access_platform?
they depend on qemu not vdpa ...

> ---
>  drivers/vhost/vdpa.c | 37 -
>  1 file changed, 37 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 77a0c9fb6cc3..f7f6ddd681ce 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -26,35 +26,6 @@
>  
>  #include "vhost.h"
>  
> -enum {
> - VHOST_VDPA_FEATURES =
> - (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
> - (1ULL << VIRTIO_F_ANY_LAYOUT) |
> - (1ULL << VIRTIO_F_VERSION_1) |
> - (1ULL << VIRTIO_F_IOMMU_PLATFORM) |
> - (1ULL << VIRTIO_F_RING_PACKED) |
> - (1ULL << VIRTIO_F_ORDER_PLATFORM) |
> - (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
> - (1ULL << VIRTIO_RING_F_EVENT_IDX),
> -
> - VHOST_VDPA_NET_FEATURES = VHOST_VDPA_FEATURES |
> - (1ULL << VIRTIO_NET_F_CSUM) |
> - (1ULL << VIRTIO_NET_F_GUEST_CSUM) |
> - (1ULL << VIRTIO_NET_F_MTU) |
> - (1ULL << VIRTIO_NET_F_MAC) |
> - (1ULL << VIRTIO_NET_F_GUEST_TSO4) |
> - (1ULL << VIRTIO_NET_F_GUEST_TSO6) |
> - (1ULL << VIRTIO_NET_F_GUEST_ECN) |
> - (1ULL << VIRTIO_NET_F_GUEST_UFO) |
> - (1ULL << VIRTIO_NET_F_HOST_TSO4) |
> - (1ULL << VIRTIO_NET_F_HOST_TSO6) |
> - (1ULL << VIRTIO_NET_F_HOST_ECN) |
> - (1ULL << VIRTIO_NET_F_HOST_UFO) |
> - (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> - (1ULL << VIRTIO_NET_F_STATUS) |
> - (1ULL << VIRTIO_NET_F_SPEED_DUPLEX),
> -};
> -
>  /* Currently, only network backend w/o multiqueue is supported. */
>  #define VHOST_VDPA_VQ_MAX2
>  
> @@ -79,10 +50,6 @@ static DEFINE_IDA(vhost_vdpa_ida);
>  
>  static dev_t vhost_vdpa_major;
>  
> -static const u64 vhost_vdpa_features[] = {
> - [VIRTIO_ID_NET] = VHOST_VDPA_NET_FEATURES,
> -};
> -
>  static void handle_vq_kick(struct vhost_work *work)
>  {
>   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> @@ -255,7 +222,6 @@ static long vhost_vdpa_get_features(struct vhost_vdpa *v, 
> u64 __user *featurep)
>   u64 features;
>  
>   features = ops->get_features(vdpa);
> - features &= vhost_vdpa_features[v->virtio_id];
>  
>   if (copy_to_user(featurep, &features, sizeof(features)))
>   return -EFAULT;
> @@ -279,9 +245,6 @@ static long vhost_vdpa_set_features(struct vhost_vdpa *v, 
> u64 __user *featurep)
>   if (copy_from_user(&features, featurep, sizeof(features)))
>   return -EFAULT;
>  
> - if (features & ~vhost_vdpa_features[v->virtio_id])
> - return -EINVAL;
> -
>   if (ops->set_features(vdpa, features))
>   return -EINVAL;
>  
> -- 
> 2.20.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


vdpa: handling of VIRTIO_F_ACCESS_PLATFORM/VIRTIO_F_ORDER_PLATFORM

2020-08-11 Thread Michael S. Tsirkin
Hi!
I'd like to raise the question of whether we can drop the requirement
of VIRTIO_F_ACCESS_PLATFORM from vdpa?
As far as I can see, it is merely required for virtio vdpa -
so should we not enforce it there?

The point is support for legacy guests - which mostly just works
on x86.

Also, what is the plan for VIRTIO_F_ORDER_PLATFORM?

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread peterz
On Tue, Aug 11, 2020 at 11:20:54AM +0200, pet...@infradead.org wrote:
> On Tue, Aug 11, 2020 at 10:38:50AM +0200, Jürgen Groß wrote:
> > In case you don't want to do it I can send the patch for the Xen
> > variants.
> 
> I might've opened a whole new can of worms here. I'm not sure we
> can/want to fix the entire fallout this release :/
> 
> Let me ponder this a little, because the more I look at things, the more
> problems I keep finding... bah bah bah.

That is, most of these irq-tracking problem are new because commit:

  859d069ee1dd ("lockdep: Prepare for NMI IRQ state tracking")

changed irq-tracking to ignore the lockdep recursion count.

This then allows:

lock_acquire()
  raw_local_irq_save();
  current->lockdep_recursion++;
  trace_lock_acquire()
   ... tracing ...
 #PF under raw_local_irq_*()

  __lock_acquire()
arch_spin_lock(&graph_lock)
  pv-spinlock-wait()
local_irq_save() under raw_local_irq_*()

However afaict that just made a bad situation worse. There already were
issues, take for example:

  trace_clock_global()
raw_local_irq_save();
arch_spin_lock()
  pv-spinlock-wait
local_irq_save()

And that has no lockdep_recursion to 'save' the say.

The tracing recursion does however avoid some of the obvious fails
there, like trace_clock calling into paravirt which then calls back into
tracing. But still, that would've caused IRQ tracking problems even with
the old code.

And in that respect, this is all the exact same problem as that other
set of patches has ( 20200807192336.405068...@infradead.org ).

Now, on the flip side, it does find actual problems, the trace_lock_*()
things were using RCU in RCU-disabled code, and here I found that
trace_clock_global() thinkg (and I suspect there's more of that).

But at this point I'm not entirelty sure how best to proceed... tracing
uses arch_spinlock_t, which means all spinlock implementations should be
notrace, but then that drops into paravirt and all hell breaks loose
because Hyper-V then calls into the APIC code etc.. etc..

At that rate we'll have the entire kernel marked notrace, and I'm fairly
sure that's not a solution either.

So let me once again see if I can't find a better solution for this all.
Clearly it needs one :/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread peterz
On Tue, Aug 11, 2020 at 10:38:50AM +0200, Jürgen Groß wrote:
> In case you don't want to do it I can send the patch for the Xen
> variants.

I might've opened a whole new can of worms here. I'm not sure we
can/want to fix the entire fallout this release :/

Let me ponder this a little, because the more I look at things, the more
problems I keep finding... bah bah bah.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[GIT PULL] virtio: features, fixes

2020-08-11 Thread Michael S. Tsirkin
OK, some patches in the series add buggy code which is then fixed by
follow-up patches, but none of the bugs fixed are severe regressions on
common configs (e.g. compiler warnings, lockdep/rt errors, or bugs in
new drivers). So I thought it's more important to preserve the credit
for the fixes.

I had to pull 5 patches from 
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux mlx5-next
to get the mlx5 things to work, this seems to be how mellanox guys are
always managing things, and they told me they are ok with it.

The following changes since commit bcf876870b95592b52519ed4aafcf9d95999bc9c:

  Linux 5.8 (2020-08-02 14:21:45 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 8a7c3213db068135e816a6a517157de6443290d6:

  vdpa/mlx5: fix up endian-ness for mtu (2020-08-10 10:38:55 -0400)


virtio: fixes, features

IRQ bypass support for vdpa and IFC
MLX5 vdpa driver
Endian-ness fixes for virtio drivers
Misc other fixes

Signed-off-by: Michael S. Tsirkin 


Alex Dewar (1):
  vdpa/mlx5: Fix uninitialised variable in core/mr.c

Colin Ian King (1):
  vdpa/mlx5: fix memory allocation failure checks

Dan Carpenter (2):
  vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
  vdpa: Fix pointer math bug in vdpasim_get_config()

Eli Cohen (9):
  net/mlx5: Support setting access rights of dma addresses
  net/mlx5: Add VDPA interface type to supported enumerations
  net/mlx5: Add interface changes required for VDPA
  net/vdpa: Use struct for set/get vq state
  vdpa: Modify get_vq_state() to return error code
  vdpa/mlx5: Add hardware descriptive header file
  vdpa/mlx5: Add support library for mlx5 VDPA implementation
  vdpa/mlx5: Add shared memory registration code
  vdpa/mlx5: Add VDPA driver for supported mlx5 devices

Gustavo A. R. Silva (1):
  vhost: Use flex_array_size() helper in copy_from_user()

Jason Wang (6):
  vhost: vdpa: remove per device feature whitelist
  vhost-vdpa: refine ioctl pre-processing
  vhost: generialize backend features setting/getting
  vhost-vdpa: support get/set backend features
  vhost-vdpa: support IOTLB batching hints
  vdpasim: support batch updating

Liao Pingfang (1):
  virtio_pci_modern: Fix the comment of virtio_pci_find_capability()

Mao Wenan (1):
  virtio_ring: Avoid loop when vq is broken in virtqueue_poll

Maor Gottlieb (2):
  net/mlx5: Export resource dump interface
  net/mlx5: Add support in query QP, CQ and MKEY segments

Max Gurtovoy (2):
  vdpasim: protect concurrent access to iommu iotlb
  vdpa: remove hard coded virtq num

Meir Lichtinger (1):
  RDMA/mlx5: ConnectX-7 new capabilities to set relaxed ordering by UMR

Michael Guralnik (2):
  net/mlx5: Enable QP number request when creating IPoIB underlay QP
  net/mlx5: Enable count action for rules with allow action

Michael S. Tsirkin (44):
  virtio: VIRTIO_F_IOMMU_PLATFORM -> VIRTIO_F_ACCESS_PLATFORM
  virtio: virtio_has_iommu_quirk -> virtio_has_dma_quirk
  virtio_balloon: fix sparse warning
  virtio_ring: sparse warning fixup
  virtio: allow __virtioXX, __leXX in config space
  virtio_9p: correct tags for config space fields
  virtio_balloon: correct tags for config space fields
  virtio_blk: correct tags for config space fields
  virtio_console: correct tags for config space fields
  virtio_crypto: correct tags for config space fields
  virtio_fs: correct tags for config space fields
  virtio_gpu: correct tags for config space fields
  virtio_input: correct tags for config space fields
  virtio_iommu: correct tags for config space fields
  virtio_mem: correct tags for config space fields
  virtio_net: correct tags for config space fields
  virtio_pmem: correct tags for config space fields
  virtio_scsi: correct tags for config space fields
  virtio_config: disallow native type fields
  mlxbf-tmfifo: sparse tags for config access
  vdpa: make sure set_features is invoked for legacy
  vhost/vdpa: switch to new helpers
  virtio_vdpa: legacy features handling
  vdpa_sim: fix endian-ness of config space
  virtio_config: cread/write cleanup
  virtio_config: rewrite using _Generic
  virtio_config: disallow native type fields (again)
  virtio_config: LE config space accessors
  virtio_caif: correct tags for config space fields
  virtio_config: add virtio_cread_le_feature
  virtio_balloon: use LE config space accesses
  virtio_input: convert to LE accessors
  virtio_fs: convert to LE accessors
  virtio_crypto: convert to LE accessors
  virtio_pmem: convert to LE accessors
  drm/virtio: convert to LE accessors
  virtio_mem: c

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 10:12, Peter Zijlstra wrote:

On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote:

On 11.08.20 09:41, Peter Zijlstra wrote:

On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:


My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

raw_local_irq_save(); /* or other IRQs off without tracing */
...
kvm_wait() /* IRQ state tracing gets confused */
...
raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt


Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.


Just to be sure I understand you correct:

You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
called by those should gain the "notrace" attribute, right?

I am not sure why the kick variants need it, though. IMO those are
called only after the lock has been released, so they should be fine
without notrace.


The issue happens when someone uses arch_spinlock_t under
raw_local_irq_*().


And again: we shouldn't forget the Hyper-V variants.


Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI().


In case you don't want to do it I can send the patch for the Xen
variants.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/4] vdpa: introduce config op to get valid iova range

2020-08-11 Thread Michael S. Tsirkin
On Tue, Aug 11, 2020 at 10:53:09AM +0800, Jason Wang wrote:
> 
> On 2020/8/10 下午8:05, Michael S. Tsirkin wrote:
> > On Thu, Aug 06, 2020 at 03:43:54PM +0300, Eli Cohen wrote:
> > > On Thu, Aug 06, 2020 at 08:29:22AM -0400, Michael S. Tsirkin wrote:
> > > > On Thu, Aug 06, 2020 at 03:03:55PM +0300, Eli Cohen wrote:
> > > > > On Wed, Aug 05, 2020 at 08:51:56AM -0400, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 17, 2020 at 11:29:44AM +0800, Jason Wang wrote:
> > > > > > > This patch introduce a config op to get valid iova range from the 
> > > > > > > vDPA
> > > > > > > device.
> > > > > > > 
> > > > > > > Signed-off-by: Jason Wang 
> > > > > > > ---
> > > > > > >   include/linux/vdpa.h | 14 ++
> > > > > > >   1 file changed, 14 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> > > > > > > index 239db794357c..b7633ed2500c 100644
> > > > > > > --- a/include/linux/vdpa.h
> > > > > > > +++ b/include/linux/vdpa.h
> > > > > > > @@ -41,6 +41,16 @@ struct vdpa_device {
> > > > > > >   unsigned int index;
> > > > > > >   };
> > > > > > > +/**
> > > > > > > + * vDPA IOVA range - the IOVA range support by the device
> > > > > > > + * @start: start of the IOVA range
> > > > > > > + * @end: end of the IOVA range
> > > > > > > + */
> > > > > > > +struct vdpa_iova_range {
> > > > > > > + u64 start;
> > > > > > > + u64 end;
> > > > > > > +};
> > > > > > > +
> > > > > > 
> > > > > > This is ambiguous. Is end in the range or just behind it?
> > > > > > How about first/last?
> > > > > It is customary in the kernel to use start-end where end corresponds 
> > > > > to
> > > > > the byte following the last in the range. See struct vm_area_struct
> > > > > vm_start and vm_end fields
> > > > Exactly my point:
> > > > 
> > > > include/linux/mm_types.h:   unsigned long vm_end;   /* The 
> > > > first byte after our end address
> > > > 
> > > > in this case Jason wants it to be the last byte, not one behind.
> > > > 
> > > > 
> > > Maybe start, size? Not ambiguous, and you don't need to do annoying
> > > calculations like size = last - start + 1
> > Size has a bunch of issues: can overlap, can not cover the entire 64 bit
> > range. The requisite checks are arguably easier to get wrong than
> > getting the size if you need it.
> 
> 
> Yes, so do you still prefer first/last or just begin/end which is consistent
> with iommu_domain_geometry?
> 
> Thanks

I prefer first/last I think, these are unambiguous.
E.g.

dma_addr_t aperture_start; /* First address that can be mapped*/
dma_addr_t aperture_end;   /* Last address that can be mapped */

instead of addressing ambiguity with a comment, let's just name the field well.



> 
> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 10:12, Peter Zijlstra wrote:

On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote:

On 11.08.20 09:41, Peter Zijlstra wrote:

On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:


My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

raw_local_irq_save(); /* or other IRQs off without tracing */
...
kvm_wait() /* IRQ state tracing gets confused */
...
raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt


Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.


Just to be sure I understand you correct:

You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
called by those should gain the "notrace" attribute, right?

I am not sure why the kick variants need it, though. IMO those are
called only after the lock has been released, so they should be fine
without notrace.


The issue happens when someone uses arch_spinlock_t under
raw_local_irq_*().


Ah, okay.




And again: we shouldn't forget the Hyper-V variants.


Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI().


I've seen that, too. :-(


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Peter Zijlstra
On Tue, Aug 11, 2020 at 09:57:55AM +0200, Jürgen Groß wrote:
> On 11.08.20 09:41, Peter Zijlstra wrote:
> > On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:
> > 
> > > My hypothesis here is simply that kvm_wait() may be called in a place
> > > where we get the same case I mentioned to Peter,
> > > 
> > >   raw_local_irq_save(); /* or other IRQs off without tracing */
> > >   ...
> > >   kvm_wait() /* IRQ state tracing gets confused */
> > >   ...
> > >   raw_local_irq_restore();
> > > 
> > > and therefore, using raw variants in kvm_wait() works. It's also safe
> > > because it doesn't call any other libraries that would result in corrupt
> > 
> > Yes, this is definitely an issue.
> > 
> > Tracing, we also musn't call into tracing when using raw_local_irq_*().
> > Because then we re-intoduce this same issue all over again.
> > 
> > Both halt() and safe_halt() are more paravirt calls, but given we're in
> > a KVM paravirt call already, I suppose we can directly use native_*()
> > here.
> > 
> > Something like so then... I suppose, but then the Xen variants need TLC
> > too.
> 
> Just to be sure I understand you correct:
> 
> You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
> called by those should gain the "notrace" attribute, right?
> 
> I am not sure why the kick variants need it, though. IMO those are
> called only after the lock has been released, so they should be fine
> without notrace.

The issue happens when someone uses arch_spinlock_t under
raw_local_irq_*().

> And again: we shouldn't forget the Hyper-V variants.

Bah, my grep failed :/ Also *groan*, that's calling apic->send_IPI().


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 09:41, Peter Zijlstra wrote:

On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:


My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

raw_local_irq_save(); /* or other IRQs off without tracing */
...
kvm_wait() /* IRQ state tracing gets confused */
...
raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt


Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.


Just to be sure I understand you correct:

You mean that xen_qlock_kick() and xen_qlock_wait() and all functions
called by those should gain the "notrace" attribute, right?

I am not sure why the kick variants need it, though. IMO those are
called only after the lock has been released, so they should be fine
without notrace.

And again: we shouldn't forget the Hyper-V variants.


Juergen
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Peter Zijlstra
On Fri, Aug 07, 2020 at 05:19:03PM +0200, Marco Elver wrote:

> My hypothesis here is simply that kvm_wait() may be called in a place
> where we get the same case I mentioned to Peter,
> 
>   raw_local_irq_save(); /* or other IRQs off without tracing */
>   ...
>   kvm_wait() /* IRQ state tracing gets confused */
>   ...
>   raw_local_irq_restore();
> 
> and therefore, using raw variants in kvm_wait() works. It's also safe
> because it doesn't call any other libraries that would result in corrupt

Yes, this is definitely an issue.

Tracing, we also musn't call into tracing when using raw_local_irq_*().
Because then we re-intoduce this same issue all over again.

Both halt() and safe_halt() are more paravirt calls, but given we're in
a KVM paravirt call already, I suppose we can directly use native_*()
here.

Something like so then... I suppose, but then the Xen variants need TLC
too.

---
 arch/x86/include/asm/irqflags.h |  4 ++--
 arch/x86/include/asm/kvm_para.h | 18 +-
 arch/x86/kernel/kvm.c   | 14 +++---
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 02a0cf547d7b..7c614db25274 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -54,13 +54,13 @@ static __always_inline void native_irq_enable(void)
asm volatile("sti": : :"memory");
 }

-static inline __cpuidle void native_safe_halt(void)
+static __always_inline __cpuidle void native_safe_halt(void)
 {
mds_idle_clear_cpu_buffers();
asm volatile("sti; hlt": : :"memory");
 }

-static inline __cpuidle void native_halt(void)
+static __always_inline __cpuidle void native_halt(void)
 {
mds_idle_clear_cpu_buffers();
asm volatile("hlt": : :"memory");
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 49d3a9edb06f..90f7ea58ebb0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,7 +30,7 @@ static inline bool kvm_check_and_clear_guest_paused(void)
  * noted by the particular hypercall.
  */

-static inline long kvm_hypercall0(unsigned int nr)
+static __always_inline long kvm_hypercall0(unsigned int nr)
 {
long ret;
asm volatile(KVM_HYPERCALL
@@ -40,7 +40,7 @@ static inline long kvm_hypercall0(unsigned int nr)
return ret;
 }

-static inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
+static __always_inline long kvm_hypercall1(unsigned int nr, unsigned long p1)
 {
long ret;
asm volatile(KVM_HYPERCALL
@@ -50,8 +50,8 @@ static inline long kvm_hypercall1(unsigned int nr, unsigned 
long p1)
return ret;
 }

-static inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
- unsigned long p2)
+static __always_inline long kvm_hypercall2(unsigned int nr, unsigned long p1,
+  unsigned long p2)
 {
long ret;
asm volatile(KVM_HYPERCALL
@@ -61,8 +61,8 @@ static inline long kvm_hypercall2(unsigned int nr, unsigned 
long p1,
return ret;
 }

-static inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
- unsigned long p2, unsigned long p3)
+static __always_inline long kvm_hypercall3(unsigned int nr, unsigned long p1,
+  unsigned long p2, unsigned long p3)
 {
long ret;
asm volatile(KVM_HYPERCALL
@@ -72,9 +72,9 @@ static inline long kvm_hypercall3(unsigned int nr, unsigned 
long p1,
return ret;
 }

-static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
- unsigned long p2, unsigned long p3,
- unsigned long p4)
+static __always_inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
+  unsigned long p2, unsigned long p3,
+  unsigned long p4)
 {
long ret;
asm volatile(KVM_HYPERCALL
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 233c77d056c9..15f8dfd8812d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -779,7 +779,7 @@ arch_initcall(kvm_alloc_cpumask);
 #ifdef CONFIG_PARAVIRT_SPINLOCKS

 /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
-static void kvm_kick_cpu(int cpu)
+static notrace kvm_kick_cpu(int cpu)
 {
int apicid;
unsigned long flags = 0;
@@ -790,14 +790,14 @@ static void kvm_kick_cpu(int cpu)

 #include 

-static void kvm_wait(u8 *ptr, u8 val)
+static notrace kvm_wait(u8 *ptr, u8 val)
 {
unsigned long flags;

if (in_nmi())
return;

-   local_irq_save(flags);
+   raw_local_irq_save(flags);

if (READ_ONCE(*ptr) != val)
goto out;
@@ -808,16 +808,16 @@ static void kvm_wait(u8 *ptr, u8 val)
 * in irq spinlock slowpath and no spuriou

Re: [PATCH v4 0/6] mm / virtio-mem: support ZONE_MOVABLE

2020-08-11 Thread David Hildenbrand
On 11.08.20 04:21, Andrew Morton wrote:
> On Mon, 10 Aug 2020 09:56:32 +0200 David Hildenbrand  wrote:
> 
>> On 04.08.20 21:41, David Hildenbrand wrote:
>>> @Andrew can we give this a churn and consider it for v5.9 in case there
>>> are no more comments?
>>
>> @Andrew, Ping, so I assume we'll target v5.10?
> 
> Yep, sorry.  Merging a significant patch series during the merge
> window(!) would be quite extraordinary and I don't think that anything
> in this patchset justifies such an action?
> 

Okay, now I know you are aware of this series :)

First, I thought #1 would be worth for v5.9, but it looks like it's not
actually necessary. I'll respin the whole series for v5.10.

-- 
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add missing noinstr to arch_local*() helpers

2020-08-11 Thread Jürgen Groß

On 11.08.20 09:00, Marco Elver wrote:

On Fri, 7 Aug 2020 at 17:19, Marco Elver  wrote:

On Fri, Aug 07, 2020 at 02:08PM +0200, Marco Elver wrote:

On Fri, 7 Aug 2020 at 14:04, Jürgen Groß  wrote:


On 07.08.20 13:38, Marco Elver wrote:

On Fri, Aug 07, 2020 at 12:35PM +0200, Jürgen Groß wrote:

...

I think CONFIG_PARAVIRT_XXL shouldn't matter, but I'm not completely
sure about that. CONFIG_PARAVIRT_SPINLOCKS would be my primary suspect.


Yes, PARAVIRT_XXL doesn't make a different. When disabling
PARAVIRT_SPINLOCKS, however, the warnings go away.


Thanks for testing!

I take it you are doing the tests in a KVM guest?


Yes, correct.


If so I have a gut feeling that the use of local_irq_save() and
local_irq_restore() in kvm_wait() might be fishy. I might be completely
wrong here, though.


Happy to help debug more, although I might need patches or pointers
what to play with.


BTW, I think Xen's variant of pv spinlocks is fine (no playing with IRQ
on/off).

Hyper-V seems to do the same as KVM, and kicking another vcpu could be
problematic as well, as it is just using IPI.


I experimented a bit more, and the below patch seems to solve the
warnings. However, that was based on your pointer about kvm_wait(), and
I can't quite tell if it is the right solution.

My hypothesis here is simply that kvm_wait() may be called in a place
where we get the same case I mentioned to Peter,

 raw_local_irq_save(); /* or other IRQs off without tracing */
 ...
 kvm_wait() /* IRQ state tracing gets confused */
 ...
 raw_local_irq_restore();

and therefore, using raw variants in kvm_wait() works. It's also safe
because it doesn't call any other libraries that would result in corrupt
IRQ state AFAIK.


Just to follow-up, it'd still be nice to fix this. Suggestions?

I could send the below as a patch, but can only go off my above
hypothesis and the fact that syzbot is happier, so not entirely
convincing.


Peter has told me via IRC he will look soon further into this.

Your finding suggests that the pv-lock implementation for Hyper-V
needs some tweaking, too. For that purpose I'm adding Wei to Cc.


Juergen



Thanks,
-- Marco


-- >8 --

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 233c77d056c9..1d412d1466f0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -797,7 +797,7 @@ static void kvm_wait(u8 *ptr, u8 val)
 if (in_nmi())
 return;

-   local_irq_save(flags);
+   raw_local_irq_save(flags);

 if (READ_ONCE(*ptr) != val)
 goto out;
@@ -810,10 +810,10 @@ static void kvm_wait(u8 *ptr, u8 val)
 if (arch_irqs_disabled_flags(flags))
 halt();
 else
-   safe_halt();
+   raw_safe_halt();

  out:
-   local_irq_restore(flags);
+   raw_local_irq_restore(flags);
  }

  #ifdef CONFIG_X86_32




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization