Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Jason Wang
On Wed, Nov 24, 2021 at 3:22 PM Michael S. Tsirkin  wrote:
>
> On Wed, Nov 24, 2021 at 10:33:28AM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 10:26 AM Jason Wang  wrote:
> > >
> > > On Wed, Nov 24, 2021 at 9:30 AM Michael Ellerman  
> > > wrote:
> > > >
> > > > "Michael S. Tsirkin"  writes:
> > > > > On Tue, Nov 23, 2021 at 10:25:20AM +0800, Jason Wang wrote:
> > > > >> On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  
> > > > >> wrote:
> > > > >> >
> > > > >> > On Mon, 22 Nov 2021 14:25:26 +0800
> > > > >> > Jason Wang  wrote:
> > > > >> >
> > > > >> > > I think the fixes are:
> > > > >> > >
> > > > >> > > 1) fixing the vhost vsock
> > > > >> > > 2) use suppress_used_validation=true to let vsock driver to 
> > > > >> > > validate
> > > > >> > > the in buffer length
> > > > >> > > 3) probably a new feature so the driver can only enable the 
> > > > >> > > validation
> > > > >> > > when the feature is enabled.
> > > > >> >
> > > > >> > I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly 
> > > > >> > good
> > > > >> > feature. Frankly the set of such bugs is device implementation
> > > > >> > specific and it makes little sense to specify a feature bit
> > > > >> > that says the device implementation claims to adhere to some
> > > > >> > aspect of the specification. Also what would be the semantic
> > > > >> > of not negotiating F_DEV_Y_FIXED_BUG_X?
> > > > >>
> > > > >> Yes, I agree. Rethink of the feature bit, it seems unnecessary,
> > > > >> especially considering the driver should not care about the used
> > > > >> length for tx.
> > > > >>
> > > > >> >
> > > > >> > On the other hand I see no other way to keep the validation
> > > > >> > permanently enabled for fixed implementations, and get around the 
> > > > >> > problem
> > > > >> > with broken implementations. So we could have something like
> > > > >> > VHOST_USED_LEN_STRICT.
> > > > >>
> > > > >> It's more about a choice of the driver's knowledge. For vsock TX it
> > > > >> should be fine. If we introduce a parameter and disable it by 
> > > > >> default,
> > > > >> it won't be very useful.
> > > > >>
> > > > >> >
> > > > >> > Maybe, we can also think of 'warn and don't alter behavior' 
> > > > >> > instead of
> > > > >> > 'warn' and alter behavior. Or maybe even not having such checks on 
> > > > >> > in
> > > > >> > production, but only when testing.
> > > > >>
> > > > >> I think there's an agreement that virtio drivers need more hardening,
> > > > >> that's why a lot of patches were merged. Especially considering the
> > > > >> new requirements came from confidential computing, smart NIC and
> > > > >> VDUSE. For virtio drivers, enabling the validation may help to
> > > > >>
> > > > >> 1) protect the driver from the buggy and malicious device
> > > > >> 2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
> > > > >> 3) force the have a smart driver that can do the validation itself
> > > > >> then we can finally remove the validation in the core
> > > > >>
> > > > >> So I'd like to keep it enabled.
> > > > >>
> > > > >> Thanks
> > > > >
> > > > > Let's see how far we can get. But yes, maybe we were too aggressive in
> > > > > breaking things by default, a warning might be a better choice for a
> > > > > couple of cycles.
> > >
> > > Ok, considering we saw the issues with balloons I think I can post a
> > > patch to use warn instead. I wonder if we need to taint the kernel in
> > > this case.
> >
> > Rethink this, consider we still have some time, I tend to convert the
> > drivers to validate the length by themselves. Does this make sense?
> >
> > Thanks
>
> That's separate but let's stop crashing guests for people ASAP.

Ok, will post a patch soon.

Thanks

>
>
> > >
> > > >
> > > > This series appears to break the virtio_balloon driver as well.
> > > >
> > > > The symptom is soft lockup warnings, eg:
> > > >
> > > >   INFO: task kworker/1:1:109 blocked for more than 614 seconds.
> > > > Not tainted 5.16.0-rc2-gcc-10.3.0 #21
> > > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> > > > message.
> > > >   task:kworker/1:1 state:D stack:12496 pid:  109 ppid: 2 
> > > > flags:0x0800
> > > >   Workqueue: events_freezable update_balloon_size_func
> > > >   Call Trace:
> > > >   [c3cef7c0] [c3cef820] 0xc3cef820 (unreliable)
> > > >   [c3cef9b0] [c001e238] __switch_to+0x1e8/0x2f0
> > > >   [c3cefa10] [c0f0a00c] __schedule+0x2cc/0xb50
> > > >   [c3cefae0] [c0f0a8fc] schedule+0x6c/0x140
> > > >   [c3cefb10] [c095b6c4] tell_host+0xe4/0x130
> > > >   [c3cefba0] [c095d234] 
> > > > update_balloon_size_func+0x394/0x3f0
> > > >   [c3cefc70] [c0178064] process_one_work+0x2c4/0x5b0
> > > >   [c3cefd10] [c01783f8] worker_thread+0xa8/0x640
> > > >   [c3cefda0] [c0185444] kthread+0x1b4/0x1c0
> > > >   [c3cefe10] [c000cee4]

Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Michael S. Tsirkin
On Wed, Nov 24, 2021 at 10:33:28AM +0800, Jason Wang wrote:
> On Wed, Nov 24, 2021 at 10:26 AM Jason Wang  wrote:
> >
> > On Wed, Nov 24, 2021 at 9:30 AM Michael Ellerman  
> > wrote:
> > >
> > > "Michael S. Tsirkin"  writes:
> > > > On Tue, Nov 23, 2021 at 10:25:20AM +0800, Jason Wang wrote:
> > > >> On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  
> > > >> wrote:
> > > >> >
> > > >> > On Mon, 22 Nov 2021 14:25:26 +0800
> > > >> > Jason Wang  wrote:
> > > >> >
> > > >> > > I think the fixes are:
> > > >> > >
> > > >> > > 1) fixing the vhost vsock
> > > >> > > 2) use suppress_used_validation=true to let vsock driver to 
> > > >> > > validate
> > > >> > > the in buffer length
> > > >> > > 3) probably a new feature so the driver can only enable the 
> > > >> > > validation
> > > >> > > when the feature is enabled.
> > > >> >
> > > >> > I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
> > > >> > feature. Frankly the set of such bugs is device implementation
> > > >> > specific and it makes little sense to specify a feature bit
> > > >> > that says the device implementation claims to adhere to some
> > > >> > aspect of the specification. Also what would be the semantic
> > > >> > of not negotiating F_DEV_Y_FIXED_BUG_X?
> > > >>
> > > >> Yes, I agree. Rethink of the feature bit, it seems unnecessary,
> > > >> especially considering the driver should not care about the used
> > > >> length for tx.
> > > >>
> > > >> >
> > > >> > On the other hand I see no other way to keep the validation
> > > >> > permanently enabled for fixed implementations, and get around the 
> > > >> > problem
> > > >> > with broken implementations. So we could have something like
> > > >> > VHOST_USED_LEN_STRICT.
> > > >>
> > > >> It's more about a choice of the driver's knowledge. For vsock TX it
> > > >> should be fine. If we introduce a parameter and disable it by default,
> > > >> it won't be very useful.
> > > >>
> > > >> >
> > > >> > Maybe, we can also think of 'warn and don't alter behavior' instead 
> > > >> > of
> > > >> > 'warn' and alter behavior. Or maybe even not having such checks on in
> > > >> > production, but only when testing.
> > > >>
> > > >> I think there's an agreement that virtio drivers need more hardening,
> > > >> that's why a lot of patches were merged. Especially considering the
> > > >> new requirements came from confidential computing, smart NIC and
> > > >> VDUSE. For virtio drivers, enabling the validation may help to
> > > >>
> > > >> 1) protect the driver from the buggy and malicious device
> > > >> 2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
> > > >> 3) force the have a smart driver that can do the validation itself
> > > >> then we can finally remove the validation in the core
> > > >>
> > > >> So I'd like to keep it enabled.
> > > >>
> > > >> Thanks
> > > >
> > > > Let's see how far we can get. But yes, maybe we were too aggressive in
> > > > breaking things by default, a warning might be a better choice for a
> > > > couple of cycles.
> >
> > Ok, considering we saw the issues with balloons I think I can post a
> > patch to use warn instead. I wonder if we need to taint the kernel in
> > this case.
> 
> Rethink this, consider we still have some time, I tend to convert the
> drivers to validate the length by themselves. Does this make sense?
> 
> Thanks

That's separate but let's stop crashing guests for people ASAP.


> >
> > >
> > > This series appears to break the virtio_balloon driver as well.
> > >
> > > The symptom is soft lockup warnings, eg:
> > >
> > >   INFO: task kworker/1:1:109 blocked for more than 614 seconds.
> > > Not tainted 5.16.0-rc2-gcc-10.3.0 #21
> > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
> > > message.
> > >   task:kworker/1:1 state:D stack:12496 pid:  109 ppid: 2 
> > > flags:0x0800
> > >   Workqueue: events_freezable update_balloon_size_func
> > >   Call Trace:
> > >   [c3cef7c0] [c3cef820] 0xc3cef820 (unreliable)
> > >   [c3cef9b0] [c001e238] __switch_to+0x1e8/0x2f0
> > >   [c3cefa10] [c0f0a00c] __schedule+0x2cc/0xb50
> > >   [c3cefae0] [c0f0a8fc] schedule+0x6c/0x140
> > >   [c3cefb10] [c095b6c4] tell_host+0xe4/0x130
> > >   [c3cefba0] [c095d234] 
> > > update_balloon_size_func+0x394/0x3f0
> > >   [c3cefc70] [c0178064] process_one_work+0x2c4/0x5b0
> > >   [c3cefd10] [c01783f8] worker_thread+0xa8/0x640
> > >   [c3cefda0] [c0185444] kthread+0x1b4/0x1c0
> > >   [c3cefe10] [c000cee4] ret_from_kernel_thread+0x5c/0x64
> > >
> > > Similar backtrace reported here by Luis:
> > >
> > >   https://lore.kernel.org/lkml/yy2duti0wayak...@bombadil.infradead.org/
> > >
> > > Bisect points to:
> > >
> > >   # first bad commit: [939779f5152d161b34f612af29e7dc1ac4472fcf] 
> > > virtio_ring: validate used buffer length
> > >
> > > Adding sup

Re: [PATCH 23/29] xfs: use IOMAP_DAX to check for DAX mappings

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 03:01:24PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 09, 2021 at 09:33:03AM +0100, Christoph Hellwig wrote:
> > Use the explicit DAX flag instead of checking the inode flag in the
> > iomap code.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Any particular reason to pass this in as a flag vs. querying the inode?

Same reason as the addition of IOMAP_DAX.  But I think I'll redo this
a bit to do the XFS paramater passing first and then actually check
IOMAP_DAX together with introducing it to make it all a little more clear.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 22/29] iomap: add a IOMAP_DAX flag

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 06:47:10PM -0800, Dan Williams wrote:
> On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
> >
> > Add a flag so that the file system can easily detect DAX operations.
> 
> Looks ok, but I would have preferred a quick note about the rationale
> here before needing to read other patches to figure that out.

The reason is to only apply the DAX partition offsets to actual DAX
operations, and not to e.g. fiemap.  I'll document that more clearly.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/29] dax: return the partition offset from fs_dax_get_by_bdev

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 06:56:29PM -0800, Dan Williams wrote:
> On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
> >
> > Prepare from removing the block_device from the DAX I/O path by returning
> 
> s/from removing/for the removal of/

Fixed.

> > td->dm_dev.bdev = bdev;
> > -   td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev);
> > +   td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off);
> 
> Perhaps allow NULL as an argument for callers that do not care about
> the start offset?

All callers currently care, dm just has another way to get at the
information.  So for now I'd like to not add the NULL special case,
but we can reconsider that as needed if/when more callers show up.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 21/29] xfs: move dax device handling into xfs_{alloc,free}_buftarg

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 06:40:47PM -0800, Dan Williams wrote:
> On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
> >
> > Hide the DAX device lookup from the xfs_super.c code.
> >
> > Reviewed-by: Christoph Hellwig 
> 
> That's an interesting spelling of "Signed-off-by", but patch looks
> good to me too. I would have expected a robot to complain about
> missing sign-off?

Hah.  I'll fix it up.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 20/29] ext4: cleanup the dax handling in ext4_fill_super

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 02:54:30PM -0800, Darrick J. Wong wrote:
> Nit: no space before the paren  ^ here.

Fixed.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/29] fsdax: decouple zeroing from the iomap buffered I/O code

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 02:53:15PM -0800, Darrick J. Wong wrote:
> > -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> > +static loff_t dax_zero_iter(struct iomap_iter *iter, bool *did_zero)
> 
> Shouldn't this return value remain s64 to match iomap_iter.processed?

I'll switch it over.  Given that loff_t is always the same as s64
it shouldn't really matter.

(same for the others)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/29] fsdax: decouple zeroing from the iomap buffered I/O code

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 01:46:35PM -0800, Dan Williams wrote:
> > +   const struct iomap_ops *ops)
> > +{
> > +   unsigned int blocksize = i_blocksize(inode);
> > +   unsigned int off = pos & (blocksize - 1);
> > +
> > +   /* Block boundary? Nothing to do */
> > +   if (!off)
> > +   return 0;
> 
> It took me a moment to figure out why this was correct. I see it was
> also copied from iomap_truncate_page(). It makes sense for DAX where
> blocksize >= PAGE_SIZE so it's always the case that the amount of
> capacity to zero relative to a page is from @pos to the end of the
> block. Is there something else that protects the blocksize < PAGE_SIZE
> case outside of DAX?
> 
> Nothing to change for this patch, just a question I had while reviewing.

This is a helper for truncate ->setattr, where everything outside the
block is deallocated.  So zeroing is only needed inside the block.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 17/29] fsdax: factor out a dax_memzero helper

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 01:22:13PM -0800, Dan Williams wrote:
> On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
> >
> > Factor out a helper for the "manual" zeroing of a DAX range to clean
> > up dax_iomap_zero a lot.
> >
> 
> Small / optional fixup below:

Incorporated.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 14/29] fsdax: simplify the pgoff calculation

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 02:36:42PM -0800, Darrick J. Wong wrote:
> > -   phys_addr_t phys_off = (start_sect + sector) * 512;
> > -
> > -   if (pgoff)
> > -   *pgoff = PHYS_PFN(phys_off);
> > -   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
> 
> AFAICT, we're relying on fs_dax_get_by_bdev to have validated this
> previously, which is why the error return stuff goes away?

Exactly.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/29] dax: remove dax_capable

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 02:31:23PM -0800, Darrick J. Wong wrote:
> > -   struct super_block  *sb = mp->m_super;
> > -
> > -   if (!xfs_buftarg_is_dax(sb, mp->m_ddev_targp) &&
> > -  (!mp->m_rtdev_targp || !xfs_buftarg_is_dax(sb, mp->m_rtdev_targp))) {
> > +   if (!mp->m_ddev_targp->bt_daxdev &&
> > +  (!mp->m_rtdev_targp || !mp->m_rtdev_targp->bt_daxdev)) {
> 
> Nit: This  ^ paren should be indented one more column because it's a
> sub-clause of the if() test.

Done.

> Nit: xfs_alert() already adds a newline to the end of the format string.

Already done in the current tree.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 06/29] dax: move the partition alignment check into fs_dax_get_by_bdev

2021-11-23 Thread Christoph Hellwig
On Tue, Nov 23, 2021 at 02:25:55PM -0800, Darrick J. Wong wrote:
> > +   if ((get_start_sect(bdev) * SECTOR_SIZE) % PAGE_SIZE ||
> > +   (bdev_nr_sectors(bdev) * SECTOR_SIZE) % PAGE_SIZE) {
> 
> Do we have to be careful about 64-bit division here, or do we not
> support DAX on 32-bit?

I can't find anything in the Kconfig limiting DAX to 32-bit.  But
then again the existing code has divisions like this, so the compiler
is probably smart enough to turn them into shifts.

> > +   pr_info("%pg: error: unaligned partition for dax\n", bdev);
> 
> I also wonder if this should be ratelimited...?

This happens once (or maybe three times for XFS with rt and log devices)
at mount time, so I see no need for a ratelimit.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 29/29] fsdax: don't require CONFIG_BLOCK

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> The file system DAX code now does not require the block code.  So allow
> building a kernel with fuse DAX but not block layer.

Looks good to me.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 28/29] iomap: build the block based code conditionally

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Only build the block based iomap code if CONFIG_BLOCK is set.  Currently
> that is always the case, but it will change soon.

Looks good.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 27/29] dax: fix up some of the block device related ifdefs

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> The DAX device <-> block device association is only enabled if
> CONFIG_BLOCK is enabled.  Update dax.h to account for that and use
> the right conditions for the fs_put_dax stub as well.

Looks good to me.

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 26/29] fsdax: shift partition offset handling into the file systems

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Remove the last user of ->bdev in dax.c by requiring the file system to
> pass in an address that already includes the DAX offset.  As part of the
> only set ->bdev or ->daxdev when actually required in the ->iomap_begin
> methods.

Changes look good except for what looks like an argument position
fixup needed for an xfs_bmbt_to_iomap() caller below...

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c |  6 +-
>  fs/erofs/data.c  | 11 --
>  fs/erofs/internal.h  |  1 +
>  fs/ext2/inode.c  |  8 +--
>  fs/ext4/inode.c  | 16 +-
>  fs/xfs/libxfs/xfs_bmap.c |  4 ++--
>  fs/xfs/xfs_aops.c|  2 +-
>  fs/xfs/xfs_iomap.c   | 45 +---
>  fs/xfs/xfs_iomap.h   |  5 +++--
>  fs/xfs/xfs_pnfs.c|  2 +-
>  10 files changed, 63 insertions(+), 37 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 0bd6cdcbacfc4..2c13c681edf09 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -711,11 +711,7 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
>
>  static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos)
>  {
> -   phys_addr_t paddr = iomap->addr + (pos & PAGE_MASK) - iomap->offset;
> -
> -   if (iomap->bdev)
> -   paddr += (get_start_sect(iomap->bdev) << SECTOR_SHIFT);
> -   return PHYS_PFN(paddr);
> +   return PHYS_PFN(iomap->addr + (pos & PAGE_MASK) - iomap->offset);
>  }
>
>  static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter 
> *iter)
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 0e35ef3f9f3d7..9b1bb177ce303 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
[..]
   }
> @@ -215,9 +218,13 @@ static int erofs_iomap_begin(struct inode *inode, loff_t 
> offset, loff_t length,
> if (ret)
> return ret;
>
> -   iomap->bdev = mdev.m_bdev;
> -   iomap->dax_dev = mdev.m_daxdev;
> iomap->offset = map.m_la;
> +   if (flags & IOMAP_DAX) {
> +   iomap->dax_dev = mdev.m_daxdev;
> +   iomap->offset += mdev.m_dax_part_off;
> +   } else {
> +   iomap->bdev = mdev.m_bdev;
> +   }

Ah, that's what IOMAP_DAX is for, to stop making iomap carry bdev
details unnecessarily.

[..]
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 704292c6ce0c7..74dbf1fd99d39 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -54,7 +54,8 @@ xfs_bmbt_to_iomap(
> struct xfs_inode*ip,
> struct iomap*iomap,
> struct xfs_bmbt_irec*imap,
> -   u16 flags)
> +   unsigned intflags,
> +   u16 iomap_flags)

It would be nice if the compiler could help with making sure that
right 'flags' values are passed to the right 'flags' parameter, but I
can't think of

[..]
> @@ -1053,23 +1061,24 @@ xfs_buffered_write_iomap_begin(
>  */
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> trace_xfs_iomap_alloc(ip, offset, count, allocfork, &imap);
> -   return xfs_bmbt_to_iomap(ip, iomap, &imap, IOMAP_F_NEW);
> +   return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, IOMAP_F_NEW);
>
>  found_imap:
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -   return xfs_bmbt_to_iomap(ip, iomap, &imap, 0);
> +   return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);

The iomap flags are supposed to be the last argument, right?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/29] dax: return the partition offset from fs_dax_get_by_bdev

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Prepare from removing the block_device from the DAX I/O path by returning

s/from removing/for the removal of/

> the partition offset from fs_dax_get_by_bdev so that the file systems
> have it at hand for use during I/O.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 9 ++---
>  drivers/md/dm.c | 4 ++--
>  fs/erofs/internal.h | 2 ++
>  fs/erofs/super.c| 4 ++--
>  fs/ext2/ext2.h  | 1 +
>  fs/ext2/super.c | 2 +-
>  fs/ext4/ext4.h  | 1 +
>  fs/ext4/super.c | 2 +-
>  fs/xfs/xfs_buf.c| 2 +-
>  fs/xfs/xfs_buf.h| 1 +
>  include/linux/dax.h | 6 --
>  11 files changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c0910687fbcb2..cc32dcf71c116 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -70,17 +70,20 @@ EXPORT_SYMBOL_GPL(dax_remove_host);
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @bdev: block device to find a dax_device for
> + * @start_off: returns the byte offset into the dax_device that @bdev starts
>   */
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> +struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
> *start_off)
>  {
> struct dax_device *dax_dev;
> +   u64 part_size;
> int id;
>
> if (!blk_queue_dax(bdev->bd_disk->queue))
> return NULL;
>
> -   if ((get_start_sect(bdev) * SECTOR_SIZE) % PAGE_SIZE ||
> -   (bdev_nr_sectors(bdev) * SECTOR_SIZE) % PAGE_SIZE) {
> +   *start_off = get_start_sect(bdev) * SECTOR_SIZE;
> +   part_size = bdev_nr_sectors(bdev) * SECTOR_SIZE;
> +   if (*start_off % PAGE_SIZE || part_size % PAGE_SIZE) {
> pr_info("%pg: error: unaligned partition for dax\n", bdev);
> return NULL;
> }
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 282008afc465f..5ea6115d19bdc 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -637,7 +637,7 @@ static int open_table_device(struct table_device *td, 
> dev_t dev,
>  struct mapped_device *md)
>  {
> struct block_device *bdev;
> -
> +   u64 part_off;
> int r;
>
> BUG_ON(td->dm_dev.bdev);
> @@ -653,7 +653,7 @@ static int open_table_device(struct table_device *td, 
> dev_t dev,
> }
>
> td->dm_dev.bdev = bdev;
> -   td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev);
> +   td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off);

Perhaps allow NULL as an argument for callers that do not care about
the start offset?


Otherwise, looks good / clever.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/29] xfs: use xfs_direct_write_iomap_ops for DAX zeroing

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> While the buffered write iomap ops do work due to the fact that zeroing
> never allocates blocks, the DAX zeroing should use the direct ops just
> like actual DAX I/O.
>

I always wondered about this, change looks good to me.

Reviewed-by: Dan Williams 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_iomap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 8cef3b68cba78..704292c6ce0c7 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1324,7 +1324,7 @@ xfs_zero_range(
>
> if (IS_DAX(inode))
> return dax_zero_range(inode, pos, len, did_zero,
> - &xfs_buffered_write_iomap_ops);
> + &xfs_direct_write_iomap_ops);
> return iomap_zero_range(inode, pos, len, did_zero,
> &xfs_buffered_write_iomap_ops);
>  }
> @@ -1339,7 +1339,7 @@ xfs_truncate_page(
>
> if (IS_DAX(inode))
> return dax_truncate_page(inode, pos, did_zero,
> -   &xfs_buffered_write_iomap_ops);
> +   &xfs_direct_write_iomap_ops);
> return iomap_truncate_page(inode, pos, did_zero,
>&xfs_buffered_write_iomap_ops);
>  }
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 23/29] xfs: use IOMAP_DAX to check for DAX mappings

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Use the explicit DAX flag instead of checking the inode flag in the
> iomap code.

It's not immediately clear to me why this is a net benefit, are you
anticipating inode-less operations? With reflink and multi-inode
operations a single iomap flag seems insufficient, no?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 22/29] iomap: add a IOMAP_DAX flag

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Add a flag so that the file system can easily detect DAX operations.

Looks ok, but I would have preferred a quick note about the rationale
here before needing to read other patches to figure that out.

If you add that you can add:

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c  | 7 ---
>  include/linux/iomap.h | 1 +
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b52b878124ac..0bd6cdcbacfc4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1180,7 +1180,7 @@ int dax_zero_range(struct inode *inode, loff_t pos, 
> loff_t len, bool *did_zero,
> .inode  = inode,
> .pos= pos,
> .len= len,
> -   .flags  = IOMAP_ZERO,
> +   .flags  = IOMAP_DAX | IOMAP_ZERO,
> };
> int ret;
>
> @@ -1308,6 +1308,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> .inode  = iocb->ki_filp->f_mapping->host,
> .pos= iocb->ki_pos,
> .len= iov_iter_count(iter),
> +   .flags  = IOMAP_DAX,
> };
> loff_t done = 0;
> int ret;
> @@ -1461,7 +1462,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
> .inode  = mapping->host,
> .pos= (loff_t)vmf->pgoff << PAGE_SHIFT,
> .len= PAGE_SIZE,
> -   .flags  = IOMAP_FAULT,
> +   .flags  = IOMAP_DAX | IOMAP_FAULT,
> };
> vm_fault_t ret = 0;
> void *entry;
> @@ -1570,7 +1571,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
> struct iomap_iter iter = {
> .inode  = mapping->host,
> .len= PMD_SIZE,
> -   .flags  = IOMAP_FAULT,
> +   .flags  = IOMAP_DAX | IOMAP_FAULT,
> };
> vm_fault_t ret = VM_FAULT_FALLBACK;
> pgoff_t max_pgoff;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6d1b08d0ae930..146a7e3e3ea11 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -141,6 +141,7 @@ struct iomap_page_ops {
>  #define IOMAP_NOWAIT   (1 << 5) /* do not block */
>  #define IOMAP_OVERWRITE_ONLY   (1 << 6) /* only pure overwrites allowed */
>  #define IOMAP_UNSHARE  (1 << 7) /* unshare_file_range */
> +#define IOMAP_DAX  (1 << 8) /* DAX mapping */
>
>  struct iomap_ops {
> /*
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 21/29] xfs: move dax device handling into xfs_{alloc, free}_buftarg

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Hide the DAX device lookup from the xfs_super.c code.
>
> Reviewed-by: Christoph Hellwig 

That's an interesting spelling of "Signed-off-by", but patch looks
good to me too. I would have expected a robot to complain about
missing sign-off?

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Jason Wang
On Wed, Nov 24, 2021 at 10:26 AM Jason Wang  wrote:
>
> On Wed, Nov 24, 2021 at 9:30 AM Michael Ellerman  wrote:
> >
> > "Michael S. Tsirkin"  writes:
> > > On Tue, Nov 23, 2021 at 10:25:20AM +0800, Jason Wang wrote:
> > >> On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  wrote:
> > >> >
> > >> > On Mon, 22 Nov 2021 14:25:26 +0800
> > >> > Jason Wang  wrote:
> > >> >
> > >> > > I think the fixes are:
> > >> > >
> > >> > > 1) fixing the vhost vsock
> > >> > > 2) use suppress_used_validation=true to let vsock driver to validate
> > >> > > the in buffer length
> > >> > > 3) probably a new feature so the driver can only enable the 
> > >> > > validation
> > >> > > when the feature is enabled.
> > >> >
> > >> > I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
> > >> > feature. Frankly the set of such bugs is device implementation
> > >> > specific and it makes little sense to specify a feature bit
> > >> > that says the device implementation claims to adhere to some
> > >> > aspect of the specification. Also what would be the semantic
> > >> > of not negotiating F_DEV_Y_FIXED_BUG_X?
> > >>
> > >> Yes, I agree. Rethink of the feature bit, it seems unnecessary,
> > >> especially considering the driver should not care about the used
> > >> length for tx.
> > >>
> > >> >
> > >> > On the other hand I see no other way to keep the validation
> > >> > permanently enabled for fixed implementations, and get around the 
> > >> > problem
> > >> > with broken implementations. So we could have something like
> > >> > VHOST_USED_LEN_STRICT.
> > >>
> > >> It's more about a choice of the driver's knowledge. For vsock TX it
> > >> should be fine. If we introduce a parameter and disable it by default,
> > >> it won't be very useful.
> > >>
> > >> >
> > >> > Maybe, we can also think of 'warn and don't alter behavior' instead of
> > >> > 'warn' and alter behavior. Or maybe even not having such checks on in
> > >> > production, but only when testing.
> > >>
> > >> I think there's an agreement that virtio drivers need more hardening,
> > >> that's why a lot of patches were merged. Especially considering the
> > >> new requirements came from confidential computing, smart NIC and
> > >> VDUSE. For virtio drivers, enabling the validation may help to
> > >>
> > >> 1) protect the driver from the buggy and malicious device
> > >> 2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
> > >> 3) force the have a smart driver that can do the validation itself
> > >> then we can finally remove the validation in the core
> > >>
> > >> So I'd like to keep it enabled.
> > >>
> > >> Thanks
> > >
> > > Let's see how far we can get. But yes, maybe we were too aggressive in
> > > breaking things by default, a warning might be a better choice for a
> > > couple of cycles.
>
> Ok, considering we saw the issues with balloons I think I can post a
> patch to use warn instead. I wonder if we need to taint the kernel in
> this case.

Rethink this, consider we still have some time, I tend to convert the
drivers to validate the length by themselves. Does this make sense?

Thanks

>
> >
> > This series appears to break the virtio_balloon driver as well.
> >
> > The symptom is soft lockup warnings, eg:
> >
> >   INFO: task kworker/1:1:109 blocked for more than 614 seconds.
> > Not tainted 5.16.0-rc2-gcc-10.3.0 #21
> >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >   task:kworker/1:1 state:D stack:12496 pid:  109 ppid: 2 
> > flags:0x0800
> >   Workqueue: events_freezable update_balloon_size_func
> >   Call Trace:
> >   [c3cef7c0] [c3cef820] 0xc3cef820 (unreliable)
> >   [c3cef9b0] [c001e238] __switch_to+0x1e8/0x2f0
> >   [c3cefa10] [c0f0a00c] __schedule+0x2cc/0xb50
> >   [c3cefae0] [c0f0a8fc] schedule+0x6c/0x140
> >   [c3cefb10] [c095b6c4] tell_host+0xe4/0x130
> >   [c3cefba0] [c095d234] update_balloon_size_func+0x394/0x3f0
> >   [c3cefc70] [c0178064] process_one_work+0x2c4/0x5b0
> >   [c3cefd10] [c01783f8] worker_thread+0xa8/0x640
> >   [c3cefda0] [c0185444] kthread+0x1b4/0x1c0
> >   [c3cefe10] [c000cee4] ret_from_kernel_thread+0x5c/0x64
> >
> > Similar backtrace reported here by Luis:
> >
> >   https://lore.kernel.org/lkml/yy2duti0wayak...@bombadil.infradead.org/
> >
> > Bisect points to:
> >
> >   # first bad commit: [939779f5152d161b34f612af29e7dc1ac4472fcf] 
> > virtio_ring: validate used buffer length
> >
> > Adding suppress used validation to the virtio balloon driver "fixes" it, eg.
> >
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index c22ff0117b46..a14b82ceebb2 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -1150,6 +1150,7 @@ static unsigned int features[] = {
> >  };
> >
> >  static struct virtio_driver vi

Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Jason Wang
On Wed, Nov 24, 2021 at 9:30 AM Michael Ellerman  wrote:
>
> "Michael S. Tsirkin"  writes:
> > On Tue, Nov 23, 2021 at 10:25:20AM +0800, Jason Wang wrote:
> >> On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  wrote:
> >> >
> >> > On Mon, 22 Nov 2021 14:25:26 +0800
> >> > Jason Wang  wrote:
> >> >
> >> > > I think the fixes are:
> >> > >
> >> > > 1) fixing the vhost vsock
> >> > > 2) use suppress_used_validation=true to let vsock driver to validate
> >> > > the in buffer length
> >> > > 3) probably a new feature so the driver can only enable the validation
> >> > > when the feature is enabled.
> >> >
> >> > I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
> >> > feature. Frankly the set of such bugs is device implementation
> >> > specific and it makes little sense to specify a feature bit
> >> > that says the device implementation claims to adhere to some
> >> > aspect of the specification. Also what would be the semantic
> >> > of not negotiating F_DEV_Y_FIXED_BUG_X?
> >>
> >> Yes, I agree. Rethink of the feature bit, it seems unnecessary,
> >> especially considering the driver should not care about the used
> >> length for tx.
> >>
> >> >
> >> > On the other hand I see no other way to keep the validation
> >> > permanently enabled for fixed implementations, and get around the problem
> >> > with broken implementations. So we could have something like
> >> > VHOST_USED_LEN_STRICT.
> >>
> >> It's more about a choice of the driver's knowledge. For vsock TX it
> >> should be fine. If we introduce a parameter and disable it by default,
> >> it won't be very useful.
> >>
> >> >
> >> > Maybe, we can also think of 'warn and don't alter behavior' instead of
> >> > 'warn' and alter behavior. Or maybe even not having such checks on in
> >> > production, but only when testing.
> >>
> >> I think there's an agreement that virtio drivers need more hardening,
> >> that's why a lot of patches were merged. Especially considering the
> >> new requirements came from confidential computing, smart NIC and
> >> VDUSE. For virtio drivers, enabling the validation may help to
> >>
> >> 1) protect the driver from the buggy and malicious device
> >> 2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
> >> 3) force the have a smart driver that can do the validation itself
> >> then we can finally remove the validation in the core
> >>
> >> So I'd like to keep it enabled.
> >>
> >> Thanks
> >
> > Let's see how far we can get. But yes, maybe we were too aggressive in
> > breaking things by default, a warning might be a better choice for a
> > couple of cycles.

Ok, considering we saw the issues with balloons I think I can post a
patch to use warn instead. I wonder if we need to taint the kernel in
this case.

>
> This series appears to break the virtio_balloon driver as well.
>
> The symptom is soft lockup warnings, eg:
>
>   INFO: task kworker/1:1:109 blocked for more than 614 seconds.
> Not tainted 5.16.0-rc2-gcc-10.3.0 #21
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:kworker/1:1 state:D stack:12496 pid:  109 ppid: 2 
> flags:0x0800
>   Workqueue: events_freezable update_balloon_size_func
>   Call Trace:
>   [c3cef7c0] [c3cef820] 0xc3cef820 (unreliable)
>   [c3cef9b0] [c001e238] __switch_to+0x1e8/0x2f0
>   [c3cefa10] [c0f0a00c] __schedule+0x2cc/0xb50
>   [c3cefae0] [c0f0a8fc] schedule+0x6c/0x140
>   [c3cefb10] [c095b6c4] tell_host+0xe4/0x130
>   [c3cefba0] [c095d234] update_balloon_size_func+0x394/0x3f0
>   [c3cefc70] [c0178064] process_one_work+0x2c4/0x5b0
>   [c3cefd10] [c01783f8] worker_thread+0xa8/0x640
>   [c3cefda0] [c0185444] kthread+0x1b4/0x1c0
>   [c3cefe10] [c000cee4] ret_from_kernel_thread+0x5c/0x64
>
> Similar backtrace reported here by Luis:
>
>   https://lore.kernel.org/lkml/yy2duti0wayak...@bombadil.infradead.org/
>
> Bisect points to:
>
>   # first bad commit: [939779f5152d161b34f612af29e7dc1ac4472fcf] virtio_ring: 
> validate used buffer length
>
> Adding suppress used validation to the virtio balloon driver "fixes" it, eg.
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index c22ff0117b46..a14b82ceebb2 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1150,6 +1150,7 @@ static unsigned int features[] = {
>  };
>
>  static struct virtio_driver virtio_balloon_driver = {
> +   .suppress_used_validation = true,
> .feature_table = features,
> .feature_table_size = ARRAY_SIZE(features),
> .driver.name =  KBUILD_MODNAME,

Looks good, we need a formal patch for this.

And we need fix Qemu as well which advertise non zero used length for
inflate/deflate queue:

static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
...
virtqueue_p

Re: update_balloon_size_func blocked for more than 120 seconds

2021-11-23 Thread Michael Ellerman
David Hildenbrand  writes:
> On Thu, Nov 11, 2021 at 11:49 PM Luis Chamberlain  wrote:
>>
>> I get the following splats with a kvm guest in idle, after a few seconds
>> it starts:
>>
>> [  242.412806] INFO: task kworker/6:2:271 blockedfor more than 120 seconds.
>> [  242.415790]   Tainted: GE 5.15.0-next-2021 #68
>> [  242.417755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  242.418332] task:kworker/6:2 state:D stack:0 pid:  271 ppid: 2 
>> flags:0x4000
>> [  242.418954] Workqueue: events_freezable update_balloon_size_func 
>> [virtio_balloon]
>> [  242.419518] Call Trace:
>> [  242.419709]  
>> [  242.419873]  __schedule+0x2fd/0x990
>> [  242.420142]  schedule+0x4e/0xc0
>> [  242.420382]  tell_host+0xaa/0xf0 [virtio_balloon]
>> [  242.420757]  ? do_wait_intr_irq+0xa0/0xa0
>> [  242.421065]  update_balloon_size_func+0x2c9/0x2e0 [virtio_balloon]
>> [  242.421527]  process_one_work+0x1e5/0x3c0
>> [  242.421833]  worker_thread+0x50/0x3b0
>> [  242.422204]  ? rescuer_thread+0x370/0x370
>> [  242.422507]  kthread+0x169/0x190
>> [  242.422754]  ? set_kthread_struct+0x40/0x40
>> [  242.423073]  ret_from_fork+0x1f/0x30
>> [  242.423347]  
>>
>> And this goes on endlessly. The last one says it blocked for more than 1208
>> seconds. This was not happening until the last few weeks but I see no
>> relevant recent commits for virtio_balloon, so the related change could
>> be elsewhere.
>
> We're stuck somewhere in:
>
> wq: update_balloon_size_func()->fill_balloon()->tell_host()
>
> Most probably in wait_event().
>
>
> I am no waitqueue expert, but my best guess would be that we're
> waiting more than 2 minutes
> on a host reply with TASK_UNINTERRUPTIBLE. At least that's my interpretation,
>
> In case we're really stuck for more than 2 minutes, the hypervisor
> might not be processing our
> requests anymore -- or it's not getting processed for some other reason (or 
> the
> waitqueue is not getting woken up due do some other BUG).
>
> IIUC, we can sleep longer via wait_event_interruptible(), TASK_UNINTERRUPTIBLE
> seems to be the issue that triggers the warning. But by changing that
> might just be hiding the fact that
> we're waiting more than 2 minutes on a reply.
>
>>
>> I could bisect but first I figured I'd check to see if someone already
>> had spotted this.
>
> Bisecting would be awesome, then we might at least know if this is a
> guest or a hypervisor issue.

I see this on ppc64le also.

I bisected it to:

  # first bad commit: [939779f5152d161b34f612af29e7dc1ac4472fcf] virtio_ring: 
validate used buffer length

I also reported it in the thread hanging off that patch:

  https://lore.kernel.org/lkml/87zgpupcga@mpe.ellerman.id.au/


cheers
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Michael Ellerman
"Michael S. Tsirkin"  writes:
> On Tue, Nov 23, 2021 at 10:25:20AM +0800, Jason Wang wrote:
>> On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  wrote:
>> >
>> > On Mon, 22 Nov 2021 14:25:26 +0800
>> > Jason Wang  wrote:
>> >
>> > > I think the fixes are:
>> > >
>> > > 1) fixing the vhost vsock
>> > > 2) use suppress_used_validation=true to let vsock driver to validate
>> > > the in buffer length
>> > > 3) probably a new feature so the driver can only enable the validation
>> > > when the feature is enabled.
>> >
>> > I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
>> > feature. Frankly the set of such bugs is device implementation
>> > specific and it makes little sense to specify a feature bit
>> > that says the device implementation claims to adhere to some
>> > aspect of the specification. Also what would be the semantic
>> > of not negotiating F_DEV_Y_FIXED_BUG_X?
>> 
>> Yes, I agree. Rethink of the feature bit, it seems unnecessary,
>> especially considering the driver should not care about the used
>> length for tx.
>> 
>> >
>> > On the other hand I see no other way to keep the validation
>> > permanently enabled for fixed implementations, and get around the problem
>> > with broken implementations. So we could have something like
>> > VHOST_USED_LEN_STRICT.
>> 
>> It's more about a choice of the driver's knowledge. For vsock TX it
>> should be fine. If we introduce a parameter and disable it by default,
>> it won't be very useful.
>> 
>> >
>> > Maybe, we can also think of 'warn and don't alter behavior' instead of
>> > 'warn' and alter behavior. Or maybe even not having such checks on in
>> > production, but only when testing.
>> 
>> I think there's an agreement that virtio drivers need more hardening,
>> that's why a lot of patches were merged. Especially considering the
>> new requirements came from confidential computing, smart NIC and
>> VDUSE. For virtio drivers, enabling the validation may help to
>> 
>> 1) protect the driver from the buggy and malicious device
>> 2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
>> 3) force the have a smart driver that can do the validation itself
>> then we can finally remove the validation in the core
>> 
>> So I'd like to keep it enabled.
>> 
>> Thanks
>
> Let's see how far we can get. But yes, maybe we were too aggressive in
> breaking things by default, a warning might be a better choice for a
> couple of cycles.

This series appears to break the virtio_balloon driver as well.

The symptom is soft lockup warnings, eg:

  INFO: task kworker/1:1:109 blocked for more than 614 seconds.
Not tainted 5.16.0-rc2-gcc-10.3.0 #21
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:kworker/1:1 state:D stack:12496 pid:  109 ppid: 2 
flags:0x0800
  Workqueue: events_freezable update_balloon_size_func
  Call Trace:
  [c3cef7c0] [c3cef820] 0xc3cef820 (unreliable)
  [c3cef9b0] [c001e238] __switch_to+0x1e8/0x2f0
  [c3cefa10] [c0f0a00c] __schedule+0x2cc/0xb50
  [c3cefae0] [c0f0a8fc] schedule+0x6c/0x140
  [c3cefb10] [c095b6c4] tell_host+0xe4/0x130
  [c3cefba0] [c095d234] update_balloon_size_func+0x394/0x3f0
  [c3cefc70] [c0178064] process_one_work+0x2c4/0x5b0
  [c3cefd10] [c01783f8] worker_thread+0xa8/0x640
  [c3cefda0] [c0185444] kthread+0x1b4/0x1c0
  [c3cefe10] [c000cee4] ret_from_kernel_thread+0x5c/0x64

Similar backtrace reported here by Luis:

  https://lore.kernel.org/lkml/yy2duti0wayak...@bombadil.infradead.org/

Bisect points to:

  # first bad commit: [939779f5152d161b34f612af29e7dc1ac4472fcf] virtio_ring: 
validate used buffer length

Adding suppress used validation to the virtio balloon driver "fixes" it, eg.

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c22ff0117b46..a14b82ceebb2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1150,6 +1150,7 @@ static unsigned int features[] = {
 };
 
 static struct virtio_driver virtio_balloon_driver = {
+   .suppress_used_validation = true,
.feature_table = features,
.feature_table_size = ARRAY_SIZE(features),
.driver.name =  KBUILD_MODNAME,


cheers
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 net-next 21/26] ice: add XDP and XSK generic per-channel statistics

2021-11-23 Thread Daniel Borkmann

Hi Alexander,

On 11/23/21 5:39 PM, Alexander Lobakin wrote:
[...]

Just commenting on ice here as one example (similar applies to other drivers):


diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c 
b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 1dd7e84f41f8..7dc287bc3a1a 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -258,6 +258,8 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
xdp_ring->next_dd = ICE_TX_THRESH - 1;
xdp_ring->next_to_clean = ntc;
ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
+   xdp_update_tx_drv_stats(&xdp_ring->xdp_stats->xdp_tx, total_pkts,
+   total_bytes);
  }

  /**
@@ -277,6 +279,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct 
ice_tx_ring *xdp_ring)
ice_clean_xdp_irq(xdp_ring);

if (!unlikely(ICE_DESC_UNUSED(xdp_ring))) {
+   xdp_update_tx_drv_full(&xdp_ring->xdp_stats->xdp_tx);
xdp_ring->tx_stats.tx_busy++;
return ICE_XDP_CONSUMED;
}
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c 
b/drivers/net/ethernet/intel/ice/ice_xsk.c
index ff55cb415b11..62ef47a38d93 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -454,42 +454,58 @@ ice_construct_skb_zc(struct ice_rx_ring *rx_ring, struct 
xdp_buff **xdp_arr)
   * @xdp: xdp_buff used as input to the XDP program
   * @xdp_prog: XDP program to run
   * @xdp_ring: ring to be used for XDP_TX action
+ * @lrstats: onstack Rx XDP stats
   *
   * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
   */
  static int
  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
-  struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
+  struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
+  struct xdp_rx_drv_stats_local *lrstats)
  {
int err, result = ICE_XDP_PASS;
u32 act;

+   lrstats->bytes += xdp->data_end - xdp->data;
+   lrstats->packets++;
+
act = bpf_prog_run_xdp(xdp_prog, xdp);

if (likely(act == XDP_REDIRECT)) {
err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-   if (err)
+   if (err) {
+   lrstats->redirect_errors++;
goto out_failure;
+   }
+   lrstats->redirect++;
return ICE_XDP_REDIR;
}

switch (act) {
case XDP_PASS:
+   lrstats->pass++;
break;
case XDP_TX:
result = ice_xmit_xdp_buff(xdp, xdp_ring);
-   if (result == ICE_XDP_CONSUMED)
+   if (result == ICE_XDP_CONSUMED) {
+   lrstats->tx_errors++;
goto out_failure;
+   }
+   lrstats->tx++;
break;
default:
bpf_warn_invalid_xdp_action(act);
-   fallthrough;
+   lrstats->invalid++;
+   goto out_failure;
case XDP_ABORTED:
+   lrstats->aborted++;
  out_failure:
trace_xdp_exception(rx_ring->netdev, xdp_prog, act);
-   fallthrough;
+   result = ICE_XDP_CONSUMED;
+   break;
case XDP_DROP:
result = ICE_XDP_CONSUMED;
+   lrstats->drop++;
break;
}


Imho, the overall approach is way too bloated. I can see the packets/bytes but 
now we
have 3 counter updates with return codes included and then the additional sync 
of the
on-stack counters into the ring counters via xdp_update_rx_drv_stats(). So we 
now need
ice_update_rx_ring_stats() as well as xdp_update_rx_drv_stats() which syncs 10 
different
stat counters via u64_stats_add() into the per ring ones. :/

I'm just taking our XDP L4LB in Cilium as an example: there we already count 
errors and
export them via per-cpu map that eventually lead to XDP_DROP cases including 
the /reason/
which caused the XDP_DROP (e.g. Prometheus can then scrape these insights from 
all the
nodes in the cluster). Given the different action codes are very often 
application specific,
there's not much debugging that you can do when /only/ looking at `ip link 
xdpstats` to
gather insight on *why* some of these actions were triggered (e.g. fib lookup 
failure, etc).
If really of interest, then maybe libxdp could have such per-action counters as 
opt-in in
its call chain..

In the case of ice_run_xdp() today, we already bump 
total_rx_bytes/total_rx_pkts under
XDP and update ice_update_rx_ring_stats(). I do see the case for XDP_TX and 
XDP_REDIRECT
where we run into driver-specific errors that are /outside of the reach/ of the 
BPF prog.
For example, we've been running into errors from XDP_TX in ice_xmit_xdp_ring() 
in the
past during testing, and were able to pinpoint th

Re: [PATCH 20/29] ext4: cleanup the dax handling in ext4_fill_super

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Only call fs_dax_get_by_bdev once the sbi has been allocated and remove
> the need for the dax_dev local variable.

Looks good.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 19/29] ext2: cleanup the dax handling in ext2_fill_super

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Only call fs_dax_get_by_bdev once the sbi has been allocated and remove
> the need for the dax_dev local variable.

Looks good.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/29] fsdax: decouple zeroing from the iomap buffered I/O code

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Unshare the DAX and iomap buffered I/O page zeroing code.  This code
> previously did a IS_DAX check deep inside the iomap code, which in
> fact was the only DAX check in the code.  Instead move these checks
> into the callers.  Most callers already have DAX special casing anyway
> and XFS will need it for reflink support as well.

Looks ok, a tangential question below about iomap_truncate_page(), but
you can add:

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c   | 77 ++
>  fs/ext2/inode.c|  6 ++--
>  fs/ext4/inode.c|  4 +--
>  fs/iomap/buffered-io.c | 35 +++
>  fs/xfs/xfs_iomap.c |  6 
>  include/linux/dax.h|  6 +++-
>  6 files changed, 91 insertions(+), 43 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index dc9ebeff850ab..5b52b878124ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1135,24 +1135,73 @@ static int dax_memzero(struct dax_device *dax_dev, 
> pgoff_t pgoff,
> return rc;
>  }
>
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +static loff_t dax_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
> -   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> -   long rc, id;
> -   unsigned offset = offset_in_page(pos);
> -   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
> +   const struct iomap *iomap = &iter->iomap;
> +   const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +   loff_t pos = iter->pos;
> +   loff_t length = iomap_length(iter);
> +   loff_t written = 0;
> +
> +   /* already zeroed?  we're done. */
> +   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +   return length;
> +
> +   do {
> +   unsigned offset = offset_in_page(pos);
> +   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> +   long rc;
> +   int id;
>
> -   id = dax_read_lock();
> -   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> -   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> -   else
> -   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
> -   dax_read_unlock(id);
> +   id = dax_read_lock();
> +   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> +   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> +   else
> +   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
> +   dax_read_unlock(id);
>
> -   if (rc < 0)
> -   return rc;
> -   return size;
> +   if (rc < 0)
> +   return rc;
> +   pos += size;
> +   length -= size;
> +   written += size;
> +   if (did_zero)
> +   *did_zero = true;
> +   } while (length > 0);
> +
> +   return written;
> +}
> +
> +int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool 
> *did_zero,
> +   const struct iomap_ops *ops)
> +{
> +   struct iomap_iter iter = {
> +   .inode  = inode,
> +   .pos= pos,
> +   .len= len,
> +   .flags  = IOMAP_ZERO,
> +   };
> +   int ret;
> +
> +   while ((ret = iomap_iter(&iter, ops)) > 0)
> +   iter.processed = dax_zero_iter(&iter, did_zero);
> +   return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_zero_range);
> +
> +int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> +   const struct iomap_ops *ops)
> +{
> +   unsigned int blocksize = i_blocksize(inode);
> +   unsigned int off = pos & (blocksize - 1);
> +
> +   /* Block boundary? Nothing to do */
> +   if (!off)
> +   return 0;

It took me a moment to figure out why this was correct. I see it was
also copied from iomap_truncate_page(). It makes sense for DAX where
blocksize >= PAGE_SIZE so it's always the case that the amount of
capacity to zero relative to a page is from @pos to the end of the
block. Is there something else that protects the blocksize < PAGE_SIZE
case outside of DAX?

Nothing to change for this patch, just a question I had while reviewing.

> +   return dax_zero_range(inode, pos, blocksize - off, did_zero, ops);
>  }
> +EXPORT_SYMBOL_GPL(dax_truncate_page);
>
>  static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> struct iov_iter *iter)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 333fa62661d56..ae9993018a015 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1297,9 +1297,9 @@ static int ext2_setsize(struct inode *inode, loff_t 
> newsize)
> inode_dio_wait(inode);
>
> if (IS_DAX(inode)) {
> -   error

Re: [PATCH 17/29] fsdax: factor out a dax_memzero helper

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Factor out a helper for the "manual" zeroing of a DAX range to clean
> up dax_iomap_zero a lot.
>

Small / optional fixup below:

Reviewed-by: Dan Williams 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c | 36 +++-
>  1 file changed, 19 insertions(+), 17 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index d7a923d152240..dc9ebeff850ab 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1121,34 +1121,36 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>
> +static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
> +   unsigned int offset, size_t size)
> +{
> +   void *kaddr;
> +   long rc;
> +
> +   rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
> +   if (rc >= 0) {

Technically this should be "> 0" because dax_direct_access() returns
nr_available_pages @pgoff, but this isn't broken because
dax_direct_access() converts the "zero pages available" case into
-ERANGE.

> +   memset(kaddr + offset, 0, size);
> +   dax_flush(dax_dev, kaddr + offset, size);
> +   }
> +   return rc;
> +}
> +
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
> pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> long rc, id;
> -   void *kaddr;
> -   bool page_aligned = false;
> unsigned offset = offset_in_page(pos);
> unsigned size = min_t(u64, PAGE_SIZE - offset, length);
>
> -   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> -   page_aligned = true;
> -
> id = dax_read_lock();
> -
> -   if (page_aligned)
> +   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> else
> -   rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, 
> NULL);
> -   if (rc < 0) {
> -   dax_read_unlock(id);
> -   return rc;
> -   }
> -
> -   if (!page_aligned) {
> -   memset(kaddr + offset, 0, size);
> -   dax_flush(iomap->dax_dev, kaddr + offset, size);
> -   }
> +   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
> dax_read_unlock(id);
> +
> +   if (rc < 0)
> +   return rc;
> return size;
>  }
>
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 16/29] fsdax: simplify the offset check in dax_iomap_zero

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> The file relative offset must have the same alignment as the storage
> offset, so use that and get rid of the call to iomap_sector.

Agree.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 15/29] xfs: add xfs_zero_range and xfs_truncate_page helpers

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> From: Shiyang Ruan 
>
> Add helpers to prepare for using different DAX operations.
>
> Signed-off-by: Shiyang Ruan 
> [hch: split from a larger patch + slight cleanups]
> Signed-off-by: Christoph Hellwig 

Looks good to me.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 14/29] fsdax: simplify the pgoff calculation

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Replace the two steps of dax_iomap_sector and bdev_dax_pgoff with a
> single dax_iomap_pgoff helper that avoids lots of cumbersome sector
> conversions.

Looks good,

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 14 --
>  fs/dax.c| 35 ++-
>  include/linux/dax.h |  1 -
>  3 files changed, 10 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 803942586d1b6..c0910687fbcb2 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -67,20 +67,6 @@ void dax_remove_host(struct gendisk *disk)
>  }
>  EXPORT_SYMBOL_GPL(dax_remove_host);
>
> -int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> -   pgoff_t *pgoff)
> -{
> -   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> -   phys_addr_t phys_off = (start_sect + sector) * 512;
> -
> -   if (pgoff)
> -   *pgoff = PHYS_PFN(phys_off);
> -   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
> -   return -EINVAL;
> -   return 0;
> -}
> -EXPORT_SYMBOL(bdev_dax_pgoff);
> -
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @bdev: block device to find a dax_device for
> diff --git a/fs/dax.c b/fs/dax.c
> index e51b4129d1b65..5364549d67a48 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -709,23 +709,22 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
> return __dax_invalidate_entry(mapping, index, false);
>  }
>
> -static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
> +static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos)
>  {
> -   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
> +   phys_addr_t paddr = iomap->addr + (pos & PAGE_MASK) - iomap->offset;
> +
> +   if (iomap->bdev)
> +   paddr += (get_start_sect(iomap->bdev) << SECTOR_SHIFT);
> +   return PHYS_PFN(paddr);
>  }
>
>  static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter 
> *iter)
>  {
> -   sector_t sector = dax_iomap_sector(&iter->iomap, iter->pos);
> +   pgoff_t pgoff = dax_iomap_pgoff(&iter->iomap, iter->pos);
> void *vto, *kaddr;
> -   pgoff_t pgoff;
> long rc;
> int id;
>
> -   rc = bdev_dax_pgoff(iter->iomap.bdev, sector, PAGE_SIZE, &pgoff);
> -   if (rc)
> -   return rc;
> -
> id = dax_read_lock();
> rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
> if (rc < 0) {
> @@ -1013,14 +1012,10 @@ EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  pfn_t *pfnp)
>  {
> -   const sector_t sector = dax_iomap_sector(iomap, pos);
> -   pgoff_t pgoff;
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> int id, rc;
> long length;
>
> -   rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
> -   if (rc)
> -   return rc;
> id = dax_read_lock();
> length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
>NULL, pfnp);
> @@ -1129,7 +1124,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
> sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
> -   pgoff_t pgoff;
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> long rc, id;
> void *kaddr;
> bool page_aligned = false;
> @@ -1140,10 +1135,6 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
> (size == PAGE_SIZE))
> page_aligned = true;
>
> -   rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
> -   if (rc)
> -   return rc;
> -
> id = dax_read_lock();
>
> if (page_aligned)
> @@ -1169,7 +1160,6 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
> const struct iomap *iomap = &iomi->iomap;
> loff_t length = iomap_length(iomi);
> loff_t pos = iomi->pos;
> -   struct block_device *bdev = iomap->bdev;
> struct dax_device *dax_dev = iomap->dax_dev;
> loff_t end = pos + length, done = 0;
> ssize_t ret = 0;
> @@ -1203,9 +1193,8 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
> while (pos < end) {
> unsigned offset = pos & (PAGE_SIZE - 1);
> const size_t size = ALIGN(length + offset, PAGE_SIZE);
> -   const sector_t sector = dax_iomap_sector(iomap, pos);
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> ssize_t map_len;
> -   pgoff_t pgoff;
> void *kaddr;
>
>  

Re: [PATCH 13/29] fsdax: use a saner calling convention for copy_cow_page_dax

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Just pass the vm_fault and iomap_iter structures, and figure out the rest
> locally.  Note that this requires moving dax_iomap_sector up in the file.

Looks good,

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c | 29 +
>  1 file changed, 13 insertions(+), 16 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 73bd1439d8089..e51b4129d1b65 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -709,26 +709,31 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
> return __dax_invalidate_entry(mapping, index, false);
>  }
>
> -static int copy_cow_page_dax(struct block_device *bdev, struct dax_device 
> *dax_dev,
> -sector_t sector, struct page *to, unsigned long 
> vaddr)
> +static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
>  {
> +   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
> +}
> +
> +static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter 
> *iter)
> +{
> +   sector_t sector = dax_iomap_sector(&iter->iomap, iter->pos);
> void *vto, *kaddr;
> pgoff_t pgoff;
> long rc;
> int id;
>
> -   rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
> +   rc = bdev_dax_pgoff(iter->iomap.bdev, sector, PAGE_SIZE, &pgoff);
> if (rc)
> return rc;
>
> id = dax_read_lock();
> -   rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
> +   rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL);
> if (rc < 0) {
> dax_read_unlock(id);
> return rc;
> }
> -   vto = kmap_atomic(to);
> -   copy_user_page(vto, kaddr, vaddr, to);
> +   vto = kmap_atomic(vmf->cow_page);
> +   copy_user_page(vto, kaddr, vmf->address, vmf->cow_page);
> kunmap_atomic(vto);
> dax_read_unlock(id);
> return 0;
> @@ -1005,11 +1010,6 @@ int dax_writeback_mapping_range(struct address_space 
> *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>
> -static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
> -{
> -   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
> -}
> -
>  static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  pfn_t *pfnp)
>  {
> @@ -1332,19 +1332,16 @@ static vm_fault_t dax_fault_synchronous_pfnp(pfn_t 
> *pfnp, pfn_t pfn)
>  static vm_fault_t dax_fault_cow_page(struct vm_fault *vmf,
> const struct iomap_iter *iter)
>  {
> -   sector_t sector = dax_iomap_sector(&iter->iomap, iter->pos);
> -   unsigned long vaddr = vmf->address;
> vm_fault_t ret;
> int error = 0;
>
> switch (iter->iomap.type) {
> case IOMAP_HOLE:
> case IOMAP_UNWRITTEN:
> -   clear_user_highpage(vmf->cow_page, vaddr);
> +   clear_user_highpage(vmf->cow_page, vmf->address);
> break;
> case IOMAP_MAPPED:
> -   error = copy_cow_page_dax(iter->iomap.bdev, 
> iter->iomap.dax_dev,
> - sector, vmf->cow_page, vaddr);
> +   error = copy_cow_page_dax(vmf, iter);
> break;
> default:
> WARN_ON_ONCE(1);
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/29] dax: simplify the dax_device <-> gendisk association

2021-11-23 Thread Dan Williams
On Mon, Nov 22, 2021 at 9:58 PM Christoph Hellwig  wrote:
>
> On Mon, Nov 22, 2021 at 07:33:06PM -0800, Dan Williams wrote:
> > Is it time to add a "DAX" symbol namespace?
>
> What would be the benefit?

Just the small benefit of identifying DAX core users with a common
grep line, and to indicate that DAX exports are more intertwined than
standalone exports, but yeah those are minor.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] drm/virtio: Fix an NULL vs IS_ERR() bug in virtio_gpu_object_shmem_init()

2021-11-23 Thread Chia-I Wu
On Thu, Nov 18, 2021 at 3:16 AM Dan Carpenter  wrote:
>
> The drm_gem_shmem_get_sg_table() function never returns NULL.  It returns
> error pointers on error.
>
> Fixes: c66df701e783 ("drm/virtio: switch from ttm to gem shmem helpers")
> Signed-off-by: Dan Carpenter 
> ---
> v2: I originally sent this patch on 19 Jun 2020 but it was somehow
> not applied.  As I review it now, I see that the bug is actually
> older than I originally thought and so I have updated the Fixes
> tag.
Reviewed-by: Chia-I Wu 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/17] Use memberof(T, m) instead of explicit NULL dereference

2021-11-23 Thread Rafael J. Wysocki
On Fri, Nov 19, 2021 at 12:37 PM Alejandro Colomar
 wrote:
>
> Signed-off-by: Alejandro Colomar 
> Cc: Ajit Khaparde 
> Cc: Andrew Morton 
> Cc: Andy Shevchenko 
> Cc: Arnd Bergmann 
> Cc: Bjorn Andersson 
> Cc: Borislav Petkov 
> Cc: Corey Minyard 
> Cc: Chris Mason 
> Cc: Christian Brauner 
> Cc: David Sterba 
> Cc: Jani Nikula 
> Cc: Jason Wang 
> Cc: Jitendra Bhivare 
> Cc: John Hubbard 
> Cc: John S. Gruber 
> Cc: Jonathan Cameron 
> Cc: Joonas Lahtinen 
> Cc: Josef Bacik 
> Cc: Kees Cook 
> Cc: Ketan Mukadam 
> Cc: Len Brown 
> Cc: "Michael S. Tsirkin" 
> Cc: Miguel Ojeda 
> Cc: Mike Rapoport 
> Cc: Nick Desaulniers 
> Cc: "Rafael J. Wysocki" 
> Cc: Rasmus Villemoes 
> Cc: Rodrigo Vivi 
> Cc: Russell King 
> Cc: Somnath Kotur 
> Cc: Sriharsha Basavapatna 
> Cc: Subbu Seetharaman 
> Cc: 
> Cc: 
> Cc: 
> Cc: 
> Cc: 
> Cc: 
> Cc: 
> ---
>  arch/x86/include/asm/bootparam_utils.h  |  3 ++-
>  arch/x86/kernel/signal_compat.c |  5 +++--
>  drivers/gpu/drm/i915/i915_utils.h   |  5 ++---
>  drivers/gpu/drm/i915/intel_runtime_pm.h |  2 +-
>  drivers/net/ethernet/emulex/benet/be.h  |  7 ---
>  drivers/net/ethernet/i825xx/ether1.c|  7 +--
>  drivers/scsi/be2iscsi/be.h  |  7 ---
>  drivers/scsi/be2iscsi/be_cmds.h |  5 -
>  fs/btrfs/ctree.h|  5 +++--
>  include/acpi/actypes.h  |  4 +++-

The change in actypes.h would need to be submitted to the upstream
ACPICA project via https://github.com/acpica/acpica/

Thanks!

>  include/linux/container_of.h|  6 +++---
>  include/linux/virtio_config.h   | 14 +++---
>  12 files changed, 41 insertions(+), 29 deletions(-)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-23 Thread Vlastimil Babka
On 11/23/21 17:35, Zi Yan wrote:
> On 19 Nov 2021, at 10:15, Zi Yan wrote:
 From what my understanding, cma required alignment of
 max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was 
 introduced,
 __free_one_page() does not prevent merging two different pageblocks, when
 MAX_ORDER - 1 > pageblock_order. But current __free_one_page() 
 implementation
 does prevent that.
>>>
>>> But it does prevent that only for isolated pageblock, not CMA, and yout
>>> patchset doesn't seem to expand that to CMA? Or am I missing something.
>>
>> Yeah, you are right. Originally, I thought preventing merging isolated 
>> pageblock
>> with other types of pageblocks is sufficient, since MIGRATE_CMA is always
>> converted from MIGRATE_ISOLATE. But that is not true. I will rework the code.
>> Thanks for pointing this out.
>>
> 
> I find that two pageblocks with different migratetypes, like 
> MIGRATE_RECLAIMABLE
> and MIGRATE_MOVABLE can be merged into a single free page after I checked
> __free_one_page() in detail and printed pageblock information during buddy 
> page
> merging.

Yes, that can happen.

I am not sure what consequence it will cause. Do you have any idea?

For MIGRATE_RECLAIMABLE or MIGRATE_MOVABLE or even MIGRATE_UNMOVABLE it's
absolutely fine. As long as these pageblocks are fully free (and they are if
it's a single free page spanning 2 pageblocks), they can be of any of these
type, as they can be reused as needed without causing fragmentation.

But in case of MIGRATE_CMA and MIGRATE_ISOLATE, uncontrolled merging would
break the specifics of those types. That's why the code is careful for
MIGRATE_ISOLATE, and MIGRATE_CMA was until now done in MAX_ORDER granularity.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-23 Thread David Hildenbrand
On 17.11.21 04:04, Zi Yan wrote:
> On 16 Nov 2021, at 3:58, David Hildenbrand wrote:
> 
>> On 15.11.21 20:37, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> Hi David,
>>
>> Hi,
>>
>> thanks for looking into this.
>>

Hi,

sorry for the delay, I wasn't "actually working" last week, so now I'm
back from holiday :)

>>>
>>> You suggested to make alloc_contig_range() deal with pageblock_order 
>>> instead of
>>> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
>>> patchset is my attempt to achieve that. Please take a look and let me know 
>>> if
>>> I am doing it correctly or not.
>>>
>>> From what my understanding, cma required alignment of
>>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was 
>>> introduced,
>>> __free_one_page() does not prevent merging two different pageblocks, when
>>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() 
>>> implementation
>>> does prevent that. It should be OK to just align cma to pageblock_order.
>>> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
>>> pageblock_order as alignment too.
>>
>> I wonder if that's sufficient. Especially the outer_start logic in
>> alloc_contig_range() might be problematic. There are some ugly corner
>> cases with free pages/allocations spanning multiple pageblocks and we
>> only isolated a single pageblock.
> 
> Thank you a lot for writing the list of these corner cases. They are
> very helpful!
> 
>>
>>
>> Regarding CMA, we have to keep the following cases working:
>>
>> a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER
>>- 1 page:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0 | pageblock 1]
>>
>> Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA,
>> but not both. We have to make sure alloc_contig_range() keeps working
>> correctly. This should be the case even with your change, as we won't
>> merging pages accross differing migratetypes.
> 
> Yes.
> 
>>
>> b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0 | pageblock 1]
>>
>> Assume both are MIGRATE_CMA. Assume we want to either allocate from
>> pageblock 0 or pageblock 1. Especially, assume we want to allocate from
>> pageblock 1. While we would isolate pageblock 1, we wouldn't isolate
>> pageblock 0.
>>
>> What happens if we either have a free page spanning the MAX_ORDER - 1
>> range already OR if we have to migrate a MAX_ORDER - 1 page, resulting
>> in a free MAX_ORDER - 1 page of which only the second pageblock is
>> isolated? We would end up essentially freeing a page that has mixed
>> pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I
>> might be wrong but I have the feeling that this would be problematic.
>>
> 
> This could happen when start_isolate_page_range() stumbles upon a compound
> page with order >= pageblock_order or a free page with order >=
> pageblock_order, but should not. start_isolate_page_range() should check
> the actual page size, either compound page size or free page size, and set
> the migratetype across pageblocks if the page is bigger than pageblock size.
> More precisely set_migratetype_isolate() should do that.

Right -- but then we have to extend the isolation range from within
set_migratetype_isolate() :/ E.g., how should we know what we have to
unisolate later ..

> 
> 
>> c) Concurrent allocations:
>> [   MAX_ ORDER - 1 ]
>> [ pageblock 0 | pageblock 1]
>>
>> Assume b) but we have two concurrent CMA allocations to pageblock 0 and
>> pageblock 1, which would now be possible as start_isolate_page_range()
>> isolate would succeed on both.
> 
> Two isolations will be serialized by the zone lock taken by
> set_migratetype_isolate(), so the concurrent allocation would not be a 
> problem.
> If it is a MAX_ORDER-1 free page, the first comer should split it and only
> isolate one of the pageblock then second one can isolate the other pageblock.

Right, the issue is essentially b) above.

> If it is a MAX_ORDER-1 compound page, the first comer should isolate both
> pageblocks, then the second one would fail. WDYT?

Possibly we could even have two independent CMA areas "colliding" within
a MAX_ ORDER - 1 page. I guess the surprise would be getting an
"-EAGAIN" from isolation, but the caller should properly handle that.

Maybe it's really easier to do something along the lines I proposed
below and always isolate the complete MAX_ORDER-1 range in
alloc_contig_range(). We just have to "fix" isolation as I drafted.

> 
> 
> In sum, it seems to me that the issue is page isolation code only sees
> pageblock without check the actual page. When there are multiple pageblocks
> belonging to one page, the problem appears. This should be fixed.
> 
>>
>>
>> Regarding virtio-mem, we care about the following cases:
>>
>> a) Allocating parts from completely movable MAX_ ORDER - 1 page:
>>[   MAX_ ORDER - 1 ]
>>[ pageblock 0

[PATCH v2 5/5] iommu/virtio: Support identity-mapped domains

2021-11-23 Thread Jean-Philippe Brucker
Support identity domains for devices that do not offer the
VIRTIO_IOMMU_F_BYPASS_CONFIG feature, by creating 1:1 mappings between
the virtual and physical address space. Identity domains created this
way still perform noticeably better than DMA domains, because they don't
have the overhead of setting up and tearing down mappings at runtime.
The performance difference between this and bypass is minimal in
comparison.

It does not matter that the physical addresses in the identity mappings
do not all correspond to memory. By enabling passthrough we are trusting
the device driver and the device itself to only perform DMA to suitable
locations. In some cases it may even be desirable to perform DMA to MMIO
regions.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 63 +---
 1 file changed, 58 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index eceb9281c8c1..6a8a52b4297b 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -375,6 +375,55 @@ static size_t viommu_del_mappings(struct viommu_domain 
*vdomain,
return unmapped;
 }
 
+/*
+ * Fill the domain with identity mappings, skipping the device's reserved
+ * regions.
+ */
+static int viommu_domain_map_identity(struct viommu_endpoint *vdev,
+ struct viommu_domain *vdomain)
+{
+   int ret;
+   struct iommu_resv_region *resv;
+   u64 iova = vdomain->domain.geometry.aperture_start;
+   u64 limit = vdomain->domain.geometry.aperture_end;
+   u32 flags = VIRTIO_IOMMU_MAP_F_READ | VIRTIO_IOMMU_MAP_F_WRITE;
+   unsigned long granule = 1UL << __ffs(vdomain->domain.pgsize_bitmap);
+
+   iova = ALIGN(iova, granule);
+   limit = ALIGN_DOWN(limit + 1, granule) - 1;
+
+   list_for_each_entry(resv, &vdev->resv_regions, list) {
+   u64 resv_start = ALIGN_DOWN(resv->start, granule);
+   u64 resv_end = ALIGN(resv->start + resv->length, granule) - 1;
+
+   if (resv_end < iova || resv_start > limit)
+   /* No overlap */
+   continue;
+
+   if (resv_start > iova) {
+   ret = viommu_add_mapping(vdomain, iova, resv_start - 1,
+(phys_addr_t)iova, flags);
+   if (ret)
+   goto err_unmap;
+   }
+
+   if (resv_end >= limit)
+   return 0;
+
+   iova = resv_end + 1;
+   }
+
+   ret = viommu_add_mapping(vdomain, iova, limit, (phys_addr_t)iova,
+flags);
+   if (ret)
+   goto err_unmap;
+   return 0;
+
+err_unmap:
+   viommu_del_mappings(vdomain, 0, iova);
+   return ret;
+}
+
 /*
  * viommu_replay_mappings - re-send MAP requests
  *
@@ -637,14 +686,18 @@ static int viommu_domain_finalise(struct viommu_endpoint 
*vdev,
vdomain->viommu = viommu;
 
if (domain->type == IOMMU_DOMAIN_IDENTITY) {
-   if (!virtio_has_feature(viommu->vdev,
-   VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
+   if (virtio_has_feature(viommu->vdev,
+  VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
+   vdomain->bypass = true;
+   return 0;
+   }
+
+   ret = viommu_domain_map_identity(vdev, vdomain);
+   if (ret) {
ida_free(&viommu->domain_ids, vdomain->id);
-   vdomain->viommu = 0;
+   vdomain->viommu = NULL;
return -EOPNOTSUPP;
}
-
-   vdomain->bypass = true;
}
 
return 0;
-- 
2.33.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 3/5] iommu/virtio: Sort reserved regions

2021-11-23 Thread Jean-Philippe Brucker
To ease identity mapping support, keep the list of reserved regions
sorted.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index ee8a7afd667b..d63ec4d11b00 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -423,7 +423,7 @@ static int viommu_add_resv_mem(struct viommu_endpoint *vdev,
size_t size;
u64 start64, end64;
phys_addr_t start, end;
-   struct iommu_resv_region *region = NULL;
+   struct iommu_resv_region *region = NULL, *next;
unsigned long prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
 
start = start64 = le64_to_cpu(mem->start);
@@ -454,7 +454,12 @@ static int viommu_add_resv_mem(struct viommu_endpoint 
*vdev,
if (!region)
return -ENOMEM;
 
-   list_add(®ion->list, &vdev->resv_regions);
+   /* Keep the list sorted */
+   list_for_each_entry(next, &vdev->resv_regions, list) {
+   if (next->start > region->start)
+   break;
+   }
+   list_add_tail(®ion->list, &next->list);
return 0;
 }
 
-- 
2.33.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 4/5] iommu/virtio: Pass end address to viommu_add_mapping()

2021-11-23 Thread Jean-Philippe Brucker
To support identity mappings, the virtio-iommu driver must be able to
represent full 64-bit ranges internally. Pass (start, end) instead of
(start, size) to viommu_add/del_mapping().

Clean comments. The one about the returned size was never true: when
sweeping the whole address space the returned size will most certainly
be smaller than 2^64.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index d63ec4d11b00..eceb9281c8c1 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -311,8 +311,8 @@ static int viommu_send_req_sync(struct viommu_dev *viommu, 
void *buf,
  *
  * On success, return the new mapping. Otherwise return NULL.
  */
-static int viommu_add_mapping(struct viommu_domain *vdomain, unsigned long 
iova,
- phys_addr_t paddr, size_t size, u32 flags)
+static int viommu_add_mapping(struct viommu_domain *vdomain, u64 iova, u64 end,
+ phys_addr_t paddr, u32 flags)
 {
unsigned long irqflags;
struct viommu_mapping *mapping;
@@ -323,7 +323,7 @@ static int viommu_add_mapping(struct viommu_domain 
*vdomain, unsigned long iova,
 
mapping->paddr  = paddr;
mapping->iova.start = iova;
-   mapping->iova.last  = iova + size - 1;
+   mapping->iova.last  = end;
mapping->flags  = flags;
 
spin_lock_irqsave(&vdomain->mappings_lock, irqflags);
@@ -338,26 +338,24 @@ static int viommu_add_mapping(struct viommu_domain 
*vdomain, unsigned long iova,
  *
  * @vdomain: the domain
  * @iova: start of the range
- * @size: size of the range. A size of 0 corresponds to the entire address
- * space.
+ * @end: end of the range
  *
- * On success, returns the number of unmapped bytes (>= size)
+ * On success, returns the number of unmapped bytes
  */
 static size_t viommu_del_mappings(struct viommu_domain *vdomain,
- unsigned long iova, size_t size)
+ u64 iova, u64 end)
 {
size_t unmapped = 0;
unsigned long flags;
-   unsigned long last = iova + size - 1;
struct viommu_mapping *mapping = NULL;
struct interval_tree_node *node, *next;
 
spin_lock_irqsave(&vdomain->mappings_lock, flags);
-   next = interval_tree_iter_first(&vdomain->mappings, iova, last);
+   next = interval_tree_iter_first(&vdomain->mappings, iova, end);
while (next) {
node = next;
mapping = container_of(node, struct viommu_mapping, iova);
-   next = interval_tree_iter_next(node, iova, last);
+   next = interval_tree_iter_next(node, iova, end);
 
/* Trying to split a mapping? */
if (mapping->iova.start < iova)
@@ -656,8 +654,8 @@ static void viommu_domain_free(struct iommu_domain *domain)
 {
struct viommu_domain *vdomain = to_viommu_domain(domain);
 
-   /* Free all remaining mappings (size 2^64) */
-   viommu_del_mappings(vdomain, 0, 0);
+   /* Free all remaining mappings */
+   viommu_del_mappings(vdomain, 0, ULLONG_MAX);
 
if (vdomain->viommu)
ida_free(&vdomain->viommu->domain_ids, vdomain->id);
@@ -742,6 +740,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
 {
int ret;
u32 flags;
+   u64 end = iova + size - 1;
struct virtio_iommu_req_map map;
struct viommu_domain *vdomain = to_viommu_domain(domain);
 
@@ -752,7 +751,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
if (flags & ~vdomain->map_flags)
return -EINVAL;
 
-   ret = viommu_add_mapping(vdomain, iova, paddr, size, flags);
+   ret = viommu_add_mapping(vdomain, iova, end, paddr, flags);
if (ret)
return ret;
 
@@ -761,7 +760,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
.domain = cpu_to_le32(vdomain->id),
.virt_start = cpu_to_le64(iova),
.phys_start = cpu_to_le64(paddr),
-   .virt_end   = cpu_to_le64(iova + size - 1),
+   .virt_end   = cpu_to_le64(end),
.flags  = cpu_to_le32(flags),
};
 
@@ -770,7 +769,7 @@ static int viommu_map(struct iommu_domain *domain, unsigned 
long iova,
 
ret = viommu_send_req_sync(vdomain->viommu, &map, sizeof(map));
if (ret)
-   viommu_del_mappings(vdomain, iova, size);
+   viommu_del_mappings(vdomain, iova, end);
 
return ret;
 }
@@ -783,7 +782,7 @@ static size_t viommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
struct virtio_iommu_req_unmap unmap;
struct viommu_domain *vdom

[PATCH v2 2/5] iommu/virtio: Support bypass domains

2021-11-23 Thread Jean-Philippe Brucker
The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a new flag to the ATTACH
request, that creates a bypass domain. Use it to enable identity
domains.

When VIRTIO_IOMMU_F_BYPASS_CONFIG is not supported by the device, we
currently fail attaching to an identity domain. Future patches will
instead create identity mappings in this case.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/virtio-iommu.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 80930ce04a16..ee8a7afd667b 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -71,6 +71,7 @@ struct viommu_domain {
struct rb_root_cached   mappings;
 
unsigned long   nr_endpoints;
+   boolbypass;
 };
 
 struct viommu_endpoint {
@@ -587,7 +588,9 @@ static struct iommu_domain *viommu_domain_alloc(unsigned 
type)
 {
struct viommu_domain *vdomain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
+   if (type != IOMMU_DOMAIN_UNMANAGED &&
+   type != IOMMU_DOMAIN_DMA &&
+   type != IOMMU_DOMAIN_IDENTITY)
return NULL;
 
vdomain = kzalloc(sizeof(*vdomain), GFP_KERNEL);
@@ -630,6 +633,17 @@ static int viommu_domain_finalise(struct viommu_endpoint 
*vdev,
vdomain->map_flags  = viommu->map_flags;
vdomain->viommu = viommu;
 
+   if (domain->type == IOMMU_DOMAIN_IDENTITY) {
+   if (!virtio_has_feature(viommu->vdev,
+   VIRTIO_IOMMU_F_BYPASS_CONFIG)) {
+   ida_free(&viommu->domain_ids, vdomain->id);
+   vdomain->viommu = 0;
+   return -EOPNOTSUPP;
+   }
+
+   vdomain->bypass = true;
+   }
+
return 0;
 }
 
@@ -691,6 +705,9 @@ static int viommu_attach_dev(struct iommu_domain *domain, 
struct device *dev)
.domain = cpu_to_le32(vdomain->id),
};
 
+   if (vdomain->bypass)
+   req.flags |= cpu_to_le32(VIRTIO_IOMMU_ATTACH_F_BYPASS);
+
for (i = 0; i < fwspec->num_ids; i++) {
req.endpoint = cpu_to_le32(fwspec->ids[i]);
 
@@ -1132,6 +1149,7 @@ static unsigned int features[] = {
VIRTIO_IOMMU_F_DOMAIN_RANGE,
VIRTIO_IOMMU_F_PROBE,
VIRTIO_IOMMU_F_MMIO,
+   VIRTIO_IOMMU_F_BYPASS_CONFIG,
 };
 
 static struct virtio_device_id id_table[] = {
-- 
2.33.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/5] iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG

2021-11-23 Thread Jean-Philippe Brucker
Add definitions for the VIRTIO_IOMMU_F_BYPASS_CONFIG, which supersedes
VIRTIO_IOMMU_F_BYPASS.

Reviewed-by: Kevin Tian 
Signed-off-by: Jean-Philippe Brucker 
---
 include/uapi/linux/virtio_iommu.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_iommu.h 
b/include/uapi/linux/virtio_iommu.h
index 237e36a280cb..cafd8cf7febf 100644
--- a/include/uapi/linux/virtio_iommu.h
+++ b/include/uapi/linux/virtio_iommu.h
@@ -16,6 +16,7 @@
 #define VIRTIO_IOMMU_F_BYPASS  3
 #define VIRTIO_IOMMU_F_PROBE   4
 #define VIRTIO_IOMMU_F_MMIO5
+#define VIRTIO_IOMMU_F_BYPASS_CONFIG   6
 
 struct virtio_iommu_range_64 {
__le64  start;
@@ -36,6 +37,8 @@ struct virtio_iommu_config {
struct virtio_iommu_range_32domain_range;
/* Probe buffer size */
__le32  probe_size;
+   __u8bypass;
+   __u8reserved[7];
 };
 
 /* Request types */
@@ -66,11 +69,14 @@ struct virtio_iommu_req_tail {
__u8reserved[3];
 };
 
+#define VIRTIO_IOMMU_ATTACH_F_BYPASS   (1 << 0)
+
 struct virtio_iommu_req_attach {
struct virtio_iommu_req_headhead;
__le32  domain;
__le32  endpoint;
-   __u8reserved[8];
+   __le32  flags;
+   __u8reserved[4];
struct virtio_iommu_req_tailtail;
 };
 
-- 
2.33.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 0/5] iommu/virtio: Add identity domains

2021-11-23 Thread Jean-Philippe Brucker
Support identity domains, allowing to only enable IOMMU protection for a
subset of endpoints (those assigned to userspace, for example). Users
may enable identity domains at compile time
(CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or
runtime (/sys/kernel/iommu_groups/*/type = identity).

Since v1 [1] I rebased onto v5.16-rc and added Kevin's review tag.
The specification update for the new feature has now been accepted [2].

Patches 1-2 support identity domains using the optional
VIRTIO_IOMMU_F_BYPASS_CONFIG feature, and patches 3-5 add a fallback to
identity mappings, when the feature is not supported.

QEMU patches are on my virtio-iommu/bypass branch [3], and depend on the
UAPI update.

[1] 
https://lore.kernel.org/linux-iommu/20211013121052.518113-1-jean-phili...@linaro.org/
[2] https://github.com/oasis-tcs/virtio-spec/issues/119
[3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass

Jean-Philippe Brucker (5):
  iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG
  iommu/virtio: Support bypass domains
  iommu/virtio: Sort reserved regions
  iommu/virtio: Pass end address to viommu_add_mapping()
  iommu/virtio: Support identity-mapped domains

 include/uapi/linux/virtio_iommu.h |   8 ++-
 drivers/iommu/virtio-iommu.c  | 113 +-
 2 files changed, 101 insertions(+), 20 deletions(-)

-- 
2.33.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] vhost/vsock: fix incorrect used length reported to the guest

2021-11-23 Thread Halil Pasic
On Mon, 22 Nov 2021 17:35:24 +0100
Stefano Garzarella  wrote:

> The "used length" reported by calling vhost_add_used() must be the
> number of bytes written by the device (using "in" buffers).
> 
> In vhost_vsock_handle_tx_kick() the device only reads the guest
> buffers (they are all "out" buffers), without writing anything,
> so we must pass 0 as "used length" to comply virtio spec.
> 
> Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
> Cc: sta...@vger.kernel.org
> Reported-by: Halil Pasic 
> Suggested-by: Jason Wang 
> Signed-off-by: Stefano Garzarella 

Reviewed-by: Halil Pasic 

> ---
>  drivers/vhost/vsock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 938aefbc75ec..4e3b95af7ee4 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -554,7 +554,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
> *work)
>   virtio_transport_free_pkt(pkt);
>  
>   len += sizeof(pkt->hdr);
> - vhost_add_used(vq, head, len);
> + vhost_add_used(vq, head, 0);
>   total_len += len;
>   added = true;
>   } while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vhost/vsock: fix used length and cleanup in vhost_vsock_handle_tx_kick()

2021-11-23 Thread Stefan Hajnoczi
On Mon, Nov 22, 2021 at 05:35:23PM +0100, Stefano Garzarella wrote:
> This is a follow-up to Micheal's patch [1] and the discussion with Halil and
> Jason [2].
> 
> I made two patches, one to fix the problem and one for cleanup. This should
> simplify the backport of the fix because we've had the problem since
> vhost-vsock was introduced (v4.8) and that part has been touched a bit
> recently.
> 
> Thanks,
> Stefano
> 
> [1] 
> https://lore.kernel.org/virtualization/20211122105822.onarsa4sydzxqynu@steredhat/T/#t
> [2] 
> https://lore.kernel.org/virtualization/20211027022107.14357-1-jasow...@redhat.com/T/#t
> 
> Stefano Garzarella (2):
>   vhost/vsock: fix incorrect used length reported to the guest
>   vhost/vsock: cleanup removing `len` variable
> 
>  drivers/vhost/vsock.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> -- 
> 2.31.1
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vsock/virtio: suppress used length validation

2021-11-23 Thread Stefan Hajnoczi
On Mon, Nov 22, 2021 at 04:32:01AM -0500, Michael S. Tsirkin wrote:
> It turns out that vhost vsock violates the virtio spec
> by supplying the out buffer length in the used length
> (should just be the in length).
> As a result, attempts to validate the used length fail with:
> vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0
> 
> Since vsock driver does not use the length fox tx and
> validates the length before use for rx, it is safe to
> suppress the validation in virtio core for this driver.
> 
> Reported-by: Halil Pasic 
> Fixes: 939779f5152d ("virtio_ring: validate used buffer length")
> Cc: "Jason Wang" 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  net/vmw_vsock/virtio_transport.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Halil Pasic
On Tue, 23 Nov 2021 07:17:05 -0500
"Michael S. Tsirkin"  wrote:

> On Mon, Nov 22, 2021 at 02:50:03PM +0100, Halil Pasic wrote:
> > On Mon, 22 Nov 2021 14:25:26 +0800
> > Jason Wang  wrote:
> >   
> > > On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:  
> > > >
> > > > On Mon, 22 Nov 2021 06:35:18 +0100
> > > > Halil Pasic  wrote:
> > > >
> > > > > > I think it should be a common issue, looking at
> > > > > > vhost_vsock_handle_tx_kick(), it did:
> > > > > >
> > > > > > len += sizeof(pkt->hdr);
> > > > > > vhost_add_used(vq, head, len);
> > > > > >
> > > > > > which looks like a violation of the spec since it's TX.
> > > > >
> > > > > I'm not sure the lines above look like a violation of the spec. If you
> > > > > examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> > > > > len == pkt->len == pkt->hdr.len
> > > > > which makes sense since according to the spec both tx and rx messages
> > > > > are hdr+payload. And I believe hdr.len is the size of the payload,
> > > > > although that does not seem to be properly documented by the spec.
> > > 
> > > Sorry for being unclear, what I meant is that we probably should use
> > > zero here. TX doesn't use in buffer actually.
> > > 
> > > According to the spec, 0 should be the used length:
> > > 
> > > "and len the total of bytes written into the buffer."  
> > 
> > Right, I was wrong. I somehow assumed this is the total length and not
> > just the number of bytes written.
> >   
> > >   
> > > > >
> > > > > On the other hand tx messages are stated to be device read-only (in 
> > > > > the
> > > > > spec) so if the device writes stuff, that is certainly wrong.
> > > > >
> > > 
> > > Yes.
> > >   
> > > > > If that is what happens.
> > > > >
> > > > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> > > > > happens. My hypothesis is that we just a last descriptor is an 'in'
> > > > > type descriptor (i.e. a device writable one). For tx that assumption
> > > > > would be wrong.
> > > > >
> > > > > I will have another look at this today and send a fix patch if my
> > > > > suspicion is confirmed.  
> > 
> > Yeah, I didn't remember the semantic of
> > vq->split.vring.used->ring[last_used].len
> > correctly, and in fact also how exactly the rings work. So your objection
> > is correct. 
> > 
> > Maybe updating some stuff would make it easier to not make this mistake.
> > 
> > For example the spec and also the linux header says:
> > 
> > /* le32 is used here for ids for padding reasons. */ 
> > struct virtq_used_elem { 
> > /* Index of start of used descriptor chain. */ 
> > le32 id; 
> > /* Total length of the descriptor chain which was used (written to) 
> > */ 
> > le32 len; 
> > };
> > 
> > I think that comment isn't as clear as it could be. I would prefer:
> > /* The number of bytes written into the device writable portion of the
> > buffer described by the descriptor chain. */
> > 
> > I believe "the descriptor chain which was used" includes both the
> > descriptors that map the device read only and the device write
> > only portions of the buffer described by the descriptor chain. And the
> > total length of that descriptor chain may be defined either as a number
> > of the descriptors that form the chain, or the length of the buffer.
> > 
> > One has to use the descriptor chain even if the whole buffer is device
> > read only. So "used" == "written to" does not make any sense to me.  
> 
> The virtio spec actually says
> 
> Total length of the descriptor chain which was written to
> 
> without the "used" part.

Yes, that is in the text after the (pseudo-)code listing which contains
the description I was referring to (in 2.6.8 The Virtqueue Used Ring).
> 
> > Also something like
> > int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int 
> > bytes_written)
> > instead of
> > int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
> > would make it easier to read the code correctly.  
> 
> I think we agree here. Patches?
> 

Will send some :D

Thanks!

[..]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Michael S. Tsirkin
On Mon, Nov 22, 2021 at 02:50:03PM +0100, Halil Pasic wrote:
> On Mon, 22 Nov 2021 14:25:26 +0800
> Jason Wang  wrote:
> 
> > On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:
> > >
> > > On Mon, 22 Nov 2021 06:35:18 +0100
> > > Halil Pasic  wrote:
> > >  
> > > > > I think it should be a common issue, looking at
> > > > > vhost_vsock_handle_tx_kick(), it did:
> > > > >
> > > > > len += sizeof(pkt->hdr);
> > > > > vhost_add_used(vq, head, len);
> > > > >
> > > > > which looks like a violation of the spec since it's TX.  
> > > >
> > > > I'm not sure the lines above look like a violation of the spec. If you
> > > > examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> > > > len == pkt->len == pkt->hdr.len
> > > > which makes sense since according to the spec both tx and rx messages
> > > > are hdr+payload. And I believe hdr.len is the size of the payload,
> > > > although that does not seem to be properly documented by the spec.  
> > 
> > Sorry for being unclear, what I meant is that we probably should use
> > zero here. TX doesn't use in buffer actually.
> > 
> > According to the spec, 0 should be the used length:
> > 
> > "and len the total of bytes written into the buffer."
> 
> Right, I was wrong. I somehow assumed this is the total length and not
> just the number of bytes written.
> 
> > 
> > > >
> > > > On the other hand tx messages are stated to be device read-only (in the
> > > > spec) so if the device writes stuff, that is certainly wrong.
> > > >  
> > 
> > Yes.
> > 
> > > > If that is what happens.
> > > >
> > > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> > > > happens. My hypothesis is that we just a last descriptor is an 'in'
> > > > type descriptor (i.e. a device writable one). For tx that assumption
> > > > would be wrong.
> > > >
> > > > I will have another look at this today and send a fix patch if my
> > > > suspicion is confirmed.
> 
> Yeah, I didn't remember the semantic of
> vq->split.vring.used->ring[last_used].len
> correctly, and in fact also how exactly the rings work. So your objection
> is correct. 
> 
> Maybe updating some stuff would make it easier to not make this mistake.
> 
> For example the spec and also the linux header says:
> 
> /* le32 is used here for ids for padding reasons. */ 
> struct virtq_used_elem { 
> /* Index of start of used descriptor chain. */ 
> le32 id; 
> /* Total length of the descriptor chain which was used (written to) 
> */ 
> le32 len; 
> };
> 
> I think that comment isn't as clear as it could be. I would prefer:
> /* The number of bytes written into the device writable portion of the
> buffer described by the descriptor chain. */
> 
> I believe "the descriptor chain which was used" includes both the
> descriptors that map the device read only and the device write
> only portions of the buffer described by the descriptor chain. And the
> total length of that descriptor chain may be defined either as a number
> of the descriptors that form the chain, or the length of the buffer.
> 
> One has to use the descriptor chain even if the whole buffer is device
> read only. So "used" == "written to" does not make any sense to me.

The virtio spec actually says

Total length of the descriptor chain which was written to

without the "used" part.

> Also something like
> int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int 
> bytes_written)
> instead of
> int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
> would make it easier to read the code correctly.

I think we agree here. Patches?

> > >
> > > If my suspicion is right something like:
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 00f64f2f8b72..efb57898920b 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > > virtqueue *_vq,
> > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > void *ret;
> > > unsigned int i;
> > > +   bool has_in;
> > > u16 last_used;
> > >
> > > START_USE(vq);
> > > @@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > > virtqueue *_vq,
> > > vq->split.vring.used->ring[last_used].id);
> > > *len = virtio32_to_cpu(_vq->vdev,
> > > vq->split.vring.used->ring[last_used].len);
> > > +   has_in = virtio16_to_cpu(_vq->vdev,
> > > +   vq->split.vring.used->ring[last_used].flags)
> > > +   & VRING_DESC_F_WRITE;  
> > 
> > Did you mean vring.desc actually? If yes, it's better not depend on
> > the descriptor ring which can be modified by the device. We've stored
> > the flags in desc_extra[].
> > 
> > >
> > > if (unlikely(i >= vq->split.vring.num)) {
> > > BAD_RING(vq, "id %u out of range\n", i);
> > > @@ -796,7 +800,7 @@ 

Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-23 Thread Michael S. Tsirkin
On Tue, Nov 23, 2021 at 10:25:20AM +0800, Jason Wang wrote:
> On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  wrote:
> >
> > On Mon, 22 Nov 2021 14:25:26 +0800
> > Jason Wang  wrote:
> >
> > > I think the fixes are:
> > >
> > > 1) fixing the vhost vsock
> > > 2) use suppress_used_validation=true to let vsock driver to validate
> > > the in buffer length
> > > 3) probably a new feature so the driver can only enable the validation
> > > when the feature is enabled.
> >
> > I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
> > feature. Frankly the set of such bugs is device implementation
> > specific and it makes little sense to specify a feature bit
> > that says the device implementation claims to adhere to some
> > aspect of the specification. Also what would be the semantic
> > of not negotiating F_DEV_Y_FIXED_BUG_X?
> 
> Yes, I agree. Rethink of the feature bit, it seems unnecessary,
> especially considering the driver should not care about the used
> length for tx.
> 
> >
> > On the other hand I see no other way to keep the validation
> > permanently enabled for fixed implementations, and get around the problem
> > with broken implementations. So we could have something like
> > VHOST_USED_LEN_STRICT.
> 
> It's more about a choice of the driver's knowledge. For vsock TX it
> should be fine. If we introduce a parameter and disable it by default,
> it won't be very useful.
> 
> >
> > Maybe, we can also think of 'warn and don't alter behavior' instead of
> > 'warn' and alter behavior. Or maybe even not having such checks on in
> > production, but only when testing.
> 
> I think there's an agreement that virtio drivers need more hardening,
> that's why a lot of patches were merged. Especially considering the
> new requirements came from confidential computing, smart NIC and
> VDUSE. For virtio drivers, enabling the validation may help to
> 
> 1) protect the driver from the buggy and malicious device
> 2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
> 3) force the have a smart driver that can do the validation itself
> then we can finally remove the validation in the core
> 
> So I'd like to keep it enabled.
> 
> Thanks

Let's see how far we can get. But yes, maybe we were too aggressive in
breaking things by default, a warning might be a better choice for a
couple of cycles.

> >
> > Regards,
> > Halil
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] i2c: virtio: disable timeout handling

2021-11-23 Thread Viresh Kumar
On 23-11-21, 10:52, Wolfram Sang wrote:
> On Thu, Nov 11, 2021 at 05:04:11PM +0100, Vincent Whitchurch wrote:
> > If a timeout is hit, it can result is incorrect data on the I2C bus
> > and/or memory corruptions in the guest since the device can still be
> > operating on the buffers it was given while the guest has freed them.
> > 
> > Here is, for example, the start of a slub_debug splat which was
> > triggered on the next transfer after one transfer was forced to timeout
> > by setting a breakpoint in the backend (rust-vmm/vhost-device):
> > 
> >  BUG kmalloc-1k (Not tainted): Poison overwritten
> >  First byte 0x1 instead of 0x6b
> >  Allocated in virtio_i2c_xfer+0x65/0x35c age=350 cpu=0 pid=29
> > __kmalloc+0xc2/0x1c9
> > virtio_i2c_xfer+0x65/0x35c
> > __i2c_transfer+0x429/0x57d
> > i2c_transfer+0x115/0x134
> > i2cdev_ioctl_rdwr+0x16a/0x1de
> > i2cdev_ioctl+0x247/0x2ed
> > vfs_ioctl+0x21/0x30
> > sys_ioctl+0xb18/0xb41
> >  Freed in virtio_i2c_xfer+0x32e/0x35c age=244 cpu=0 pid=29
> > kfree+0x1bd/0x1cc
> > virtio_i2c_xfer+0x32e/0x35c
> > __i2c_transfer+0x429/0x57d
> > i2c_transfer+0x115/0x134
> > i2cdev_ioctl_rdwr+0x16a/0x1de
> > i2cdev_ioctl+0x247/0x2ed
> > vfs_ioctl+0x21/0x30
> > sys_ioctl+0xb18/0xb41
> > 
> > There is no simple fix for this (the driver would have to always create
> > bounce buffers and hold on to them until the device eventually returns
> > the buffers), so just disable the timeout support for now.
> > 
> > Fixes: 3cfc88380413d20f ("i2c: virtio: add a virtio i2c frontend driver")
> > Acked-by: Jie Deng 
> > Signed-off-by: Vincent Whitchurch 
> 
> Applied to for-current, thanks!
> 

Thanks, I completely forgot replying to the last email from Vincent.

FWIW,

Reviewed-by: Viresh Kumar 

-- 
viresh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: use %rip-relative addressing in hook calls

2021-11-23 Thread Juergen Gross via Virtualization

On 23.11.21 10:29, Jan Beulich wrote:

On 05.10.2021 09:43, Juergen Gross wrote:

On 30.09.21 14:40, Jan Beulich via Virtualization wrote:

While using a plain (constant) address works, its use needlessly invokes
a SIB addressing mode, making every call site one byte larger than
necessary. Instead of using an "i" constraint with address-of operator
and a 'c' operand modifier, simply use an ordinary "m" constraint, which
the 64-bit compiler will translate to %rip-relative addressing. This way
we also tell the compiler the truth about operand usage - the memory
location gets actually read, after all.

32-bit code generation is unaffected by the change.

Signed-off-by: Jan Beulich 


Reviewed-by: Juergen Gross 


Thanks. I notice this wasn't part of your 5.16-rc1 pull request, nor
did it make it into Linus'es tree via any other route. May I ask what
the plans here are?


I CC-ed you on the related mail I sent to the x86 maintainers:

"Re: Which tree for paravirt related patches?" on Nov 4th, and Thomas
Gleixner promised to look at your patch. Adding him to this response
again in order to remind him.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: use %rip-relative addressing in hook calls

2021-11-23 Thread Jan Beulich via Virtualization
On 05.10.2021 09:43, Juergen Gross wrote:
> On 30.09.21 14:40, Jan Beulich via Virtualization wrote:
>> While using a plain (constant) address works, its use needlessly invokes
>> a SIB addressing mode, making every call site one byte larger than
>> necessary. Instead of using an "i" constraint with address-of operator
>> and a 'c' operand modifier, simply use an ordinary "m" constraint, which
>> the 64-bit compiler will translate to %rip-relative addressing. This way
>> we also tell the compiler the truth about operand usage - the memory
>> location gets actually read, after all.
>>
>> 32-bit code generation is unaffected by the change.
>>
>> Signed-off-by: Jan Beulich 
> 
> Reviewed-by: Juergen Gross 

Thanks. I notice this wasn't part of your 5.16-rc1 pull request, nor
did it make it into Linus'es tree via any other route. May I ask what
the plans here are?

Jan

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-23 Thread Jason Wang
On Tue, Nov 23, 2021 at 2:42 PM Eli Cohen  wrote:
>
> On Tue, Nov 23, 2021 at 10:28:04AM +0800, Jason Wang wrote:
> > On Mon, Nov 22, 2021 at 11:56 PM Parav Pandit  wrote:
> > >
> > >
> > >
> > > > From: Eli Cohen 
> > > > Sent: Monday, November 22, 2021 8:37 PM
> > > >
> > > > On Mon, Nov 22, 2021 at 12:15:36PM +0200, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang 
> > > > > > Sent: Monday, November 22, 2021 3:02 PM
> > > > > >
> > > > > > > If we go with vendor stats, how can we communicate the information
> > > > > > > to userspace? Currenlty we use netlink attributes defined to pass
> > > > > > > this information.
> > > > > >
> > > > > > It can be done exactly as what have been done in the patch, we can
> > > > > > document it as vendor stats.
> > > > > >
> > > > > Yes, attribute to have VENDOR_ prefix in it.
> > > > > >
> > > > > > Ok, I think I get you. So I wonder if it's more useful to use device
> > > > > > specific counters. For networking, it could be packets 
> > > > > > send/received etc.
> > > > >
> > > > > Yes, I equally discussed this previously with Eli as its more 
> > > > > meaningful for end
> > > > users.
> > > > > We just return the device id of it along with queue number that helps 
> > > > > to show
> > > > tx and rx.
> > > > > For ctrl q, it is just ctrl commands and ctrl completions.
> > > >
> > > > I don't think we should mix send/receive packets for descriptors 
> > > > statistics. The
> > > > hardware could process a descriptor and still not transmit any packet.
> > > >
> > > > We can add packets send/recv but descriptor statistics have their own 
> > > > value.
> > > >
> > > Oh right. I read Jason's comment of _packets_ to fast. I meant to say 
> > > send/receive descriptors.
> > > I guess you already named them as tx and rx. Didn't review the patches in 
> > > this series yet.
> > >
> > > > To summarize, I can add the VENDOR_ preifx to the attibutes and re-send 
> > > > or is
> > > > there anything else you think should change?
> > > VENDOR_ prefix and command as iproute2 command as "vstats" looks fine to 
> > > me.
> >
> > Ack, but we need to figure out:
> >
> > 1) use descriptors or buffers.
>
> Descriptors.
> Currently we don't support indirect buffers but when we do, we will
> preserve the semantics.

Just to confirm, if I understand correctly:

1) with indirect descriptors, only 1 descriptor is counted
2) with N descriptors chained together, it will report N descriptors

Thanks

>
> > 2) if we use descriptors, for indirect descriptors and descriptor
> > chains how are they counted?
> >
>
> We count descriptors, not buffers.
>
> > Thanks
> >
> > >
> >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Kvm virtual machine uses virtio graphics card, the rotating screen is stuck

2021-11-23 Thread Gerd Hoffmann
On Tue, Nov 23, 2021 at 03:49:28PM +0800, Jason Wang wrote:
> On Tue, Nov 23, 2021 at 3:00 PM 苟浩  wrote:
> >
> > Hello,
> >
> > I use `xrandr -o left` to rotate the screen in the kvm virtual machine.
> > When configured as a Virtio graphics card, the screen will freeze after 
> > rotating the screen, and the keyboard and mouse will not respond.
> > When configured as a VGA graphics card, it is normal after rotating the 
> > screen.
> >
> > Is the Virtio graphics card not supporting rotating?
> 
> Adding list and Gerd for the answer.

Hmm dunno.  Never tried that.  Can't see an obvious reason why virtio
should show different behavior than stdvga, so probably a bug somewhere.

I'm wondering why you want rotate the screen in the first place though.
You can add any resolution you want using xrandr, including portrait
modes like 768x1024 ...

take care,
  Gerd

- cut here 
#!/bin/sh

width="$1"
height="$2"

if test "$width" = "" -o "$height" = ""; then
echo "usage: $0 width height"
exit 1
fi

output=$(xrandr --query | awk '/ connected/ { print $1; exit }')
mode="${width}x${height}"
echo "# setting mode $mode on output $output"

if xrandr --query | grep -q -e " $mode "; then
true # mode already there
else
modeline=$(cvt $width $height | grep Modeline | cut -d" " -f3-)
(set -x; xrandr --newmode "$mode" $modeline;
 xrandr --addmode "$output" "$mode")
fi
(set -x; xrandr --output "$output" --mode "$mode")

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vdpa_sim: avoid putting an uninitialized iova_domain

2021-11-23 Thread Stefano Garzarella

On Mon, Nov 22, 2021 at 08:22:21PM +0800, Longpeng(Mike) wrote:

From: Longpeng 

The system will crash if we put an uninitialized iova_domain, this
could happen when an error occurs before initializing the iova_domain
in vdpasim_create().

BUG: kernel NULL pointer dereference, address: 
...
RIP: 0010:__cpuhp_state_remove_instance+0x96/0x1c0
...
Call Trace:

put_iova_domain+0x29/0x220
vdpasim_free+0xd1/0x120 [vdpa_sim]
vdpa_release_dev+0x21/0x40 [vdpa]
device_release+0x33/0x90
kobject_release+0x63/0x160
vdpasim_create+0x127/0x2a0 [vdpa_sim]
vdpasim_net_dev_add+0x7d/0xfe [vdpa_sim_net]
vdpa_nl_cmd_dev_add_set_doit+0xe1/0x1a0 [vdpa]
genl_family_rcv_msg_doit+0x112/0x140
genl_rcv_msg+0xdf/0x1d0
...

So we must make sure the iova_domain is already initialized before
put it.

In addition, we may get the following warning in this case:
WARNING: ... drivers/iommu/iova.c:344 iova_cache_put+0x58/0x70

So we must make sure the iova_cache_put() is invoked only if the
iova_cache_get() is already invoked. Let's fix it together.

Signed-off-by: Longpeng 
---
drivers/vdpa/vdpa_sim/vdpa_sim.c | 7 +--
1 file changed, 5 insertions(+), 2 deletions(-)


Ooops, thanks for fixing this :-)

With the Fixes tag as suggested:

Reviewed-by: Stefano Garzarella 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization