Re: [RFC v3 2/2] pmem: enable pmem_submit_bio for asynchronous flush

2022-02-16 Thread Dan Williams
On Wed, Feb 16, 2022 at 12:39 AM Pankaj Gupta
 wrote:
>
> > >
> > > Return from "pmem_submit_bio" when asynchronous flush is
> > > still in progress in other context.
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/nvdimm/pmem.c| 15 ---
> > >  drivers/nvdimm/region_devs.c |  4 +++-
> > >  2 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > index fe7ece1534e1..f20e30277a68 100644
> > > --- a/drivers/nvdimm/pmem.c
> > > +++ b/drivers/nvdimm/pmem.c
> > > @@ -201,8 +201,12 @@ static void pmem_submit_bio(struct bio *bio)
> > > struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
> > > struct nd_region *nd_region = to_region(pmem);
> > >
> > > -   if (bio->bi_opf & REQ_PREFLUSH)
> > > +   if (bio->bi_opf & REQ_PREFLUSH) {
> > > ret = nvdimm_flush(nd_region, bio);
> > > +   /* asynchronous flush completes in other context */
> >
> > I think a negative error code is a confusing way to capture the case
> > of "bio successfully coalesced to previously pending flush request.
> > Perhaps reserve negative codes for failure, 0 for synchronously
> > completed, and > 0 for coalesced flush request.
>
> Yes. I implemented this way previously, will revert it to. Thanks!
>
> >
> > > +   if (ret == -EINPROGRESS)
> > > +   return;
> > > +   }
> > >
> > > do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
> > > if (do_acct)
> > > @@ -222,13 +226,18 @@ static void pmem_submit_bio(struct bio *bio)
> > > if (do_acct)
> > > bio_end_io_acct(bio, start);
> > >
> > > -   if (bio->bi_opf & REQ_FUA)
> > > +   if (bio->bi_opf & REQ_FUA) {
> > > ret = nvdimm_flush(nd_region, bio);
> > > +   /* asynchronous flush completes in other context */
> > > +   if (ret == -EINPROGRESS)
> > > +   return;
> > > +   }
> > >
> > > if (ret)
> > > bio->bi_status = errno_to_blk_status(ret);
> > >
> > > -   bio_endio(bio);
> > > +   if (bio)
> > > +   bio_endio(bio);
> > >  }
> > >
> > >  static int pmem_rw_page(struct block_device *bdev, sector_t sector,
> > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > index 9ccf3d608799..8512d2eaed4e 100644
> > > --- a/drivers/nvdimm/region_devs.c
> > > +++ b/drivers/nvdimm/region_devs.c
> > > @@ -1190,7 +1190,9 @@ int nvdimm_flush(struct nd_region *nd_region, 
> > > struct bio *bio)
> > > if (!nd_region->flush)
> > > rc = generic_nvdimm_flush(nd_region);
> > > else {
> > > -   if (nd_region->flush(nd_region, bio))
> > > +   rc = nd_region->flush(nd_region, bio);
> > > +   /* ongoing flush in other context */
> > > +   if (rc && rc != -EINPROGRESS)
> > > rc = -EIO;
> >
> > Why change this to -EIO vs just let the error code through untranslated?
>
> The reason was to be generic error code instead of returning host side
> return codes to guest?

Ok, maybe a comment to indicate the need to avoid exposing these error
codes toa guest so someone does not ask the same question in the
future?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v3 1/2] virtio-pmem: Async virtio-pmem flush

2022-02-16 Thread Dan Williams
On Wed, Feb 16, 2022 at 12:47 AM Pankaj Gupta
 wrote:
>
> > >
> > > Enable asynchronous flush for virtio pmem using work queue. Also,
> > > coalesce the flush requests when a flush is already in process.
> > > This functionality is copied from md/RAID code.
> > >
> > > When a flush is already in process, new flush requests wait till
> > > previous flush completes in another context (work queue). For all
> > > the requests come between ongoing flush and new flush start time, only
> > > single flush executes, thus adhers to flush coalscing logic. This is
> >
> > s/adhers/adheres/
> >
> > s/coalscing/coalescing/
> >
> > > important for maintaining the flush request order with request coalscing.
> >
> > s/coalscing/coalescing/
>
> o.k. Sorry for the spelling mistakes.
>
> >
> > >
> > > Signed-off-by: Pankaj Gupta 
> > > ---
> > >  drivers/nvdimm/nd_virtio.c   | 74 +++-
> > >  drivers/nvdimm/virtio_pmem.c | 10 +
> > >  drivers/nvdimm/virtio_pmem.h | 16 
> > >  3 files changed, 83 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > > index 10351d5b49fa..179ea7a73338 100644
> > > --- a/drivers/nvdimm/nd_virtio.c
> > > +++ b/drivers/nvdimm/nd_virtio.c
> > > @@ -100,26 +100,66 @@ static int virtio_pmem_flush(struct nd_region 
> > > *nd_region)
> > >  /* The asynchronous flush callback function */
> > >  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > >  {
> > > -   /*
> > > -* Create child bio for asynchronous flush and chain with
> > > -* parent bio. Otherwise directly call nd_region flush.
> > > +   /* queue asynchronous flush and coalesce the flush requests */
> > > +   struct virtio_device *vdev = nd_region->provider_data;
> > > +   struct virtio_pmem *vpmem  = vdev->priv;
> > > +   ktime_t req_start = ktime_get_boottime();
> > > +   int ret = -EINPROGRESS;
> > > +
> > > +   spin_lock_irq(>lock);
> >
> > Why a new lock and not continue to use ->pmem_lock?
>
> This spinlock is to protect entry in 'wait_event_lock_irq'
> and the Other spinlock is to protect virtio queue data.

Understood, but md shares the mddev->lock for both purposes, so I
would ask that you either document what motivates the locking split,
or just reuse the lock until a strong reason to split them arises.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v3 2/2] pmem: enable pmem_submit_bio for asynchronous flush

2022-02-15 Thread Dan Williams
On Tue, Jan 11, 2022 at 8:21 AM Pankaj Gupta
 wrote:
>
> Return from "pmem_submit_bio" when asynchronous flush is
> still in progress in other context.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/pmem.c| 15 ---
>  drivers/nvdimm/region_devs.c |  4 +++-
>  2 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index fe7ece1534e1..f20e30277a68 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -201,8 +201,12 @@ static void pmem_submit_bio(struct bio *bio)
> struct pmem_device *pmem = bio->bi_bdev->bd_disk->private_data;
> struct nd_region *nd_region = to_region(pmem);
>
> -   if (bio->bi_opf & REQ_PREFLUSH)
> +   if (bio->bi_opf & REQ_PREFLUSH) {
> ret = nvdimm_flush(nd_region, bio);
> +   /* asynchronous flush completes in other context */

I think a negative error code is a confusing way to capture the case
of "bio successfully coalesced to previously pending flush request.
Perhaps reserve negative codes for failure, 0 for synchronously
completed, and > 0 for coalesced flush request.

> +   if (ret == -EINPROGRESS)
> +   return;
> +   }
>
> do_acct = blk_queue_io_stat(bio->bi_bdev->bd_disk->queue);
> if (do_acct)
> @@ -222,13 +226,18 @@ static void pmem_submit_bio(struct bio *bio)
> if (do_acct)
> bio_end_io_acct(bio, start);
>
> -   if (bio->bi_opf & REQ_FUA)
> +   if (bio->bi_opf & REQ_FUA) {
> ret = nvdimm_flush(nd_region, bio);
> +   /* asynchronous flush completes in other context */
> +   if (ret == -EINPROGRESS)
> +   return;
> +   }
>
> if (ret)
> bio->bi_status = errno_to_blk_status(ret);
>
> -   bio_endio(bio);
> +   if (bio)
> +   bio_endio(bio);
>  }
>
>  static int pmem_rw_page(struct block_device *bdev, sector_t sector,
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 9ccf3d608799..8512d2eaed4e 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -1190,7 +1190,9 @@ int nvdimm_flush(struct nd_region *nd_region, struct 
> bio *bio)
> if (!nd_region->flush)
> rc = generic_nvdimm_flush(nd_region);
> else {
> -   if (nd_region->flush(nd_region, bio))
> +   rc = nd_region->flush(nd_region, bio);
> +   /* ongoing flush in other context */
> +   if (rc && rc != -EINPROGRESS)
> rc = -EIO;

Why change this to -EIO vs just let the error code through untranslated?

> }
>
> --
> 2.25.1
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v3 1/2] virtio-pmem: Async virtio-pmem flush

2022-02-15 Thread Dan Williams
On Tue, Jan 11, 2022 at 8:23 AM Pankaj Gupta
 wrote:
>
> Enable asynchronous flush for virtio pmem using work queue. Also,
> coalesce the flush requests when a flush is already in process.
> This functionality is copied from md/RAID code.
>
> When a flush is already in process, new flush requests wait till
> previous flush completes in another context (work queue). For all
> the requests come between ongoing flush and new flush start time, only
> single flush executes, thus adhers to flush coalscing logic. This is

s/adhers/adheres/

s/coalscing/coalescing/

> important for maintaining the flush request order with request coalscing.

s/coalscing/coalescing/

>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/nvdimm/nd_virtio.c   | 74 +++-
>  drivers/nvdimm/virtio_pmem.c | 10 +
>  drivers/nvdimm/virtio_pmem.h | 16 
>  3 files changed, 83 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> index 10351d5b49fa..179ea7a73338 100644
> --- a/drivers/nvdimm/nd_virtio.c
> +++ b/drivers/nvdimm/nd_virtio.c
> @@ -100,26 +100,66 @@ static int virtio_pmem_flush(struct nd_region 
> *nd_region)
>  /* The asynchronous flush callback function */
>  int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
>  {
> -   /*
> -* Create child bio for asynchronous flush and chain with
> -* parent bio. Otherwise directly call nd_region flush.
> +   /* queue asynchronous flush and coalesce the flush requests */
> +   struct virtio_device *vdev = nd_region->provider_data;
> +   struct virtio_pmem *vpmem  = vdev->priv;
> +   ktime_t req_start = ktime_get_boottime();
> +   int ret = -EINPROGRESS;
> +
> +   spin_lock_irq(>lock);

Why a new lock and not continue to use ->pmem_lock?

Have you tested this with CONFIG_PROVE_LOCKING?

Along those lines do you have a selftest that can be added to the
kernel as well so that 0day or other bots could offer early warnings
on regressions?

> +   /* flush requests wait until ongoing flush completes,
> +* hence coalescing all the pending requests.
>  */
> -   if (bio && bio->bi_iter.bi_sector != -1) {
> -   struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> -
> -   if (!child)
> -   return -ENOMEM;
> -   bio_copy_dev(child, bio);
> -   child->bi_opf = REQ_PREFLUSH;
> -   child->bi_iter.bi_sector = -1;
> -   bio_chain(child, bio);
> -   submit_bio(child);
> -   return 0;
> +   wait_event_lock_irq(vpmem->sb_wait,
> +   !vpmem->flush_bio ||
> +   ktime_before(req_start, vpmem->prev_flush_start),
> +   vpmem->lock);
> +   /* new request after previous flush is completed */
> +   if (ktime_after(req_start, vpmem->prev_flush_start)) {
> +   WARN_ON(vpmem->flush_bio);
> +   vpmem->flush_bio = bio;
> +   bio = NULL;
> +   }
> +   spin_unlock_irq(>lock);
> +
> +   if (!bio)
> +   queue_work(vpmem->pmem_wq, >flush_work);
> +   else {
> +   /* flush completed in other context while we waited */
> +   if (bio && (bio->bi_opf & REQ_PREFLUSH))
> +   bio->bi_opf &= ~REQ_PREFLUSH;
> +   else if (bio && (bio->bi_opf & REQ_FUA))
> +   bio->bi_opf &= ~REQ_FUA;
> +
> +   ret = vpmem->prev_flush_err;
> }
> -   if (virtio_pmem_flush(nd_region))
> -   return -EIO;
>
> -   return 0;
> +   return ret;
>  };
>  EXPORT_SYMBOL_GPL(async_pmem_flush);
> +
> +void submit_async_flush(struct work_struct *ws)

This name is too generic to be exported from drivers/nvdimm/nd_virtio.c

...it strikes me that there is little reason for nd_virtio and
virtio_pmem to be separate modules. They are both enabled by the same
Kconfig, so why not combine them into one module and drop the exports?

> +{
> +   struct virtio_pmem *vpmem = container_of(ws, struct virtio_pmem, 
> flush_work);
> +   struct bio *bio = vpmem->flush_bio;
> +
> +   vpmem->start_flush = ktime_get_boottime();
> +   vpmem->prev_flush_err = virtio_pmem_flush(vpmem->nd_region);
> +   vpmem->prev_flush_start = vpmem->start_flush;
> +   vpmem->flush_bio = NULL;
> +   wake_up(>sb_wait);
> +
> +   if (vpmem->prev_flush_err)
> +   bio->bi_status = errno_to_blk_status(-EIO);
> +
> +   /* Submit parent bio only for PREFLUSH */
> +   if (bio && (bio->bi_opf & REQ_PREFLUSH)) {
> +   bio->bi_opf &= ~REQ_PREFLUSH;
> +   submit_bio(bio);
> +   } else if (bio && (bio->bi_opf & REQ_FUA)) {
> +   bio->bi_opf &= ~REQ_FUA;
> +   bio_endio(bio);
> +   }
> +}
> +EXPORT_SYMBOL_GPL(submit_async_flush);
>  MODULE_LICENSE("GPL");
> diff --git 

Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-15 Thread Dan Williams
On Wed, Dec 15, 2021 at 7:53 AM Vivek Goyal  wrote:
>
> On Tue, Dec 14, 2021 at 03:43:38PM -0800, Dan Williams wrote:
> > On Tue, Dec 14, 2021 at 12:33 PM Vivek Goyal  wrote:
> > >
> > > On Tue, Dec 14, 2021 at 08:41:30AM -0800, Dan Williams wrote:
> > > > On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal  wrote:
> > > > >
> > > > > On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> > > > > > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > > > > > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  
> > > > > > > wrote:
> > > > > > > > Going forward, I am wondering should virtiofs use flushcache 
> > > > > > > > version as
> > > > > > > > well. What if host filesystem is using DAX and mapping 
> > > > > > > > persistent memory
> > > > > > > > pfn directly into qemu address space. I have never tested that.
> > > > > > > >
> > > > > > > > Right now we are relying on applications to do fsync/msync on 
> > > > > > > > virtiofs
> > > > > > > > for data persistence.
> > > > > > >
> > > > > > > This sounds like it would need coordination with a paravirtualized
> > > > > > > driver that can indicate whether the host side is pmem or not, 
> > > > > > > like
> > > > > > > the virtio_pmem driver. However, if the guest sends any 
> > > > > > > fsync/msync
> > > > > > > you would still need to go explicitly cache flush any dirty page
> > > > > > > because you can't necessarily trust that the guest did that 
> > > > > > > already.
> > > > > >
> > > > > > Do we?  The application can't really know what backend it is on, so
> > > > > > it sounds like the current virtiofs implementation doesn't really, 
> > > > > > does it?
> > > > >
> > > > > Agreed that application does not know what backend it is on. So 
> > > > > virtiofs
> > > > > just offers regular posix API where applications have to do 
> > > > > fsync/msync
> > > > > for data persistence. No support for mmap(MAP_SYNC). We don't offer 
> > > > > persistent
> > > > > memory programming model on virtiofs. That's not the expectation. DAX
> > > > > is used only to bypass guest page cache.
> > > > >
> > > > > With this assumption, I think we might not have to use flushcache 
> > > > > version
> > > > > at all even if shared filesystem is on persistent memory on host.
> > > > >
> > > > > - We mmap() host files into qemu address space. So any dax store in 
> > > > > virtiofs
> > > > >   should make corresponding pages dirty in page cache on host and when
> > > > >   and fsync()/msync() comes later, it should flush all the data to 
> > > > > PMEM.
> > > > >
> > > > > - In case of file extending writes, virtiofs falls back to regular
> > > > >   FUSE_WRITE path (and not use DAX), and in that case host pmem driver
> > > > >   should make sure writes are flushed to pmem immediately.
> > > > >
> > > > > Are there any other path I am missing. If not, looks like we might not
> > > > > have to use flushcache version in virtiofs at all as long as we are 
> > > > > not
> > > > > offering guest applications user space flushes and MAP_SYNC support.
> > > > >
> > > > > We still might have to use machine check safe variant though as loads
> > > > > might generate synchronous machine check. What's not clear to me is
> > > > > that if this MC safe variant should be used only in case of PMEM or
> > > > > should it be used in case of non-PMEM as well.
> > > >
> > > > It should be used on any memory address that can throw exception on
> > > > load, which is any physical address, in paths that can tolerate
> > > > memcpy() returning an error code, most I/O paths, and can tolerate
> > > > slower copy performance on older platforms that do not support MC
> > > > recovery with fast string operations, to date that's only PMEM users.
> > >
> > > Ok, So basically latest cpus can do fast string 

Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-14 Thread Dan Williams
On Tue, Dec 14, 2021 at 12:33 PM Vivek Goyal  wrote:
>
> On Tue, Dec 14, 2021 at 08:41:30AM -0800, Dan Williams wrote:
> > On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal  wrote:
> > >
> > > On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> > > > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > > > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
> > > > > > Going forward, I am wondering should virtiofs use flushcache 
> > > > > > version as
> > > > > > well. What if host filesystem is using DAX and mapping persistent 
> > > > > > memory
> > > > > > pfn directly into qemu address space. I have never tested that.
> > > > > >
> > > > > > Right now we are relying on applications to do fsync/msync on 
> > > > > > virtiofs
> > > > > > for data persistence.
> > > > >
> > > > > This sounds like it would need coordination with a paravirtualized
> > > > > driver that can indicate whether the host side is pmem or not, like
> > > > > the virtio_pmem driver. However, if the guest sends any fsync/msync
> > > > > you would still need to go explicitly cache flush any dirty page
> > > > > because you can't necessarily trust that the guest did that already.
> > > >
> > > > Do we?  The application can't really know what backend it is on, so
> > > > it sounds like the current virtiofs implementation doesn't really, does 
> > > > it?
> > >
> > > Agreed that application does not know what backend it is on. So virtiofs
> > > just offers regular posix API where applications have to do fsync/msync
> > > for data persistence. No support for mmap(MAP_SYNC). We don't offer 
> > > persistent
> > > memory programming model on virtiofs. That's not the expectation. DAX
> > > is used only to bypass guest page cache.
> > >
> > > With this assumption, I think we might not have to use flushcache version
> > > at all even if shared filesystem is on persistent memory on host.
> > >
> > > - We mmap() host files into qemu address space. So any dax store in 
> > > virtiofs
> > >   should make corresponding pages dirty in page cache on host and when
> > >   and fsync()/msync() comes later, it should flush all the data to PMEM.
> > >
> > > - In case of file extending writes, virtiofs falls back to regular
> > >   FUSE_WRITE path (and not use DAX), and in that case host pmem driver
> > >   should make sure writes are flushed to pmem immediately.
> > >
> > > Are there any other path I am missing. If not, looks like we might not
> > > have to use flushcache version in virtiofs at all as long as we are not
> > > offering guest applications user space flushes and MAP_SYNC support.
> > >
> > > We still might have to use machine check safe variant though as loads
> > > might generate synchronous machine check. What's not clear to me is
> > > that if this MC safe variant should be used only in case of PMEM or
> > > should it be used in case of non-PMEM as well.
> >
> > It should be used on any memory address that can throw exception on
> > load, which is any physical address, in paths that can tolerate
> > memcpy() returning an error code, most I/O paths, and can tolerate
> > slower copy performance on older platforms that do not support MC
> > recovery with fast string operations, to date that's only PMEM users.
>
> Ok, So basically latest cpus can do fast string operations with MC
> recovery so that using MC safe variant is not a problem.
>
> Then there is range of cpus which can do MC recovery but do slower
> versions of memcpy and that's where the issue is.
>
> So if we knew that virtiofs dax window is backed by a pmem device
> then we should always use MC safe variant. Even if it means paying
> the price of slow version for the sake of correctness.
>
> But if we are not using pmem on host, then there is no point in
> using MC safe variant.
>
> IOW.
>
> if (virtiofs_backed_by_pmem) {

No, PMEM should not be considered at all relative to whether to use MC
or not, it is 100% a decision of whether you expect virtiofs users
will balk more at unhandled machine checks or performance regressions
on the platforms that set "enable_copy_mc_fragile()". See
quirk_intel_brickland_xeon_ras_cap() and
quirk_intel_purley_xeon_ras_cap() in arch/x86/kernel/quirks.c.

> use_mc_safe_version
> else
&g

Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-14 Thread Dan Williams
On Tue, Dec 14, 2021 at 6:23 AM Vivek Goyal  wrote:
>
> On Mon, Dec 13, 2021 at 09:23:18AM +0100, Christoph Hellwig wrote:
> > On Sun, Dec 12, 2021 at 06:44:26AM -0800, Dan Williams wrote:
> > > On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
> > > > Going forward, I am wondering should virtiofs use flushcache version as
> > > > well. What if host filesystem is using DAX and mapping persistent memory
> > > > pfn directly into qemu address space. I have never tested that.
> > > >
> > > > Right now we are relying on applications to do fsync/msync on virtiofs
> > > > for data persistence.
> > >
> > > This sounds like it would need coordination with a paravirtualized
> > > driver that can indicate whether the host side is pmem or not, like
> > > the virtio_pmem driver. However, if the guest sends any fsync/msync
> > > you would still need to go explicitly cache flush any dirty page
> > > because you can't necessarily trust that the guest did that already.
> >
> > Do we?  The application can't really know what backend it is on, so
> > it sounds like the current virtiofs implementation doesn't really, does it?
>
> Agreed that application does not know what backend it is on. So virtiofs
> just offers regular posix API where applications have to do fsync/msync
> for data persistence. No support for mmap(MAP_SYNC). We don't offer persistent
> memory programming model on virtiofs. That's not the expectation. DAX
> is used only to bypass guest page cache.
>
> With this assumption, I think we might not have to use flushcache version
> at all even if shared filesystem is on persistent memory on host.
>
> - We mmap() host files into qemu address space. So any dax store in virtiofs
>   should make corresponding pages dirty in page cache on host and when
>   and fsync()/msync() comes later, it should flush all the data to PMEM.
>
> - In case of file extending writes, virtiofs falls back to regular
>   FUSE_WRITE path (and not use DAX), and in that case host pmem driver
>   should make sure writes are flushed to pmem immediately.
>
> Are there any other path I am missing. If not, looks like we might not
> have to use flushcache version in virtiofs at all as long as we are not
> offering guest applications user space flushes and MAP_SYNC support.
>
> We still might have to use machine check safe variant though as loads
> might generate synchronous machine check. What's not clear to me is
> that if this MC safe variant should be used only in case of PMEM or
> should it be used in case of non-PMEM as well.

It should be used on any memory address that can throw exception on
load, which is any physical address, in paths that can tolerate
memcpy() returning an error code, most I/O paths, and can tolerate
slower copy performance on older platforms that do not support MC
recovery with fast string operations, to date that's only PMEM users.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] dax: always use _copy_mc_to_iter in dax_copy_to_iter

2021-12-13 Thread Dan Williams
On Mon, Dec 13, 2021 at 12:20 AM Christoph Hellwig  wrote:
>
> On Sun, Dec 12, 2021 at 06:48:05AM -0800, Dan Williams wrote:
> > On Fri, Dec 10, 2021 at 6:05 AM Vivek Goyal  wrote:
> > >
> > > On Thu, Dec 09, 2021 at 07:38:28AM +0100, Christoph Hellwig wrote:
> > > > While using the MC-safe copy routines is rather pointless on a virtual 
> > > > device
> > > > like virtiofs,
> > >
> > > I was wondering about that. Is it completely pointless.
> > >
> > > Typically we are just mapping host page cache into qemu address space.
> > > That shows as virtiofs device pfn in guest and that pfn is mapped into
> > > guest application address space in mmap() call.
> > >
> > > Given on host its DRAM, so I would not expect machine check on load side
> > > so there was no need to use machine check safe variant.
> >
> > That's a broken assumption, DRAM experiences multi-bit ECC errors.
> > Machine checks, data aborts, etc existed before PMEM.
>
> So the conclusion here is that we should always use the mc safe variant?

The proposal is one of the following:

1/ any paths not currently using the mc safe variant should continue
not using it to avoid the performance regression on older platforms,
i.e. drop this patch.

2/ add plumbing to switch to mcsafe variant, but only on newer
platforms, incremental new patch

3/ always use the mc safe variant, keep this patch

We could go with 3/ and see who screams, because 3/ is smallest
ongoing maintenance burden. However, I feel like 1/ is the path of
least resistance until the platforms with the need to do 'careful'
copying age out of use.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] dax: always use _copy_mc_to_iter in dax_copy_to_iter

2021-12-12 Thread Dan Williams
On Wed, Dec 8, 2021 at 10:38 PM Christoph Hellwig  wrote:
>
> While using the MC-safe copy routines is rather pointless on a virtual device
> like virtiofs, it also isn't harmful at all.  So just use _copy_mc_to_iter
> unconditionally to simplify the code.

>From a correctness perspective, yes, but from a performance perspective, see:

enable_copy_mc_fragile()

...on those platforms fast-string copy implementation is replaced with
a manual unrolled copy. So this will cause a performance regression on
those platforms.

How about let's keep this as is / still only use it for PMEM where end
users are already dealing with the performance difference across
platforms? I considered exporting an indicator of which backend
routine has been selected from arch/x86/lib/copy_mc.c, but it got
messy quickly so I fell back to just keeping the status quo.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] dax: always use _copy_mc_to_iter in dax_copy_to_iter

2021-12-12 Thread Dan Williams
On Fri, Dec 10, 2021 at 6:05 AM Vivek Goyal  wrote:
>
> On Thu, Dec 09, 2021 at 07:38:28AM +0100, Christoph Hellwig wrote:
> > While using the MC-safe copy routines is rather pointless on a virtual 
> > device
> > like virtiofs,
>
> I was wondering about that. Is it completely pointless.
>
> Typically we are just mapping host page cache into qemu address space.
> That shows as virtiofs device pfn in guest and that pfn is mapped into
> guest application address space in mmap() call.
>
> Given on host its DRAM, so I would not expect machine check on load side
> so there was no need to use machine check safe variant.

That's a broken assumption, DRAM experiences multi-bit ECC errors.
Machine checks, data aborts, etc existed before PMEM.

>  But what if host
> filesystem is on persistent memory and using DAX. In that case load in
> guest can trigger a machine check. Not sure if that machine check will
> actually travel into the guest and unblock read() operation or not.
>
> But this sounds like a good change from virtiofs point of view, anyway.
>
> Thanks
> Vivek
>
>
> > it also isn't harmful at all.  So just use _copy_mc_to_iter
> > unconditionally to simplify the code.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/dax/super.c | 10 --
> >  fs/fuse/virtio_fs.c |  1 -
> >  include/linux/dax.h |  1 -
> >  3 files changed, 12 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index ff676a07480c8..fe783234ca669 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -107,8 +107,6 @@ enum dax_device_flags {
> >   DAXDEV_SYNC,
> >   /* do not use uncached operations to write data */
> >   DAXDEV_CACHED,
> > - /* do not use mcsafe operations to read data */
> > - DAXDEV_NOMCSAFE,
> >  };
> >
> >  /**
> > @@ -171,8 +169,6 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, 
> > pgoff_t pgoff, void *addr,
> >* via access_ok() in vfs_red, so use the 'no check' version to bypass
> >* the HARDENED_USERCOPY overhead.
> >*/
> > - if (test_bit(DAXDEV_NOMCSAFE, _dev->flags))
> > - return _copy_to_iter(addr, bytes, i);
> >   return _copy_mc_to_iter(addr, bytes, i);
> >  }
> >
> > @@ -242,12 +238,6 @@ void set_dax_cached(struct dax_device *dax_dev)
> >  }
> >  EXPORT_SYMBOL_GPL(set_dax_cached);
> >
> > -void set_dax_nomcsafe(struct dax_device *dax_dev)
> > -{
> > - set_bit(DAXDEV_NOMCSAFE, _dev->flags);
> > -}
> > -EXPORT_SYMBOL_GPL(set_dax_nomcsafe);
> > -
> >  bool dax_alive(struct dax_device *dax_dev)
> >  {
> >   lockdep_assert_held(_srcu);
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 754319ce2a29b..d9c20b148ac19 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -838,7 +838,6 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> >   if (IS_ERR(fs->dax_dev))
> >   return PTR_ERR(fs->dax_dev);
> >   set_dax_cached(fs->dax_dev);
> > - set_dax_nomcsafe(fs->dax_dev);
> >   return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
> >   fs->dax_dev);
> >  }
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index d22cbf03d37d2..d267331bc37e7 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -90,7 +90,6 @@ static inline bool daxdev_mapping_supported(struct 
> > vm_area_struct *vma,
> >  #endif
> >
> >  void set_dax_cached(struct dax_device *dax_dev);
> > -void set_dax_nomcsafe(struct dax_device *dax_dev);
> >
> >  struct writeback_control;
> >  #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
> > --
> > 2.30.2
> >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-12 Thread Dan Williams
On Fri, Dec 10, 2021 at 6:17 AM Vivek Goyal  wrote:
>
> On Thu, Dec 09, 2021 at 07:38:27AM +0100, Christoph Hellwig wrote:
> > These methods indirect the actual DAX read/write path.  In the end pmem
> > uses magic flush and mc safe variants and fuse and dcssblk use plain ones
> > while device mapper picks redirects to the underlying device.
> >
> > Add set_dax_virtual() and set_dax_nomcsafe() APIs for fuse to skip these
> > special variants, then use them everywhere as they fall back to the plain
> > ones on s390 anyway and remove an indirect call from the read/write path
> > as well as a lot of boilerplate code.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/dax/super.c   | 36 ++--
> >  drivers/md/dm-linear.c| 20 -
> >  drivers/md/dm-log-writes.c| 80 ---
> >  drivers/md/dm-stripe.c| 20 -
> >  drivers/md/dm.c   | 50 --
> >  drivers/nvdimm/pmem.c | 20 -
> >  drivers/s390/block/dcssblk.c  | 14 --
> >  fs/dax.c  |  5 ---
> >  fs/fuse/virtio_fs.c   | 19 +
> >  include/linux/dax.h   |  9 ++--
> >  include/linux/device-mapper.h |  4 --
> >  11 files changed, 37 insertions(+), 240 deletions(-)
> >
>
> [..]
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 5c03a0364a9bb..754319ce2a29b 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -753,20 +753,6 @@ static long virtio_fs_direct_access(struct dax_device 
> > *dax_dev, pgoff_t pgoff,
> >   return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
> >  }
> >
> > -static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
> > -pgoff_t pgoff, void *addr,
> > -size_t bytes, struct iov_iter *i)
> > -{
> > - return copy_from_iter(addr, bytes, i);
> > -}
> > -
> > -static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
> > -pgoff_t pgoff, void *addr,
> > -size_t bytes, struct iov_iter *i)
> > -{
> > - return copy_to_iter(addr, bytes, i);
> > -}
> > -
> >  static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
> >pgoff_t pgoff, size_t nr_pages)
> >  {
> > @@ -783,8 +769,6 @@ static int virtio_fs_zero_page_range(struct dax_device 
> > *dax_dev,
> >
> >  static const struct dax_operations virtio_fs_dax_ops = {
> >   .direct_access = virtio_fs_direct_access,
> > - .copy_from_iter = virtio_fs_copy_from_iter,
> > - .copy_to_iter = virtio_fs_copy_to_iter,
> >   .zero_page_range = virtio_fs_zero_page_range,
> >  };
> >
> > @@ -853,7 +837,8 @@ static int virtio_fs_setup_dax(struct virtio_device 
> > *vdev, struct virtio_fs *fs)
> >   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
> >   if (IS_ERR(fs->dax_dev))
> >   return PTR_ERR(fs->dax_dev);
> > -
> > + set_dax_cached(fs->dax_dev);
>
> Looks good to me from virtiofs point of view.
>
> Reviewed-by: Vivek Goyal 
>
> Going forward, I am wondering should virtiofs use flushcache version as
> well. What if host filesystem is using DAX and mapping persistent memory
> pfn directly into qemu address space. I have never tested that.
>
> Right now we are relying on applications to do fsync/msync on virtiofs
> for data persistence.

This sounds like it would need coordination with a paravirtualized
driver that can indicate whether the host side is pmem or not, like
the virtio_pmem driver. However, if the guest sends any fsync/msync
you would still need to go explicitly cache flush any dirty page
because you can't necessarily trust that the guest did that already.

>
> > + set_dax_nomcsafe(fs->dax_dev);
> >   return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
> >   fs->dax_dev);
> >  }
>
> Thanks
> Vivek
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/5] dax: remove the copy_from_iter and copy_to_iter methods

2021-12-12 Thread Dan Williams
On Wed, Dec 8, 2021 at 10:38 PM Christoph Hellwig  wrote:
>
> These methods indirect the actual DAX read/write path.  In the end pmem
> uses magic flush and mc safe variants and fuse and dcssblk use plain ones
> while device mapper picks redirects to the underlying device.
>
> Add set_dax_virtual() and set_dax_nomcsafe() APIs for fuse to skip these
> special variants, then use them everywhere as they fall back to the plain
> ones on s390 anyway and remove an indirect call from the read/write path
> as well as a lot of boilerplate code.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c   | 36 ++--
>  drivers/md/dm-linear.c| 20 -
>  drivers/md/dm-log-writes.c| 80 ---
>  drivers/md/dm-stripe.c| 20 -
>  drivers/md/dm.c   | 50 --
>  drivers/nvdimm/pmem.c | 20 -
>  drivers/s390/block/dcssblk.c  | 14 --
>  fs/dax.c  |  5 ---
>  fs/fuse/virtio_fs.c   | 19 +
>  include/linux/dax.h   |  9 ++--
>  include/linux/device-mapper.h |  4 --
>  11 files changed, 37 insertions(+), 240 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e81d5ee57390f..ff676a07480c8 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -105,6 +105,10 @@ enum dax_device_flags {
> DAXDEV_WRITE_CACHE,
> /* flag to check if device supports synchronous flush */
> DAXDEV_SYNC,
> +   /* do not use uncached operations to write data */
> +   DAXDEV_CACHED,
> +   /* do not use mcsafe operations to read data */
> +   DAXDEV_NOMCSAFE,

Linus did not like the mcsafe name, and this brings it back. Let's
flip the polarity to positively indicate which routine to use, and to
match the 'nofault' style which says "copy and handle faults".

/* do not leave the caches dirty after writes */
DAXDEV_NOCACHE

/* handle CPU fetch exceptions during reads */
DAXDEV_NOMC

...and then flip the use cases around.

Otherwise, nice cleanup. In retrospect I took the feedback to push
this decision to a driver a bit too literally, this is much better.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/5] dax: remove the DAXDEV_F_SYNC flag

2021-12-12 Thread Dan Williams
On Wed, Dec 8, 2021 at 10:38 PM Christoph Hellwig  wrote:
>
> Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and
> just let the drivers call set_dax_synchronous directly.
>
> Signed-off-by: Christoph Hellwig 

Sure, looks good to me.

Reviewed-by: Dan Williams 

> ---
>  drivers/dax/bus.c| 3 ++-
>  drivers/dax/super.c  | 6 +-
>  drivers/md/dm.c  | 2 +-
>  drivers/nvdimm/pmem.c| 7 +++
>  drivers/s390/block/dcssblk.c | 4 ++--
>  fs/fuse/virtio_fs.c  | 2 +-
>  include/linux/dax.h  | 8 ++--
>  7 files changed, 12 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6683d42c32c56..da2a14d096d29 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1324,11 +1324,12 @@ struct dev_dax *devm_create_dev_dax(struct 
> dev_dax_data *data)
>  * No dax_operations since there is no access to this device outside 
> of
>  * mmap of the resulting character device.
>  */
> -   dax_dev = alloc_dax(dev_dax, NULL, DAXDEV_F_SYNC);
> +   dax_dev = alloc_dax(dev_dax, NULL);
> if (IS_ERR(dax_dev)) {
> rc = PTR_ERR(dax_dev);
> goto err_alloc_dax;
> }
> +   set_dax_synchronous(dax_dev);
>
> /* a device_dax instance is dead while the driver is not attached */
> kill_dax(dax_dev);
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e18155f43a635..e81d5ee57390f 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -345,8 +345,7 @@ static struct dax_device *dax_dev_get(dev_t devt)
> return dax_dev;
>  }
>
> -struct dax_device *alloc_dax(void *private, const struct dax_operations *ops,
> -   unsigned long flags)
> +struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
>  {
> struct dax_device *dax_dev;
> dev_t devt;
> @@ -366,9 +365,6 @@ struct dax_device *alloc_dax(void *private, const struct 
> dax_operations *ops,
>
> dax_dev->ops = ops;
> dax_dev->private = private;
> -   if (flags & DAXDEV_F_SYNC)
> -   set_dax_synchronous(dax_dev);
> -
> return dax_dev;
>
>   err_dev:
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 4e997c02bb0a0..f4b972af10928 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1765,7 +1765,7 @@ static struct mapped_device *alloc_dev(int minor)
> sprintf(md->disk->disk_name, "dm-%d", minor);
>
> if (IS_ENABLED(CONFIG_FS_DAX)) {
> -   md->dax_dev = alloc_dax(md, _dax_ops, 0);
> +   md->dax_dev = alloc_dax(md, _dax_ops);
> if (IS_ERR(md->dax_dev)) {
> md->dax_dev = NULL;
> goto bad;
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 8294f1c701baa..85b3339286bd8 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -402,7 +402,6 @@ static int pmem_attach_disk(struct device *dev,
> struct gendisk *disk;
> void *addr;
> int rc;
> -   unsigned long flags = 0UL;
>
> pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
> if (!pmem)
> @@ -495,13 +494,13 @@ static int pmem_attach_disk(struct device *dev,
> nvdimm_badblocks_populate(nd_region, >bb, _range);
> disk->bb = >bb;
>
> -   if (is_nvdimm_sync(nd_region))
> -   flags = DAXDEV_F_SYNC;
> -   dax_dev = alloc_dax(pmem, _dax_ops, flags);
> +   dax_dev = alloc_dax(pmem, _dax_ops);
> if (IS_ERR(dax_dev)) {
> rc = PTR_ERR(dax_dev);
> goto out;
> }
> +   if (is_nvdimm_sync(nd_region))
> +   set_dax_synchronous(dax_dev);
> rc = dax_add_host(dax_dev, disk);
> if (rc)
> goto out_cleanup_dax;
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index e65e83764d1ce..10823debc09bd 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -686,13 +686,13 @@ dcssblk_add_store(struct device *dev, struct 
> device_attribute *attr, const char
> if (rc)
> goto put_dev;
>
> -   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops,
> -   DAXDEV_F_SYNC);
> +   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
> if (IS_ERR(dev_info->dax_dev)) {
> rc = PTR_ERR(dev_info->dax_dev);
> dev_info->dax_dev = NULL;
> goto put_dev;
> 

Re: [PATCH 2/5] dax: simplify dax_synchronous and set_dax_synchronous

2021-12-12 Thread Dan Williams
On Wed, Dec 8, 2021 at 10:38 PM Christoph Hellwig  wrote:
>
> Remove the pointless wrappers.
>
> Signed-off-by: Christoph Hellwig 

Looks good, not sure why those ever existed.

Reviewed-by: Dan Williams 

> ---
>  drivers/dax/super.c |  8 
>  include/linux/dax.h | 12 ++--
>  2 files changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e7152a6c4cc40..e18155f43a635 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -208,17 +208,17 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
>
> -bool __dax_synchronous(struct dax_device *dax_dev)
> +bool dax_synchronous(struct dax_device *dax_dev)
>  {
> return test_bit(DAXDEV_SYNC, _dev->flags);
>  }
> -EXPORT_SYMBOL_GPL(__dax_synchronous);
> +EXPORT_SYMBOL_GPL(dax_synchronous);
>
> -void __set_dax_synchronous(struct dax_device *dax_dev)
> +void set_dax_synchronous(struct dax_device *dax_dev)
>  {
> set_bit(DAXDEV_SYNC, _dev->flags);
>  }
> -EXPORT_SYMBOL_GPL(__set_dax_synchronous);
> +EXPORT_SYMBOL_GPL(set_dax_synchronous);
>
>  bool dax_alive(struct dax_device *dax_dev)
>  {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 87ae4c9b1d65b..3bd1fdb5d5f4b 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -48,16 +48,8 @@ void put_dax(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);
>  void dax_write_cache(struct dax_device *dax_dev, bool wc);
>  bool dax_write_cache_enabled(struct dax_device *dax_dev);
> -bool __dax_synchronous(struct dax_device *dax_dev);
> -static inline bool dax_synchronous(struct dax_device *dax_dev)
> -{
> -   return  __dax_synchronous(dax_dev);
> -}
> -void __set_dax_synchronous(struct dax_device *dax_dev);
> -static inline void set_dax_synchronous(struct dax_device *dax_dev)
> -{
> -   __set_dax_synchronous(dax_dev);
> -}
> +bool dax_synchronous(struct dax_device *dax_dev);
> +void set_dax_synchronous(struct dax_device *dax_dev);
>  /*
>   * Check if given mapping is supported by the file / underlying device.
>   */
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/5] uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()

2021-12-12 Thread Dan Williams
On Wed, Dec 8, 2021 at 10:38 PM Christoph Hellwig  wrote:
>
> These two wrappers are never used.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvdimm/pmem.c |  4 ++--
>  include/linux/uio.h   | 20 +---
>  2 files changed, 3 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4190c8c46ca88..8294f1c701baa 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -302,8 +302,8 @@ static long pmem_dax_direct_access(struct dax_device 
> *dax_dev,
>  }
>
>  /*
> - * Use the 'no check' versions of copy_from_iter_flushcache() and
> - * copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
> + * Use the 'no check' versions of _copy_from_iter_flushcache() and
> + * _copy_mc_to_iter() to bypass HARDENED_USERCOPY overhead. Bounds
>   * checking, both file offset and device offset, is handled by
>   * dax_iomap_actor()
>   */

This comment change does not make sense since it is saying why pmem is
using the "_" versions. However, I assume this whole comment goes away
in a later patch.

> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 6350354f97e90..494d552c1d663 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -196,7 +196,7 @@ bool copy_from_iter_full_nocache(void *addr, size_t 
> bytes, struct iov_iter *i)
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  /*
>   * Note, users like pmem that depend on the stricter semantics of
> - * copy_from_iter_flushcache() than copy_from_iter_nocache() must check for
> + * _copy_from_iter_flushcache() than _copy_from_iter_nocache() must check for
>   * IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) before assuming that the
>   * destination is flushed from the cache on return.
>   */

Same here.

> @@ -211,24 +211,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, 
> struct iov_iter *i);
>  #define _copy_mc_to_iter _copy_to_iter
>  #endif
>
> -static __always_inline __must_check
> -size_t copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter 
> *i)
> -{
> -   if (unlikely(!check_copy_size(addr, bytes, false)))
> -   return 0;
> -   else
> -   return _copy_from_iter_flushcache(addr, bytes, i);
> -}
> -
> -static __always_inline __must_check
> -size_t copy_mc_to_iter(void *addr, size_t bytes, struct iov_iter *i)
> -{
> -   if (unlikely(!check_copy_size(addr, bytes, true)))
> -   return 0;
> -   else
> -   return _copy_mc_to_iter(addr, bytes, i);
> -}
> -
>  size_t iov_iter_zero(size_t bytes, struct iov_iter *);
>  unsigned long iov_iter_alignment(const struct iov_iter *i);
>  unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 29/29] fsdax: don't require CONFIG_BLOCK

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> The file system DAX code now does not require the block code.  So allow
> building a kernel with fuse DAX but not block layer.

Looks good to me.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 28/29] iomap: build the block based code conditionally

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Only build the block based iomap code if CONFIG_BLOCK is set.  Currently
> that is always the case, but it will change soon.

Looks good.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 27/29] dax: fix up some of the block device related ifdefs

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> The DAX device <-> block device association is only enabled if
> CONFIG_BLOCK is enabled.  Update dax.h to account for that and use
> the right conditions for the fs_put_dax stub as well.

Looks good to me.

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 26/29] fsdax: shift partition offset handling into the file systems

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Remove the last user of ->bdev in dax.c by requiring the file system to
> pass in an address that already includes the DAX offset.  As part of the
> only set ->bdev or ->daxdev when actually required in the ->iomap_begin
> methods.

Changes look good except for what looks like an argument position
fixup needed for an xfs_bmbt_to_iomap() caller below...

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c |  6 +-
>  fs/erofs/data.c  | 11 --
>  fs/erofs/internal.h  |  1 +
>  fs/ext2/inode.c  |  8 +--
>  fs/ext4/inode.c  | 16 +-
>  fs/xfs/libxfs/xfs_bmap.c |  4 ++--
>  fs/xfs/xfs_aops.c|  2 +-
>  fs/xfs/xfs_iomap.c   | 45 +---
>  fs/xfs/xfs_iomap.h   |  5 +++--
>  fs/xfs/xfs_pnfs.c|  2 +-
>  10 files changed, 63 insertions(+), 37 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 0bd6cdcbacfc4..2c13c681edf09 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -711,11 +711,7 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
>
>  static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos)
>  {
> -   phys_addr_t paddr = iomap->addr + (pos & PAGE_MASK) - iomap->offset;
> -
> -   if (iomap->bdev)
> -   paddr += (get_start_sect(iomap->bdev) << SECTOR_SHIFT);
> -   return PHYS_PFN(paddr);
> +   return PHYS_PFN(iomap->addr + (pos & PAGE_MASK) - iomap->offset);
>  }
>
>  static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter 
> *iter)
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 0e35ef3f9f3d7..9b1bb177ce303 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
[..]
   }
> @@ -215,9 +218,13 @@ static int erofs_iomap_begin(struct inode *inode, loff_t 
> offset, loff_t length,
> if (ret)
> return ret;
>
> -   iomap->bdev = mdev.m_bdev;
> -   iomap->dax_dev = mdev.m_daxdev;
> iomap->offset = map.m_la;
> +   if (flags & IOMAP_DAX) {
> +   iomap->dax_dev = mdev.m_daxdev;
> +   iomap->offset += mdev.m_dax_part_off;
> +   } else {
> +   iomap->bdev = mdev.m_bdev;
> +   }

Ah, that's what IOMAP_DAX is for, to stop making iomap carry bdev
details unnecessarily.

[..]
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 704292c6ce0c7..74dbf1fd99d39 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -54,7 +54,8 @@ xfs_bmbt_to_iomap(
> struct xfs_inode*ip,
> struct iomap*iomap,
> struct xfs_bmbt_irec*imap,
> -   u16 flags)
> +   unsigned intflags,
> +   u16 iomap_flags)

It would be nice if the compiler could help with making sure that
right 'flags' values are passed to the right 'flags' parameter, but I
can't think of

[..]
> @@ -1053,23 +1061,24 @@ xfs_buffered_write_iomap_begin(
>  */
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> trace_xfs_iomap_alloc(ip, offset, count, allocfork, );
> -   return xfs_bmbt_to_iomap(ip, iomap, , IOMAP_F_NEW);
> +   return xfs_bmbt_to_iomap(ip, iomap, , flags, IOMAP_F_NEW);
>
>  found_imap:
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -   return xfs_bmbt_to_iomap(ip, iomap, , 0);
> +   return xfs_bmbt_to_iomap(ip, iomap, , flags, 0);

The iomap flags are supposed to be the last argument, right?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/29] dax: return the partition offset from fs_dax_get_by_bdev

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Prepare from removing the block_device from the DAX I/O path by returning

s/from removing/for the removal of/

> the partition offset from fs_dax_get_by_bdev so that the file systems
> have it at hand for use during I/O.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 9 ++---
>  drivers/md/dm.c | 4 ++--
>  fs/erofs/internal.h | 2 ++
>  fs/erofs/super.c| 4 ++--
>  fs/ext2/ext2.h  | 1 +
>  fs/ext2/super.c | 2 +-
>  fs/ext4/ext4.h  | 1 +
>  fs/ext4/super.c | 2 +-
>  fs/xfs/xfs_buf.c| 2 +-
>  fs/xfs/xfs_buf.h| 1 +
>  include/linux/dax.h | 6 --
>  11 files changed, 22 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c0910687fbcb2..cc32dcf71c116 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -70,17 +70,20 @@ EXPORT_SYMBOL_GPL(dax_remove_host);
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @bdev: block device to find a dax_device for
> + * @start_off: returns the byte offset into the dax_device that @bdev starts
>   */
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> +struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 
> *start_off)
>  {
> struct dax_device *dax_dev;
> +   u64 part_size;
> int id;
>
> if (!blk_queue_dax(bdev->bd_disk->queue))
> return NULL;
>
> -   if ((get_start_sect(bdev) * SECTOR_SIZE) % PAGE_SIZE ||
> -   (bdev_nr_sectors(bdev) * SECTOR_SIZE) % PAGE_SIZE) {
> +   *start_off = get_start_sect(bdev) * SECTOR_SIZE;
> +   part_size = bdev_nr_sectors(bdev) * SECTOR_SIZE;
> +   if (*start_off % PAGE_SIZE || part_size % PAGE_SIZE) {
> pr_info("%pg: error: unaligned partition for dax\n", bdev);
> return NULL;
> }
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 282008afc465f..5ea6115d19bdc 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -637,7 +637,7 @@ static int open_table_device(struct table_device *td, 
> dev_t dev,
>  struct mapped_device *md)
>  {
> struct block_device *bdev;
> -
> +   u64 part_off;
> int r;
>
> BUG_ON(td->dm_dev.bdev);
> @@ -653,7 +653,7 @@ static int open_table_device(struct table_device *td, 
> dev_t dev,
> }
>
> td->dm_dev.bdev = bdev;
> -   td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev);
> +   td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, _off);

Perhaps allow NULL as an argument for callers that do not care about
the start offset?


Otherwise, looks good / clever.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 24/29] xfs: use xfs_direct_write_iomap_ops for DAX zeroing

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> While the buffered write iomap ops do work due to the fact that zeroing
> never allocates blocks, the DAX zeroing should use the direct ops just
> like actual DAX I/O.
>

I always wondered about this, change looks good to me.

Reviewed-by: Dan Williams 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_iomap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 8cef3b68cba78..704292c6ce0c7 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1324,7 +1324,7 @@ xfs_zero_range(
>
> if (IS_DAX(inode))
> return dax_zero_range(inode, pos, len, did_zero,
> - _buffered_write_iomap_ops);
> + _direct_write_iomap_ops);
> return iomap_zero_range(inode, pos, len, did_zero,
> _buffered_write_iomap_ops);
>  }
> @@ -1339,7 +1339,7 @@ xfs_truncate_page(
>
> if (IS_DAX(inode))
> return dax_truncate_page(inode, pos, did_zero,
> -   _buffered_write_iomap_ops);
> +   _direct_write_iomap_ops);
> return iomap_truncate_page(inode, pos, did_zero,
>_buffered_write_iomap_ops);
>  }
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 23/29] xfs: use IOMAP_DAX to check for DAX mappings

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Use the explicit DAX flag instead of checking the inode flag in the
> iomap code.

It's not immediately clear to me why this is a net benefit, are you
anticipating inode-less operations? With reflink and multi-inode
operations a single iomap flag seems insufficient, no?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 22/29] iomap: add a IOMAP_DAX flag

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Add a flag so that the file system can easily detect DAX operations.

Looks ok, but I would have preferred a quick note about the rationale
here before needing to read other patches to figure that out.

If you add that you can add:

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c  | 7 ---
>  include/linux/iomap.h | 1 +
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b52b878124ac..0bd6cdcbacfc4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1180,7 +1180,7 @@ int dax_zero_range(struct inode *inode, loff_t pos, 
> loff_t len, bool *did_zero,
> .inode  = inode,
> .pos= pos,
> .len= len,
> -   .flags  = IOMAP_ZERO,
> +   .flags  = IOMAP_DAX | IOMAP_ZERO,
> };
> int ret;
>
> @@ -1308,6 +1308,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> .inode  = iocb->ki_filp->f_mapping->host,
> .pos= iocb->ki_pos,
> .len= iov_iter_count(iter),
> +   .flags  = IOMAP_DAX,
> };
> loff_t done = 0;
> int ret;
> @@ -1461,7 +1462,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
> .inode  = mapping->host,
> .pos= (loff_t)vmf->pgoff << PAGE_SHIFT,
> .len= PAGE_SIZE,
> -   .flags  = IOMAP_FAULT,
> +   .flags  = IOMAP_DAX | IOMAP_FAULT,
> };
> vm_fault_t ret = 0;
> void *entry;
> @@ -1570,7 +1571,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
> struct iomap_iter iter = {
> .inode  = mapping->host,
> .len= PMD_SIZE,
> -   .flags  = IOMAP_FAULT,
> +   .flags  = IOMAP_DAX | IOMAP_FAULT,
> };
> vm_fault_t ret = VM_FAULT_FALLBACK;
> pgoff_t max_pgoff;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6d1b08d0ae930..146a7e3e3ea11 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -141,6 +141,7 @@ struct iomap_page_ops {
>  #define IOMAP_NOWAIT   (1 << 5) /* do not block */
>  #define IOMAP_OVERWRITE_ONLY   (1 << 6) /* only pure overwrites allowed */
>  #define IOMAP_UNSHARE  (1 << 7) /* unshare_file_range */
> +#define IOMAP_DAX  (1 << 8) /* DAX mapping */
>
>  struct iomap_ops {
> /*
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 21/29] xfs: move dax device handling into xfs_{alloc, free}_buftarg

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Hide the DAX device lookup from the xfs_super.c code.
>
> Reviewed-by: Christoph Hellwig 

That's an interesting spelling of "Signed-off-by", but patch looks
good to me too. I would have expected a robot to complain about
missing sign-off?

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 20/29] ext4: cleanup the dax handling in ext4_fill_super

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Only call fs_dax_get_by_bdev once the sbi has been allocated and remove
> the need for the dax_dev local variable.

Looks good.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 19/29] ext2: cleanup the dax handling in ext2_fill_super

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Only call fs_dax_get_by_bdev once the sbi has been allocated and remove
> the need for the dax_dev local variable.

Looks good.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/29] fsdax: decouple zeroing from the iomap buffered I/O code

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Unshare the DAX and iomap buffered I/O page zeroing code.  This code
> previously did a IS_DAX check deep inside the iomap code, which in
> fact was the only DAX check in the code.  Instead move these checks
> into the callers.  Most callers already have DAX special casing anyway
> and XFS will need it for reflink support as well.

Looks ok, a tangential question below about iomap_truncate_page(), but
you can add:

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c   | 77 ++
>  fs/ext2/inode.c|  6 ++--
>  fs/ext4/inode.c|  4 +--
>  fs/iomap/buffered-io.c | 35 +++
>  fs/xfs/xfs_iomap.c |  6 
>  include/linux/dax.h|  6 +++-
>  6 files changed, 91 insertions(+), 43 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index dc9ebeff850ab..5b52b878124ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1135,24 +1135,73 @@ static int dax_memzero(struct dax_device *dax_dev, 
> pgoff_t pgoff,
> return rc;
>  }
>
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +static loff_t dax_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
> -   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> -   long rc, id;
> -   unsigned offset = offset_in_page(pos);
> -   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
> +   const struct iomap *iomap = >iomap;
> +   const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +   loff_t pos = iter->pos;
> +   loff_t length = iomap_length(iter);
> +   loff_t written = 0;
> +
> +   /* already zeroed?  we're done. */
> +   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +   return length;
> +
> +   do {
> +   unsigned offset = offset_in_page(pos);
> +   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> +   long rc;
> +   int id;
>
> -   id = dax_read_lock();
> -   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> -   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> -   else
> -   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
> -   dax_read_unlock(id);
> +   id = dax_read_lock();
> +   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> +   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> +   else
> +   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
> +   dax_read_unlock(id);
>
> -   if (rc < 0)
> -   return rc;
> -   return size;
> +   if (rc < 0)
> +   return rc;
> +   pos += size;
> +   length -= size;
> +   written += size;
> +   if (did_zero)
> +   *did_zero = true;
> +   } while (length > 0);
> +
> +   return written;
> +}
> +
> +int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool 
> *did_zero,
> +   const struct iomap_ops *ops)
> +{
> +   struct iomap_iter iter = {
> +   .inode  = inode,
> +   .pos= pos,
> +   .len= len,
> +   .flags  = IOMAP_ZERO,
> +   };
> +   int ret;
> +
> +   while ((ret = iomap_iter(, ops)) > 0)
> +   iter.processed = dax_zero_iter(, did_zero);
> +   return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_zero_range);
> +
> +int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> +   const struct iomap_ops *ops)
> +{
> +   unsigned int blocksize = i_blocksize(inode);
> +   unsigned int off = pos & (blocksize - 1);
> +
> +   /* Block boundary? Nothing to do */
> +   if (!off)
> +   return 0;

It took me a moment to figure out why this was correct. I see it was
also copied from iomap_truncate_page(). It makes sense for DAX where
blocksize >= PAGE_SIZE so it's always the case that the amount of
capacity to zero relative to a page is from @pos to the end of the
block. Is there something else that protects the blocksize < PAGE_SIZE
case outside of DAX?

Nothing to change for this patch, just a question I had while reviewing.

> +   return dax_zero_range(inode, pos, blocksize - off, did_zero, ops);
>  }
> +EXPORT_SYMBOL_GPL(dax_truncate_page);
>
>  static loff_t dax_iomap_ite

Re: [PATCH 17/29] fsdax: factor out a dax_memzero helper

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Factor out a helper for the "manual" zeroing of a DAX range to clean
> up dax_iomap_zero a lot.
>

Small / optional fixup below:

Reviewed-by: Dan Williams 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c | 36 +++-
>  1 file changed, 19 insertions(+), 17 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index d7a923d152240..dc9ebeff850ab 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1121,34 +1121,36 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>
> +static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
> +   unsigned int offset, size_t size)
> +{
> +   void *kaddr;
> +   long rc;
> +
> +   rc = dax_direct_access(dax_dev, pgoff, 1, , NULL);
> +   if (rc >= 0) {

Technically this should be "> 0" because dax_direct_access() returns
nr_available_pages @pgoff, but this isn't broken because
dax_direct_access() converts the "zero pages available" case into
-ERANGE.

> +   memset(kaddr + offset, 0, size);
> +   dax_flush(dax_dev, kaddr + offset, size);
> +   }
> +   return rc;
> +}
> +
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
> pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> long rc, id;
> -   void *kaddr;
> -   bool page_aligned = false;
> unsigned offset = offset_in_page(pos);
> unsigned size = min_t(u64, PAGE_SIZE - offset, length);
>
> -   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> -   page_aligned = true;
> -
> id = dax_read_lock();
> -
> -   if (page_aligned)
> +   if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
> rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> else
> -   rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , 
> NULL);
> -   if (rc < 0) {
> -   dax_read_unlock(id);
> -   return rc;
> -   }
> -
> -   if (!page_aligned) {
> -   memset(kaddr + offset, 0, size);
> -   dax_flush(iomap->dax_dev, kaddr + offset, size);
> -   }
> +   rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
> dax_read_unlock(id);
> +
> +   if (rc < 0)
> +   return rc;
> return size;
>  }
>
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 16/29] fsdax: simplify the offset check in dax_iomap_zero

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> The file relative offset must have the same alignment as the storage
> offset, so use that and get rid of the call to iomap_sector.

Agree.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 15/29] xfs: add xfs_zero_range and xfs_truncate_page helpers

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> From: Shiyang Ruan 
>
> Add helpers to prepare for using different DAX operations.
>
> Signed-off-by: Shiyang Ruan 
> [hch: split from a larger patch + slight cleanups]
> Signed-off-by: Christoph Hellwig 

Looks good to me.

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 14/29] fsdax: simplify the pgoff calculation

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Replace the two steps of dax_iomap_sector and bdev_dax_pgoff with a
> single dax_iomap_pgoff helper that avoids lots of cumbersome sector
> conversions.

Looks good,

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 14 --
>  fs/dax.c| 35 ++-
>  include/linux/dax.h |  1 -
>  3 files changed, 10 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 803942586d1b6..c0910687fbcb2 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -67,20 +67,6 @@ void dax_remove_host(struct gendisk *disk)
>  }
>  EXPORT_SYMBOL_GPL(dax_remove_host);
>
> -int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> -   pgoff_t *pgoff)
> -{
> -   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> -   phys_addr_t phys_off = (start_sect + sector) * 512;
> -
> -   if (pgoff)
> -   *pgoff = PHYS_PFN(phys_off);
> -   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
> -   return -EINVAL;
> -   return 0;
> -}
> -EXPORT_SYMBOL(bdev_dax_pgoff);
> -
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @bdev: block device to find a dax_device for
> diff --git a/fs/dax.c b/fs/dax.c
> index e51b4129d1b65..5364549d67a48 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -709,23 +709,22 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
> return __dax_invalidate_entry(mapping, index, false);
>  }
>
> -static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
> +static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos)
>  {
> -   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
> +   phys_addr_t paddr = iomap->addr + (pos & PAGE_MASK) - iomap->offset;
> +
> +   if (iomap->bdev)
> +   paddr += (get_start_sect(iomap->bdev) << SECTOR_SHIFT);
> +   return PHYS_PFN(paddr);
>  }
>
>  static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter 
> *iter)
>  {
> -   sector_t sector = dax_iomap_sector(>iomap, iter->pos);
> +   pgoff_t pgoff = dax_iomap_pgoff(>iomap, iter->pos);
> void *vto, *kaddr;
> -   pgoff_t pgoff;
> long rc;
> int id;
>
> -   rc = bdev_dax_pgoff(iter->iomap.bdev, sector, PAGE_SIZE, );
> -   if (rc)
> -   return rc;
> -
> id = dax_read_lock();
> rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, , NULL);
> if (rc < 0) {
> @@ -1013,14 +1012,10 @@ EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  pfn_t *pfnp)
>  {
> -   const sector_t sector = dax_iomap_sector(iomap, pos);
> -   pgoff_t pgoff;
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> int id, rc;
> long length;
>
> -   rc = bdev_dax_pgoff(iomap->bdev, sector, size, );
> -   if (rc)
> -   return rc;
> id = dax_read_lock();
> length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
>NULL, pfnp);
> @@ -1129,7 +1124,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
> sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
> -   pgoff_t pgoff;
> +   pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
> long rc, id;
> void *kaddr;
> bool page_aligned = false;
> @@ -1140,10 +1135,6 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
> (size == PAGE_SIZE))
> page_aligned = true;
>
> -   rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, );
> -   if (rc)
> -   return rc;
> -
> id = dax_read_lock();
>
> if (page_aligned)
> @@ -1169,7 +1160,6 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *iomi,
> const struct iomap *iomap = >iomap;
> loff_t length = iomap_length(iomi);
> loff_t pos = iomi->pos;
> -   struct block_device *bdev = iomap->bdev;
> struct dax_device *dax_dev = iomap->dax_dev;
> loff_t end = pos + length, done = 0;
> ssize_t ret = 0;
> @@ -1203,9 +1193,8 @@ static loff_t dax_iomap_iter(const struct iomap_iter 
> *i

Re: [PATCH 13/29] fsdax: use a saner calling convention for copy_cow_page_dax

2021-11-23 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Just pass the vm_fault and iomap_iter structures, and figure out the rest
> locally.  Note that this requires moving dax_iomap_sector up in the file.

Looks good,

Reviewed-by: Dan Williams 

>
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c | 29 +
>  1 file changed, 13 insertions(+), 16 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 73bd1439d8089..e51b4129d1b65 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -709,26 +709,31 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
> return __dax_invalidate_entry(mapping, index, false);
>  }
>
> -static int copy_cow_page_dax(struct block_device *bdev, struct dax_device 
> *dax_dev,
> -sector_t sector, struct page *to, unsigned long 
> vaddr)
> +static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
>  {
> +   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
> +}
> +
> +static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter 
> *iter)
> +{
> +   sector_t sector = dax_iomap_sector(>iomap, iter->pos);
> void *vto, *kaddr;
> pgoff_t pgoff;
> long rc;
> int id;
>
> -   rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, );
> +   rc = bdev_dax_pgoff(iter->iomap.bdev, sector, PAGE_SIZE, );
> if (rc)
> return rc;
>
> id = dax_read_lock();
> -   rc = dax_direct_access(dax_dev, pgoff, 1, , NULL);
> +   rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, , NULL);
> if (rc < 0) {
> dax_read_unlock(id);
> return rc;
> }
> -   vto = kmap_atomic(to);
> -   copy_user_page(vto, kaddr, vaddr, to);
> +   vto = kmap_atomic(vmf->cow_page);
> +   copy_user_page(vto, kaddr, vmf->address, vmf->cow_page);
> kunmap_atomic(vto);
> dax_read_unlock(id);
> return 0;
> @@ -1005,11 +1010,6 @@ int dax_writeback_mapping_range(struct address_space 
> *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>
> -static sector_t dax_iomap_sector(const struct iomap *iomap, loff_t pos)
> -{
> -   return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
> -}
> -
>  static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  pfn_t *pfnp)
>  {
> @@ -1332,19 +1332,16 @@ static vm_fault_t dax_fault_synchronous_pfnp(pfn_t 
> *pfnp, pfn_t pfn)
>  static vm_fault_t dax_fault_cow_page(struct vm_fault *vmf,
> const struct iomap_iter *iter)
>  {
> -   sector_t sector = dax_iomap_sector(>iomap, iter->pos);
> -   unsigned long vaddr = vmf->address;
> vm_fault_t ret;
> int error = 0;
>
> switch (iter->iomap.type) {
> case IOMAP_HOLE:
> case IOMAP_UNWRITTEN:
> -   clear_user_highpage(vmf->cow_page, vaddr);
> +   clear_user_highpage(vmf->cow_page, vmf->address);
> break;
> case IOMAP_MAPPED:
> -   error = copy_cow_page_dax(iter->iomap.bdev, 
> iter->iomap.dax_dev,
> - sector, vmf->cow_page, vaddr);
> +   error = copy_cow_page_dax(vmf, iter);
> break;
> default:
> WARN_ON_ONCE(1);
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/29] dax: simplify the dax_device <-> gendisk association

2021-11-23 Thread Dan Williams
On Mon, Nov 22, 2021 at 9:58 PM Christoph Hellwig  wrote:
>
> On Mon, Nov 22, 2021 at 07:33:06PM -0800, Dan Williams wrote:
> > Is it time to add a "DAX" symbol namespace?
>
> What would be the benefit?

Just the small benefit of identifying DAX core users with a common
grep line, and to indicate that DAX exports are more intertwined than
standalone exports, but yeah those are minor.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 12/29] fsdax: remove a pointless __force cast in copy_cow_page_dax

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Despite its name copy_user_page expected kernel addresses, which is what
> we already have.

Yup,

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 11/29] dm-stripe: add a stripe_dax_pgoff helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 10/29] dm-log-writes: add a log_writes_dax_pgoff helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/29] dm-linear: add a linear_dax_pgoff helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/29] dax: remove dax_capable

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Just open code the block size and dax_dev == NULL checks in the callers.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

You dropped Gao Xiang's reviewed-by:

https://lore.kernel.org/r/YW1nrxmaw5CfFXBb@B-P7TQMD6M-0146.local

...and Darrick's

https://lore.kernel.org/r/20211019154447.GL24282@magnolia

...which had a few more review comments below, otherwise you can also add:

Reviewed-by: Dan Williams 


> ---
>  drivers/dax/super.c  | 36 
>  drivers/md/dm-table.c| 22 +++---
>  drivers/md/dm.c  | 21 -
>  drivers/md/dm.h  |  4 
>  drivers/nvdimm/pmem.c|  1 -
>  drivers/s390/block/dcssblk.c |  1 -
>  fs/erofs/super.c | 11 +++
>  fs/ext2/super.c  |  6 --
>  fs/ext4/super.c  |  9 ++---
>  fs/xfs/xfs_super.c   | 21 -
>  include/linux/dax.h  | 14 --
>  11 files changed, 36 insertions(+), 110 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 482fe775324a4..803942586d1b6 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -108,42 +108,6 @@ struct dax_device *fs_dax_get_by_bdev(struct 
> block_device *bdev)
> return dax_dev;
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -
> -bool generic_fsdax_supported(struct dax_device *dax_dev,
> -   struct block_device *bdev, int blocksize, sector_t start,
> -   sector_t sectors)
> -{
> -   if (blocksize != PAGE_SIZE) {
> -   pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
> -   return false;
> -   }
> -
> -   if (!dax_dev) {
> -   pr_debug("%pg: error: dax unsupported by block device\n", 
> bdev);
> -   return false;
> -   }
> -
> -   return true;
> -}
> -EXPORT_SYMBOL_GPL(generic_fsdax_supported);
> -
> -bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> -   int blocksize, sector_t start, sector_t len)
> -{
> -   bool ret = false;
> -   int id;
> -
> -   if (!dax_dev)
> -   return false;
> -
> -   id = dax_read_lock();
> -   if (dax_alive(dax_dev) && dax_dev->ops->dax_supported)
> -   ret = dax_dev->ops->dax_supported(dax_dev, bdev, blocksize,
> - start, len);
> -   dax_read_unlock(id);
> -   return ret;
> -}
> -EXPORT_SYMBOL_GPL(dax_supported);
>  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>
>  enum dax_device_flags {
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index bcddc5effd155..f4915a7d5dc84 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -806,12 +806,14 @@ void dm_table_set_type(struct dm_table *t, enum 
> dm_queue_mode type)
>  EXPORT_SYMBOL_GPL(dm_table_set_type);
>
>  /* validate the dax capability of the target device span */
> -int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
> +static int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
> sector_t start, sector_t len, void *data)
>  {
> -   int blocksize = *(int *) data;
> +   if (dev->dax_dev)
> +   return false;
>
> -   return !dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
> +   DMDEBUG("%pg: error: dax unsupported by block device", dev->bdev);
> +   return true;
>  }
>
>  /* Check devices support synchronous DAX */
> @@ -821,8 +823,8 @@ static int device_not_dax_synchronous_capable(struct 
> dm_target *ti, struct dm_de
> return !dev->dax_dev || !dax_synchronous(dev->dax_dev);
>  }
>
> -bool dm_table_supports_dax(struct dm_table *t,
> -  iterate_devices_callout_fn iterate_fn, int 
> *blocksize)
> +static bool dm_table_supports_dax(struct dm_table *t,
> +  iterate_devices_callout_fn iterate_fn)
>  {
> struct dm_target *ti;
> unsigned i;
> @@ -835,7 +837,7 @@ bool dm_table_supports_dax(struct dm_table *t,
> return false;
>
> if (!ti->type->iterate_devices ||
> -   ti->type->iterate_devices(ti, iterate_fn, blocksize))
> +   ti->type->iterate_devices(ti, iterate_fn, NULL))
> return false;
> }
>
> @@ -862,7 +864,6 @@ static int dm_table_determine_type(struct dm_table *

Re: [PATCH 07/29] xfs: factor out a xfs_setup_dax_always helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Factor out another DAX setup helper to simplify future changes.  Also
> move the experimental warning after the checks to not clutter the log
> too much if the setup failed.
>
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 06/29] dax: move the partition alignment check into fs_dax_get_by_bdev

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> fs_dax_get_by_bdev is the primary interface to find a dax device for a
> block device, so move the partition alignment check there instead of
> wiring it up through ->dax_supported.
>

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 05/29] dax: remove the pgmap sanity checks in generic_fsdax_supported

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Drivers that register a dax_dev should make sure it works, no need
> to double check from the file system.

Reviewed-by: Dan Williams 

...with a self-reminder to migrate this validation to a unit test to
backstop any future refactoring of the memmap reservation code.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/29] dax: simplify the dax_device <-> gendisk association

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Replace the dax_host_hash with an xarray indexed by the pointer value
> of the gendisk, and require explicitly calls from the block drivers that
> want to associate their gendisk with a dax_device.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 
> ---
>  drivers/dax/bus.c|   6 +-
>  drivers/dax/super.c  | 106 +--
>  drivers/md/dm.c  |   6 +-
>  drivers/nvdimm/pmem.c|   8 ++-
>  drivers/s390/block/dcssblk.c |  11 +++-
>  fs/fuse/virtio_fs.c  |   2 +-
>  include/linux/dax.h  |  19 +--
>  7 files changed, 62 insertions(+), 96 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6cc4da4c713d9..bd7af2f7c5b0a 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1323,10 +1323,10 @@ struct dev_dax *devm_create_dev_dax(struct 
> dev_dax_data *data)
> }
>
> /*
> -* No 'host' or dax_operations since there is no access to this
> -* device outside of mmap of the resulting character device.
> +* No dax_operations since there is no access to this device outside 
> of
> +* mmap of the resulting character device.
>  */
> -   dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
> +   dax_dev = alloc_dax(dev_dax, NULL, DAXDEV_F_SYNC);
> if (IS_ERR(dax_dev)) {
> rc = PTR_ERR(dax_dev);
> goto err_alloc_dax;
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e20d0cef10a18..9383c11b21853 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -7,10 +7,8 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -26,10 +24,8 @@
>   * @flags: state and boolean properties
>   */
>  struct dax_device {
> -   struct hlist_node list;
> struct inode inode;
> struct cdev cdev;
> -   const char *host;
> void *private;
> unsigned long flags;
> const struct dax_operations *ops;
> @@ -42,10 +38,6 @@ static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
>  static struct super_block *dax_superblock __read_mostly;
>
> -#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
> -static struct hlist_head dax_host_list[DAX_HASH_SIZE];
> -static DEFINE_SPINLOCK(dax_host_lock);
> -
>  int dax_read_lock(void)
>  {
> return srcu_read_lock(_srcu);
> @@ -58,13 +50,22 @@ void dax_read_unlock(int id)
>  }
>  EXPORT_SYMBOL_GPL(dax_read_unlock);
>
> -static int dax_host_hash(const char *host)
> +#if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
> +#include 
> +
> +static DEFINE_XARRAY(dax_hosts);
> +
> +int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk)
>  {
> -   return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
> +   return xa_insert(_hosts, (unsigned long)disk, dax_dev, 
> GFP_KERNEL);
>  }
> +EXPORT_SYMBOL_GPL(dax_add_host);

Is it time to add a "DAX" symbol namespace?

>
> -#if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
> -#include 
> +void dax_remove_host(struct gendisk *disk)
> +{
> +   xa_erase(_hosts, (unsigned long)disk);
> +}
> +EXPORT_SYMBOL_GPL(dax_remove_host);
>
>  int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> pgoff_t *pgoff)
> @@ -82,40 +83,23 @@ EXPORT_SYMBOL(bdev_dax_pgoff);
>
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> - * @host: alternate name for the device registered by a dax driver
> + * @bdev: block device to find a dax_device for
>   */
> -static struct dax_device *dax_get_by_host(const char *host)
> +struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
>  {
> -   struct dax_device *dax_dev, *found = NULL;
> -   int hash, id;
> +   struct dax_device *dax_dev;
> +   int id;
>
> -   if (!host)
> +   if (!blk_queue_dax(bdev->bd_disk->queue))
> return NULL;
>
> -   hash = dax_host_hash(host);
> -
> id = dax_read_lock();
> -   spin_lock(_host_lock);
> -   hlist_for_each_entry(dax_dev, _host_list[hash], list) {
> -   if (!dax_alive(dax_dev)
> -   || strcmp(host, dax_dev->host) != 0)
> -   continue;
> -
> -   if (igrab(_dev->inode))
> -   found = dax_dev;
> -   break;
> -   }
> -   spin_unlock(_host_lock);
> +   dax_dev = xa_load(_hosts, (unsigned long)bdev->bd_disk);
> +   if (!dax_dev || !dax_alive(dax_dev) || !igrab(_dev->inode))
> +   dax_dev = NULL;
> dax_read_unlock(id);
>
> -   return found;
> -}
> -
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> -   if (!blk_queue_dax(bdev->bd_disk->queue))
> -   return NULL;
> -   return 

Re: [PATCH 03/29] dax: remove CONFIG_DAX_DRIVER

2021-11-22 Thread Dan Williams
On Wed, Nov 17, 2021 at 9:43 AM Dan Williams  wrote:
>
> On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
> >
> > CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.
>
> Applied.

Unapplied,

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/29] dm: make the DAX support dependend on CONFIG_FS_DAX

2021-11-22 Thread Dan Williams
On Thu, Nov 18, 2021 at 10:55 PM Christoph Hellwig  wrote:
>
> On Wed, Nov 17, 2021 at 09:23:44AM -0800, Dan Williams wrote:
> > Applied, fixed the spelling of 'dependent' in the subject and picked
> > up Mike's Ack from the previous send:
> >
> > https://lore.kernel.org/r/yyasbvuorceds...@redhat.com
> >
> > Christoph, any particular reason you did not pick up the tags from the
> > last posting?
>
> I thought I did, but apparently I've missed some.

I'll reply with the ones I see missing that need carrying over and add
my own reviewed-by then you can send me a pull request when ready,
deal?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/29] nvdimm/pmem: move dax_attribute_group from dax to pmem

2021-11-19 Thread Dan Williams
On Thu, Nov 18, 2021 at 10:56 PM Christoph Hellwig  wrote:
>
> On Wed, Nov 17, 2021 at 09:44:25AM -0800, Dan Williams wrote:
> > On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
> > >
> > > dax_attribute_group is only used by the pmem driver, and can avoid the
> > > completely pointless lookup by the disk name if moved there.  This
> > > leaves just a single caller of dax_get_by_host, so move dax_get_by_host
> > > into the same ifdef block as that caller.
> > >
> > > Signed-off-by: Christoph Hellwig 
> > > Reviewed-by: Dan Williams 
> > > Link: https://lore.kernel.org/r/20210922173431.2454024-3-...@lst.de
> > > Signed-off-by: Dan Williams 
> >
> > This one already made v5.16-rc1.
>
> Yes, but 5.16-rc1 did not exist yet when I pointed the series.
>
> Note that the series also has a conflict against 5.16-rc1 in pmem.c,
> and buildbot pointed out the file systems need explicit dax.h
> includes in a few files for some configurations.
>
> The current branch is here, I just did not bother to repost without
> any comments:
>
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-block-cleanup
>
> no functional changes.

Do you just want to send me a pull request after you add all the acks?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/29] nvdimm/pmem: move dax_attribute_group from dax to pmem

2021-11-17 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> dax_attribute_group is only used by the pmem driver, and can avoid the
> completely pointless lookup by the disk name if moved there.  This
> leaves just a single caller of dax_get_by_host, so move dax_get_by_host
> into the same ifdef block as that caller.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Dan Williams 
> Link: https://lore.kernel.org/r/20210922173431.2454024-3-...@lst.de
> Signed-off-by: Dan Williams 

This one already made v5.16-rc1.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/29] dax: remove CONFIG_DAX_DRIVER

2021-11-17 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.

Applied.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/29] dm: make the DAX support dependend on CONFIG_FS_DAX

2021-11-17 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> The device mapper DAX support is all hanging off a block device and thus
> can't be used with device dax.  Make it depend on CONFIG_FS_DAX instead
> of CONFIG_DAX_DRIVER.  This also means that bdev_dax_pgoff only needs to
> be built under CONFIG_FS_DAX now.
>

Applied, fixed the spelling of 'dependent' in the subject and picked
up Mike's Ack from the previous send:

https://lore.kernel.org/r/yyasbvuorceds...@redhat.com

Christoph, any particular reason you did not pick up the tags from the
last posting?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: futher decouple DAX from block devices

2021-11-04 Thread Dan Williams
On Thu, Nov 4, 2021 at 10:36 AM Christoph Hellwig  wrote:
>
> On Thu, Nov 04, 2021 at 10:34:17AM -0700, Darrick J. Wong wrote:
> > /me wonders, are block devices going away?  Will mkfs.xfs have to learn
> > how to talk to certain chardevs?  I guess jffs2 and others already do
> > that kind of thing... but I suppose I can wait for the real draft to
> > show up to ramble further. ;)
>
> Right now I've mostly been looking into the kernel side.  An no, I
> do not expect /dev/pmem* to go away as you'll still need it for a
> not DAX aware file system and/or application (such as mkfs initially).
>
> But yes, just pointing mkfs to the chardev should be doable with very
> little work.  We can point it to a regular file after all.

Note that I've avoided implementing read/write fops for dax devices
partly out of concern for not wanting to figure out shared-mmap vs
write coherence issues, but also because of a bet with Dave Hansen
that device-dax not grow features like what happened to hugetlbfs. So
it would seem mkfs would need to switch to mmap I/O, or bite the
bullet and implement read/write fops in the driver.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: futher decouple DAX from block devices

2021-10-29 Thread Dan Williams
On Fri, Oct 29, 2021 at 8:55 AM Darrick J. Wong  wrote:
>
> On Fri, Oct 29, 2021 at 08:42:29AM -0700, Dan Williams wrote:
> > On Thu, Oct 28, 2021 at 4:52 PM Stephen Rothwell  
> > wrote:
> > >
> > > Hi Dan,
> > >
> > > On Wed, 27 Oct 2021 13:46:31 -0700 Dan Williams 
> > >  wrote:
> > > >
> > > > My merge resolution is here [1]. Christoph, please have a look. The
> > > > rebase and the merge result are both passing my test and I'm now going
> > > > to review the individual patches. However, while I do that and collect
> > > > acks from DM and EROFS folks, I want to give Stephen a heads up that
> > > > this is coming. Primarily I want to see if someone sees a better
> > > > strategy to merge this, please let me know, but if not I plan to walk
> > > > Stephen and Linus through the resolution.
> > >
> > > It doesn't look to bad to me (however it is a bit late in the cycle :-(
> > > ).  Once you are happy, just put it in your tree (some of the conflicts
> > > are against the current -rc3 based version of your tree anyway) and I
> > > will cope with it on Monday.
> >
> > Christoph, Darrick, Shiyang,
> >
> > I'm losing my nerve to try to jam this into v5.16 this late in the
> > cycle.
>
> Always a solid choice to hold off for a little more testing and a little
> less anxiety. :)
>
> I don't usually accept new code patches for iomap after rc4 anyway.
>
> > I do want to get dax+reflink squared away as soon as possible,
> > but that looks like something that needs to build on top of a
> > v5.16-rc1 at this point. If Linus does a -rc8 then maybe it would have
> > enough soak time, but otherwise I want to take the time to collect the
> > acks and queue up some more follow-on cleanups to prepare for
> > block-less-dax.
>
> I think that hwpoison-calls-xfs-rmap patchset is a prerequisite for
> dax+reflink anyway, right?  /me had concluded both were 5.17 things.

Ok, cool, sounds like a plan.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: futher decouple DAX from block devices

2021-10-29 Thread Dan Williams
On Thu, Oct 28, 2021 at 4:52 PM Stephen Rothwell  wrote:
>
> Hi Dan,
>
> On Wed, 27 Oct 2021 13:46:31 -0700 Dan Williams  
> wrote:
> >
> > My merge resolution is here [1]. Christoph, please have a look. The
> > rebase and the merge result are both passing my test and I'm now going
> > to review the individual patches. However, while I do that and collect
> > acks from DM and EROFS folks, I want to give Stephen a heads up that
> > this is coming. Primarily I want to see if someone sees a better
> > strategy to merge this, please let me know, but if not I plan to walk
> > Stephen and Linus through the resolution.
>
> It doesn't look to bad to me (however it is a bit late in the cycle :-(
> ).  Once you are happy, just put it in your tree (some of the conflicts
> are against the current -rc3 based version of your tree anyway) and I
> will cope with it on Monday.

Christoph, Darrick, Shiyang,

I'm losing my nerve to try to jam this into v5.16 this late in the
cycle. I do want to get dax+reflink squared away as soon as possible,
but that looks like something that needs to build on top of a
v5.16-rc1 at this point. If Linus does a -rc8 then maybe it would have
enough soak time, but otherwise I want to take the time to collect the
acks and queue up some more follow-on cleanups to prepare for
block-less-dax.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 11/11] dax: move bdev_dax_pgoff to fs/dax.c

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> No functional changet, but this will allow for a tighter integration

s/changet/changes/

> with the iomap code, including possible passing the partition offset

s/possible/possibly/

> in the iomap in the future.  For now it mostly avoids growing more

s/now/now,/

...all of the above fixed up locally.

Other than that, it looks good to me.

> callers outside of fs/dax.c.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 14 --
>  fs/dax.c| 13 +
>  include/linux/dax.h |  1 -
>  3 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 803942586d1b6..c0910687fbcb2 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -67,20 +67,6 @@ void dax_remove_host(struct gendisk *disk)
>  }
>  EXPORT_SYMBOL_GPL(dax_remove_host);
>
> -int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> -   pgoff_t *pgoff)
> -{
> -   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> -   phys_addr_t phys_off = (start_sect + sector) * 512;
> -
> -   if (pgoff)
> -   *pgoff = PHYS_PFN(phys_off);
> -   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
> -   return -EINVAL;
> -   return 0;
> -}
> -EXPORT_SYMBOL(bdev_dax_pgoff);
> -
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @bdev: block device to find a dax_device for
> diff --git a/fs/dax.c b/fs/dax.c
> index 4e3e5a283a916..eb715363fd667 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -709,6 +709,19 @@ int dax_invalidate_mapping_entry_sync(struct 
> address_space *mapping,
> return __dax_invalidate_entry(mapping, index, false);
>  }
>
> +static int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t 
> size,
> +   pgoff_t *pgoff)
> +{
> +   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
> +   phys_addr_t phys_off = (start_sect + sector) * 512;
> +
> +   if (pgoff)
> +   *pgoff = PHYS_PFN(phys_off);
> +   if (phys_off % PAGE_SIZE || size % PAGE_SIZE)
> +   return -EINVAL;
> +   return 0;
> +}
> +
>  static int copy_cow_page_dax(struct block_device *bdev, struct dax_device 
> *dax_dev,
>  sector_t sector, struct page *to, unsigned long 
> vaddr)
>  {
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 439c3c70e347b..324363b798ecd 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -107,7 +107,6 @@ static inline bool daxdev_mapping_supported(struct 
> vm_area_struct *vma,
>  #endif
>
>  struct writeback_control;
> -int bdev_dax_pgoff(struct block_device *, sector_t, size_t, pgoff_t *pgoff);
>  #if IS_ENABLED(CONFIG_FS_DAX)
>  int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk);
>  void dax_remove_host(struct gendisk *disk);
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 10/11] dm-stripe: add a stripe_dax_pgoff helper

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.

Again, looks good. Kind of embarrassing when the open-coded version is
less LOC than using the helper.

Mike, ack?

>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/dm-stripe.c | 63 ++
>  1 file changed, 15 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> index f084607220293..50dba3f39274c 100644
> --- a/drivers/md/dm-stripe.c
> +++ b/drivers/md/dm-stripe.c
> @@ -301,83 +301,50 @@ static int stripe_map(struct dm_target *ti, struct bio 
> *bio)
>  }
>
>  #if IS_ENABLED(CONFIG_FS_DAX)
> -static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> -   long nr_pages, void **kaddr, pfn_t *pfn)
> +static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t 
> *pgoff)
>  {
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> struct stripe_c *sc = ti->private;
> -   struct dax_device *dax_dev;
> struct block_device *bdev;
> +   sector_t dev_sector;
> uint32_t stripe;
> -   long ret;
>
> -   stripe_map_sector(sc, sector, , _sector);
> +   stripe_map_sector(sc, *pgoff * PAGE_SECTORS, , _sector);
> dev_sector += sc->stripe[stripe].physical_start;
> -   dax_dev = sc->stripe[stripe].dev->dax_dev;
> bdev = sc->stripe[stripe].dev->bdev;
>
> -   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, );
> -   if (ret)
> -   return ret;
> +   *pgoff = (get_start_sect(bdev) + dev_sector) >> PAGE_SECTORS_SHIFT;
> +   return sc->stripe[stripe].dev->dax_dev;
> +}
> +
> +static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> +   long nr_pages, void **kaddr, pfn_t *pfn)
> +{
> +   struct dax_device *dax_dev = stripe_dax_pgoff(ti, );
> +
> return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>  }
>
>  static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
> void *addr, size_t bytes, struct iov_iter *i)
>  {
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> -   struct stripe_c *sc = ti->private;
> -   struct dax_device *dax_dev;
> -   struct block_device *bdev;
> -   uint32_t stripe;
> -
> -   stripe_map_sector(sc, sector, , _sector);
> -   dev_sector += sc->stripe[stripe].physical_start;
> -   dax_dev = sc->stripe[stripe].dev->dax_dev;
> -   bdev = sc->stripe[stripe].dev->bdev;
> +   struct dax_device *dax_dev = stripe_dax_pgoff(ti, );
>
> -   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
> -   return 0;
> return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
>  }
>
>  static size_t stripe_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
> void *addr, size_t bytes, struct iov_iter *i)
>  {
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> -   struct stripe_c *sc = ti->private;
> -   struct dax_device *dax_dev;
> -   struct block_device *bdev;
> -   uint32_t stripe;
> -
> -   stripe_map_sector(sc, sector, , _sector);
> -   dev_sector += sc->stripe[stripe].physical_start;
> -   dax_dev = sc->stripe[stripe].dev->dax_dev;
> -   bdev = sc->stripe[stripe].dev->bdev;
> +   struct dax_device *dax_dev = stripe_dax_pgoff(ti, );
>
> -   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
> -   return 0;
> return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
>  }
>
>  static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>   size_t nr_pages)
>  {
> -   int ret;
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> -   struct stripe_c *sc = ti->private;
> -   struct dax_device *dax_dev;
> -   struct block_device *bdev;
> -   uint32_t stripe;
> +   struct dax_device *dax_dev = stripe_dax_pgoff(ti, );
>
> -   stripe_map_sector(sc, sector, , _sector);
> -   dev_sector += sc->stripe[stripe].physical_start;
> -   dax_dev = sc->stripe[stripe].dev->dax_dev;
> -   bdev = sc->stripe[stripe].dev->bdev;
> -
> -   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages << PAGE_SHIFT, 
> );
> -   if (ret)
> -   return ret;
> return dax_zero_page_range(dax_dev, pgoff, nr_pages);
>  }
>
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/11] dm-log-writes: add a log_writes_dax_pgoff helper

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.

Looks good.

Mike, ack?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/11] dm-linear: add a linear_dax_pgoff helper

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.

Looks good.

Mike, ack?

>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/dm-linear.c | 49 +-
>  1 file changed, 15 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 32fbab11bf90c..bf03f73fd0f36 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -164,63 +164,44 @@ static int linear_iterate_devices(struct dm_target *ti,
>  }
>
>  #if IS_ENABLED(CONFIG_FS_DAX)
> +static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t 
> *pgoff)
> +{
> +   struct linear_c *lc = ti->private;
> +   sector_t sector = linear_map_sector(ti, *pgoff << PAGE_SECTORS_SHIFT);
> +
> +   *pgoff = (get_start_sect(lc->dev->bdev) + sector) >> 
> PAGE_SECTORS_SHIFT;
> +   return lc->dev->dax_dev;
> +}
> +
>  static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> long nr_pages, void **kaddr, pfn_t *pfn)
>  {
> -   long ret;
> -   struct linear_c *lc = ti->private;
> -   struct block_device *bdev = lc->dev->bdev;
> -   struct dax_device *dax_dev = lc->dev->dax_dev;
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> -
> -   dev_sector = linear_map_sector(ti, sector);
> -   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages * PAGE_SIZE, );
> -   if (ret)
> -   return ret;
> +   struct dax_device *dax_dev = linear_dax_pgoff(ti, );
> +
> return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
>  }
>
>  static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
> void *addr, size_t bytes, struct iov_iter *i)
>  {
> -   struct linear_c *lc = ti->private;
> -   struct block_device *bdev = lc->dev->bdev;
> -   struct dax_device *dax_dev = lc->dev->dax_dev;
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> +   struct dax_device *dax_dev = linear_dax_pgoff(ti, );
>
> -   dev_sector = linear_map_sector(ti, sector);
> -   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
> -   return 0;
> return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
>  }
>
>  static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
> void *addr, size_t bytes, struct iov_iter *i)
>  {
> -   struct linear_c *lc = ti->private;
> -   struct block_device *bdev = lc->dev->bdev;
> -   struct dax_device *dax_dev = lc->dev->dax_dev;
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> +   struct dax_device *dax_dev = linear_dax_pgoff(ti, );
>
> -   dev_sector = linear_map_sector(ti, sector);
> -   if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), ))
> -   return 0;
> return dax_copy_to_iter(dax_dev, pgoff, addr, bytes, i);
>  }
>
>  static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff,
>   size_t nr_pages)
>  {
> -   int ret;
> -   struct linear_c *lc = ti->private;
> -   struct block_device *bdev = lc->dev->bdev;
> -   struct dax_device *dax_dev = lc->dev->dax_dev;
> -   sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
> -
> -   dev_sector = linear_map_sector(ti, sector);
> -   ret = bdev_dax_pgoff(bdev, dev_sector, nr_pages << PAGE_SHIFT, 
> );
> -   if (ret)
> -   return ret;
> +   struct dax_device *dax_dev = linear_dax_pgoff(ti, );
> +
> return dax_zero_page_range(dax_dev, pgoff, nr_pages);
>  }
>
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [dm-devel] [PATCH 07/11] dax: remove dax_capable

2021-10-27 Thread Dan Williams
On Tue, Oct 19, 2021 at 8:45 AM Darrick J. Wong  wrote:
>
> On Mon, Oct 18, 2021 at 06:40:50AM +0200, Christoph Hellwig wrote:
> > Just open code the block size and dax_dev == NULL checks in the callers.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/dax/super.c  | 36 
> >  drivers/md/dm-table.c| 22 +++---
> >  drivers/md/dm.c  | 21 -
> >  drivers/md/dm.h  |  4 
> >  drivers/nvdimm/pmem.c|  1 -
> >  drivers/s390/block/dcssblk.c |  1 -
> >  fs/erofs/super.c | 11 +++
> >  fs/ext2/super.c  |  6 --
> >  fs/ext4/super.c  |  9 ++---
> >  fs/xfs/xfs_super.c   | 21 -
> >  include/linux/dax.h  | 14 --
> >  11 files changed, 36 insertions(+), 110 deletions(-)
> >
[..]   if (ext4_has_feature_inline_data(sb)) {
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index d07020a8eb9e3..163ceafbd8fd2 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
[..]
> > + if (mp->m_super->s_blocksize != PAGE_SIZE) {
> > + xfs_alert(mp,
> > + "DAX not supported for blocksize. Turning off 
> > DAX.\n");
>
> Newlines aren't needed at the end of extX_msg/xfs_alert format strings.

Thanks Darrick, I fixed those up.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 07/11] dax: remove dax_capable

2021-10-27 Thread Dan Williams
I am going to change the subject of this patch to:

dax: remove ->dax_supported()

On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>

I'll add a bit more background to help others review this.

The ->dax_supported() operation arranges for a stack of devices to
each answer the question "is dax operational". That request routes to
generic_fsdax_supported() at last level device and that attempted an
actual dax_direct_access() call and did some sanity checks. However,
those sanity checks can be validated in other ways and with those
removed the only question to answer is "has each block device driver
in the stack performed dax_add_host()". That can be validated without
a dax_operation. So, just open code the block size and dax_dev == NULL
checks in the callers, and delete ->dax_supported().

Mike, let me know if you have any concerns.

> Just open code the block size and dax_dev == NULL checks in the callers.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c  | 36 
>  drivers/md/dm-table.c| 22 +++---
>  drivers/md/dm.c  | 21 -
>  drivers/md/dm.h  |  4 
>  drivers/nvdimm/pmem.c|  1 -
>  drivers/s390/block/dcssblk.c |  1 -
>  fs/erofs/super.c | 11 +++
>  fs/ext2/super.c  |  6 --
>  fs/ext4/super.c  |  9 ++---
>  fs/xfs/xfs_super.c   | 21 -
>  include/linux/dax.h  | 14 --
>  11 files changed, 36 insertions(+), 110 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 482fe775324a4..803942586d1b6 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -108,42 +108,6 @@ struct dax_device *fs_dax_get_by_bdev(struct 
> block_device *bdev)
> return dax_dev;
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -
> -bool generic_fsdax_supported(struct dax_device *dax_dev,
> -   struct block_device *bdev, int blocksize, sector_t start,
> -   sector_t sectors)
> -{
> -   if (blocksize != PAGE_SIZE) {
> -   pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
> -   return false;
> -   }
> -
> -   if (!dax_dev) {
> -   pr_debug("%pg: error: dax unsupported by block device\n", 
> bdev);
> -   return false;
> -   }
> -
> -   return true;
> -}
> -EXPORT_SYMBOL_GPL(generic_fsdax_supported);
> -
> -bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> -   int blocksize, sector_t start, sector_t len)
> -{
> -   bool ret = false;
> -   int id;
> -
> -   if (!dax_dev)
> -   return false;
> -
> -   id = dax_read_lock();
> -   if (dax_alive(dax_dev) && dax_dev->ops->dax_supported)
> -   ret = dax_dev->ops->dax_supported(dax_dev, bdev, blocksize,
> - start, len);
> -   dax_read_unlock(id);
> -   return ret;
> -}
> -EXPORT_SYMBOL_GPL(dax_supported);
>  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>
>  enum dax_device_flags {
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 1fa4d5582dca5..4ae671c2168ea 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -807,12 +807,14 @@ void dm_table_set_type(struct dm_table *t, enum 
> dm_queue_mode type)
>  EXPORT_SYMBOL_GPL(dm_table_set_type);
>
>  /* validate the dax capability of the target device span */
> -int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
> +static int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
> sector_t start, sector_t len, void *data)
>  {
> -   int blocksize = *(int *) data;
> +   if (dev->dax_dev)
> +   return false;
>
> -   return !dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
> +   pr_debug("%pg: error: dax unsupported by block device\n", dev->bdev);
> +   return true;
>  }
>
>  /* Check devices support synchronous DAX */
> @@ -822,8 +824,8 @@ static int device_not_dax_synchronous_capable(struct 
> dm_target *ti, struct dm_de
> return !dev->dax_dev || !dax_synchronous(dev->dax_dev);
>  }
>
> -bool dm_table_supports_dax(struct dm_table *t,
> -  iterate_devices_callout_fn iterate_fn, int 
> *blocksize)
> +static bool dm_table_supports_dax(struct dm_table *t,
> +  iterate_devices_callout_fn iterate_fn)
>  {
> struct dm_target *ti;
> unsigned i;
> @@ -836,7 +838,7 @@ bool dm_table_supports_dax(struct dm_table *t,
> return false;
>
> if (!ti->type->iterate_devices ||
> -   ti->type->iterate_devices(ti, iterate_fn, blocksize))
> +   ti->type->iterate_devices(ti, iterate_fn, NULL))
> return false;
> }
>
> @@ -863,7 +865,6 @@ static int 

Re: [PATCH 06/11] xfs: factor out a xfs_setup_dax helper

2021-10-27 Thread Dan Williams
On Tue, Oct 19, 2021 at 12:24 AM Christoph Hellwig  wrote:
>
> On Mon, Oct 18, 2021 at 09:43:51AM -0700, Darrick J. Wong wrote:
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -339,6 +339,32 @@ xfs_buftarg_is_dax(
> > > bdev_nr_sectors(bt->bt_bdev));
> > >  }
> > >
> > > +static int
> > > +xfs_setup_dax(
> >
> > /me wonders if this should be named xfs_setup_dax_always, since this
> > doesn't handle the dax=inode mode?
>
> Sure, why not.

I went ahead and made that change locally.

>
> > The only reason I bring that up is that Eric reminded me a while ago
> > that we don't actually print any kind of EXPERIMENTAL warning for the
> > auto-detection behavior.
>
> Yes, I actually noticed that as well when preparing this series.

The rest looks good to me.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 05/11] dax: move the partition alignment check into fs_dax_get_by_bdev

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> fs_dax_get_by_bdev is the primary interface to find a dax device for a
> block device, so move the partition alignment check there instead of
> wiring it up through ->dax_supported.

Looks good.

>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 23 ++-
>  1 file changed, 6 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 04fc680542e8d..482fe775324a4 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -93,6 +93,12 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device 
> *bdev)
> if (!blk_queue_dax(bdev->bd_disk->queue))
> return NULL;
>
> +   if ((get_start_sect(bdev) * SECTOR_SIZE) % PAGE_SIZE ||
> +   (bdev_nr_sectors(bdev) * SECTOR_SIZE) % PAGE_SIZE) {
> +   pr_info("%pg: error: unaligned partition for dax\n", bdev);
> +   return NULL;
> +   }
> +
> id = dax_read_lock();
> dax_dev = xa_load(_hosts, (unsigned long)bdev->bd_disk);
> if (!dax_dev || !dax_alive(dax_dev) || !igrab(_dev->inode))
> @@ -107,10 +113,6 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
> struct block_device *bdev, int blocksize, sector_t start,
> sector_t sectors)
>  {
> -   pgoff_t pgoff, pgoff_end;
> -   sector_t last_page;
> -   int err;
> -
> if (blocksize != PAGE_SIZE) {
> pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
> return false;
> @@ -121,19 +123,6 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
> return false;
> }
>
> -   err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
> -   if (err) {
> -   pr_info("%pg: error: unaligned partition for dax\n", bdev);
> -   return false;
> -   }
> -
> -   last_page = PFN_DOWN((start + sectors - 1) * 512) * PAGE_SIZE / 512;
> -   err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, _end);
> -   if (err) {
> -   pr_info("%pg: error: unaligned partition for dax\n", bdev);
> -   return false;
> -   }
> -
> return true;
>  }
>  EXPORT_SYMBOL_GPL(generic_fsdax_supported);
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/11] dax: remove the pgmap sanity checks in generic_fsdax_supported

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> Drivers that register a dax_dev should make sure it works, no need
> to double check from the file system.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c | 49 +
>  1 file changed, 1 insertion(+), 48 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9383c11b21853..04fc680542e8d 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -107,13 +107,9 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
> struct block_device *bdev, int blocksize, sector_t start,
> sector_t sectors)
>  {
> -   bool dax_enabled = false;
> pgoff_t pgoff, pgoff_end;
> -   void *kaddr, *end_kaddr;
> -   pfn_t pfn, end_pfn;
> sector_t last_page;
> -   long len, len2;
> -   int err, id;
> +   int err;
>
> if (blocksize != PAGE_SIZE) {
> pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
> @@ -138,49 +134,6 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
> return false;
> }
>
> -   id = dax_read_lock();
> -   len = dax_direct_access(dax_dev, pgoff, 1, , );
> -   len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
> -
> -   if (len < 1 || len2 < 1) {
> -   pr_info("%pg: error: dax access failed (%ld)\n",
> -   bdev, len < 1 ? len : len2);
> -   dax_read_unlock(id);
> -   return false;
> -   }
> -
> -   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
> -   /*
> -* An arch that has enabled the pmem api should also
> -* have its drivers support pfn_t_devmap()
> -*
> -* This is a developer warning and should not trigger in
> -* production. dax_flush() will crash since it depends
> -* on being able to do (page_address(pfn_to_page())).
> -*/
> -   WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
> -   dax_enabled = true;
> -   } else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
> -   struct dev_pagemap *pgmap, *end_pgmap;
> -
> -   pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
> -   end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
> -   if (pgmap && pgmap == end_pgmap && pgmap->type == 
> MEMORY_DEVICE_FS_DAX
> -   && pfn_t_to_page(pfn)->pgmap == pgmap
> -   && pfn_t_to_page(end_pfn)->pgmap == pgmap
> -   && pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
> -   && pfn_t_to_pfn(end_pfn) == 
> PHYS_PFN(__pa(end_kaddr)))

This is effectively a self-test for a regression that was found while
manipulating the 'struct page' memmap metadata reservation for PMEM
namespaces.

I guess it's just serving as a security-blanket now and I should find
a way to move it out to a self-test. I'll take a look.

> -   dax_enabled = true;
> -   put_dev_pagemap(pgmap);
> -   put_dev_pagemap(end_pgmap);
> -
> -   }
> -   dax_read_unlock(id);
> -
> -   if (!dax_enabled) {
> -   pr_info("%pg: error: dax support not enabled\n", bdev);
> -   return false;
> -   }
> return true;
>  }
>  EXPORT_SYMBOL_GPL(generic_fsdax_supported);
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] dax: simplify the dax_device <-> gendisk association

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> Replace the dax_host_hash with an xarray indexed by the pointer value
> of the gendisk, and require explicitl calls from the block drivers that

s/explicitl/explicitl/

I've fixed that up locally.

> want to associate their gendisk with a dax_device.

This looks good. 0day-robot is now chewing on it via the test merge
with linux-next (libnvdimm-pending).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/11] dax: remove CONFIG_DAX_DRIVER

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.

Looks good, I don't think an s390 ack is needed for this one.

> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/Kconfig| 4 
>  drivers/nvdimm/Kconfig | 2 +-
>  drivers/s390/block/Kconfig | 2 +-
>  fs/fuse/Kconfig| 2 +-
>  4 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index d2834c2cfa10d..954ab14ba7778 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -1,8 +1,4 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -config DAX_DRIVER
> -   select DAX
> -   bool
> -
>  menuconfig DAX
> tristate "DAX: direct access to differentiated memory"
> select SRCU
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index b7d1eb38b27d4..347fe7afa5830 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -22,7 +22,7 @@ if LIBNVDIMM
>  config BLK_DEV_PMEM
> tristate "PMEM: Persistent memory block device support"
> default LIBNVDIMM
> -   select DAX_DRIVER
> +   select DAX
> select ND_BTT if BTT
> select ND_PFN if NVDIMM_PFN
> help
> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
> index d0416dbd0cd81..e3710a762abae 100644
> --- a/drivers/s390/block/Kconfig
> +++ b/drivers/s390/block/Kconfig
> @@ -5,7 +5,7 @@ comment "S/390 block device drivers"
>  config DCSSBLK
> def_tristate m
> select FS_DAX_LIMITED
> -   select DAX_DRIVER
> +   select DAX
> prompt "DCSSBLK support"
> depends on S390 && BLOCK
> help
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 40ce9a1c12e5d..038ed0b9aaa5d 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -45,7 +45,7 @@ config FUSE_DAX
> select INTERVAL_TREE
> depends on VIRTIO_FS
> depends on FS_DAX
> -   depends on DAX_DRIVER
> +   depends on DAX
> help
>   This allows bypassing guest page cache and allows mapping host page
>   cache directly in guest address space.
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/11] dm: make the DAX support dependend on CONFIG_FS_DAX

2021-10-27 Thread Dan Williams
On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> The device mapper DAX support is all hanging off a block device and thus
> can't be used with device dax.  Make it depend on CONFIG_FS_DAX instead
> of CONFIG_DAX_DRIVER.  This also means that bdev_dax_pgoff only needs to
> be built under CONFIG_FS_DAX now.

Looks good.

Mike, can I get an ack to take this through nvdimm.git? (you'll likely
see me repeat this question on subsequent patches in this series).

>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/dax/super.c| 6 ++
>  drivers/md/dm-linear.c | 2 +-
>  drivers/md/dm-log-writes.c | 2 +-
>  drivers/md/dm-stripe.c | 2 +-
>  drivers/md/dm-writecache.c | 2 +-
>  drivers/md/dm.c| 2 +-
>  6 files changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index b882cf8106ea3..e20d0cef10a18 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -63,7 +63,7 @@ static int dax_host_hash(const char *host)
> return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
>  }
>
> -#ifdef CONFIG_BLOCK
> +#if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
>  #include 
>
>  int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> @@ -80,7 +80,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t 
> sector, size_t size,
>  }
>  EXPORT_SYMBOL(bdev_dax_pgoff);
>
> -#if IS_ENABLED(CONFIG_FS_DAX)
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @host: alternate name for the device registered by a dax driver
> @@ -219,8 +218,7 @@ bool dax_supported(struct dax_device *dax_dev, struct 
> block_device *bdev,
> return ret;
>  }
>  EXPORT_SYMBOL_GPL(dax_supported);
> -#endif /* CONFIG_FS_DAX */
> -#endif /* CONFIG_BLOCK */
> +#endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>
>  enum dax_device_flags {
> /* !alive + rcu grace period == no new operations / mappings */
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 679b4c0a2eea1..32fbab11bf90c 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -163,7 +163,7 @@ static int linear_iterate_devices(struct dm_target *ti,
> return fn(ti, lc->dev, lc->start, ti->len, data);
>  }
>
> -#if IS_ENABLED(CONFIG_DAX_DRIVER)
> +#if IS_ENABLED(CONFIG_FS_DAX)
>  static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> long nr_pages, void **kaddr, pfn_t *pfn)
>  {
> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
> index d93a4db235124..6d694526881d0 100644
> --- a/drivers/md/dm-log-writes.c
> +++ b/drivers/md/dm-log-writes.c
> @@ -903,7 +903,7 @@ static void log_writes_io_hints(struct dm_target *ti, 
> struct queue_limits *limit
> limits->io_min = limits->physical_block_size;
>  }
>
> -#if IS_ENABLED(CONFIG_DAX_DRIVER)
> +#if IS_ENABLED(CONFIG_FS_DAX)
>  static int log_dax(struct log_writes_c *lc, sector_t sector, size_t bytes,
>struct iov_iter *i)
>  {
> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> index 6660b6b53d5bf..f084607220293 100644
> --- a/drivers/md/dm-stripe.c
> +++ b/drivers/md/dm-stripe.c
> @@ -300,7 +300,7 @@ static int stripe_map(struct dm_target *ti, struct bio 
> *bio)
> return DM_MAPIO_REMAPPED;
>  }
>
> -#if IS_ENABLED(CONFIG_DAX_DRIVER)
> +#if IS_ENABLED(CONFIG_FS_DAX)
>  static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
> long nr_pages, void **kaddr, pfn_t *pfn)
>  {
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 18320444fb0a9..4c3a6e33604d3 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -38,7 +38,7 @@
>  #define BITMAP_GRANULARITY PAGE_SIZE
>  #endif
>
> -#if IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_DAX_DRIVER)
> +#if IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && IS_ENABLED(CONFIG_FS_DAX)
>  #define DM_WRITECACHE_HAS_PMEM
>  #endif
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 7870e6460633f..79737aee516b1 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1783,7 +1783,7 @@ static struct mapped_device *alloc_dev(int minor)
> md->disk->private_data = md;
> sprintf(md->disk->disk_name, "dm-%d", minor);
>
> -   if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> +   if (IS_ENABLED(CONFIG_FS_DAX)) {
> md->dax_dev = alloc_dax(md, md->disk->disk_name,
> _dax_ops, 0);
> if (IS_ERR(md->dax_dev))
> --
> 2.30.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: futher decouple DAX from block devices

2021-10-27 Thread Dan Williams
[ add sfr ]

On Sun, Oct 17, 2021 at 9:41 PM Christoph Hellwig  wrote:
>
> Hi Dan,
>
> this series cleans up and simplifies the association between DAX and block
> devices in preparation of allowing to mount file systems directly on DAX
> devices without a detour through block devices.

So I notice that this is based on linux-next while libnvdimm-for-next
is based on v5.15-rc4. Since I'm not Andrew I went ahead and rebased
these onto v5.15-rc4, tested that, and then merged with linux-next to
resolve the conflicts and tested again.

My merge resolution is here [1]. Christoph, please have a look. The
rebase and the merge result are both passing my test and I'm now going
to review the individual patches. However, while I do that and collect
acks from DM and EROFS folks, I want to give Stephen a heads up that
this is coming. Primarily I want to see if someone sees a better
strategy to merge this, please let me know, but if not I plan to walk
Stephen and Linus through the resolution.

[1]: 
https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/commit/?id=c3894cf6eb8f


>
> Diffstat:
>  drivers/dax/Kconfig  |4
>  drivers/dax/bus.c|2
>  drivers/dax/super.c  |  220 
> +--
>  drivers/md/dm-linear.c   |   51 +++--
>  drivers/md/dm-log-writes.c   |   44 +++-
>  drivers/md/dm-stripe.c   |   65 +++-
>  drivers/md/dm-table.c|   22 ++--
>  drivers/md/dm-writecache.c   |2
>  drivers/md/dm.c  |   29 -
>  drivers/md/dm.h  |4
>  drivers/nvdimm/Kconfig   |2
>  drivers/nvdimm/pmem.c|9 -
>  drivers/s390/block/Kconfig   |2
>  drivers/s390/block/dcssblk.c |   12 +-
>  fs/dax.c |   13 ++
>  fs/erofs/super.c |   11 +-
>  fs/ext2/super.c  |6 -
>  fs/ext4/super.c  |9 +
>  fs/fuse/Kconfig  |2
>  fs/fuse/virtio_fs.c  |2
>  fs/xfs/xfs_super.c   |   54 +-
>  include/linux/dax.h  |   30 ++---
>  22 files changed, 185 insertions(+), 410 deletions(-)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 2:28 PM Andi Kleen  wrote:
[..]
> >> But how do you debug the kernel then? Making early undebuggable seems
> >> just bad policy to me.
> > I am not proposing making the early undebuggable.
>
>
> That's the implication of moving the policy into initrd.
>
>
> If only initrd can authorize then it won't be possible to authorize
> before initrd, thus the early console won't work.

Again, the proposal is that the allow-list is limited to just enough
devices to startup and debug the initramfs and no more. Everything
else can be dynamic, and this allows for a powerful custom override
interface without needing to debate additional ABI like command line
overrides, and minimizes future changes to this kernel-internal
allow-list.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 11:35 AM Andi Kleen  wrote:
>
>
> > The "better safe-than-sorry" argument is hard to build consensus
> > around. The spectre mitigations ran into similar problems where the
> > community rightly wanted to see the details and instrument the
> > problematic paths rather than blanket sprinkle lfence "just to be
> > safe".
>
> But that was due to performance problems in hot paths. Nothing of this
> applies here.

It applies because a new API that individual driver authors is being
proposed and that's an ongoing maintenance burden that might be
mitigated by hiding that implementation detail from leaf drivers.

>
> > In this case the rules about when a driver is suitably
> > "hardened" are vague and the overlapping policy engines are confusing.
>
> What is confusing exactly?

Multiple places to go to enable functionality. The device-filter
firewall policy can collide with the ioremap access control policy.

> For me it both seems very straight forward and simple (but then I'm biased)

You seem to be having a difficult time iterating this proposal toward
consensus. I don't think the base principles are being contested as
much as the semantics, scope, and need for the proposed API that is in
the purview of all leaf driver developers.

> The policy is:
>
> - Have an allow list at driver registration.
>
> - Have an additional opt-in for MMIO mappings (and potentially config
> space, but that's not currently there) to cover init calls completely.

The proliferation of policy engines and driver special casing is the
issue. Especially in this case where the virtio use case being
opted-in is *already* in a path that has been authorized by the
device-filter policy engine. I.e. why special case the ioremap() in
virtio to be additionally authorized when the device has already been
authorized to probe? Put another way, the easiest driver API change to
merge would be no additional changes in leaf drivers.

>
> >
> > I'd rather see more concerted efforts focused/limited core changes
> > rather than leaf driver changes until there is a clearer definition of
> > hardened.
>
> A hardened driver is a driver that
>
> - Had similar security (not API) oriented review of its IO operations
> (mainly MMIO access, but also PCI config space) as a non privileged user
> interface (like a ioctl). That review should be focused on memory safety.
>
> - Had some fuzzing on these IO interfaces using to be released tools.

What is the intersection of ioremap() users that are outside of the
proposed probe authorization regime AND want confidential computing
support?

> Right now it's only three virtio drivers (console, net, block)
>
> Really it's no different than what we do for every new unprivileged user
> interface.
>
>
> > I.e. instead of jumping to the assertion that fixing up
> > these init-path vulnerabilities are too big to fix, dig to the next
> > level to provide more evidence that per-driver opt-in is the only
> > viable option.
> >
> > For example, how many of these problematic paths are built-in to the
> > average kernel config?
>
> I don't think arguments from "the average kernel config" (if such a
> thing even exists) are useful. That's would be just hand waving.

I'm trying to bridge to your contention that this enabling can not
rely on custom kernel configs and must offer protection on the same
kernel image that might ship in the host, but lets set this aside and
focus on when and where leaf drivers need to adopt a new API.

> > A strawman might be to add a sprinkling error
> > exits in the module_init() of the problematic drivers, and only fail
> > if the module is built-in, and let modprobe policy handle the rest.
>
>
> That would be already hundreds of changes. I have no idea how such a
> thing could be maintained or sustained either.
>
> Really I don't even see how these alternatives can be considered. Tree
> sweeps should always be last resort. They're a pain for everyone. But
> here they're casually thrown around as alternatives to straight forward
> one or two line changes.

If it looked straightforward I'm not sure we would be having this
discussion, I think it's reasonable to ask if this is a per-driver
opt-in responsibility that must be added in addition to probe
authorization.

> >> Default policy in user space just seems to be a bad idea here. Who
> >> should know if a driver is hardened other than the kernel? Maintaining
> >> the list somewhere else just doesn't make sense to me.
> > I do not understand the maintenance burden correlation of where the
> > policy is driven vs where the list is maintained?
>
> All the hardening and auditing happens in the kernel tree. So it seems
> the natural place to store the result is in the kernel tree.
>
> But there's no single package for initrd, so you would need custom
> configurations for all the supported distros.
>
> Also we're really arguing about a list that currently has three entries.
>
>
> >   Even if I agreed
> > with the contention that 

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Tue, Oct 12, 2021 at 11:57 AM Reshetova, Elena
 wrote:
>
>
> > I suspect the true number is even higher because that doesn't include IO
> > inside calls to other modules and indirect pointers, correct?
>
> Actually everything should be included. Smatch has cross-function db and
> I am using it for getting the call chains and it follows function pointers.
> Also since I am starting from a list of individual read IOs, every single
> base read IO in drivers/* should be covered as far as I can see. But if it 
> uses
> some weird IO wrappers then the actual list might be higher.

Why analyze individual IO calls? I thought the goal here was to
disable entire classes of ioremap() users?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-12 Thread Dan Williams
On Sun, Oct 10, 2021 at 3:11 PM Andi Kleen  wrote:
>
>
> On 10/9/2021 1:39 PM, Dan Williams wrote:
> > On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:
> >> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> From: Andi Kleen 
> >>>
> >>> For Confidential VM guests like TDX, the host is untrusted and hence
> >>> the devices emulated by the host or any data coming from the host
> >>> cannot be trusted. So the drivers that interact with the outside world
> >>> have to be hardened by sharing memory with host on need basis
> >>> with proper hardening fixes.
> >>>
> >>> For the PCI driver case, to share the memory with the host add
> >>> pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >>>
> >>> Signed-off-by: Andi Kleen 
> >>> Signed-off-by: Kuppuswamy Sathyanarayanan 
> >>> 
> >> So I proposed to make all pci mappings shared, eliminating the need
> >> to patch drivers.
> >>
> >> To which Andi replied
> >>  One problem with removing the ioremap opt-in is that
> >>  it's still possible for drivers to get at devices without going 
> >> through probe.
> >>
> >> To which Greg replied:
> >> https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> >>  If there are in-kernel PCI drivers that do not do this, they need 
> >> to be
> >>  fixed today.
> >>
> >> Can you guys resolve the differences here?
> > I agree with you and Greg here. If a driver is accessing hardware
> > resources outside of the bind lifetime of one of the devices it
> > supports, and in a way that neither modrobe-policy nor
> > device-authorization -policy infrastructure can block, that sounds
> > like a bug report.
>
> The 5.15 tree has something like ~2.4k IO accesses (including MMIO and
> others) in init functions that also register drivers (thanks Elena for
> the number)
>
> Some are probably old drivers that could be fixed, but it's quite a few
> legitimate cases. For example for platform or ISA drivers that's the
> only way they can be implemented because they often have no other
> enumeration mechanism. For PCI drivers it's rarer, but also still can
> happen. One example that comes to mind here is the x86 Intel uncore
> drivers, which support a mix of MSR, ioremap and PCI config space
> accesses all from the same driver. This particular example can (and
> should be) fixed in other ways, but similar things also happen in other
> drivers, and they're not all broken. Even for the broken ones they're
> usually for some crufty old devices that has very few users, so it's
> likely untestable in practice.
>
> My point is just that the ecosystem of devices that Linux supports is
> messy enough that there are legitimate exceptions from the "First IO
> only in probe call only" rule.
>
> And we can't just fix them all. Even if we could it would be hard to
> maintain.
>
> Using a "firewall model" hooking into a few strategic points like we're
> proposing here is much saner for everyone.
>
> Now we can argue about the details. Right now what we're proposing has
> some redundancies: it has both a device model filter and low level
> filter for ioremap (this patch and some others). The low level filter is
> for catching issues that don't clearly fit into the
> "enumeration<->probe" model. You could call that redundant, but I would
> call it defense in depth or better safe than sorry. In theory it would
> be enough to have the low level opt-in only, but that would have the
> drawback that is something gets enumerated after all you would have all
> kind of weird device driver failures and in some cases even killed
> guests. So I think it makes sense to have

The "better safe-than-sorry" argument is hard to build consensus
around. The spectre mitigations ran into similar problems where the
community rightly wanted to see the details and instrument the
problematic paths rather than blanket sprinkle lfence "just to be
safe". In this case the rules about when a driver is suitably
"hardened" are vague and the overlapping policy engines are confusing.

I'd rather see more concerted efforts focused/limited core changes
rather than leaf driver changes until there is a clearer definition of
hardened. I.e. instead of jumping to the assertion that fixing up
these init-path vulnerabilities are too big to fix, dig to the next
level to provide more evidence that per-driver opt-in is the only
viable option.

For example, how many of these problematic paths ar

Re: [PATCH v5 12/16] PCI: Add pci_iomap_host_shared(), pci_iomap_host_shared_range()

2021-10-09 Thread Dan Williams
On Sat, Oct 9, 2021 at 2:53 AM Michael S. Tsirkin  wrote:
>
> On Fri, Oct 08, 2021 at 05:37:07PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > From: Andi Kleen 
> >
> > For Confidential VM guests like TDX, the host is untrusted and hence
> > the devices emulated by the host or any data coming from the host
> > cannot be trusted. So the drivers that interact with the outside world
> > have to be hardened by sharing memory with host on need basis
> > with proper hardening fixes.
> >
> > For the PCI driver case, to share the memory with the host add
> > pci_iomap_host_shared() and pci_iomap_host_shared_range() APIs.
> >
> > Signed-off-by: Andi Kleen 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
>
> So I proposed to make all pci mappings shared, eliminating the need
> to patch drivers.
>
> To which Andi replied
> One problem with removing the ioremap opt-in is that
> it's still possible for drivers to get at devices without going 
> through probe.
>
> To which Greg replied:
> https://lore.kernel.org/all/yvxbnj431yiww...@kroah.com/
> If there are in-kernel PCI drivers that do not do this, they need to 
> be
> fixed today.
>
> Can you guys resolve the differences here?

I agree with you and Greg here. If a driver is accessing hardware
resources outside of the bind lifetime of one of the devices it
supports, and in a way that neither modrobe-policy nor
device-authorization -policy infrastructure can block, that sounds
like a bug report. Fix those drivers instead of sprinkling
ioremap_shared in select places and with unclear rules about when a
driver is allowed to do "shared" mappings. Let the new
device-authorization mechanism (with policy in userspace) be the
central place where all of these driver "trust" issues are managed.

> And once they are resolved, mention this in the commit log so
> I don't get to re-read the series just to find out nothing
> changed in this respect?
>
> I frankly do not believe we are anywhere near being able to harden
> an arbitrary kernel config against attack.
> How about creating a defconfig that makes sense for TDX then?
> Anyone deviating from that better know what they are doing,
> this API tweaking is just putting policy into the kernel  ...

Right, userspace authorization policy and select driver fixups seems
to be the answer to the raised concerns.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-05 Thread Dan Williams
On Sun, Oct 3, 2021 at 10:16 PM Mika Westerberg
 wrote:
>
> Hi,
>
> On Fri, Oct 01, 2021 at 12:57:18PM -0700, Dan Williams wrote:
> > > > Ah, so are you saying that it would be sufficient for USB if the
> > > > generic authorized implementation did something like:
> > > >
> > > > dev->authorized = 1;
> > > > device_attach(dev);
> > > >
> > > > ...for the authorize case, and:
> > > >
> > > > dev->authorize = 0;
> > > > device_release_driver(dev);
> > > >
> > > > ...for the deauthorize case?
> > >
> > > Yes, I think so.  But I haven't tried making this change to test and
> > > see what really happens.
> >
> > Sounds like a useful path for this effort to explore. Especially as
> > Greg seems to want the proposed "has_probe_authorization" flag in the
> > bus_type to disappear and make this all generic. It just seems that
> > Thunderbolt would need deeper surgery to move what it does in the
> > authorization toggle path into the probe and remove paths.
> >
> > Mika, do you see a path for Thunderbolt to align its authorization
> > paths behind bus ->probe() ->remove() events similar to what USB might
> > be able to support for a generic authorization path?
>
> In Thunderbolt "authorization" actually means whether there is a PCIe
> tunnel to the device or not. There is no driver bind/unbind happening
> when authorization toggles (well on Thunderbolt bus, there can be on PCI
> bus after the tunnel is established) so I'm not entirely sure how we
> could use the bus ->probe() or ->remove for that to be honest.

Greg, per your comment:

"... which was to move the way that busses are allowed to authorize
the devices they wish to control into a generic way instead of being
bus-specific logic."

We have USB and TB that have already diverged on the ABI here. The USB
behavior is more in line with the "probe authorization" concept, while
TB is about tunnel establishment and not cleanly tied to probe
authorization. So while I see a path to a common authorization
implementation for USB and other buses (per the insight from Alan), TB
needs to retain the ability to record the authorization state as an
enum rather than a bool, and emit a uevent on authorization status
change.

So how about something like the following that moves the attribute
into the core, but still calls back to TB and USB to perform their
legacy authorization work. This new authorized attribute only shows up
when devices default to not authorized, i.e. when userspace owns the
allow list past critical-boot built-in drivers, or if the bus (USB /
TB) implements ->authorize().


diff --git a/drivers/base/core.c b/drivers/base/core.c
index e65dd803a453..8f8fbe2637d1 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2414,6 +2414,58 @@ static ssize_t online_store(struct device *dev,
struct device_attribute *attr,
 }
 static DEVICE_ATTR_RW(online);

+static ssize_t authorized_show(struct device *dev,
+  struct device_attribute *attr, char *buf)
+{
+   return sysfs_emit(buf, "%u\n", dev->authorized);
+}
+
+static ssize_t authorized_store(struct device *dev,
+   struct device_attribute *attr, const char *buf,
+   size_t count)
+{
+   unsigned int val, save;
+   ssize_t rc;
+
+   rc = kstrtouint(buf, 0, );
+   if (rc < 0)
+   return rc;
+
+   /* some buses (Thunderbolt) support authorized values > 1 */
+   if (val > 1 && !dev->bus->authorize)
+   return -EINVAL;
+
+   device_lock(dev);
+   save = dev->authorized;
+   if (save == val) {
+   rc = count;
+   goto err;
+   }
+
+   dev->authorized = val;
+   if (dev->bus->authorize) {
+   /* notify bus about change in authorization state */
+   rc = dev->bus->authorize(dev);
+   if (rc) {
+   dev->authorized = save;
+   goto err;
+   }
+   }
+   device_unlock(dev);
+
+   if (dev->authorized) {
+   if (!device_attach(dev))
+   dev_dbg(dev, "failed to probe after authorize\n");
+   } else
+   device_release_driver(dev);
+   return count;
+
+err:
+   device_unlock(dev);
+   return rc < 0 ? rc : count;
+}
+static DEVICE_ATTR_RW(authorized);
+
 static ssize_t removable_show(struct device *dev, struct
device_attribute *attr,
  char *buf)
 {
@@ -2616,8 +2668,16 @@ static int device_add_attrs(struct device *dev)
goto err_remove_dev_waiting_f

Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-04 Thread Dan Williams
On Sat, Oct 2, 2021 at 7:20 AM Andi Kleen  wrote:
>
>
> On 10/2/2021 4:14 AM, Greg Kroah-Hartman wrote:
> > On Sat, Oct 02, 2021 at 07:04:28AM -0400, Michael S. Tsirkin wrote:
> >> On Fri, Oct 01, 2021 at 08:49:28AM -0700, Andi Kleen wrote:
> Do you have a list of specific drivers and kernel options that you
>  feel you now "trust"?
> >>> For TDX it's currently only virtio net/block/console
> >>>
> >>> But we expect this list to grow slightly over time, but not at a high rate
> >>> (so hopefully <10)
> >> Well there are already >10 virtio drivers and I think it's reasonable
> >> that all of these will be used with encrypted guests. The list will
> >> grow.
> > What is keeping "all" drivers from being on this list?
>
> It would be too much work to harden them all, and it would be pointless
> because all these drivers are never legitimately needed in a virtualized
> environment which only virtualize a very small number of devices.
>
> >   How exactly are
> > you determining what should, and should not, be allowed?
>
> Everything that has had reasonable effort at hardening can be added. But
> if someone proposes to add a driver that should trigger additional
> scrutiny in code review. We should also request them to do some fuzzing.
>
> It's a bit similar to someone trying to add a new syscall interface.
> That also triggers much additional scrutiny for good reasons and people
> start fuzzing it.
>
>
> >How can
> > drivers move on, or off, of it over time?
>
> Adding something is submitting a patch to the allow list.
>
> I'm not sure the "off" case would happen, unless the driver is
> completely removed, or maybe it has some unfixable security problem. But
> that is all rather unlikely.
>
>
> >
> > And why not just put all of that into userspace and have it pick and
> > choose?  That should be the end-goal here, you don't want to encode
> > policy like this in the kernel, right?
>
> How would user space know what drivers have been hardened? This is
> really something that the kernel needs to determine. I don't think we
> can outsource it to anyone else.

How it is outsourcing by moving that same allow list over the kernel /
user boundary. It can be maintained by the same engineers and get
deployed by something like:

dracut --authorize-device-list=confidential-computing-default $kernel-version

With that distributions can deploy kernel-specific authorizations and
admins can deploy site-specific authorizations. Then the kernel
implementation is minimized to authorize just enough drivers by
default to let userspace take over the policy.

> Also BTW of course user space can still override it, but really the
> defaults should be a kernel policy.

The default is secure, trust nothing but bootstrap devices.

> There's also the additional problem that one of the goals of
> confidential guest is to just move existing guest virtual images into
> them without much changes. So it's better for such a case if as much as
> possible of the policy is in the kernel. But that's more a secondary
> consideration, the first point is really the important part.

The same image can be used on host and guest in this "do it in
userspace" proposal.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-01 Thread Dan Williams
On Fri, Oct 1, 2021 at 12:02 PM Alan Stern  wrote:
>
> On Fri, Oct 01, 2021 at 11:09:52AM -0700, Dan Williams wrote:
> > On Fri, Oct 1, 2021 at 9:47 AM Alan Stern  wrote:
> > >
> > > On Fri, Oct 01, 2021 at 09:13:54AM -0700, Dan Williams wrote:
> > > > Bear with me, and perhaps it's a lack of imagination on my part, but I
> > > > don't see how to get to a globally generic "authorized" sysfs ABI
> > > > given that USB and Thunderbolt want to do bus specific actions on
> > > > authorization toggle events. Certainly a default generic authorized
> > > > attribute can be defined for all the other buses that don't have
> > > > legacy here, but Thunderbolt will still require support for '2' as an
> > > > authorized value, and USB will still want to base probe decisions on
> > > > the authorization state of both the usb_device and the usb_interface.
> > >
> > > The USB part isn't really accurate (I can't speak for Thunderbolt).
> > > When a usb_device is deauthorized, the device will be unconfigured,
> > > deleting all its interfaces and removing the need for any probe
> > > decisions about them.  In other words, the probe decision for a
> > > usb_device or usb_interface depends only on the device's/interface's
> > > own authorization state.
> > >
> > > True, the interface binding code does contain a test of the device's
> > > authorization setting.  That test is redundant and can be removed.
> > >
> > > The actions that USB wants to take on authorization toggle events for
> > > usb_devices are: for authorize, select and install a configuration;
> > > for deauthorize, unconfigure the device.  Each of these could be
> > > handled simply enough just by binding/unbinding the device.  (There
> > > is some special code for handling wireless USB devices, but wireless
> > > USB is now defunct.)
> >
> > Ah, so are you saying that it would be sufficient for USB if the
> > generic authorized implementation did something like:
> >
> > dev->authorized = 1;
> > device_attach(dev);
> >
> > ...for the authorize case, and:
> >
> > dev->authorize = 0;
> > device_release_driver(dev);
> >
> > ...for the deauthorize case?
>
> Yes, I think so.  But I haven't tried making this change to test and
> see what really happens.

Sounds like a useful path for this effort to explore. Especially as
Greg seems to want the proposed "has_probe_authorization" flag in the
bus_type to disappear and make this all generic. It just seems that
Thunderbolt would need deeper surgery to move what it does in the
authorization toggle path into the probe and remove paths.

Mika, do you see a path for Thunderbolt to align its authorization
paths behind bus ->probe() ->remove() events similar to what USB might
be able to support for a generic authorization path?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-01 Thread Dan Williams
On Fri, Oct 1, 2021 at 9:47 AM Alan Stern  wrote:
>
> On Fri, Oct 01, 2021 at 09:13:54AM -0700, Dan Williams wrote:
> > Bear with me, and perhaps it's a lack of imagination on my part, but I
> > don't see how to get to a globally generic "authorized" sysfs ABI
> > given that USB and Thunderbolt want to do bus specific actions on
> > authorization toggle events. Certainly a default generic authorized
> > attribute can be defined for all the other buses that don't have
> > legacy here, but Thunderbolt will still require support for '2' as an
> > authorized value, and USB will still want to base probe decisions on
> > the authorization state of both the usb_device and the usb_interface.
>
> The USB part isn't really accurate (I can't speak for Thunderbolt).
> When a usb_device is deauthorized, the device will be unconfigured,
> deleting all its interfaces and removing the need for any probe
> decisions about them.  In other words, the probe decision for a
> usb_device or usb_interface depends only on the device's/interface's
> own authorization state.
>
> True, the interface binding code does contain a test of the device's
> authorization setting.  That test is redundant and can be removed.
>
> The actions that USB wants to take on authorization toggle events for
> usb_devices are: for authorize, select and install a configuration;
> for deauthorize, unconfigure the device.  Each of these could be
> handled simply enough just by binding/unbinding the device.  (There
> is some special code for handling wireless USB devices, but wireless
> USB is now defunct.)

Ah, so are you saying that it would be sufficient for USB if the
generic authorized implementation did something like:

dev->authorized = 1;
device_attach(dev);

...for the authorize case, and:

dev->authorize = 0;
device_release_driver(dev);

...for the deauthorize case?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-10-01 Thread Dan Williams
On Fri, Oct 1, 2021 at 12:03 AM Greg Kroah-Hartman
 wrote:
>
> On Thu, Sep 30, 2021 at 12:04:05PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> >
> >
> > On 9/30/21 8:23 AM, Greg Kroah-Hartman wrote:
> > > On Thu, Sep 30, 2021 at 08:18:18AM -0700, Kuppuswamy, Sathyanarayanan 
> > > wrote:
> > > >
> > > >
> > > > On 9/30/21 6:36 AM, Dan Williams wrote:
> > > > > > And in particular, not all virtio drivers are hardened -
> > > > > > I think at this point blk and scsi drivers have been hardened - so
> > > > > > treating them all the same looks wrong.
> > > > > My understanding was that they have been audited, Sathya?
> > > >
> > > > Yes, AFAIK, it has been audited. Andi also submitted some patches
> > > > related to it. Andi, can you confirm.
> > >
> > > What is the official definition of "audited"?
> >
> >
> > In our case (Confidential Computing platform), the host is an un-trusted
> > entity. So any interaction with host from the drivers will have to be
> > protected against the possible attack from the host. For example, if we
> > are accessing a memory based on index value received from host, we have
> > to make sure it does not lead to out of bound access or when sharing the
> > memory with the host, we need to make sure only the required region is
> > shared with the host and the memory is un-shared after use properly.
>
> You have not defined the term "audited" here at all in any way that can
> be reviewed or verified by anyone from what I can tell.

Agree.

>
> You have only described a new model that you wish the kernel to run in,
> one in which it does not trust the hardware at all.  That is explicitly
> NOT what the kernel has been designed for so far, and if you wish to
> change that, lots of things need to be done outside of simply running
> some fuzzers on a few random drivers.
>
> For one example, how do you ensure that the memory you are reading from
> hasn't been modified by the host between writing to it the last time you
> did?  Do you have a list of specific drivers and kernel options that you
> feel you now "trust"?  If so, how long does that trust last for?  Until
> someonen else modifies that code?  What about modifications to functions
> that your "audited" code touches?  Who is doing this auditing?  How do
> you know the auditing has been done correctly?  Who has reviewed and
> audited the tools that are doing the auditing?  Where is the
> specification that has been agreed on how the auditing must be done?
> And so on...
>
> I feel like there are a lot of different things all being mixed up here
> into one "oh we want this to happen!" type of thread.  Please let's just
> stick to the one request that I had here, which was to move the way that
> busses are allowed to authorize the devices they wish to control into a
> generic way instead of being bus-specific logic.

I want to continue to shave this proposal down, but that last sentence
reads as self-contradictory.

Bear with me, and perhaps it's a lack of imagination on my part, but I
don't see how to get to a globally generic "authorized" sysfs ABI
given that USB and Thunderbolt want to do bus specific actions on
authorization toggle events. Certainly a default generic authorized
attribute can be defined for all the other buses that don't have
legacy here, but Thunderbolt will still require support for '2' as an
authorized value, and USB will still want to base probe decisions on
the authorization state of both the usb_device and the usb_interface.

> Any requests outside of that type of functionality are just that,
> outside the scope of this patchset and should get their own patch series
> and discussion.

A way to shave this further would be to default authorize just enough
devices to do the rest of the authorization from initramfs. That would
seem to cut out 'virtio-net' and 'virtio-storage' from default
authorized list.

The generic authorized attribute would only need to appear when
'dev_default_authorization' is set to false.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 6:41 PM Alan Stern  wrote:
>
> On Thu, Sep 30, 2021 at 01:52:59PM -0700, Dan Williams wrote:
> > On Thu, Sep 30, 2021 at 1:44 PM Alan Stern  
> > wrote:
> > >
> > > On Thu, Sep 30, 2021 at 12:23:36PM -0700, Andi Kleen wrote:
> > > >
> > > > > I don't think the current mitigations under discussion here are about
> > > > > keeping the system working. In fact most encrypted VM configs tend to
> > > > > stop booting as a preferred way to handle security issues.
> > > >
> > > > Maybe we should avoid the "trusted" term here. We're only really using 
> > > > it
> > > > because USB is using it and we're now using a common framework like Greg
> > > > requested. But I don't think it's the right way to think about it.
> > > >
> > > > We usually call the drivers "hardened". The requirement for a hardened
> > > > driver is that all interactions through MMIO/port/config space IO/MSRs 
> > > > are
> > > > sanitized and do not cause memory safety issues or other information 
> > > > leaks.
> > > > Other than that there is no requirement on the functionality. In 
> > > > particular
> > > > DOS is ok since a malicious hypervisor can decide to not run the guest 
> > > > at
> > > > any time anyways.
> > > >
> > > > Someone loading an malicious driver inside the guest would be out of 
> > > > scope.
> > > > If an attacker can do that inside the guest you already violated the
> > > > security mechanisms and there are likely easier ways to take over the 
> > > > guest
> > > > or leak data.
> > > >
> > > > The goal of the device filter mechanism is to prevent loading unhardened
> > > > drivers that could be exploited without them being themselves malicious.
> > >
> > > If all you want to do is prevent someone from loading a bunch of
> > > drivers that you have identified as unhardened, why not just use a
> > > modprobe blacklist?  Am I missing something?
> >
> > modules != drivers (i.e. multi-driver modules are a thing) and builtin
> > modules do not adhere to modprobe policy.
> >
> > There is also a desire to be able to support a single kernel image
> > across hosts and guests. So, if you were going to say, "just compile
> > all unnecessary drivers as modules" that defeats the common kernel
> > image goal. For confidential computing the expectation is that the
> > necessary device set is small. As you can see in the patches in this
> > case it's just a few lines of PCI ids and a hack to the virtio bus to
> > achieve the goal of disabling all extraneous devices by default.
>
>
>
> If your goal is to prevent some unwanted _drivers_ from operating --
> or all but a few desired drivers, as the case may be -- why extend
> the "authorized" API to all _devices_?  Why not instead develop a
> separate API (but of similar form) for drivers?
>
> Wouldn't that make more sense?  It corresponds a lot more directly
> with what you say you want to accomplish.

This was v1. v1 was NAKd [1] [2]:

[1]: https://lore.kernel.org/all/yqwpa+layt7yz...@kroah.com/
[2]: https://lore.kernel.org/all/yqzdqm6foezm6...@kroah.com/

> What would you do in the theoretical case where two separate drivers
> can manage the same device, but one of them is desired (or hardened)
> and the other isn't?

Allow for user override, just like we do today for new_id, remove_id,
bind, and unbind  when default driver policy is insufficient.

echo 1 > /sys/bus/$bus/devices/$device/authorized
echo $device > /sys/bus/$bus/drivers/$desired_driver/bind

The device filter is really only necessary to bootstrap until you can
run override policy scripts. The driver firewall approach was overkill
in that regard.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/6] driver core: Add common support to skip probe for un-authorized devices

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 1:44 PM Alan Stern  wrote:
>
> On Thu, Sep 30, 2021 at 12:23:36PM -0700, Andi Kleen wrote:
> >
> > > I don't think the current mitigations under discussion here are about
> > > keeping the system working. In fact most encrypted VM configs tend to
> > > stop booting as a preferred way to handle security issues.
> >
> > Maybe we should avoid the "trusted" term here. We're only really using it
> > because USB is using it and we're now using a common framework like Greg
> > requested. But I don't think it's the right way to think about it.
> >
> > We usually call the drivers "hardened". The requirement for a hardened
> > driver is that all interactions through MMIO/port/config space IO/MSRs are
> > sanitized and do not cause memory safety issues or other information leaks.
> > Other than that there is no requirement on the functionality. In particular
> > DOS is ok since a malicious hypervisor can decide to not run the guest at
> > any time anyways.
> >
> > Someone loading an malicious driver inside the guest would be out of scope.
> > If an attacker can do that inside the guest you already violated the
> > security mechanisms and there are likely easier ways to take over the guest
> > or leak data.
> >
> > The goal of the device filter mechanism is to prevent loading unhardened
> > drivers that could be exploited without them being themselves malicious.
>
> If all you want to do is prevent someone from loading a bunch of
> drivers that you have identified as unhardened, why not just use a
> modprobe blacklist?  Am I missing something?

modules != drivers (i.e. multi-driver modules are a thing) and builtin
modules do not adhere to modprobe policy.

There is also a desire to be able to support a single kernel image
across hosts and guests. So, if you were going to say, "just compile
all unnecessary drivers as modules" that defeats the common kernel
image goal. For confidential computing the expectation is that the
necessary device set is small. As you can see in the patches in this
case it's just a few lines of PCI ids and a hack to the virtio bus to
achieve the goal of disabling all extraneous devices by default.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] driver core: Move the "authorized" attribute from USB/Thunderbolt to core

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 12:50 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 9/30/21 12:04 PM, Dan Williams wrote:
> >>> That's why it was highlighted in the changelog. Hopefully a
> >>> Thunderbolt developer can confirm if it is a non-issue.
> >>> Documentation/ABI/testing/sysfs-bus-thunderbolt does not seem to
> >>> answer this question about whether authorized_show and
> >>> authorized_store need to be symmetric.
> >> Apparently, Bolt does read it [1] and cares about it [2].
> > Ah, thank you!
> >
> > Yeah, looks like the conversion to bool was indeed too hopeful.
> >
>
> IIUC, the end result of value "2" in authorized sysfs is to just
> "authorize" or "de-authorize". In that case, can the user space
> driver adapt to this int->bool change? Just want to know the
> possibility.

ABIs are forever. The kernel has to uphold its contract to bolt that
it will return '2' and not '1' after '2' has been written.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] driver core: Move the "authorized" attribute from USB/Thunderbolt to core

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 11:25 AM Yehezkel Bernat  wrote:
>
> On Thu, Sep 30, 2021 at 6:28 PM Dan Williams  wrote:
> >
> > On Thu, Sep 30, 2021 at 4:20 AM Yehezkel Bernat  
> > wrote:
> > >
> > > On Thu, Sep 30, 2021 at 4:05 AM Kuppuswamy Sathyanarayanan
> > >  wrote:
> > > >
> > > > no functional
> > > > changes other than value 2 being not visible to the user.
> > > >
> > >
> > > Are we sure we don't break any user-facing tool with it? Tools might use 
> > > this to
> > > "remember" how the device was authorized this time.
> >
> > That's why it was highlighted in the changelog. Hopefully a
> > Thunderbolt developer can confirm if it is a non-issue.
> > Documentation/ABI/testing/sysfs-bus-thunderbolt does not seem to
> > answer this question about whether authorized_show and
> > authorized_store need to be symmetric.
>
> Apparently, Bolt does read it [1] and cares about it [2].

Ah, thank you!

Yeah, looks like the conversion to bool was indeed too hopeful.

>
> [1] 
> https://gitlab.freedesktop.org/bolt/bolt/-/blob/130e09d1c7ff02c09e4ad1c9c36e9940b68e58d8/boltd/bolt-sysfs.c#L511
> [2] 
> https://gitlab.freedesktop.org/bolt/bolt/-/blob/130e09d1c7ff02c09e4ad1c9c36e9940b68e58d8/boltd/bolt-device.c#L639
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] driver core: Move the "authorized" attribute from USB/Thunderbolt to core

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 4:20 AM Yehezkel Bernat  wrote:
>
> On Thu, Sep 30, 2021 at 4:05 AM Kuppuswamy Sathyanarayanan
>  wrote:
> >
> > no functional
> > changes other than value 2 being not visible to the user.
> >
>
> Are we sure we don't break any user-facing tool with it? Tools might use this 
> to
> "remember" how the device was authorized this time.

That's why it was highlighted in the changelog. Hopefully a
Thunderbolt developer can confirm if it is a non-issue.
Documentation/ABI/testing/sysfs-bus-thunderbolt does not seem to
answer this question about whether authorized_show and
authorized_store need to be symmetric.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] driver core: Move the "authorized" attribute from USB/Thunderbolt to core

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 8:00 AM Alan Stern  wrote:
>
> On Wed, Sep 29, 2021 at 06:55:12PM -0700, Dan Williams wrote:
> > On Wed, Sep 29, 2021 at 6:43 PM Alan Stern  
> > wrote:
> > >
> > > On Wed, Sep 29, 2021 at 06:05:06PM -0700, Kuppuswamy Sathyanarayanan 
> > > wrote:
> > > > Currently bus drivers like "USB" or "Thunderbolt" implement a custom
> > > > version of device authorization to selectively authorize the driver
> > > > probes. Since there is a common requirement, move the "authorized"
> > > > attribute support to the driver core in order to allow it to be used
> > > > by other subsystems / buses.
> > > >
> > > > Similar requirements have been discussed in the PCI [1] community for
> > > > PCI bus drivers as well.
> > > >
> > > > No functional changes are intended. It just converts authorized
> > > > attribute from int to bool and moves it to the driver core. There
> > > > should be no user-visible change in the location or semantics of
> > > > attributes for USB devices.
> > > >
> > > > Regarding thunderbolt driver, although it declares sw->authorized as
> > > > "int" and allows 0,1,2 as valid values for sw->authorized attribute,
> > > > but within the driver, in all authorized attribute related checks,
> > > > it is treated as bool value. So when converting the authorized
> > > > attribute from int to bool value, there should be no functional
> > > > changes other than value 2 being not visible to the user.
> > > >
> > > > [1]: 
> > > > https://lore.kernel.org/all/CACK8Z6E8pjVeC934oFgr=vb3pulx_gyt2nkzaogdrqj9tks...@mail.gmail.com/
> > > >
> > > > Reviewed-by: Dan Williams 
> > > > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > > > 
> > >
> > > Since you're moving the authorized flag from the USB core to the
> > > driver core, the corresponding sysfs attribute functions should be
> > > moved as well.
> >
> > Unlike when 'removable' moved from USB to the driver core there isn't
> > a common definition for how the 'authorized' sysfs-attribute behaves
> > across buses. The only common piece is where this flag is stored in
> > the data structure, i.e. the 'authorized' sysfs interface is
> > purposefully left bus specific.
>
> How about implementing "library" versions of show_authorized() and
> store_authorized() that the bus-specific attribute routines can call?
> These library routines would handle parsing the input values, storing
> the new flag, and displaying the stored flag value.  That way at
> least the common parts of these APIs would be centralized in the
> driver core, and any additional functionality could easily be added
> by the bus-specific attribute routine.
>

While show_authorized() seems like it could be standardized, have a
look at what the different store_authorized() implementations do.
Thunderbolt wants "switch approval" vs "switch challenge" and USB has
a bunch of bus-specific work to do when the authorization state
changes. I don't see much room for a library to help there as more
buses add authorization support. That said I do think it would be
useful to have a common implementation available for generic probe
authorization to toggle the flag if the bus does not have any
authorization work to do, but that seems a follow-on once this core is
accepted.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/6] virtio: Initialize authorized attribute for confidential guest

2021-09-30 Thread Dan Williams
On Thu, Sep 30, 2021 at 4:03 AM Michael S. Tsirkin  wrote:
>
> On Wed, Sep 29, 2021 at 06:05:09PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > Confidential guest platforms like TDX have a requirement to allow
> > only trusted devices. By default the confidential-guest core will
> > arrange for all devices to default to unauthorized (via
> > dev_default_authorization) in device_initialize(). Since virtio
> > driver is already hardened against the attack from the un-trusted host,
> > override the confidential computing default unauthorized state
> >
> > Reviewed-by: Dan Williams 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
>
> Architecturally this all looks backwards. IIUC nothing about virtio
> makes it authorized or trusted. The driver is hardened,
> true, but this should be set at the driver not the device level.

That's was my initial reaction to this proposal as well, and I ended
up leading Sathya astray from what Greg wanted. Greg rightly points
out that the "authorized" attribute from USB and Thunderbolt already
exists [1] [2]. So the choice is find an awkward way to mix driver
trust with existing bus-local "authorized" mechanisms, or promote the
authorized capability to the driver-core. This patch set implements
the latter to keep the momentum on the already shipping design scheme
to not add to the driver-core maintenance burden.

[1]: https://lore.kernel.org/all/yquaj78y8j1um...@kroah.com/
[2]: https://lore.kernel.org/all/yqzf%2futgrjfbz...@kroah.com/

> And in particular, not all virtio drivers are hardened -
> I think at this point blk and scsi drivers have been hardened - so
> treating them all the same looks wrong.

My understanding was that they have been audited, Sathya?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] driver core: Move the "authorized" attribute from USB/Thunderbolt to core

2021-09-29 Thread Dan Williams
On Wed, Sep 29, 2021 at 7:39 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 9/29/21 6:55 PM, Dan Williams wrote:
> >> Also, you ignored the usb_[de]authorize_interface() functions and
> >> their friends.
> > Ugh, yes.
>
> I did not change it because I am not sure about the interface vs device
> dependency.
>

This is was the rationale for has_probe_authorization flag. USB
performs authorization of child devices based on the authorization
state of the parent interface.

> I think following change should work.
>
> diff --git a/drivers/usb/core/driver.c b/drivers/usb/core/driver.c
> index f57b5a7a90ca..84969732d09c 100644
> --- a/drivers/usb/core/driver.c
> +++ b/drivers/usb/core/driver.c
> @@ -334,7 +334,7 @@ static int usb_probe_interface(struct device *dev)
> if (udev->dev.authorized == false) {
> dev_err(>dev, "Device is not authorized for usage\n");
> return error;
> -   } else if (intf->authorized == 0) {
> +   } else if (intf->dev.authorized == 0) {

== false.

> dev_err(>dev, "Interface %d is not authorized for 
> usage\n",
> intf->altsetting->desc.bInterfaceNumber);
> return error;
> @@ -546,7 +546,7 @@ int usb_driver_claim_interface(struct usb_driver *driver,
> return -EBUSY;
>
> /* reject claim if interface is not authorized */
> -   if (!iface->authorized)
> +   if (!iface->dev.authorized)

I'd do == false to keep it consistent with other conversions.

> return -ENODEV;
>
> dev->driver = >drvwrap.driver;
> diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> index 47548ce1cfb1..ab3c8d1e4db9 100644
> --- a/drivers/usb/core/message.c
> +++ b/drivers/usb/core/message.c
> @@ -1791,9 +1791,9 @@ void usb_deauthorize_interface(struct usb_interface 
> *intf)
>
> device_lock(dev->parent);
>
> -   if (intf->authorized) {
> +   if (intf->dev.authorized) {
> device_lock(dev);
> -   intf->authorized = 0;
> +   intf->dev.authorized = 0;

= false;

> device_unlock(dev);
>
> usb_forced_unbind_intf(intf);
> @@ -1811,9 +1811,9 @@ void usb_authorize_interface(struct usb_interface *intf)
>   {
> struct device *dev = >dev;
>
> -   if (!intf->authorized) {
> +   if (!intf->dev.authorized) {
> device_lock(dev);
> -   intf->authorized = 1; /* authorize interface */
> +   intf->dev.authorized = 1; /* authorize interface */

= true

...not sure that comment is worth preserving.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] driver core: Move the "authorized" attribute from USB/Thunderbolt to core

2021-09-29 Thread Dan Williams
On Wed, Sep 29, 2021 at 6:43 PM Alan Stern  wrote:
>
> On Wed, Sep 29, 2021 at 06:05:06PM -0700, Kuppuswamy Sathyanarayanan wrote:
> > Currently bus drivers like "USB" or "Thunderbolt" implement a custom
> > version of device authorization to selectively authorize the driver
> > probes. Since there is a common requirement, move the "authorized"
> > attribute support to the driver core in order to allow it to be used
> > by other subsystems / buses.
> >
> > Similar requirements have been discussed in the PCI [1] community for
> > PCI bus drivers as well.
> >
> > No functional changes are intended. It just converts authorized
> > attribute from int to bool and moves it to the driver core. There
> > should be no user-visible change in the location or semantics of
> > attributes for USB devices.
> >
> > Regarding thunderbolt driver, although it declares sw->authorized as
> > "int" and allows 0,1,2 as valid values for sw->authorized attribute,
> > but within the driver, in all authorized attribute related checks,
> > it is treated as bool value. So when converting the authorized
> > attribute from int to bool value, there should be no functional
> > changes other than value 2 being not visible to the user.
> >
> > [1]: 
> > https://lore.kernel.org/all/CACK8Z6E8pjVeC934oFgr=vb3pulx_gyt2nkzaogdrqj9tks...@mail.gmail.com/
> >
> > Reviewed-by: Dan Williams 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
>
> Since you're moving the authorized flag from the USB core to the
> driver core, the corresponding sysfs attribute functions should be
> moved as well.

Unlike when 'removable' moved from USB to the driver core there isn't
a common definition for how the 'authorized' sysfs-attribute behaves
across buses. The only common piece is where this flag is stored in
the data structure, i.e. the 'authorized' sysfs interface is
purposefully left bus specific.

> Also, you ignored the usb_[de]authorize_interface() functions and
> their friends.

Ugh, yes.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

2021-08-25 Thread Dan Williams
On Wed, Aug 25, 2021 at 12:23 AM David Hildenbrand  wrote:
>
> On 25.08.21 02:58, Dan Williams wrote:
> > On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand  wrote:
> >>
> >> virtio-mem dynamically exposes memory inside a device memory region as
> >> system RAM to Linux, coordinating with the hypervisor which parts are
> >> actually "plugged" and consequently usable/accessible. On the one hand, the
> >> virtio-mem driver adds/removes whole memory blocks, creating/removing busy
> >> IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
> >> memory inside added memory blocks, dynamically either exposing them to
> >> the buddy or hiding them from the buddy and marking them PG_offline.
> >>
> >> virtio-mem wants to make sure that in a sane environment, nobody
> >> "accidentially" accesses unplugged memory inside the device managed
> >> region. After /proc/kcore has been sanitized and /dev/kmem has been
> >> removed, /dev/mem is the remaining interface that still allows uncontrolled
> >> access to the device-managed region of virtio-mem devices from user
> >> space.
> >>
> >> There is no known sane use case for mapping virtio-mem device memory
> >> via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
> >> that region. So once the driver was loaded and detected the device
> >> along the device-managed region, we just want to disallow any access via
> >> /dev/mem to it.
> >>
> >> Let's add the basic infrastructure to exclude some physical memory
> >> regions completely from /dev/mem access, on any architecture and under
> >> any system configuration (independent of CONFIG_STRICT_DEVMEM and
> >> independent of "iomem=").
> >
> > I'm certainly on team "/dev/mem considered harmful", but this approach
> > feels awkward. It feels wrong for being non-committal about whether
> > CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
> > turned on all the time, and the configuration option dropped, or there
> > are users clinging onto /dev/mem where they expect to be able to build
> > a debug kernel to turn all of these restrictions off, even the
> > virtio-mem ones. This splits the difference and says some /dev/mem
> > accesses are always disallowed for "reasons", but I could say the same
> > thing about pmem, there's no sane reason to allow /dev/mem which has
> > no idea about the responsibilities of properly touching pmem to get
> > access to it.
>
> For virtio-mem, there is no use case *and* access could be harmful; I
> don't even want to allow if for debugging purposes. If you want to
> inspect virtio-mem device memory content, use /proc/kcore, which
> performs proper synchronized access checks. Modifying random virtio-mem
> memory via /dev/mem in a debug kernel will not be possible: if you
> really have to play with fire, use kdb or better don't load the
> virtio-mem driver during boot, such that the kernel won't even be making
> use of device memory.
>
> I don't want people disabling CONFIG_STRICT_DEVMEM, or booting with
> "iomem=relaxed", and "accidentally" accessing any of virtio-mem memory
> via /dev/mem, while it gets concurrently plugged/unplugged by the
> virtio-mem driver. Not even for debugging purposes.

That sounds more an argument that all of the existing "kernel is using
this region" cases should become mandatory exclusions. If unloading
the driver removes the exclusion then that's precisely
CONFIG_IO_STRICT_DEVMEM. Why is the virtio-mem driver more special
than any other driver that expects this integrity guarantee?

> We disallow mapping to some other regions independent of
> CONFIG_STRICT_DEVMEM already, so the idea to ignore CONFIG_STRICT_DEVMEM
> is not completely new:
>
> "Note that with PAT support enabled, even in this case there are
> restrictions on /dev/mem use due to the cache aliasing requirements."
>
> Maybe you even want to do something similar with PMEM now that there is
> infrastructure for it and just avoid having to deal with revoking
> /dev/mem mappings later.

That would be like blocking writes to /dev/sda just because a
filesytem might later be mounted on it. If the /dev/mem access is not
actively colliding with other kernel operations what business does the
kernel have saying no?

I'm pushing on this topic because I am also considering an exclusion
on PCI configuration access to the "DOE mailbox" since it can disrupt
the kernel's operation, at the same time, root can go change PCI BARs
to nonsensical values whenever it wants which is also in t

Re: [PATCH v2 1/3] /dev/mem: disallow access to explicitly excluded system RAM regions

2021-08-24 Thread Dan Williams
On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand  wrote:
>
> virtio-mem dynamically exposes memory inside a device memory region as
> system RAM to Linux, coordinating with the hypervisor which parts are
> actually "plugged" and consequently usable/accessible. On the one hand, the
> virtio-mem driver adds/removes whole memory blocks, creating/removing busy
> IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
> memory inside added memory blocks, dynamically either exposing them to
> the buddy or hiding them from the buddy and marking them PG_offline.
>
> virtio-mem wants to make sure that in a sane environment, nobody
> "accidentially" accesses unplugged memory inside the device managed
> region. After /proc/kcore has been sanitized and /dev/kmem has been
> removed, /dev/mem is the remaining interface that still allows uncontrolled
> access to the device-managed region of virtio-mem devices from user
> space.
>
> There is no known sane use case for mapping virtio-mem device memory
> via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
> that region. So once the driver was loaded and detected the device
> along the device-managed region, we just want to disallow any access via
> /dev/mem to it.
>
> Let's add the basic infrastructure to exclude some physical memory
> regions completely from /dev/mem access, on any architecture and under
> any system configuration (independent of CONFIG_STRICT_DEVMEM and
> independent of "iomem=").

I'm certainly on team "/dev/mem considered harmful", but this approach
feels awkward. It feels wrong for being non-committal about whether
CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
turned on all the time, and the configuration option dropped, or there
are users clinging onto /dev/mem where they expect to be able to build
a debug kernel to turn all of these restrictions off, even the
virtio-mem ones. This splits the difference and says some /dev/mem
accesses are always disallowed for "reasons", but I could say the same
thing about pmem, there's no sane reason to allow /dev/mem which has
no idea about the responsibilities of properly touching pmem to get
access to it.


>
> Any range marked with "IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE"
> will be excluded, even if not busy. For now, there are no applicable
> ranges and we'll modify virtio-mem next to properly set
> IORESOURCE_EXCLUSIVE on the parent resource.
>
> As next_resource() will iterate over children although we might want to
> skip a certain range completely, let's add and use
> next_range_skip_children() and for_each_resource(), to optimize that case,
> avoding having to traverse subtrees that are not of interest.
>
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/char/mem.c | 22 
>  include/linux/ioport.h |  1 +
>  kernel/resource.c  | 47 ++
>  lib/Kconfig.debug  |  4 +++-
>  4 files changed, 60 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> index 1c596b5cdb27..1829dc6a1f29 100644
> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -60,13 +60,18 @@ static inline int valid_mmap_phys_addr_range(unsigned 
> long pfn, size_t size)
>  }
>  #endif
>
> -#ifdef CONFIG_STRICT_DEVMEM
>  static inline int page_is_allowed(unsigned long pfn)
>  {
> -   return devmem_is_allowed(pfn);
> +#ifdef CONFIG_STRICT_DEVMEM
> +   if (!devmem_is_allowed(pfn))
> +   return 0;
> +#endif /* CONFIG_STRICT_DEVMEM */
> +   return !iomem_range_contains_excluded_devmem(PFN_PHYS(pfn), 
> PAGE_SIZE);
>  }
> +
>  static inline int range_is_allowed(unsigned long pfn, unsigned long size)
>  {
> +#ifdef CONFIG_STRICT_DEVMEM
> u64 from = ((u64)pfn) << PAGE_SHIFT;
> u64 to = from + size;
> u64 cursor = from;
> @@ -77,18 +82,9 @@ static inline int range_is_allowed(unsigned long pfn, 
> unsigned long size)
> cursor += PAGE_SIZE;
> pfn++;
> }
> -   return 1;
> -}
> -#else
> -static inline int page_is_allowed(unsigned long pfn)
> -{
> -   return 1;
> -}
> -static inline int range_is_allowed(unsigned long pfn, unsigned long size)
> -{
> -   return 1;
> +#endif /* CONFIG_STRICT_DEVMEM */
> +   return !iomem_range_contains_excluded_devmem(PFN_PHYS(pfn), size);
>  }
> -#endif
>
>  #ifndef unxlate_dev_mem_ptr
>  #define unxlate_dev_mem_ptr unxlate_dev_mem_ptr
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 8359c50f9988..d31f83281327 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -308,6 +308,7 @@ extern struct resource * __devm_request_region(struct 
> device *dev,
>  extern void __devm_release_region(struct device *dev, struct resource 
> *parent,
>   resource_size_t start, resource_size_t n);
>  extern int iomem_map_sanity_check(resource_size_t addr, unsigned long size);
> +extern bool 

Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-24 Thread Dan Williams
On Tue, Aug 24, 2021 at 2:57 PM Rajat Jain  wrote:
>
> On Mon, Aug 23, 2021 at 6:06 PM Dan Williams  wrote:
> >
> > On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
> >  wrote:
> > >
> > >
> > >
> > > On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> > > >> Add a new variant of pci_iomap for mapping all PCI resources
> > > >> of a devices as shared memory with a hypervisor in a confidential
> > > >> guest.
> > > >>
> > > >> Signed-off-by: Andi Kleen
> > > >> Signed-off-by: Kuppuswamy 
> > > >> Sathyanarayanan
> > > > I'm a bit puzzled by this part. So why should the guest*not*  map
> > > > pci memory as shared? And if the answer is never (as it seems to be)
> > > > then why not just make regular pci_iomap DTRT?
> > >
> > > It is in the context of confidential guest (where VMM is un-trusted). So
> > > we don't want to make all PCI resource as shared. It should be allowed
> > > only for hardened drivers/devices.
> >
> > That's confusing, isn't device authorization what keeps unaudited
> > drivers from loading against untrusted devices? I'm feeling like
> > Michael that this should be a detail that drivers need not care about
> > explicitly, in which case it does not need to be exported because the
> > detail can be buried in lower levels.
> >
> > Note, I specifically said "unaudited", not "hardened" because as Greg
> > mentioned the kernel must trust drivers, its devices that may not be
> > trusted.
>
> Can you please point me to the thread where this discussion with Greg
> is ongoing?

It slowed down to implement the "move to the 'authorized' device
model" recommendation. LWN has a good writeup (as usual) and a link to
the thread:

https://lwn.net/Articles/865918/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-24 Thread Dan Williams
On Tue, Aug 24, 2021 at 1:50 PM Andi Kleen  wrote:
>
>
> On 8/24/2021 1:31 PM, Bjorn Helgaas wrote:
> > On Tue, Aug 24, 2021 at 01:14:02PM -0700, Andi Kleen wrote:
> >> On 8/24/2021 11:55 AM, Bjorn Helgaas wrote:
> >>> [+cc Rajat; I still don't know what "shared memory with a hypervisor
> >>> in a confidential guest" means,
> >> A confidential guest is a guest which uses memory encryption to isolate
> >> itself from the host. It doesn't trust the host. But it still needs to
> >> communicate with the host for IO, so it has some special memory areas that
> >> are explicitly marked shared. These are used to do IO with the host. All
> >> their usage needs to be carefully hardened to avoid any security attacks on
> >> the guest, that's why we want to limit this interaction only to a small set
> >> of hardened drivers. For MMIO, the set is currently only virtio and MSI-X.
> > Good material for the commit log next time around.  Thanks!
>
> This is all in the patch intro too, which should make it into the merge
> commits.
>
> I don't think we can reexplain the basic concepts for every individual
> patch in a large patch kit.

Maybe not the whole cover letter, but how about just a line in this
one that says "Recall that 'shared' in this context refers to memory
that lacks confidentiality and integrity protection from the VMM so
that it can communicate with the VM." Although I think
ioremap_noprotect() might be clearer than shared for the protected
guest use case?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 11/15] pci: Add pci_iomap_shared{,_range}

2021-08-23 Thread Dan Williams
On Mon, Aug 23, 2021 at 5:31 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 8/23/21 4:56 PM, Michael S. Tsirkin wrote:
> >> Add a new variant of pci_iomap for mapping all PCI resources
> >> of a devices as shared memory with a hypervisor in a confidential
> >> guest.
> >>
> >> Signed-off-by: Andi Kleen
> >> Signed-off-by: Kuppuswamy 
> >> Sathyanarayanan
> > I'm a bit puzzled by this part. So why should the guest*not*  map
> > pci memory as shared? And if the answer is never (as it seems to be)
> > then why not just make regular pci_iomap DTRT?
>
> It is in the context of confidential guest (where VMM is un-trusted). So
> we don't want to make all PCI resource as shared. It should be allowed
> only for hardened drivers/devices.

That's confusing, isn't device authorization what keeps unaudited
drivers from loading against untrusted devices? I'm feeling like
Michael that this should be a detail that drivers need not care about
explicitly, in which case it does not need to be exported because the
detail can be buried in lower levels.

Note, I specifically said "unaudited", not "hardened" because as Greg
mentioned the kernel must trust drivers, its devices that may not be
trusted.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 4/4] bus: Make remove callback return void

2021-07-06 Thread Dan Williams
On Tue, Jul 6, 2021 at 8:51 AM Uwe Kleine-König
 wrote:
>
> The driver core ignores the return value of this callback because there
> is only little it can do when a device disappears.
>
> This is the final bit of a long lasting cleanup quest where several
> buses were converted to also return void from their remove callback.
> Additionally some resource leaks were fixed that were caused by drivers
> returning an error code in the expectation that the driver won't go
> away.
>
> With struct bus_type::remove returning void it's prevented that newly
> implemented buses return an ignored error code and so don't anticipate
> wrong expectations for driver authors.
>

>  drivers/cxl/core.c| 3 +--
>  drivers/dax/bus.c | 4 +---
>  drivers/nvdimm/bus.c      | 3 +--

For CXL, DAX, and NVDIMM

Acked-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 12:11 PM David Hildenbrand  wrote:
>
>
> > Am 04.01.2021 um 20:52 schrieb Dave Hansen :
> >
> > On 1/4/21 11:27 AM, Matthew Wilcox wrote:
> >>> On Mon, Jan 04, 2021 at 11:19:13AM -0800, Dave Hansen wrote:
> >>> On 12/21/20 8:30 AM, Liang Li wrote:
>  --- a/include/linux/page-flags.h
>  +++ b/include/linux/page-flags.h
>  @@ -137,6 +137,9 @@ enum pageflags {
>  #endif
>  #ifdef CONFIG_64BIT
> PG_arch_2,
>  +#endif
>  +#ifdef CONFIG_PREZERO_PAGE
>  +PG_zero,
>  #endif
> __NR_PAGEFLAGS,
> >>>
> >>> I don't think this is worth a generic page->flags bit.
> >>>
> >>> There's a ton of space in 'struct page' for pages that are in the
> >>> allocator.  Can't we use some of that space?
> >>
> >> I was going to object to that too, but I think the entire approach is
> >> flawed and needs to be thrown out.  It just nukes the caches in extremely
> >> subtle and hard to measure ways, lowering overall system performance.
> >
> > Yeah, it certainly can't be the default, but it *is* useful for thing
> > where we know that there are no cache benefits to zeroing close to where
> > the memory is allocated.
> >
> > The trick is opting into it somehow, either in a process or a VMA.
> >
>
> The patch set is mostly trying to optimize starting a new process. So 
> process/vma doesn‘t really work.
>
> I still wonder if using tmpfs/shmem cannot somehow be used to cover the 
> original use case of starting a new vm fast (or rebooting an existing one 
> involving restarting the process).

If it's rebooting a VM then file-backed should be able to skip the
zeroing because the stale data exposure is only to itself. If the
memory is being repurposed to a new VM then the file needs to be
reallocated / re-zeroed just like the anonymous case.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 11:28 AM Matthew Wilcox  wrote:
>
> On Mon, Jan 04, 2021 at 11:19:13AM -0800, Dave Hansen wrote:
> > On 12/21/20 8:30 AM, Liang Li wrote:
> > > --- a/include/linux/page-flags.h
> > > +++ b/include/linux/page-flags.h
> > > @@ -137,6 +137,9 @@ enum pageflags {
> > >  #endif
> > >  #ifdef CONFIG_64BIT
> > > PG_arch_2,
> > > +#endif
> > > +#ifdef CONFIG_PREZERO_PAGE
> > > +   PG_zero,
> > >  #endif
> > > __NR_PAGEFLAGS,
> >
> > I don't think this is worth a generic page->flags bit.
> >
> > There's a ton of space in 'struct page' for pages that are in the
> > allocator.  Can't we use some of that space?
>
> I was going to object to that too, but I think the entire approach is
> flawed and needs to be thrown out.  It just nukes the caches in extremely
> subtle and hard to measure ways, lowering overall system performance.

At a minimum the performance analysis should at least try to quantify
that externalized cost. Certainly that overhead went somewhere. Maybe
if this overhead was limited to run when the CPU would otherwise go
idle, but that might mean it never runs in practice...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-04 Thread Dan Williams
On Wed, Nov 4, 2020 at 11:32 PM gregkh  wrote:
>
> On Wed, Nov 04, 2020 at 03:21:23PM -0800, Dan Williams wrote:
> > On Tue, Nov 3, 2020 at 7:45 AM Jason Gunthorpe  wrote:
> > [..]
> > > > +MODULE_DEVICE_TABLE(auxiliary, mlx5v_id_table);
> > > > +
> > > > +static struct auxiliary_driver mlx5v_driver = {
> > > > + .name = "vnet",
> > > > + .probe = mlx5v_probe,
> > > > + .remove = mlx5v_remove,
> > > > + .id_table = mlx5v_id_table,
> > > > +};
> > >
> > > It is hard to see from the diff, but when this patch is applied the
> > > vdpa module looks like I imagined things would look with the auxiliary
> > > bus. It is very similar in structure to a PCI driver with the probe()
> > > function cleanly registering with its subsystem. This is what I'd like
> > > to see from the new Intel RDMA driver.
> > >
> > > Greg, I think this patch is the best clean usage example.
> > >
> > > I've looked over this series and it has the right idea and
> > > parts. There is definitely more that can be done to improve mlx5 in
> > > this area, but this series is well scoped and cleans a good part of
> > > it.
> >
> > Greg?
> >
> > I know you alluded to going your own way if the auxiliary bus patches
> > did not shape up soon, but it seems they have and the stakeholders
> > have reached this consensus point.
> >
> > Were there any additional changes you wanted to see happen? I'll go
> > give the final set another once over, but David has been diligently
> > fixing up all the declared major issues so I expect to find at most
> > minor incremental fixups.
>
> This is in my to-review pile, along with a load of other stuff at the
> moment:
> $ ~/bin/mdfrm -c ~/mail/todo/
> 1709 messages in /home/gregkh/mail/todo/
>
> So give me a chance.  There is no rush on my side for this given the
> huge delays that have happened here on the authorship side many times in
> the past :)

Sure, I was more looking to confirm that it's worth continuing to
polish this set given your mention of possibly going a different
direction.

> If you can review it, or anyone else, that is always most appreciated.

Thanks, will do.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH mlx5-next v1 06/11] vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus

2020-11-04 Thread Dan Williams
On Tue, Nov 3, 2020 at 7:45 AM Jason Gunthorpe  wrote:
[..]
> > +MODULE_DEVICE_TABLE(auxiliary, mlx5v_id_table);
> > +
> > +static struct auxiliary_driver mlx5v_driver = {
> > + .name = "vnet",
> > + .probe = mlx5v_probe,
> > + .remove = mlx5v_remove,
> > + .id_table = mlx5v_id_table,
> > +};
>
> It is hard to see from the diff, but when this patch is applied the
> vdpa module looks like I imagined things would look with the auxiliary
> bus. It is very similar in structure to a PCI driver with the probe()
> function cleanly registering with its subsystem. This is what I'd like
> to see from the new Intel RDMA driver.
>
> Greg, I think this patch is the best clean usage example.
>
> I've looked over this series and it has the right idea and
> parts. There is definitely more that can be done to improve mlx5 in
> this area, but this series is well scoped and cleans a good part of
> it.

Greg?

I know you alluded to going your own way if the auxiliary bus patches
did not shape up soon, but it seems they have and the stakeholders
have reached this consensus point.

Were there any additional changes you wanted to see happen? I'll go
give the final set another once over, but David has been diligently
fixing up all the declared major issues so I expect to find at most
minor incremental fixups.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] treewide: cleanup unreachable breaks

2020-10-17 Thread Dan Williams
On Sat, Oct 17, 2020 at 9:10 AM  wrote:
>
> From: Tom Rix 
>
> This is a upcoming change to clean up a new warning treewide.
> I am wondering if the change could be one mega patch (see below) or
> normal patch per file about 100 patches or somewhere half way by collecting
> early acks.
>
> clang has a number of useful, new warnings see
> https://clang.llvm.org/docs/DiagnosticsReference.html
>
> This change cleans up -Wunreachable-code-break
> https://clang.llvm.org/docs/DiagnosticsReference.html#wunreachable-code-break
> for 266 of 485 warnings in this week's linux-next, allyesconfig on x86_64.
>
> The method of fixing was to look for warnings where the preceding statement
> was a simple statement and by inspection made the subsequent break unneeded.
> In order of frequency these look like
>
> return and break
>
> switch (c->x86_vendor) {
> case X86_VENDOR_INTEL:
> intel_p5_mcheck_init(c);
> return 1;
> -   break;
>
> goto and break
>
> default:
> operation = 0; /* make gcc happy */
> goto fail_response;
> -   break;
>
> break and break
> case COLOR_SPACE_SRGB:
> /* by pass */
> REG_SET(OUTPUT_CSC_CONTROL, 0,
> OUTPUT_CSC_GRPH_MODE, 0);
> break;
> -   break;
>
> The exception to the simple statement, is a switch case with a block
> and the end of block is a return
>
> struct obj_buffer *buff = r->ptr;
> return scnprintf(str, PRIV_STR_SIZE,
> "size=%u\naddr=0x%X\n", buff->size,
> buff->addr);
> }
> -   break;
>
> Not considered obvious and excluded, breaks after
> multi level switches
> complicated if-else if-else blocks
> panic() or similar calls
>
> And there is an odd addition of a 'fallthrough' in drivers/tty/nozomi.c
[..]
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 5a7c80053c62..2f250874b1a4 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -200,11 +200,10 @@ ssize_t nd_namespace_store(struct device *dev,
> }
> break;
> default:
> len = -EBUSY;
> goto out_attach;
> -   break;
> }

Acked-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-02 Thread Dan Williams
On Sat, May 2, 2020 at 2:27 AM David Hildenbrand  wrote:
>
> >> Now, let's clarify what I want regarding virtio-mem:
> >>
> >> 1. kexec should not add virtio-mem memory to the initial firmware
> >>memmap. The driver has to be in charge as discussed.
> >> 2. kexec should not place kexec images onto virtio-mem memory. That
> >>would end badly.
> >> 3. kexec should still dump virtio-mem memory via kdump.
> >
> > Ok, but then seems to say to me that dax/kmem is a different type of
> > (driver managed) than virtio-mem and it's confusing to try to apply
> > the same meaning. Why not just call your type for the distinct type it
> > is "System RAM (virtio-mem)" and let any other driver managed memory
> > follow the same "System RAM ($driver)" format if it wants?
>
> I had the same idea but discarded it because it seemed to uglify the
> add_memory() interface (passing yet another parameter only relevant for
> driver managed memory). Maybe we really want a new one, because I like
> that idea:
>
> /*
>  * Add special, driver-managed memory to the system as system ram.
>  * The resource_name is expected to have the name format "System RAM
>  * ($DRIVER)", so user space (esp. kexec-tools)" can special-case it.
>  *
>  * For this memory, no entries in /sys/firmware/memmap are created,
>  * as this memory won't be part of the raw firmware-provided memory map
>  * e.g., after a reboot. Also, the created memory resource is flagged
>  * with IORESOURCE_MEM_DRIVER_MANAGED, so in-kernel users can special-
>  * case this memory (e.g., not place kexec images onto it).
>  */
> int add_memory_driver_managed(int nid, u64 start, u64 size,
>   const char *resource_name);
>
>
> If we'd ever have to special case it even more in the kernel, we could
> allow to specify further resource flags. While passing the driver name
> instead of the resource_name would be an option, this way we don't have
> to hand craft new resource strings for added memory resources.
>
> Thoughts?

Looks useful to me and simplifies walking /proc/iomem. I personally
like the safety of the string just being the $driver component of the
name, but I won't lose sleep if the interface stays freeform like you
propose.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 2:11 PM David Hildenbrand  wrote:
>
> On 01.05.20 22:12, Dan Williams wrote:
[..]
> >>> Consider the case of EFI Special Purpose (SP) Memory that is
> >>> marked EFI Conventional Memory with the SP attribute. In that case the
> >>> firmware memory map marked it as conventional RAM, but the kernel
> >>> optionally marks it as System RAM vs Soft Reserved. The 2008 patch
> >>> simply does not consider that case. I'm not sure strict textualism
> >>> works for coding decisions.
> >>
> >> I am no expert on that matter (esp EFI). But looking at the users of
> >> firmware_map_add_early(), the single user is in arch/x86/kernel/e820.c
> >> . So the single source of /sys/firmware/memmap is (besides hotplug) e820.
> >>
> >> "'e820_table_firmware': the original firmware version passed to us by
> >> the bootloader - not modified by the kernel. ... inform the user about
> >> the firmware's notion of memory layout via /sys/firmware/memmap"
> >> (arch/x86/kernel/e820.c)
> >>
> >> How is the EFI Special Purpose (SP) Memory represented in e820?
> >> /sys/firmware/memmap is really simple: just dump in e820. No policies IIUC.
> >
> > e820 now has a Soft Reserved translation for this which means "try to
> > reserve, but treat as System RAM is ok too". It seems generically
> > useful to me that the toggle for determining whether Soft Reserved or
> > System RAM shows up /sys/firmware/memmap is a determination that
> > policy can make. The kernel need not preemptively block it.
>
> So, I think I have to clarify something here. We do have two ways to kexec
>
> 1. kexec_load(): User space (kexec-tools) crafts the memmap (e.g., using
> /sys/firmware/memmap on x86-64) and selects memory where to place the
> kexec images (e.g., using /proc/iomem)
>
> 2. kexec_file_load(): The kernel reuses the (basically) raw firmware
> memmap and selects memory where to place kexec images.
>
> We are talking about changing 1, to behave like 2 in regards to
> dax/kmem. 2. does currently not add any hotplugged memory to the
> fixed-up e820, and it should be fixed regarding hotplugged DIMMs that
> would appear in e820 after a reboot.
>
> Now, all these policy discussions are nice and fun, but I don't really
> see a good reason to (ab)use /sys/firmware/memmap for that (e.g., parent
> properties). If you want to be able to make this configurable, then
> e.g., add a way to configure this in the kernel (for example along with
> kmem) to make 1. and 2. behave the same way. Otherwise, you really only
> can change 1.

That's clearer.

>
>
> Now, let's clarify what I want regarding virtio-mem:
>
> 1. kexec should not add virtio-mem memory to the initial firmware
>memmap. The driver has to be in charge as discussed.
> 2. kexec should not place kexec images onto virtio-mem memory. That
>would end badly.
> 3. kexec should still dump virtio-mem memory via kdump.

Ok, but then seems to say to me that dax/kmem is a different type of
(driver managed) than virtio-mem and it's confusing to try to apply
the same meaning. Why not just call your type for the distinct type it
is "System RAM (virtio-mem)" and let any other driver managed memory
follow the same "System RAM ($driver)" format if it wants?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 12:18 PM David Hildenbrand  wrote:
>
> On 01.05.20 20:43, Dan Williams wrote:
> > On Fri, May 1, 2020 at 11:14 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 20:03, Dan Williams wrote:
> >>> On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  
> >>> wrote:
> >>>>
> >>>> On 01.05.20 19:45, David Hildenbrand wrote:
> >>>>> On 01.05.20 19:39, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>>>
> >>>>>>>>>>> I assume:
> >>>>>>>>>>>
> >>>>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and 
> >>>>>>>>>>> still, is
> >>>>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. 
> >>>>>>>>>>> When DIMMs
> >>>>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>>>>>> assume
> >>>>>>>>>>> we wanted to be able to reflect that, to make kexec look like a 
> >>>>>>>>>>> real reboot.
> >>>>>>>>>>>
> >>>>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>>>
> >>>>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>>>> documentation? (especially, if the actual firmware memmap will 
> >>>>>>>>>>> *not*
> >>>>>>>>>>> contain that memory after a reboot)
> >>>>>>>>>>
> >>>>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>>>> Shaohui Zheng , who hasn't been heard 
> >>>>>>>>>> from in
> >>>>>>>>>> a decade.  I looked through the email discussion from that time 
> >>>>>>>>>> and I'm
> >>>>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave 
> >>>>>>>>>> Hansen's
> >>>>>>>>>> review comments.
> >>>>>>>>>
> >>>>>>>>> Okay, thanks for checking. I think the documentation from 2008 is 
> >>>>>>>>> pretty
> >>>>>>>>> clear what has to be done here. I will add some of these details to 
> >>>>>>>>> the
> >>>>>>>>> patch description.
> >>>>>>>>>
> >>>>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>>>> won't really suffer from a name change in /proc/iomem, I will go 
> >>>>>>>>> back to
> >>>>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>>>> 1. Don't create firmware memma

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 11:14 AM David Hildenbrand  wrote:
>
> On 01.05.20 20:03, Dan Williams wrote:
> > On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 19:45, David Hildenbrand wrote:
> >>> On 01.05.20 19:39, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  
> >>>> wrote:
> >>>>>
> >>>>> On 01.05.20 18:56, Dan Williams wrote:
> >>>>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>>>
> >>>>>>>>> I assume:
> >>>>>>>>>
> >>>>>>>>> The firmware memmap was added primarily for x86-64 kexec (and 
> >>>>>>>>> still, is
> >>>>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When 
> >>>>>>>>> DIMMs
> >>>>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>>>> assume
> >>>>>>>>> we wanted to be able to reflect that, to make kexec look like a 
> >>>>>>>>> real reboot.
> >>>>>>>>>
> >>>>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>>>
> >>>>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>>>> contain that memory after a reboot)
> >>>>>>>>
> >>>>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>>>> Shaohui Zheng , who hasn't been heard from 
> >>>>>>>> in
> >>>>>>>> a decade.  I looked through the email discussion from that time and 
> >>>>>>>> I'm
> >>>>>>>> not seeing anything useful.  But I wasn't able to locate Dave 
> >>>>>>>> Hansen's
> >>>>>>>> review comments.
> >>>>>>>
> >>>>>>> Okay, thanks for checking. I think the documentation from 2008 is 
> >>>>>>> pretty
> >>>>>>> clear what has to be done here. I will add some of these details to 
> >>>>>>> the
> >>>>>>> patch description.
> >>>>>>>
> >>>>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>>>> won't really suffer from a name change in /proc/iomem, I will go back 
> >>>>>>> to
> >>>>>>> the MHP_DRIVER_MANAGED approach and
> >>>>>>> 1. Don't create firmware memmap entries
> >>>>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>>>
> >>>>>>> This way, kernel users and user space can figure out that this memory
> >>>>>>> has different semantics and handle it accordingly - I think that was
> >>>>>>> what Eric was asking for.
> >>>>>>>
> >>>>>>> Of course, open for suggestions.
> >>>>>>
> >>>>>> I'm still more of a fan of this bei

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 10:51 AM David Hildenbrand  wrote:
>
> On 01.05.20 19:45, David Hildenbrand wrote:
> > On 01.05.20 19:39, Dan Williams wrote:
> >> On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  wrote:
> >>>
> >>> On 01.05.20 18:56, Dan Williams wrote:
> >>>> On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  
> >>>> wrote:
> >>>>>
> >>>>> On 01.05.20 00:24, Andrew Morton wrote:
> >>>>>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand 
> >>>>>>  wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>> Why does the firmware map support hotplug entries?
> >>>>>>>
> >>>>>>> I assume:
> >>>>>>>
> >>>>>>> The firmware memmap was added primarily for x86-64 kexec (and still, 
> >>>>>>> is
> >>>>>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When 
> >>>>>>> DIMMs
> >>>>>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>>>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>>>>> ballooning and you reboot ... the the e820 is changed as well). I 
> >>>>>>> assume
> >>>>>>> we wanted to be able to reflect that, to make kexec look like a real 
> >>>>>>> reboot.
> >>>>>>>
> >>>>>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>>>>
> >>>>>>>
> >>>>>>> But I assume only Andrew can enlighten us.
> >>>>>>>
> >>>>>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>>>>> firmware memmap, even if this contradicts with the existing
> >>>>>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>>>>> contain that memory after a reboot)
> >>>>>>
> >>>>>> For some reason that patch is misattributed - it was authored by
> >>>>>> Shaohui Zheng , who hasn't been heard from in
> >>>>>> a decade.  I looked through the email discussion from that time and I'm
> >>>>>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>>>>> review comments.
> >>>>>
> >>>>> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >>>>> clear what has to be done here. I will add some of these details to the
> >>>>> patch description.
> >>>>>
> >>>>> Also, now that I know that esp. kexec-tools already don't consider
> >>>>> dax/kmem memory properly (memory will not get dumped via kdump) and
> >>>>> won't really suffer from a name change in /proc/iomem, I will go back to
> >>>>> the MHP_DRIVER_MANAGED approach and
> >>>>> 1. Don't create firmware memmap entries
> >>>>> 2. Name the resource "System RAM (driver managed)"
> >>>>> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>>>>
> >>>>> This way, kernel users and user space can figure out that this memory
> >>>>> has different semantics and handle it accordingly - I think that was
> >>>>> what Eric was asking for.
> >>>>>
> >>>>> Of course, open for suggestions.
> >>>>
> >>>> I'm still more of a fan of this being communicated by "System RAM"
> >>>
> >>> I was mentioning somewhere in this thread that "System RAM" inside a
> >>> hierarchy (like dax/kmem) will already be basically ignored by
> >>> kexec-tools. So, placing it inside a hierarchy already makes it look
> >>> special already.
> >>>
> >>> But after all, as we have to change kexec-tools either way, we can
> >>> directly go ahead and flag it properly as special (in case there will
> >>> ever be other cases where we could no longer distinguish it).
> >>>
> >>>> being parented especially because that tells you something about how
> >>>> the memory is driver-managed and which mechanism might be in play.
> >>>
> >>> The 

Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-05-01 Thread Dan Williams
On Fri, May 1, 2020 at 10:21 AM David Hildenbrand  wrote:
>
> On 01.05.20 18:56, Dan Williams wrote:
> > On Fri, May 1, 2020 at 2:34 AM David Hildenbrand  wrote:
> >>
> >> On 01.05.20 00:24, Andrew Morton wrote:
> >>> On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  
> >>> wrote:
> >>>
> >>>>>
> >>>>> Why does the firmware map support hotplug entries?
> >>>>
> >>>> I assume:
> >>>>
> >>>> The firmware memmap was added primarily for x86-64 kexec (and still, is
> >>>> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> >>>> get hotplugged on real HW, they get added to e820. Same applies to
> >>>> memory added via HyperV balloon (unless memory is unplugged via
> >>>> ballooning and you reboot ... the the e820 is changed as well). I assume
> >>>> we wanted to be able to reflect that, to make kexec look like a real 
> >>>> reboot.
> >>>>
> >>>> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> >>>>
> >>>>
> >>>> But I assume only Andrew can enlighten us.
> >>>>
> >>>> @Andrew, any guidance here? Should we really add all memory to the
> >>>> firmware memmap, even if this contradicts with the existing
> >>>> documentation? (especially, if the actual firmware memmap will *not*
> >>>> contain that memory after a reboot)
> >>>
> >>> For some reason that patch is misattributed - it was authored by
> >>> Shaohui Zheng , who hasn't been heard from in
> >>> a decade.  I looked through the email discussion from that time and I'm
> >>> not seeing anything useful.  But I wasn't able to locate Dave Hansen's
> >>> review comments.
> >>
> >> Okay, thanks for checking. I think the documentation from 2008 is pretty
> >> clear what has to be done here. I will add some of these details to the
> >> patch description.
> >>
> >> Also, now that I know that esp. kexec-tools already don't consider
> >> dax/kmem memory properly (memory will not get dumped via kdump) and
> >> won't really suffer from a name change in /proc/iomem, I will go back to
> >> the MHP_DRIVER_MANAGED approach and
> >> 1. Don't create firmware memmap entries
> >> 2. Name the resource "System RAM (driver managed)"
> >> 3. Flag the resource via something like IORESOURCE_MEM_DRIVER_MANAGED.
> >>
> >> This way, kernel users and user space can figure out that this memory
> >> has different semantics and handle it accordingly - I think that was
> >> what Eric was asking for.
> >>
> >> Of course, open for suggestions.
> >
> > I'm still more of a fan of this being communicated by "System RAM"
>
> I was mentioning somewhere in this thread that "System RAM" inside a
> hierarchy (like dax/kmem) will already be basically ignored by
> kexec-tools. So, placing it inside a hierarchy already makes it look
> special already.
>
> But after all, as we have to change kexec-tools either way, we can
> directly go ahead and flag it properly as special (in case there will
> ever be other cases where we could no longer distinguish it).
>
> > being parented especially because that tells you something about how
> > the memory is driver-managed and which mechanism might be in play.
>
> The could be communicated to some degree via the resource hierarchy.
>
> E.g.,
>
> [root@localhost ~]# cat /proc/iomem
> ...
> 14000-33fff : Persistent Memory
>   14000-1481f : namespace0.0
>   15000-33fff : dax0.0
> 15000-33fff : System RAM (driver managed)
>
> vs.
>
>:/# cat /proc/iomem
> [...]
> 14000-333ff : virtio-mem (virtio0)
>   14000-147ff : System RAM (driver managed)
>   14800-14fff : System RAM (driver managed)
>   15000-157ff : System RAM (driver managed)
>
> Good enough for my taste.
>
> > What about adding an optional /sys/firmware/memmap/X/parent attribute.
>
> I really don't want any firmware memmap entries for something that is
> not part of the firmware provided memmap. In addition,
> /sys/firmware/memmap/ is still a fairly x86_64 specific thing. Only mips
> and two arm configs enable it at all.
>
> So, IMHO, /sys/firmware/memmap/ is definitely not the way to go.

I think that's a policy decision and policy decisions do not belong in
the kernel. Give the tooling the opportunity to decide whether System
RAM stays that way over a kexec. The parenthetical reference otherwise
looks out of place to me in the /proc/iomem output. What makes it
"driver managed" is how the kernel handles it, not how the kernel
names it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   >