Re: [dm-devel] [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-23 Thread Dan Williams
On Thu, Mar 18, 2021 at 7:18 PM ruansy.f...@fujitsu.com
 wrote:
>
>
>
> > -Original Message-
> > From: ruansy.f...@fujitsu.com 
> > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> > > > > > >
> > > > > > > After the conversation with Dave I don't see the point of this.
> > > > > > > If there is a memory_failure() on a page, why not just call
> > > > > > > memory_failure()? That already knows how to find the inode and
> > > > > > > the filesystem can be notified from there.
> > > > > >
> > > > > > We want memory_failure() supports reflinked files.  In this
> > > > > > case, we are not able to track multiple files from a page(this
> > > > > > broken
> > > > > > page) because
> > > > > > page->mapping,page->index can only track one file.  Thus, I
> > > > > > page->introduce this
> > > > > > ->memory_failure() implemented in pmem driver, to call
> > > > > > ->->corrupted_range()
> > > > > > upper level to upper level, and finally find out files who are
> > > > > > using(mmapping) this page.
> > > > > >
> > > > >
> > > > > I know the motivation, but this implementation seems backwards.
> > > > > It's already the case that memory_failure() looks up the
> > > > > address_space associated with a mapping. From there I would expect
> > > > > a new 'struct address_space_operations' op to let the fs handle
> > > > > the case when there are multiple address_spaces associated with a 
> > > > > given
> > file.
> > > > >
> > > >
> > > > Let me think about it.  In this way, we
> > > > 1. associate file mapping with dax page in dax page fault;
> > >
> > > I think this needs to be a new type of association that proxies the
> > > representation of the reflink across all involved address_spaces.
> > >
> > > > 2. iterate files reflinked to notify `kill processes signal` by the
> > > >   new address_space_operation;
> > > > 3. re-associate to another reflinked file mapping when unmmaping
> > > > (rmap qeury in filesystem to get the another file).
> > >
> > > Perhaps the proxy object is reference counted per-ref-link. It seems
> > > error prone to keep changing the association of the pfn while the reflink 
> > > is
> > in-tact.
> > Hi, Dan
> >
> > I think my early rfc patchset was implemented in this way:
> >  - Create a per-page 'dax-rmap tree' to store each reflinked file's 
> > (mapping,
> > offset) when causing dax page fault.
> >  - Mount this tree on page->zone_device_data which is not used in fsdax, so
> > that we can iterate reflinked file mappings in memory_failure() easily.
> > In my understanding, the dax-rmap tree is the proxy object you mentioned.  
> > If
> > so, I have to say, this method was rejected. Because this will cause huge
> > overhead in some case that every dax page have one dax-rmap tree.
> >
>
> Hi, Dan
>
> How do you think about this?  I am still confused.  Could you give me some 
> advice?

So I think the primary driver of this functionality is dax-devices and
the architectural model for memory failure where several architectures
and error handlers know how to route pfn failure to the
memory_failure() frontend.

Compare that to block-devices where sector failure has no similar
framework, and despite some initial interest about reusing 'struct
badblocks' for this type of scenario there has been no real uptake to
expand 'struct badblocks' outside of the pmem driver.

I think the work you have done for ->corrupted_range() just needs to
be repurposed away from a block-device operation to dax-device
infrastructure. Christoph's pushback on extending
block_device_operations makes sense to me because there is likely no
other user of this facility than the pmem driver, and the pmem driver
only needs it for the vestigial reason that filesystems mount on
block-devices and not dax-devices.

Recently Dave drove home the point that a filesystem can't do anything
with pfns, it needs LBAs. A dax-device does not have LBA's, but it
does operate on the concept of device-relative offsets. The filesystem
is allowed to assume that dax-device:PFN[device_byte_offset >>
PAGE_SHIFT] aliases the same data as the associated
block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also
reiterated that this interface should be range based, which you
already had, but I did not include in my attempt to communicate the
mass failure of an entire surprise-removed device.

So I think the path forward is:

- teach memory_failure() to allow for ranged failures

- let interested drivers register for memory failure events via a
blocking_notifier_head

- teach memory_failure() to optionally let the notifier chain claim
the event vs its current default of walking page->mapping

- teach the pmem driver to register for memory_failure() events and
filter the ones that apply to pfns that the driver owns

- drop the nfit driver's usage of the mce notifier chain since
memory_failure() is a superset of what the mce notifier communicates

- augment the pmem driver's view of badblocks that it 

Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-23 Thread Ming Lei
On Tue, Mar 23, 2021 at 09:54:36AM -0700, Sagi Grimberg wrote:
> 
> > > > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > > > +{
> > > > +   bio->bi_iter.bi_private_data = cookie;
> > > > +}
> > > > +
> > > 
> > > Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
> > > an exported function for failover. nvme-mpath updates bio.bi_dev
> > > when re-submitting I/Os to an alternate path, so I'm thinking
> > > that if this function is exported then nvme-mpath could do as little
> > > as the below to allow polling?
> > > 
> > > --
> > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > > index 92adebfaf86f..e562e296153b 100644
> > > --- a/drivers/nvme/host/multipath.c
> > > +++ b/drivers/nvme/host/multipath.c
> > > @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct 
> > > *work)
> > >  struct nvme_ns_head *head =
> > >  container_of(work, struct nvme_ns_head, requeue_work);
> > >  struct bio *bio, *next;
> > > +   blk_qc_t cookie;
> > > 
> > >  spin_lock_irq(>requeue_lock);
> > >  next = bio_list_get(>requeue_list);
> > > @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct 
> > > *work)
> > >   * path.
> > >   */
> > >  bio_set_dev(bio, head->disk->part0);
> > > -   submit_bio_noacct(bio);
> > > +   cookie = submit_bio_noacct(bio);
> > > +   blk_bio_poll_post_submit(bio, cookie);
> > >  }
> > >   }
> > > --
> > > 
> > > I/O failover will create misalignment from the polling context cpu and
> > > the submission cpu (running requeue_work), but I don't see if there is
> > > something that would break here...
> > 
> > I understand requeue shouldn't be one usual event, and I guess it is just
> > fine to fallback to IRQ based mode?
> 
> Well, when it will failover, it will probably be directed to the poll
> queues. Maybe I'm missing something...

In this patchset, because it isn't submitted directly from FS, there
isn't one polling context associated with this bio, so its HIPRI flag
will be cleared, then fallback to irq mode.

> 
> > This patchset actually doesn't cover such bio submission from kernel 
> > context.
> 
> What is the difference?

So far upper layer(io_uring, or dio, ..) needs to get the returned cookie, then
pass it to blk_poll().

For this case, the cookie can't be passed to FS caller of submit_bio(FS bio), so
it can't be polled by in-tree's code.



Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-integrity - add the "reset_recalculate" flag

2021-03-23 Thread Mikulas Patocka



On Tue, 23 Mar 2021, Milan Broz wrote:

> > Do you need to bump the number of feature args supported (from 17 to
> > 18)?

Goo point. I've sent version 2 of the patch.

> And also update target minor version.
> 
> I was just under the impression that we decided not to support such a 
> flag (because we cannot change tag size, so it is not usable in some 
> situations). But if it is so simple, why not.

I found out that it is not as hard as it seemed, so we can implement it.

> For the reference, it was discovered in this report 
> https://gitlab.com/cryptsetup/cryptsetup/-/issues/631
> 
> Milan

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v2] dm-integrity - add the "reset_recalculate" flag

2021-03-23 Thread Mikulas Patocka
This patch adds a new flag "reset_recalculate" that will restart
recalculating from the beginning of the device. It can be used if we want
to change the hash function. Example:

#!/bin/sh
dmsetup remove_all
rmmod brd
set -e
modprobe brd rd_size=1048576
dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
internal_hash:sha256 recalculate'
sleep 10
dmsetup status
dmsetup remove in
dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
internal_hash:sha3-256 reset_recalculate'

Signed-off-by: Mikulas Patocka 

Index: linux-2.6/drivers/md/dm-integrity.c
===
--- linux-2.6.orig/drivers/md/dm-integrity.c
+++ linux-2.6/drivers/md/dm-integrity.c
@@ -262,6 +262,7 @@ struct dm_integrity_c {
bool journal_uptodate;
bool just_formatted;
bool recalculate_flag;
+   bool reset_recalculate_flag;
bool discard;
bool fix_padding;
bool fix_hmac;
@@ -2526,7 +2527,8 @@ static void do_journal_write(struct dm_i
 #ifndef INTERNAL_VERIFY
unlikely(from_replay) &&
 #endif
-   ic->internal_hash) {
+   ic->internal_hash &&
+   !ic->reset_recalculate_flag) {
char test_tag[max_t(size_t, 
HASH_MAX_DIGESTSIZE, MAX_TAG_SIZE)];
 
integrity_sector_checksum(ic, sec + ((l 
- j) << ic->sb->log2_sectors_per_block),
@@ -3134,7 +3136,8 @@ static void dm_integrity_resume(struct d
rw_journal_sectors(ic, REQ_OP_READ, 0, 0,
   ic->n_bitmap_blocks * (BITMAP_BLOCK_SIZE >> 
SECTOR_SHIFT), NULL);
if (ic->mode == 'B') {
-   if (ic->sb->log2_blocks_per_bitmap_bit == 
ic->log2_blocks_per_bitmap_bit) {
+   if (ic->sb->log2_blocks_per_bitmap_bit == 
ic->log2_blocks_per_bitmap_bit &&
+   !ic->reset_recalculate_flag) {
block_bitmap_copy(ic, ic->recalc_bitmap, 
ic->journal);
block_bitmap_copy(ic, ic->may_write_bitmap, 
ic->journal);
if (!block_bitmap_op(ic, ic->journal, 0, 
ic->provided_data_sectors,
@@ -3156,7 +3159,8 @@ static void dm_integrity_resume(struct d
}
} else {
if (!(ic->sb->log2_blocks_per_bitmap_bit == 
ic->log2_blocks_per_bitmap_bit &&
- block_bitmap_op(ic, ic->journal, 0, 
ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR))) {
+ block_bitmap_op(ic, ic->journal, 0, 
ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR)) ||
+   ic->reset_recalculate_flag) {
ic->sb->flags |= 
cpu_to_le32(SB_FLAG_RECALCULATING);
ic->sb->recalc_sector = cpu_to_le64(0);
}
@@ -3169,6 +3173,10 @@ static void dm_integrity_resume(struct d
dm_integrity_io_error(ic, "writing superblock", r);
} else {
replay_journal(ic);
+   if (ic->reset_recalculate_flag) {
+   ic->sb->flags |= cpu_to_le32(SB_FLAG_RECALCULATING);
+   ic->sb->recalc_sector = cpu_to_le64(0);
+   }
if (ic->mode == 'B') {
ic->sb->flags |= cpu_to_le32(SB_FLAG_DIRTY_BITMAP);
ic->sb->log2_blocks_per_bitmap_bit = 
ic->log2_blocks_per_bitmap_bit;
@@ -3242,6 +3250,7 @@ static void dm_integrity_status(struct d
arg_count += !!ic->meta_dev;
arg_count += ic->sectors_per_block != 1;
arg_count += !!(ic->sb->flags & 
cpu_to_le32(SB_FLAG_RECALCULATING));
+   arg_count += ic->reset_recalculate_flag;
arg_count += ic->discard;
arg_count += ic->mode == 'J';
arg_count += ic->mode == 'J';
@@ -3261,6 +3270,8 @@ static void dm_integrity_status(struct d
DMEMIT(" block_size:%u", ic->sectors_per_block << 
SECTOR_SHIFT);
if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING))
DMEMIT(" recalculate");
+   if (ic->reset_recalculate_flag)
+   DMEMIT(" reset_recalculate");
if (ic->discard)
DMEMIT(" allow_discards");
DMEMIT(" journal_sectors:%u", ic->initial_sectors - SB_SECTORS);
@@ -3914,7 +3925,7 @@ static int dm_integrity_ctr(struct dm_ta
unsigned extra_args;
struct dm_arg_set as;
static const struct dm_arg _args[] = {
-   {0, 17, "Invalid number of feature args"},
+   {0, 18, "Invalid number of feature args"},
};
unsigned journal_sectors, 

Re: [dm-devel] md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-23 Thread Ewan D. Milne
On Tue, 2021-03-23 at 15:47 +0800, lixiaokeng wrote:
> 
> On 2021/3/22 22:22, Mike Snitzer wrote:
> > On Mon, Mar 22 2021 at  4:11am -0400,
> > Christoph Hellwig  wrote:
> > 
> > > On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
> > > > From: Zhiqiang Liu 
> > > > 
> > > > When we make IO stress test on multipath device, there will
> > > > be a metadata err because of wrong path. In the test, we
> > > > concurrent execute 'iscsi device login|logout' and
> > > > 'multipath -r' command with IO stress on multipath device.
> > > > In some case, systemd-udevd may have not time to process
> > > > uevents of iscsi device logout|login, and then 'multipath -r'
> > > > command triggers multipathd daemon calls ioctl to load table
> > > > with incorrect old device info from systemd-udevd.
> > > > Then, one iscsi path may be incorrectly attached to another
> > > > multipath which has different uuid. Finally, the metadata err
> > > > occurs when umounting filesystem to down write metadata on
> > > > the iscsi device which is actually not owned by the multipath
> > > > device.
> > > > 
> > > > So we need to check whether all pgpaths of one multipath have
> > > > the same uuid, if not, we should throw a error.
> > > > 
> > > > Signed-off-by: Zhiqiang Liu 
> > > > Signed-off-by: lixiaokeng 
> > > > Signed-off-by: linfeilong 
> > > > Signed-off-by: Wubo 
> > > > ---
> > > >  drivers/md/dm-mpath.c   | 52
> > > > +
> > > >  drivers/scsi/scsi_lib.c |  1 +
> > > >  2 files changed, 53 insertions(+)
> > > > 
> > > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > > > index bced42f082b0..f0b995784b53 100644
> > > > --- a/drivers/md/dm-mpath.c
> > > > +++ b/drivers/md/dm-mpath.c
> > > > @@ -24,6 +24,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >  #include 
> > > >  #include 
> > > > 
> > > > @@ -1169,6 +1170,45 @@ static int parse_features(struct
> > > > dm_arg_set *as, struct multipath *m)
> > > > return r;
> > > >  }
> > > > 
> > > > +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
> > > > +#define MPATH_UUID_PREFIX_LEN 7
> > > > +static int check_pg_uuid(struct priority_group *pg, char
> > > > *md_uuid)
> > > > +{
> > > > +   char pgpath_uuid[DM_UUID_LEN] = {0};
> > > > +   struct request_queue *q;
> > > > +   struct pgpath *pgpath;
> > > > +   struct scsi_device *sdev;
> > > > +   ssize_t count;
> > > > +   int r = 0;
> > > > +
> > > > +   list_for_each_entry(pgpath, >pgpaths, list) {
> > > > +   q = bdev_get_queue(pgpath->path.dev->bdev);
> > > > +   sdev = scsi_device_from_queue(q);
> > > 
> > > Common dm-multipath code should never poke into scsi
> > > internals.  This
> > > is something for the device handler to check.  It probably also
> > > won't
> > > work for all older devices.
> > 
> > Definitely.
> > 
> > But that aside, userspace (multipathd) _should_ be able to do extra
> > validation, _before_ pushing down a new table to the kernel, rather
> > than
> > forcing the kernel to do it.
> > 
> 
> Martin (committer of multipath-tools) said that:
> "Don't get me wrong, I don't argue against tough testing. But we
> should
> be aware that there are always time intervals during which
> multipathd's
> picture of the present devices is different from what the kernel
> sees."
> 
> It is difficult to solve this in multipathd.
> 
> Regards,
> Lixiaokeng
> 

I think the patch is no good.  There are plenty of devices that don't
support VPD page 83h:

int scsi_vpd_lun_id(struct scsi_device *sdev, char *id, size_t id_len)
{
u8 cur_id_type = 0xff;
u8 cur_id_size = 0;
unsigned char *d, *cur_id_str;
unsigned char __rcu *vpd_pg83;
int id_size = -EINVAL;

rcu_read_lock();
vpd_pg83 = rcu_dereference(sdev->vpd_pg83);
if (!vpd_pg83) {
rcu_read_unlock();
return -ENXIO;
}

and the DM layer should not be looking at the properties of the
underlying devices in this way anyway.  It should be pushed down
to the table.

-Ewan



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-23 Thread Sagi Grimberg




+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+   bio->bi_iter.bi_private_data = cookie;
+}
+


Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
an exported function for failover. nvme-mpath updates bio.bi_dev
when re-submitting I/Os to an alternate path, so I'm thinking
that if this function is exported then nvme-mpath could do as little
as the below to allow polling?

--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 92adebfaf86f..e562e296153b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
 struct nvme_ns_head *head =
 container_of(work, struct nvme_ns_head, requeue_work);
 struct bio *bio, *next;
+   blk_qc_t cookie;

 spin_lock_irq(>requeue_lock);
 next = bio_list_get(>requeue_list);
@@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
  * path.
  */
 bio_set_dev(bio, head->disk->part0);
-   submit_bio_noacct(bio);
+   cookie = submit_bio_noacct(bio);
+   blk_bio_poll_post_submit(bio, cookie);
 }
  }
--

I/O failover will create misalignment from the polling context cpu and
the submission cpu (running requeue_work), but I don't see if there is
something that would break here...


I understand requeue shouldn't be one usual event, and I guess it is just
fine to fallback to IRQ based mode?


Well, when it will failover, it will probably be directed to the poll
queues. Maybe I'm missing something...


This patchset actually doesn't cover such bio submission from kernel context.


What is the difference?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-integrity - add the "reset_recalculate" flag

2021-03-23 Thread Milan Broz
On 23/03/2021 16:12, Mike Snitzer wrote:
> On Tue, Mar 23 2021 at 10:59am -0400,
> Mikulas Patocka  wrote:
> 
>> This patch adds a new flag "reset_recalculate" that will restart
>> recalculating from the beginning of the device. It can be used if we want
>> to change the hash function. Example:
>>
>> #!/bin/sh
>> dmsetup remove_all
>> rmmod brd
>> set -e
>> modprobe brd rd_size=1048576
>> dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
>> internal_hash:sha256 recalculate'
>> sleep 10
>> dmsetup status
>> dmsetup remove in
>> dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
>> internal_hash:sha3-256 reset_recalculate'
>>
>> Signed-off-by: Mikulas Patocka 
>>
>> Index: linux-2.6/drivers/md/dm-integrity.c
>> ===
>> --- linux-2.6.orig/drivers/md/dm-integrity.c
>> +++ linux-2.6/drivers/md/dm-integrity.c
>> @@ -262,6 +262,7 @@ struct dm_integrity_c {
>>  bool journal_uptodate;
>>  bool just_formatted;
>>  bool recalculate_flag;
>> +bool reset_recalculate_flag;
>>  bool discard;
>>  bool fix_padding;
>>  bool fix_hmac;
>> @@ -3134,7 +3135,8 @@ static void dm_integrity_resume(struct d
>>  rw_journal_sectors(ic, REQ_OP_READ, 0, 0,
>> ic->n_bitmap_blocks * (BITMAP_BLOCK_SIZE >> 
>> SECTOR_SHIFT), NULL);
>>  if (ic->mode == 'B') {
>> -if (ic->sb->log2_blocks_per_bitmap_bit == 
>> ic->log2_blocks_per_bitmap_bit) {
>> +if (ic->sb->log2_blocks_per_bitmap_bit == 
>> ic->log2_blocks_per_bitmap_bit &&
>> +!ic->reset_recalculate_flag) {
>>  block_bitmap_copy(ic, ic->recalc_bitmap, 
>> ic->journal);
>>  block_bitmap_copy(ic, ic->may_write_bitmap, 
>> ic->journal);
>>  if (!block_bitmap_op(ic, ic->journal, 0, 
>> ic->provided_data_sectors,
>> @@ -3156,7 +3158,8 @@ static void dm_integrity_resume(struct d
>>  }
>>  } else {
>>  if (!(ic->sb->log2_blocks_per_bitmap_bit == 
>> ic->log2_blocks_per_bitmap_bit &&
>> -  block_bitmap_op(ic, ic->journal, 0, 
>> ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR))) {
>> +  block_bitmap_op(ic, ic->journal, 0, 
>> ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR)) ||
>> +ic->reset_recalculate_flag) {
>>  ic->sb->flags |= 
>> cpu_to_le32(SB_FLAG_RECALCULATING);
>>  ic->sb->recalc_sector = cpu_to_le64(0);
>>  }
>> @@ -3169,6 +3172,10 @@ static void dm_integrity_resume(struct d
>>  dm_integrity_io_error(ic, "writing superblock", r);
>>  } else {
>>  replay_journal(ic);
>> +if (ic->reset_recalculate_flag) {
>> +ic->sb->flags |= cpu_to_le32(SB_FLAG_RECALCULATING);
>> +ic->sb->recalc_sector = cpu_to_le64(0);
>> +}
>>  if (ic->mode == 'B') {
>>  ic->sb->flags |= cpu_to_le32(SB_FLAG_DIRTY_BITMAP);
>>  ic->sb->log2_blocks_per_bitmap_bit = 
>> ic->log2_blocks_per_bitmap_bit;
>> @@ -3242,6 +3249,7 @@ static void dm_integrity_status(struct d
>>  arg_count += !!ic->meta_dev;
>>  arg_count += ic->sectors_per_block != 1;
>>  arg_count += !!(ic->sb->flags & 
>> cpu_to_le32(SB_FLAG_RECALCULATING));
>> +arg_count += ic->reset_recalculate_flag;
>>  arg_count += ic->discard;
>>  arg_count += ic->mode == 'J';
>>  arg_count += ic->mode == 'J';
>> @@ -3261,6 +3269,8 @@ static void dm_integrity_status(struct d
>>  DMEMIT(" block_size:%u", ic->sectors_per_block << 
>> SECTOR_SHIFT);
>>  if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING))
>>  DMEMIT(" recalculate");
>> +if (ic->reset_recalculate_flag)
>> +DMEMIT(" reset_recalculate");
>>  if (ic->discard)
>>  DMEMIT(" allow_discards");
>>  DMEMIT(" journal_sectors:%u", ic->initial_sectors - SB_SECTORS);
>> @@ -4058,6 +4068,9 @@ static int dm_integrity_ctr(struct dm_ta
>>  goto bad;
>>  } else if (!strcmp(opt_string, "recalculate")) {
>>  ic->recalculate_flag = true;
>> +} else if (!strcmp(opt_string, "reset_recalculate")) {
>> +ic->recalculate_flag = true;
>> +ic->reset_recalculate_flag = true;
>>  } else if (!strcmp(opt_string, "allow_discards")) {
>>  ic->discard = true;
>>  } else if (!strcmp(opt_string, "fix_padding")) {
> 
> Do you need to bump the number of feature args supported (from 17 to
> 18)?

And also update 

Re: [dm-devel] dm-integrity - add the "reset_recalculate" flag

2021-03-23 Thread Mike Snitzer
On Tue, Mar 23 2021 at 10:59am -0400,
Mikulas Patocka  wrote:

> This patch adds a new flag "reset_recalculate" that will restart
> recalculating from the beginning of the device. It can be used if we want
> to change the hash function. Example:
> 
> #!/bin/sh
> dmsetup remove_all
> rmmod brd
> set -e
> modprobe brd rd_size=1048576
> dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
> internal_hash:sha256 recalculate'
> sleep 10
> dmsetup status
> dmsetup remove in
> dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
> internal_hash:sha3-256 reset_recalculate'
> 
> Signed-off-by: Mikulas Patocka 
> 
> Index: linux-2.6/drivers/md/dm-integrity.c
> ===
> --- linux-2.6.orig/drivers/md/dm-integrity.c
> +++ linux-2.6/drivers/md/dm-integrity.c
> @@ -262,6 +262,7 @@ struct dm_integrity_c {
>   bool journal_uptodate;
>   bool just_formatted;
>   bool recalculate_flag;
> + bool reset_recalculate_flag;
>   bool discard;
>   bool fix_padding;
>   bool fix_hmac;
> @@ -3134,7 +3135,8 @@ static void dm_integrity_resume(struct d
>   rw_journal_sectors(ic, REQ_OP_READ, 0, 0,
>  ic->n_bitmap_blocks * (BITMAP_BLOCK_SIZE >> 
> SECTOR_SHIFT), NULL);
>   if (ic->mode == 'B') {
> - if (ic->sb->log2_blocks_per_bitmap_bit == 
> ic->log2_blocks_per_bitmap_bit) {
> + if (ic->sb->log2_blocks_per_bitmap_bit == 
> ic->log2_blocks_per_bitmap_bit &&
> + !ic->reset_recalculate_flag) {
>   block_bitmap_copy(ic, ic->recalc_bitmap, 
> ic->journal);
>   block_bitmap_copy(ic, ic->may_write_bitmap, 
> ic->journal);
>   if (!block_bitmap_op(ic, ic->journal, 0, 
> ic->provided_data_sectors,
> @@ -3156,7 +3158,8 @@ static void dm_integrity_resume(struct d
>   }
>   } else {
>   if (!(ic->sb->log2_blocks_per_bitmap_bit == 
> ic->log2_blocks_per_bitmap_bit &&
> -   block_bitmap_op(ic, ic->journal, 0, 
> ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR))) {
> +   block_bitmap_op(ic, ic->journal, 0, 
> ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR)) ||
> + ic->reset_recalculate_flag) {
>   ic->sb->flags |= 
> cpu_to_le32(SB_FLAG_RECALCULATING);
>   ic->sb->recalc_sector = cpu_to_le64(0);
>   }
> @@ -3169,6 +3172,10 @@ static void dm_integrity_resume(struct d
>   dm_integrity_io_error(ic, "writing superblock", r);
>   } else {
>   replay_journal(ic);
> + if (ic->reset_recalculate_flag) {
> + ic->sb->flags |= cpu_to_le32(SB_FLAG_RECALCULATING);
> + ic->sb->recalc_sector = cpu_to_le64(0);
> + }
>   if (ic->mode == 'B') {
>   ic->sb->flags |= cpu_to_le32(SB_FLAG_DIRTY_BITMAP);
>   ic->sb->log2_blocks_per_bitmap_bit = 
> ic->log2_blocks_per_bitmap_bit;
> @@ -3242,6 +3249,7 @@ static void dm_integrity_status(struct d
>   arg_count += !!ic->meta_dev;
>   arg_count += ic->sectors_per_block != 1;
>   arg_count += !!(ic->sb->flags & 
> cpu_to_le32(SB_FLAG_RECALCULATING));
> + arg_count += ic->reset_recalculate_flag;
>   arg_count += ic->discard;
>   arg_count += ic->mode == 'J';
>   arg_count += ic->mode == 'J';
> @@ -3261,6 +3269,8 @@ static void dm_integrity_status(struct d
>   DMEMIT(" block_size:%u", ic->sectors_per_block << 
> SECTOR_SHIFT);
>   if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING))
>   DMEMIT(" recalculate");
> + if (ic->reset_recalculate_flag)
> + DMEMIT(" reset_recalculate");
>   if (ic->discard)
>   DMEMIT(" allow_discards");
>   DMEMIT(" journal_sectors:%u", ic->initial_sectors - SB_SECTORS);
> @@ -4058,6 +4068,9 @@ static int dm_integrity_ctr(struct dm_ta
>   goto bad;
>   } else if (!strcmp(opt_string, "recalculate")) {
>   ic->recalculate_flag = true;
> + } else if (!strcmp(opt_string, "reset_recalculate")) {
> + ic->recalculate_flag = true;
> + ic->reset_recalculate_flag = true;
>   } else if (!strcmp(opt_string, "allow_discards")) {
>   ic->discard = true;
>   } else if (!strcmp(opt_string, "fix_padding")) {

Do you need to bump the number of feature args supported (from 17 to
18)?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm-integrity - add the "reset_recalculate" flag

2021-03-23 Thread Mikulas Patocka
This patch adds a new flag "reset_recalculate" that will restart
recalculating from the beginning of the device. It can be used if we want
to change the hash function. Example:

#!/bin/sh
dmsetup remove_all
rmmod brd
set -e
modprobe brd rd_size=1048576
dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
internal_hash:sha256 recalculate'
sleep 10
dmsetup status
dmsetup remove in
dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
internal_hash:sha3-256 reset_recalculate'

Signed-off-by: Mikulas Patocka 

Index: linux-2.6/drivers/md/dm-integrity.c
===
--- linux-2.6.orig/drivers/md/dm-integrity.c
+++ linux-2.6/drivers/md/dm-integrity.c
@@ -262,6 +262,7 @@ struct dm_integrity_c {
bool journal_uptodate;
bool just_formatted;
bool recalculate_flag;
+   bool reset_recalculate_flag;
bool discard;
bool fix_padding;
bool fix_hmac;
@@ -3134,7 +3135,8 @@ static void dm_integrity_resume(struct d
rw_journal_sectors(ic, REQ_OP_READ, 0, 0,
   ic->n_bitmap_blocks * (BITMAP_BLOCK_SIZE >> 
SECTOR_SHIFT), NULL);
if (ic->mode == 'B') {
-   if (ic->sb->log2_blocks_per_bitmap_bit == 
ic->log2_blocks_per_bitmap_bit) {
+   if (ic->sb->log2_blocks_per_bitmap_bit == 
ic->log2_blocks_per_bitmap_bit &&
+   !ic->reset_recalculate_flag) {
block_bitmap_copy(ic, ic->recalc_bitmap, 
ic->journal);
block_bitmap_copy(ic, ic->may_write_bitmap, 
ic->journal);
if (!block_bitmap_op(ic, ic->journal, 0, 
ic->provided_data_sectors,
@@ -3156,7 +3158,8 @@ static void dm_integrity_resume(struct d
}
} else {
if (!(ic->sb->log2_blocks_per_bitmap_bit == 
ic->log2_blocks_per_bitmap_bit &&
- block_bitmap_op(ic, ic->journal, 0, 
ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR))) {
+ block_bitmap_op(ic, ic->journal, 0, 
ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR)) ||
+   ic->reset_recalculate_flag) {
ic->sb->flags |= 
cpu_to_le32(SB_FLAG_RECALCULATING);
ic->sb->recalc_sector = cpu_to_le64(0);
}
@@ -3169,6 +3172,10 @@ static void dm_integrity_resume(struct d
dm_integrity_io_error(ic, "writing superblock", r);
} else {
replay_journal(ic);
+   if (ic->reset_recalculate_flag) {
+   ic->sb->flags |= cpu_to_le32(SB_FLAG_RECALCULATING);
+   ic->sb->recalc_sector = cpu_to_le64(0);
+   }
if (ic->mode == 'B') {
ic->sb->flags |= cpu_to_le32(SB_FLAG_DIRTY_BITMAP);
ic->sb->log2_blocks_per_bitmap_bit = 
ic->log2_blocks_per_bitmap_bit;
@@ -3242,6 +3249,7 @@ static void dm_integrity_status(struct d
arg_count += !!ic->meta_dev;
arg_count += ic->sectors_per_block != 1;
arg_count += !!(ic->sb->flags & 
cpu_to_le32(SB_FLAG_RECALCULATING));
+   arg_count += ic->reset_recalculate_flag;
arg_count += ic->discard;
arg_count += ic->mode == 'J';
arg_count += ic->mode == 'J';
@@ -3261,6 +3269,8 @@ static void dm_integrity_status(struct d
DMEMIT(" block_size:%u", ic->sectors_per_block << 
SECTOR_SHIFT);
if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING))
DMEMIT(" recalculate");
+   if (ic->reset_recalculate_flag)
+   DMEMIT(" reset_recalculate");
if (ic->discard)
DMEMIT(" allow_discards");
DMEMIT(" journal_sectors:%u", ic->initial_sectors - SB_SECTORS);
@@ -4058,6 +4068,9 @@ static int dm_integrity_ctr(struct dm_ta
goto bad;
} else if (!strcmp(opt_string, "recalculate")) {
ic->recalculate_flag = true;
+   } else if (!strcmp(opt_string, "reset_recalculate")) {
+   ic->recalculate_flag = true;
+   ic->reset_recalculate_flag = true;
} else if (!strcmp(opt_string, "allow_discards")) {
ic->discard = true;
} else if (!strcmp(opt_string, "fix_padding")) {

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] multipathd fix NULL dereference in check_path

2021-03-23 Thread lixiaokeng
When iscsi login/logout and multipath command are executed
concurrently, there is a coredump.

The reason is:
check_path
->update_multipath_strings
->sync_paths
->orphan_path//pp->mpp is set to NULL
->update_multipath_status
->dm_get_status  //return DMP_NOT_FOUND
->condlog //pp->mpp->alias, NULL dereference

Here we don't dereference pp-> mpp if it is NULL.

Signed-off-by: Lixiaokeng
---
 multipathd/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/multipathd/main.c b/multipathd/main.c
index 28cb236..5ed2267 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -2010,7 +2010,7 @@ check_path (struct vectors * vecs, struct path * pp, int 
ticks)
if (ret == DMP_NOT_FOUND) {
/* multipath device missing. Likely removed */
condlog(1, "%s: multipath device '%s' not found",
-   pp->dev, pp->mpp->alias);
+   pp->dev, pp->mpp ? pp->mpp->alias : "");
return 0;
} else
condlog(1, "%s: Couldn't synchronize with kernel state",

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-23 Thread Ming Lei
On Mon, Mar 22, 2021 at 08:46:04PM -0700, Sagi Grimberg wrote:
> 
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +   bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> 
> Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
> an exported function for failover. nvme-mpath updates bio.bi_dev
> when re-submitting I/Os to an alternate path, so I'm thinking
> that if this function is exported then nvme-mpath could do as little
> as the below to allow polling?
> 
> --
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 92adebfaf86f..e562e296153b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
> struct nvme_ns_head *head =
> container_of(work, struct nvme_ns_head, requeue_work);
> struct bio *bio, *next;
> +   blk_qc_t cookie;
> 
> spin_lock_irq(>requeue_lock);
> next = bio_list_get(>requeue_list);
> @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
>  * path.
>  */
> bio_set_dev(bio, head->disk->part0);
> -   submit_bio_noacct(bio);
> +   cookie = submit_bio_noacct(bio);
> +   blk_bio_poll_post_submit(bio, cookie);
> }
>  }
> --
> 
> I/O failover will create misalignment from the polling context cpu and
> the submission cpu (running requeue_work), but I don't see if there is
> something that would break here...

I understand requeue shouldn't be one usual event, and I guess it is just
fine to fallback to IRQ based mode?

This patchset actually doesn't cover such bio submission from kernel context.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-23 Thread Ming Lei
On Fri, Mar 19, 2021 at 02:38:11PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei  wrote:
> 
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> 
> This is awkward because bio-based IO polling doesn't exist upstream yet,
> so this header should be covering your approach as a clean slate, e.g.:
> 
> The complexity associated with frequent bio splitting with bio-based
> devices makes it difficult to implement IO polling efficiently because
> the fan-out of underlying hw queues that need to be polled (as a
> side-effect of bios being split) creates a need for more easily mapping
> a group of bios to the hw queues that need to be polled.
> 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in 
> > bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> 
> Maybe be more precise by covering how all bios from that task's
> submission context will be moved to poll queue of poll context?

Strictly speaking, it isn't a must to add poll context which is just
more efficient for polling locally after taking bio grouping list in V2.

> 
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> 
> Awkward to reference "previous version", maybe instead say:
> 
> In was found that kfifo doesn't scale well for a submission queue as
> queue depth is increased, so a new mechanism for tracking bios is
> needed.

OK.

> 
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/bio.c   |   5 ++
> >  block/blk-core.c  | 149 +++-
> >  block/blk-mq.c| 173 +-
> >  block/blk.h   |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio 
> > *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +   /* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +   if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +   bio_set_flag(bio, BIO_DONE);
> > +   return;
> > +   }
> >  again:
> > if (!bio_remaining_done(bio))
> > return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned 
> > int nr_grps)
> > sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +   return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +   int i;
> > +   struct bio_grp_list_data *grp;
> > +
> > +   for (i = 0; i < list->nr_grps; i++) {
> > +   grp = >head[i];
> > +   if (grp->grp_data == bio_grp_data(bio)) {
> > +   __bio_grp_list_add(>list, bio);
> > +   return true;
> > +   }
> > +   }
> > +
> > +   if (i == list->max_nr_grps)
> > +   return false;
> > +
> > +   /* create a new group */
> > +   grp = >head[i];
> > +   bio_list_init(>list);
> > +   grp->grp_data = bio_grp_data(bio);
> > +   __bio_grp_list_add(>list, bio);
> > +   list->nr_grps++;
> > +
> 

Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-23 Thread Ming Lei
On Sat, Mar 20, 2021 at 01:56:13PM +0800, JeffleXu wrote:
> 
> 
> On 3/19/21 9:46 PM, Ming Lei wrote:
> > On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
> >> I'm thinking how this mechanism could work with *original* bio-based
> >> devices that don't ne built upon mq devices, such as nvdimm. This
> > 
> > non-mq device needs driver to implement io polling by itself, block
> > layer can't help it, and that can't be this patchset's job.
> > 
> >> mechanism (also including my original design) mainly focuses on virtual
> >> devices that built upon mq devices, i.e., md/dm.
> >>
> >> As the original bio-based devices wants to support IO polling in the
> >> future, then they should be somehow distingushed from md/dm.
> >>
> >>
> >> On 3/19/21 12:48 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, track associated underlying bios by per-task
> >>> submission queue and save 'cookie' poll data in 
> >>> bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> In previous version, kfifo is used to implement submission queue, and
> >>> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> >>> So far bio's size is close to 2 cacheline size, and it may not be
> >>> accepted to add new field into bio for solving the scalability issue by
> >>> tracking bios via linked list, switch to bio group list for tracking bio,
> >>> the idea is to reuse .bi_end_io for linking bios into a linked list for
> >>> all sharing same .bi_end_io(call it bio group), which is recovered before
> >>> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> >>> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> >>> provide very limited groups, such as 32 for fixing the scalability issue.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei 
> >>> ---
> >>>  block/bio.c   |   5 ++
> >>>  block/blk-core.c  | 149 +++-
> >>>  block/blk-mq.c| 173 +-
> >>>  block/blk.h   |   9 ++
> >>>  include/linux/blk_types.h |  16 +++-
> >>>  5 files changed, 348 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index 26b7f721cda8..04c043dc60fc 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio 
> >>> *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> + /* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> + if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> + bio_set_flag(bio, BIO_DONE);
> >>> + return;
> >>> + }
> >>>  again:
> >>>   if (!bio_remaining_done(bio))
> >>>   return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index efc7a61a84b4..778d25a7e76c 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -805,6 +805,77 @@ static inline unsigned int 
> >>> bio_grp_list_size(unsigned int nr_grps)
> >>>   sizeof(struct bio_grp_list_data);
> >>>  }
> >>>  
> >>> +static inline void *bio_grp_data(struct bio *bio)
> >>> +{
> >>> + return bio->bi_poll;
> >>> +}
> >>> +
> >>> +/* add bio into bio group list, return true if it is added */
> >>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> >>> +{
> >>> + int i;
> >>> + struct bio_grp_list_data *grp;
> >>> +
> >>> + for (i = 0; i < list->nr_grps; i++) {
> >>> + grp = >head[i];
> >>> + if (grp->grp_data == bio_grp_data(bio)) {
> >>> + __bio_grp_list_add(>list, bio);
> >>> + return true;
> >>> + }
> >>> + }
> >>> +
> >>> + if (i == list->max_nr_grps)
> >>> + return false;
> >>> +
> >>> + /* create a new group */
> >>> + grp = >head[i];
> >>> + bio_list_init(>list);
> >>> + grp->grp_data = bio_grp_data(bio);
> >>> + __bio_grp_list_add(>list, bio);
> >>> + list->nr_grps++;
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void 
> >>> *grp_data)
> >>> +{
> >>> + int i;
> >>> + struct bio_grp_list_data *grp;
> >>> +
> >>> + for (i = 

Re: [dm-devel] [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'

2021-03-23 Thread Ming Lei
On Fri, Mar 19, 2021 at 01:44:22PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei  wrote:
> 
> > There is a hole at the end of 'struct bvec_iter', so put a new field
> > here and we can save cookie returned from submit_bio() here for
> > supporting bio based polling.
> > 
> > This way can avoid to extend bio unnecessarily.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  include/linux/bvec.h | 9 +
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index ff832e698efb..61c0f55f7165 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -43,6 +43,15 @@ struct bvec_iter {
> >  
> > unsigned intbi_bvec_done;   /* number of bytes completed in
> >current bvec */
> > +
> > +   /*
> > +* There is a hole at the end of bvec_iter, define one filed to
> 
> s/filed/field/
> 
> > +* hold something which isn't relate with 'bvec_iter', so that we can
> 
> s/relate/related/
> or
> s/isn't relate with/doesn't relate to/
> 
> > +* avoid to extend bio. So far this new field is used for bio based
> 
> s/to extend/extending/
> 
> > +* pooling, we will store returning value of underlying queue's
> 
> s/pooling/polling/
> 

Good catch, will fix all in V3.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG

2021-03-23 Thread Ming Lei
On Fri, Mar 19, 2021 at 01:38:13PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei  wrote:
> 
> > Add one req flag REQ_TAG which will be used in the following patch for
> > supporting bio based IO polling.
> 
> "REQ_TAG" is so generic yet is used in such a specific way (to mark an
> FS bio as having polling context)
> 
> I don't have a great suggestion for a better name, just seems "REQ_TAG"
> is lacking... (especially given the potential for confusion due to
> blk-mq's notion of "tag").
> 
> REQ_FS? REQ_FS_CTX? REQ_POLL? REQ_POLL_CTX? REQ_NAMING_IS_HARD :)
> 

Maybe REQ_POLL_CTX is better, it is just for marking bios:

1) which need to be polled in this context

2) which can be polled in this context

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 04/13] block: create io poll context for submission and poll task

2021-03-23 Thread Ming Lei
On Fri, Mar 19, 2021 at 01:05:09PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei  wrote:
> 
> > Create per-task io poll context for both IO submission and poll task
> > if the queue is bio based and supports polling.
> > 
> > This io polling context includes two queues:
> 1) submission queue(sq) for storing HIPRI bio submission result(cookie)
>and the bio, written by submission task and read by poll task.

BTW, V2 has switched to store bio only, and cookie is actually stored in
side bio.

> 2) polling queue(pq) for holding data moved from sq, only used in poll
>context for running bio polling.
>  
> (nit, but it just reads a bit clearer to enumerate the 2 queues)

OK.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 01/13] block: add helper of blk_queue_poll

2021-03-23 Thread Ming Lei
On Fri, Mar 19, 2021 at 12:52:42PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei  wrote:
> 
> > There has been 3 users, and will be more, so add one such helper.
> > 
> > Signed-off-by: Ming Lei 
> 
> Not sure if you're collecting Reviewed-by or Acked-by at this point?
> Seems you dropped Chaitanya's Reviewed-by to v1:
> https://listman.redhat.com/archives/dm-devel/2021-March/msg00166.html

Sorry, that should be an accident.

> 
> Do you plan to iterate a lot more before you put out a non-RFC?  For
> this RFC v2, I'll withhold adding any of my Reviewed-by tags and just
> reply where I see things that might need folding into the next
> iteration.

If no one objects the basic approach taken in V2, I will remove RFC in
V3.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] dm-multipath - IO queue dispatch based on FPIN Congestion/Latency notifications.

2021-03-23 Thread Erwin van Londen
Hello All,

Just wondering if there were any plans to incorporate FPIN
congestion/latency notifications in dm-multipath to disperse IO over
non-affected paths.

Regards,
Erwin van Londen
-- 
Kind regards,

Erwin van Londen
http://erwinvanlonden.net
PGP key: http://erwinvanlonden.net/pgp-key-id/



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-23 Thread lixiaokeng



On 2021/3/22 22:22, Mike Snitzer wrote:
> On Mon, Mar 22 2021 at  4:11am -0400,
> Christoph Hellwig  wrote:
> 
>> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
>>> From: Zhiqiang Liu 
>>>
>>> When we make IO stress test on multipath device, there will
>>> be a metadata err because of wrong path. In the test, we
>>> concurrent execute 'iscsi device login|logout' and
>>> 'multipath -r' command with IO stress on multipath device.
>>> In some case, systemd-udevd may have not time to process
>>> uevents of iscsi device logout|login, and then 'multipath -r'
>>> command triggers multipathd daemon calls ioctl to load table
>>> with incorrect old device info from systemd-udevd.
>>> Then, one iscsi path may be incorrectly attached to another
>>> multipath which has different uuid. Finally, the metadata err
>>> occurs when umounting filesystem to down write metadata on
>>> the iscsi device which is actually not owned by the multipath
>>> device.
>>>
>>> So we need to check whether all pgpaths of one multipath have
>>> the same uuid, if not, we should throw a error.
>>>
>>> Signed-off-by: Zhiqiang Liu 
>>> Signed-off-by: lixiaokeng 
>>> Signed-off-by: linfeilong 
>>> Signed-off-by: Wubo 
>>> ---
>>>  drivers/md/dm-mpath.c   | 52 +
>>>  drivers/scsi/scsi_lib.c |  1 +
>>>  2 files changed, 53 insertions(+)
>>>
>>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
>>> index bced42f082b0..f0b995784b53 100644
>>> --- a/drivers/md/dm-mpath.c
>>> +++ b/drivers/md/dm-mpath.c
>>> @@ -24,6 +24,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>
>>> @@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, 
>>> struct multipath *m)
>>> return r;
>>>  }
>>>
>>> +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
>>> +#define MPATH_UUID_PREFIX_LEN 7
>>> +static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
>>> +{
>>> +   char pgpath_uuid[DM_UUID_LEN] = {0};
>>> +   struct request_queue *q;
>>> +   struct pgpath *pgpath;
>>> +   struct scsi_device *sdev;
>>> +   ssize_t count;
>>> +   int r = 0;
>>> +
>>> +   list_for_each_entry(pgpath, >pgpaths, list) {
>>> +   q = bdev_get_queue(pgpath->path.dev->bdev);
>>> +   sdev = scsi_device_from_queue(q);
>>
>> Common dm-multipath code should never poke into scsi internals.  This
>> is something for the device handler to check.  It probably also won't
>> work for all older devices.
> 
> Definitely.
> 
> But that aside, userspace (multipathd) _should_ be able to do extra
> validation, _before_ pushing down a new table to the kernel, rather than
> forcing the kernel to do it.
> 

Martin (committer of multipath-tools) said that:
"Don't get me wrong, I don't argue against tough testing. But we should
be aware that there are always time intervals during which multipathd's
picture of the present devices is different from what the kernel sees."

It is difficult to solve this in multipathd.

Regards,
Lixiaokeng

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel