Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On 2020/09/07 20:54, Kanchan Joshi wrote: > On Mon, Sep 7, 2020 at 5:07 PM Damien Le Moal wrote: >> >> On 2020/09/07 20:24, Kanchan Joshi wrote: >>> On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal wrote: On 2020/09/07 16:01, Kanchan Joshi wrote: >> Even for SMR, the user is free to set the elevator to none, which >> disables zone >> write locking. Issuing writes correctly then becomes the responsibility >> of the >> application. This can be useful for settings that for instance use NCQ >> I/O >> priorities, which give better results when "none" is used. > > Was it not a problem that even if the application is sending writes > correctly, scheduler may not preserve the order. > And even when none is being used, re-queue can happen which may lead > to different ordering. "Issuing writes correctly" means doing small writes, one per zone at most. In that case, it does not matter if the block layer reorders writes. Per zone, they will still be sequential. >> As far as I know, zoned drives are always used in tightly controlled >> environments. Problems like "does not know what other applications would >> be >> doing" are non-existent. Setting up the drive correctly for the use case >> at hand >> is a sysadmin/server setup problem, based on *the* application (singular) >> requirements. > > Fine. > But what about the null-block-zone which sets MQ-deadline but does not > actually use write-lock to avoid race among multiple appends on a > zone. > Does that deserve a fix? In nullblk, commands are executed under a spinlock. So there is no concurrency problem. The spinlock serializes the execution of all commands. null_blk zone append emulation thus does not need to take the scheduler level zone write lock like scsi does. >>> >>> I do not see spinlock for that. There is one "nullb->lock", but its >>> scope is limited to memory-backed handling. >>> For concurrent zone-appends on a zone, multiple threads may set the >>> "same" write-pointer into incoming request(s). >>> Are you referring to any other spinlock that can avoid "same wp being >>> returned to multiple threads". >> >> Checking again, it looks like you are correct. nullb->lock is indeed only >> used >> for processing read/write with memory backing turned on. >> We either need to extend that spinlock use, or add one to protect the zone >> array >> when doing zoned commands and checks of read/write against a zone wp. >> Care to send a patch ? I can send one too. > > Sure, I can send. > Do you think it is not OK to use zone write-lock (same like SCSI > emulation) instead of introducing a new spinlock? zone write lock will not protect against read or zone management commands executed concurrently with writes. Only concurrent writes to the same zone will be serialized with the scheduler zone write locking, which may not be used at all also if the user set the scheduler to none. A lock for exclusive access and changes to the zone array is needed. > > -- Damien Le Moal Western Digital Research
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On Mon, Sep 7, 2020 at 5:07 PM Damien Le Moal wrote: > > On 2020/09/07 20:24, Kanchan Joshi wrote: > > On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal wrote: > >> > >> On 2020/09/07 16:01, Kanchan Joshi wrote: > Even for SMR, the user is free to set the elevator to none, which > disables zone > write locking. Issuing writes correctly then becomes the responsibility > of the > application. This can be useful for settings that for instance use NCQ > I/O > priorities, which give better results when "none" is used. > >>> > >>> Was it not a problem that even if the application is sending writes > >>> correctly, scheduler may not preserve the order. > >>> And even when none is being used, re-queue can happen which may lead > >>> to different ordering. > >> > >> "Issuing writes correctly" means doing small writes, one per zone at most. > >> In > >> that case, it does not matter if the block layer reorders writes. Per > >> zone, they > >> will still be sequential. > >> > As far as I know, zoned drives are always used in tightly controlled > environments. Problems like "does not know what other applications would > be > doing" are non-existent. Setting up the drive correctly for the use case > at hand > is a sysadmin/server setup problem, based on *the* application (singular) > requirements. > >>> > >>> Fine. > >>> But what about the null-block-zone which sets MQ-deadline but does not > >>> actually use write-lock to avoid race among multiple appends on a > >>> zone. > >>> Does that deserve a fix? > >> > >> In nullblk, commands are executed under a spinlock. So there is no > >> concurrency > >> problem. The spinlock serializes the execution of all commands. null_blk > >> zone > >> append emulation thus does not need to take the scheduler level zone write > >> lock > >> like scsi does. > > > > I do not see spinlock for that. There is one "nullb->lock", but its > > scope is limited to memory-backed handling. > > For concurrent zone-appends on a zone, multiple threads may set the > > "same" write-pointer into incoming request(s). > > Are you referring to any other spinlock that can avoid "same wp being > > returned to multiple threads". > > Checking again, it looks like you are correct. nullb->lock is indeed only used > for processing read/write with memory backing turned on. > We either need to extend that spinlock use, or add one to protect the zone > array > when doing zoned commands and checks of read/write against a zone wp. > Care to send a patch ? I can send one too. Sure, I can send. Do you think it is not OK to use zone write-lock (same like SCSI emulation) instead of introducing a new spinlock? -- Kanchan
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On 2020/09/07 20:24, Kanchan Joshi wrote: > On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal wrote: >> >> On 2020/09/07 16:01, Kanchan Joshi wrote: Even for SMR, the user is free to set the elevator to none, which disables zone write locking. Issuing writes correctly then becomes the responsibility of the application. This can be useful for settings that for instance use NCQ I/O priorities, which give better results when "none" is used. >>> >>> Was it not a problem that even if the application is sending writes >>> correctly, scheduler may not preserve the order. >>> And even when none is being used, re-queue can happen which may lead >>> to different ordering. >> >> "Issuing writes correctly" means doing small writes, one per zone at most. In >> that case, it does not matter if the block layer reorders writes. Per zone, >> they >> will still be sequential. >> As far as I know, zoned drives are always used in tightly controlled environments. Problems like "does not know what other applications would be doing" are non-existent. Setting up the drive correctly for the use case at hand is a sysadmin/server setup problem, based on *the* application (singular) requirements. >>> >>> Fine. >>> But what about the null-block-zone which sets MQ-deadline but does not >>> actually use write-lock to avoid race among multiple appends on a >>> zone. >>> Does that deserve a fix? >> >> In nullblk, commands are executed under a spinlock. So there is no >> concurrency >> problem. The spinlock serializes the execution of all commands. null_blk zone >> append emulation thus does not need to take the scheduler level zone write >> lock >> like scsi does. > > I do not see spinlock for that. There is one "nullb->lock", but its > scope is limited to memory-backed handling. > For concurrent zone-appends on a zone, multiple threads may set the > "same" write-pointer into incoming request(s). > Are you referring to any other spinlock that can avoid "same wp being > returned to multiple threads". Checking again, it looks like you are correct. nullb->lock is indeed only used for processing read/write with memory backing turned on. We either need to extend that spinlock use, or add one to protect the zone array when doing zoned commands and checks of read/write against a zone wp. Care to send a patch ? I can send one too. -- Damien Le Moal Western Digital Research
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal wrote: > > On 2020/09/07 16:01, Kanchan Joshi wrote: > >> Even for SMR, the user is free to set the elevator to none, which disables > >> zone > >> write locking. Issuing writes correctly then becomes the responsibility of > >> the > >> application. This can be useful for settings that for instance use NCQ I/O > >> priorities, which give better results when "none" is used. > > > > Was it not a problem that even if the application is sending writes > > correctly, scheduler may not preserve the order. > > And even when none is being used, re-queue can happen which may lead > > to different ordering. > > "Issuing writes correctly" means doing small writes, one per zone at most. In > that case, it does not matter if the block layer reorders writes. Per zone, > they > will still be sequential. > > >> As far as I know, zoned drives are always used in tightly controlled > >> environments. Problems like "does not know what other applications would be > >> doing" are non-existent. Setting up the drive correctly for the use case > >> at hand > >> is a sysadmin/server setup problem, based on *the* application (singular) > >> requirements. > > > > Fine. > > But what about the null-block-zone which sets MQ-deadline but does not > > actually use write-lock to avoid race among multiple appends on a > > zone. > > Does that deserve a fix? > > In nullblk, commands are executed under a spinlock. So there is no concurrency > problem. The spinlock serializes the execution of all commands. null_blk zone > append emulation thus does not need to take the scheduler level zone write > lock > like scsi does. I do not see spinlock for that. There is one "nullb->lock", but its scope is limited to memory-backed handling. For concurrent zone-appends on a zone, multiple threads may set the "same" write-pointer into incoming request(s). Are you referring to any other spinlock that can avoid "same wp being returned to multiple threads". -- Kanchan
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On 2020/09/07 16:01, Kanchan Joshi wrote: >> Even for SMR, the user is free to set the elevator to none, which disables >> zone >> write locking. Issuing writes correctly then becomes the responsibility of >> the >> application. This can be useful for settings that for instance use NCQ I/O >> priorities, which give better results when "none" is used. > > Was it not a problem that even if the application is sending writes > correctly, scheduler may not preserve the order. > And even when none is being used, re-queue can happen which may lead > to different ordering. "Issuing writes correctly" means doing small writes, one per zone at most. In that case, it does not matter if the block layer reorders writes. Per zone, they will still be sequential. >> As far as I know, zoned drives are always used in tightly controlled >> environments. Problems like "does not know what other applications would be >> doing" are non-existent. Setting up the drive correctly for the use case at >> hand >> is a sysadmin/server setup problem, based on *the* application (singular) >> requirements. > > Fine. > But what about the null-block-zone which sets MQ-deadline but does not > actually use write-lock to avoid race among multiple appends on a > zone. > Does that deserve a fix? In nullblk, commands are executed under a spinlock. So there is no concurrency problem. The spinlock serializes the execution of all commands. null_blk zone append emulation thus does not need to take the scheduler level zone write lock like scsi does. -- Damien Le Moal Western Digital Research
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On Wed, Aug 19, 2020 at 4:47 PM Damien Le Moal wrote: > > On 2020/08/19 19:32, Kanchan Joshi wrote: > > On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal > > wrote: > >> > >> On 2020/08/19 18:27, Kanchan Joshi wrote: > >>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig wrote: > > On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: > > Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. > > No, it is not. > >>> > >>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS? > >>> I see that null-block zoned and SCSI-ZBC both set this requirement. I > >>> wonder how it became different for NVMe. > >> > >> It is not required for an NVMe ZNS drive that has zone append native > >> support. > >> zonefs and upcoming btrfs do not use regular writes, removing the > >> requirement > >> for zone write locking. > > > > I understand that if a particular user (zonefs, btrfs etc) is not > > sending regular-write and sending append instead, write-lock is not > > required. > > But if that particular user or some other user (say F2FS) sends > > regular write(s), write-lock is needed. > > And that can be trivially enabled by setting the drive elevator to > mq-deadline. > > > Above block-layer, both the opcodes REQ_OP_WRITE and > > REQ_OP_ZONE_APPEND are available to be used by users. And I thought > > write-lock is taken or not is a per-opcode thing and not per-user (FS, > > MD/DM, user-space etc.), is not that correct? And MQ-deadline can > > cater to both the opcodes, while other schedulers cannot serve > > REQ_OP_WRITE well for zoned-device. > > mq-deadline ignores zone append commands. No zone lock is taken for these. In > scsi, the emulation takes the zone lock before transforming the zone append > into > a regular write. That locking is consistent with the mq-scheduler level > locking > since the same lock bitmap is used. So if the user only issues zone append > writes, mq-deadline is not needed and there is no reasons to force its use by > setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber... Right, got your point. > >> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be > >> set only > >> and only if the drive does not have native zone append support. > > > > Sure I can keep it that way, once I get it right. If it is really not > > required for native-append drive, it should not be here at the place > > where I added. > > > >> And even in that > >> case, since for an emulated zone append the zone write lock is taken and > >> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is > >> required > >> only if the user will also be issuing regular writes at high QD. And that > >> is > >> trivially controllable by the user by simply setting the drive elevator to > >> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed. > > > > Are we saying applications should switch schedulers based on the write > > QD (use any-scheduler for QD1 and mq-deadline for QD-N). > > Even if it does that, it does not know what other applications would > > be doing. That seems hard-to-get-right and possible only in a > > tightly-controlled environment. > > Even for SMR, the user is free to set the elevator to none, which disables > zone > write locking. Issuing writes correctly then becomes the responsibility of the > application. This can be useful for settings that for instance use NCQ I/O > priorities, which give better results when "none" is used. Was it not a problem that even if the application is sending writes correctly, scheduler may not preserve the order. And even when none is being used, re-queue can happen which may lead to different ordering. > As far as I know, zoned drives are always used in tightly controlled > environments. Problems like "does not know what other applications would be > doing" are non-existent. Setting up the drive correctly for the use case at > hand > is a sysadmin/server setup problem, based on *the* application (singular) > requirements. Fine. But what about the null-block-zone which sets MQ-deadline but does not actually use write-lock to avoid race among multiple appends on a zone. Does that deserve a fix? -- Kanchan
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On 2020/08/19 19:32, Kanchan Joshi wrote: > On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal wrote: >> >> On 2020/08/19 18:27, Kanchan Joshi wrote: >>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: > Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. No, it is not. >>> >>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS? >>> I see that null-block zoned and SCSI-ZBC both set this requirement. I >>> wonder how it became different for NVMe. >> >> It is not required for an NVMe ZNS drive that has zone append native support. >> zonefs and upcoming btrfs do not use regular writes, removing the requirement >> for zone write locking. > > I understand that if a particular user (zonefs, btrfs etc) is not > sending regular-write and sending append instead, write-lock is not > required. > But if that particular user or some other user (say F2FS) sends > regular write(s), write-lock is needed. And that can be trivially enabled by setting the drive elevator to mq-deadline. > Above block-layer, both the opcodes REQ_OP_WRITE and > REQ_OP_ZONE_APPEND are available to be used by users. And I thought > write-lock is taken or not is a per-opcode thing and not per-user (FS, > MD/DM, user-space etc.), is not that correct? And MQ-deadline can > cater to both the opcodes, while other schedulers cannot serve > REQ_OP_WRITE well for zoned-device. mq-deadline ignores zone append commands. No zone lock is taken for these. In scsi, the emulation takes the zone lock before transforming the zone append into a regular write. That locking is consistent with the mq-scheduler level locking since the same lock bitmap is used. So if the user only issues zone append writes, mq-deadline is not needed and there is no reasons to force its use by setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber... >> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set >> only >> and only if the drive does not have native zone append support. > > Sure I can keep it that way, once I get it right. If it is really not > required for native-append drive, it should not be here at the place > where I added. > >> And even in that >> case, since for an emulated zone append the zone write lock is taken and >> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required >> only if the user will also be issuing regular writes at high QD. And that is >> trivially controllable by the user by simply setting the drive elevator to >> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed. > > Are we saying applications should switch schedulers based on the write > QD (use any-scheduler for QD1 and mq-deadline for QD-N). > Even if it does that, it does not know what other applications would > be doing. That seems hard-to-get-right and possible only in a > tightly-controlled environment. Even for SMR, the user is free to set the elevator to none, which disables zone write locking. Issuing writes correctly then becomes the responsibility of the application. This can be useful for settings that for instance use NCQ I/O priorities, which give better results when "none" is used. As far as I know, zoned drives are always used in tightly controlled environments. Problems like "does not know what other applications would be doing" are non-existent. Setting up the drive correctly for the use case at hand is a sysadmin/server setup problem, based on *the* application (singular) requirements. -- Damien Le Moal Western Digital Research
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal wrote: > > On 2020/08/19 18:27, Kanchan Joshi wrote: > > On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig wrote: > >> > >> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: > >>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. > >> > >> No, it is not. > > > > Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS? > > I see that null-block zoned and SCSI-ZBC both set this requirement. I > > wonder how it became different for NVMe. > > It is not required for an NVMe ZNS drive that has zone append native support. > zonefs and upcoming btrfs do not use regular writes, removing the requirement > for zone write locking. I understand that if a particular user (zonefs, btrfs etc) is not sending regular-write and sending append instead, write-lock is not required. But if that particular user or some other user (say F2FS) sends regular write(s), write-lock is needed. Above block-layer, both the opcodes REQ_OP_WRITE and REQ_OP_ZONE_APPEND are available to be used by users. And I thought write-lock is taken or not is a per-opcode thing and not per-user (FS, MD/DM, user-space etc.), is not that correct? And MQ-deadline can cater to both the opcodes, while other schedulers cannot serve REQ_OP_WRITE well for zoned-device. > In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set > only > and only if the drive does not have native zone append support. Sure I can keep it that way, once I get it right. If it is really not required for native-append drive, it should not be here at the place where I added. > And even in that > case, since for an emulated zone append the zone write lock is taken and > released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required > only if the user will also be issuing regular writes at high QD. And that is > trivially controllable by the user by simply setting the drive elevator to > mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed. Are we saying applications should switch schedulers based on the write QD (use any-scheduler for QD1 and mq-deadline for QD-N). Even if it does that, it does not know what other applications would be doing. That seems hard-to-get-right and possible only in a tightly-controlled environment. -- Joshi
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On 2020/08/19 18:27, Kanchan Joshi wrote: > On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig wrote: >> >> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: >>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. >> >> No, it is not. > > Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS? > I see that null-block zoned and SCSI-ZBC both set this requirement. I > wonder how it became different for NVMe. It is not required for an NVMe ZNS drive that has zone append native support. zonefs and upcoming btrfs do not use regular writes, removing the requirement for zone write locking. In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only and only if the drive does not have native zone append support. And even in that case, since for an emulated zone append the zone write lock is taken and released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required only if the user will also be issuing regular writes at high QD. And that is trivially controllable by the user by simply setting the drive elevator to mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed. -- Damien Le Moal Western Digital Research
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig wrote: > > On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: > > Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. > > No, it is not. Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS? I see that null-block zoned and SCSI-ZBC both set this requirement. I wonder how it became different for NVMe. -- Kanchan
Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: > Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. No, it is not.
[PATCH 1/2] nvme: set io-scheduler requirement for ZNS
Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. Signed-off-by: Kanchan Joshi --- drivers/nvme/host/zns.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index 57cfd78731fb..cabd870fb64e 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -96,6 +96,7 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns, q->limits.zoned = BLK_ZONED_HM; blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); + blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE); blk_queue_max_open_zones(q, le32_to_cpu(id->mor) + 1); blk_queue_max_active_zones(q, le32_to_cpu(id->mar) + 1); free_data: -- 2.17.1