Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-16 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal  wrote:
> Shaun,
>
> On 8/10/16 12:58, Shaun Tancheff wrote:
>>
>> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal 
>> wrote:

 On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:
>>
>>
>> [trim]
>>
> Since disk type == 0 for everything that isn't HM so I would prefer the
> sysfs 'zoned' file just report if the drive is HA or HM.
>
 Okay. So let's put in the 'zoned' attribute the device type:
 'host-managed', 'host-aware', or 'device managed'.
>>>
>>>
>>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned
>>> file.
>>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>>> else. This means that drive managed models are not exposed as zoned block
>>> devices. For HM vs HA differentiation, an application can look at the
>>> device type file since it is already present.
>>>
>>> We could indeed set the "zoned" file to the device type, but HM drives
>>> and
>>> regular drives will both have "0" in it, so no differentiation possible.
>>> The other choice could be the "zoned" bits defined by ZBC, but these
>>> do not define a value for host managed drives, and the drive managed
>>> value
>>> being not "0" could be confusing too. So I settled for a simple 0/1
>>> boolean.
>>
>>
>> This seems good to me.
>
>
> Another option I forgot is for the "zoned" file to indicate the total number
> of zones of the device, and 0 for a non zoned regular block device. That
> would work as well.

Clearly either is sufficient.

> [...]
>>>
>>> Done: I hacked Shaun ioctl code and added finish zone too. The
>>> difference with Shaun initial code is that the ioctl are propagated down
>>> to
>>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need
>>> for
>>> BIO request definition for the zone operations. So a lot less code added.
>>
>>
>> The purpose of the BIO flags is not to enable the ioctls so much as
>> the other way round. Creating BIO op's is to enable issuing ZBC
>> commands from device mapper targets and file systems without some
>> heinous ioctl hacks.
>> Making the resulting block layer interfaces available via ioctls is just a
>> reasonable way to exercise the code ... or that was my intent.
>
>
> Yes, I understood your code. However, since (or if) we keep the zone
> information in the RB-tree cache, there is no need for the report zone
> operation BIO interface. Same for reset write pointer by keeping the mapping
> to discard. blk_lookup_zone can be used in kernel as a report zone BIO
> replacement and works as well for the report zone ioctl implementation. For
> reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl
> becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and
> finish zone remains. For these, adding the BIO interface seemed an overkill.
> Hence my choice of propagating the ioctl to the driver.
> This is debatable of course, and adding an in-kernel interface is not hard:
> we can implement blk_open_zone, blk_close_zone and blk_finish_zone using
> __blkdev_driver_ioctl. That looks clean to me.

Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API
-> Kernel
is not really a good designed IMO.

> Overall, my concern with the BIO based interface for the ZBC commands is
> that it adds one flag for each command, which is not really the philosophy
> of the interface and potentially opens the door for more such
> implementations in the future with new standards and new commands coming up.
> Clearly that is not a sustainable path. So I think that a more specific
> interface for these zone operations is a better choice. That is consistent
> with what happens with the tons of ATA and SCSI commands not actually doing
> data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and
> are processed as request REQ_TYPE_BLOCK_PC.

Part of the reason for following on Mike Christie's bio op/flags cleanup was to
make these op's. The advantage of being added as ops is that there is only
1 extra bit need (not 4 or 5 bits for flags). The other reason for being
promoted into the block layer as commands is because it seems to me
to make sense that these abstractions could be allowed to be passed through
a DM layer and be handled by a files system.

>>> The ioctls do not mimic exactly the ZBC standard. For instance, there is
>>> no
>>> reporting options for report zones, nor is the "all" bit supported for
>>> open,
>>> close or finish zone commands. But the information provided on zones is
>>> complete
>>> and maps to the standard definitions.
>>
>>
>> For the reporting options I have planned to reuse the stream_id in
>> struct bio when that is formalized. There are certainly other places in
>> struct bio to stuff a few extra bits ...

Sorry I was confused here. I was under the impression you were talking
about one of my patches when you seem to have been talking about

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-16 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal  wrote:
> Shaun,
>
> On 8/10/16 12:58, Shaun Tancheff wrote:
>>
>> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal 
>> wrote:

 On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:
>>
>>
>> [trim]
>>
> Since disk type == 0 for everything that isn't HM so I would prefer the
> sysfs 'zoned' file just report if the drive is HA or HM.
>
 Okay. So let's put in the 'zoned' attribute the device type:
 'host-managed', 'host-aware', or 'device managed'.
>>>
>>>
>>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned
>>> file.
>>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>>> else. This means that drive managed models are not exposed as zoned block
>>> devices. For HM vs HA differentiation, an application can look at the
>>> device type file since it is already present.
>>>
>>> We could indeed set the "zoned" file to the device type, but HM drives
>>> and
>>> regular drives will both have "0" in it, so no differentiation possible.
>>> The other choice could be the "zoned" bits defined by ZBC, but these
>>> do not define a value for host managed drives, and the drive managed
>>> value
>>> being not "0" could be confusing too. So I settled for a simple 0/1
>>> boolean.
>>
>>
>> This seems good to me.
>
>
> Another option I forgot is for the "zoned" file to indicate the total number
> of zones of the device, and 0 for a non zoned regular block device. That
> would work as well.

Clearly either is sufficient.

> [...]
>>>
>>> Done: I hacked Shaun ioctl code and added finish zone too. The
>>> difference with Shaun initial code is that the ioctl are propagated down
>>> to
>>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need
>>> for
>>> BIO request definition for the zone operations. So a lot less code added.
>>
>>
>> The purpose of the BIO flags is not to enable the ioctls so much as
>> the other way round. Creating BIO op's is to enable issuing ZBC
>> commands from device mapper targets and file systems without some
>> heinous ioctl hacks.
>> Making the resulting block layer interfaces available via ioctls is just a
>> reasonable way to exercise the code ... or that was my intent.
>
>
> Yes, I understood your code. However, since (or if) we keep the zone
> information in the RB-tree cache, there is no need for the report zone
> operation BIO interface. Same for reset write pointer by keeping the mapping
> to discard. blk_lookup_zone can be used in kernel as a report zone BIO
> replacement and works as well for the report zone ioctl implementation. For
> reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl
> becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and
> finish zone remains. For these, adding the BIO interface seemed an overkill.
> Hence my choice of propagating the ioctl to the driver.
> This is debatable of course, and adding an in-kernel interface is not hard:
> we can implement blk_open_zone, blk_close_zone and blk_finish_zone using
> __blkdev_driver_ioctl. That looks clean to me.

Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API
-> Kernel
is not really a good designed IMO.

> Overall, my concern with the BIO based interface for the ZBC commands is
> that it adds one flag for each command, which is not really the philosophy
> of the interface and potentially opens the door for more such
> implementations in the future with new standards and new commands coming up.
> Clearly that is not a sustainable path. So I think that a more specific
> interface for these zone operations is a better choice. That is consistent
> with what happens with the tons of ATA and SCSI commands not actually doing
> data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and
> are processed as request REQ_TYPE_BLOCK_PC.

Part of the reason for following on Mike Christie's bio op/flags cleanup was to
make these op's. The advantage of being added as ops is that there is only
1 extra bit need (not 4 or 5 bits for flags). The other reason for being
promoted into the block layer as commands is because it seems to me
to make sense that these abstractions could be allowed to be passed through
a DM layer and be handled by a files system.

>>> The ioctls do not mimic exactly the ZBC standard. For instance, there is
>>> no
>>> reporting options for report zones, nor is the "all" bit supported for
>>> open,
>>> close or finish zone commands. But the information provided on zones is
>>> complete
>>> and maps to the standard definitions.
>>
>>
>> For the reporting options I have planned to reuse the stream_id in
>> struct bio when that is formalized. There are certainly other places in
>> struct bio to stuff a few extra bits ...

Sorry I was confused here. I was under the impression you were talking
about one of my patches when you seem to have been talking about
your hacking thereof.

> We could add reporting options to 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Shaun Tancheff
On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal  wrote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
> […]

>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, but
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=1 in your
current patches. I believe that setting discard_zeroes_data=1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the “discard_zeroes_data” thing, I also think that is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and speed
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) which
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less memory. I would also like a path available that
does not require SD_ZBC or BLK_ZONED for Host Aware
drives to work, hence this set of patches and me begging
for a single bit in struct bio.

>>
>> [..]
>>
> 3) Try to condense the blkzone data structure to save memory:
> I think that we can at the very 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Shaun Tancheff
On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal  wrote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
> […]

>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, but
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=1 in your
current patches. I believe that setting discard_zeroes_data=1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the “discard_zeroes_data” thing, I also think that is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and speed
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) which
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less memory. I would also like a path available that
does not require SD_ZBC or BLK_ZONED for Host Aware
drives to work, hence this set of patches and me begging
for a single bit in struct bio.

>>
>> [..]
>>
> 3) Try to condense the blkzone data structure to save memory:
> I think that we can at the very least remove the zone length, and also
> may be 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Damien Le Moal

Shaun,

> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
[…]
>>> 
>> No, surely not.
>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>> Without the RB tree any mkfs program will issue a 'discard' for every
>> sector. We will be able to coalesce those into one discard per zone, but
>> we still need to issue one for _every_ zone.
> 
> How can you make coalesce work transparently in the
> sd layer _without_ keeping some sort of a discard cache along
> with the zone cache?
> 
> Currently the block layer's blkdev_issue_discard() is breaking
> large discard's into nice granular and aligned chunks but it is
> not preventing small discards nor coalescing them.
> 
> In the sd layer would there be way to persist or purge an
> overly large discard cache? What about honoring
> discard_zeroes_data? Once the discard is completed with
> discard_zeroes_data you have to return zeroes whenever
> a discarded sector is read. Isn't that a log more than just
> tracking a write pointer? Couldn't a zone have dozens of holes?

My understanding of the standards regarding discard is that it is not
mandatory and that it is a hint to the drive. The drive can completely
ignore it if it thinks that is a better choice. I may be wrong on this
though. Need to check again.
For reset write pointer, the mapping to discard requires that the calls
to blkdev_issue_discard be zone aligned for anything to happen. Specify
less than a zone and nothing will be done. This I think preserve the
discard semantic.

As for the “discard_zeroes_data” thing, I also think that is a drive
feature not mandatory. Drives may have it or not, which is consistent
with the ZBC/ZAC standards regarding reading after write pointer (nothing
says that zeros have to be returned). In any case, discard of CMR zones
will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
choice.

> 
>> Which is (as indicated) really slow, and easily takes several minutes.
>> With the RB tree we can short-circuit discards to empty zones, and speed
>> up processing time dramatically.
>> Sure we could be moving the logic into mkfs and friends, but that would
>> require us to change the programs and agree on a library (libzbc?) which
>> should be handling that.
> 
> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
> so I'm not sure your argument is valid here.

This initial SMR support patch is just that: a first try. Jaegeuk
used SG_IO (in fact copy-paste of parts of libzbc) because the current
ZBC patch-set has no ioctl API for zone information manipulation. We
will fix this mkfs.f2fs once we agree on an ioctl interface.

> 
> [..]
> 
 3) Try to condense the blkzone data structure to save memory:
 I think that we can at the very least remove the zone length, and also
 may be the per zone spinlock too (a single spinlock and proper state flags 
 can
 be used).
>>> 
>>> I have a variant that is an array of descriptors that roughly mimics the
>>> api from blk-zoned.c that I did a few months ago as an example.
>>> I should be able to update that to the current kernel + patches.
>>> 
>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>> identical zone sizes of course we can remove the zone length.
>> And we can do away with the per-zone spinlock and use a global one instead.
> 
> I don't think dropping the zone length is a reasonable thing to do.
> 
> What I propose is an array of _descriptors_ it doesn't drop _any_
> of the zone information that you are holding in an RB tree, it is
> just a condensed format that _mostly_ plugs into your existing
> API.

I do not agree. The Seagate drive already has one zone (the last one)
that is not the same length as the other zones. Sure, since it is the
last one, we can had “if (last zone)” all over the place and make it
work. But that is really ugly. Keeping the length field makes the code
generic and following the standard, which has no restriction on the
zone sizes. We could do some memory optimisation using different types
of blk_zone sturcts, the types mapping to the SAME value: drives with
constant zone size can use a blk_zone type without the length field,
others use a different type that include the field. Accessor functions
can hide the different types in the zone manipulation code.

Best regards.


-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
damien.lem...@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or 
legally privileged information of WDC and/or its affiliates, and are intended 
solely for the use of the individual or entity to which they are addressed. If 
you are not the intended recipient, any disclosure, copying, distribution 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Damien Le Moal

Shaun,

> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
[…]
>>> 
>> No, surely not.
>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>> Without the RB tree any mkfs program will issue a 'discard' for every
>> sector. We will be able to coalesce those into one discard per zone, but
>> we still need to issue one for _every_ zone.
> 
> How can you make coalesce work transparently in the
> sd layer _without_ keeping some sort of a discard cache along
> with the zone cache?
> 
> Currently the block layer's blkdev_issue_discard() is breaking
> large discard's into nice granular and aligned chunks but it is
> not preventing small discards nor coalescing them.
> 
> In the sd layer would there be way to persist or purge an
> overly large discard cache? What about honoring
> discard_zeroes_data? Once the discard is completed with
> discard_zeroes_data you have to return zeroes whenever
> a discarded sector is read. Isn't that a log more than just
> tracking a write pointer? Couldn't a zone have dozens of holes?

My understanding of the standards regarding discard is that it is not
mandatory and that it is a hint to the drive. The drive can completely
ignore it if it thinks that is a better choice. I may be wrong on this
though. Need to check again.
For reset write pointer, the mapping to discard requires that the calls
to blkdev_issue_discard be zone aligned for anything to happen. Specify
less than a zone and nothing will be done. This I think preserve the
discard semantic.

As for the “discard_zeroes_data” thing, I also think that is a drive
feature not mandatory. Drives may have it or not, which is consistent
with the ZBC/ZAC standards regarding reading after write pointer (nothing
says that zeros have to be returned). In any case, discard of CMR zones
will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
choice.

> 
>> Which is (as indicated) really slow, and easily takes several minutes.
>> With the RB tree we can short-circuit discards to empty zones, and speed
>> up processing time dramatically.
>> Sure we could be moving the logic into mkfs and friends, but that would
>> require us to change the programs and agree on a library (libzbc?) which
>> should be handling that.
> 
> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
> so I'm not sure your argument is valid here.

This initial SMR support patch is just that: a first try. Jaegeuk
used SG_IO (in fact copy-paste of parts of libzbc) because the current
ZBC patch-set has no ioctl API for zone information manipulation. We
will fix this mkfs.f2fs once we agree on an ioctl interface.

> 
> [..]
> 
 3) Try to condense the blkzone data structure to save memory:
 I think that we can at the very least remove the zone length, and also
 may be the per zone spinlock too (a single spinlock and proper state flags 
 can
 be used).
>>> 
>>> I have a variant that is an array of descriptors that roughly mimics the
>>> api from blk-zoned.c that I did a few months ago as an example.
>>> I should be able to update that to the current kernel + patches.
>>> 
>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>> identical zone sizes of course we can remove the zone length.
>> And we can do away with the per-zone spinlock and use a global one instead.
> 
> I don't think dropping the zone length is a reasonable thing to do.
> 
> What I propose is an array of _descriptors_ it doesn't drop _any_
> of the zone information that you are holding in an RB tree, it is
> just a condensed format that _mostly_ plugs into your existing
> API.

I do not agree. The Seagate drive already has one zone (the last one)
that is not the same length as the other zones. Sure, since it is the
last one, we can had “if (last zone)” all over the place and make it
work. But that is really ugly. Keeping the length field makes the code
generic and following the standard, which has no restriction on the
zone sizes. We could do some memory optimisation using different types
of blk_zone sturcts, the types mapping to the SAME value: drives with
constant zone size can use a blk_zone type without the length field,
others use a different type that include the field. Accessor functions
can hide the different types in the zone manipulation code.

Best regards.


-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
damien.lem...@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or 
legally privileged information of WDC and/or its affiliates, and are intended 
solely for the use of the individual or entity to which they are addressed. If 
you are not the intended recipient, any disclosure, copying, distribution or 
any action taken or 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-14 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke  wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  
>> wrote:
>>> Hannes, Shaun,
>>>
>>> Let me add some more comments.
>>>
 On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:

 On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
>
> Adding Hannes and Damien to Cc.
>
> Christoph,
>
> I can make a patch the marshal Hannes' RB-Tree into to a block report, 
> that is
> quite simple. I can even have the open/close/reset zone commands update 
> the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around 
> the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups 
> that do
> not need the RB-Tree to function with zoned media.
>>
>> I have posted patches to integrate with the zone cache, hopefully they
>> make sense.
>>
> [ .. ]
 I have thought about condensing the RB tree information, but then I
 figured that for 'real' SMR handling we cannot assume all zones are of
 fixed size, and hence we need all the information there.
 Any condensing method would assume a given structure of the zones, which
 the standard just doesn't provide.
 Or am I missing something here?

Of course you can condense the zone cache without loosing any
information. Here is the layout I used ... I haven't update the patch
to the latest posted patches but this is the basic idea.

[It was originally done as a follow on of making your zone cache work
 with Seagate's HA drive. I did not include the wp-in-arrays patch
 along with the HA drive support that I sent you in May as you were
 quite terse about RB trees when I tried to discuss this approach with
 you at Vault]

struct blk_zone {
unsigned type:4;
unsigned state:5;
unsigned extra:7;
unsigned wp:40;
void *private_data;
};

struct contiguous_wps {
u64 start_lba;
u64 last_lba; /* or # of blocks */
u64 zone_size; /* size in blocks */
unsigned is_zoned:1;
u32 zone_count;
spinlock_t lock;
struct blk_zone zones[0];
};

struct zone_wps {
u32 wps_count;
struct contiguous_wps **wps;
};

Then in struct request_queue
-struct rb_root zones;
+   struct struct zone_wps *zones;

For each contiguous chunk of zones you need a descriptor. In the current
drives you need 1 or 2 descriptors.

Here a conventional drive is encapsulated as zoned media with one
drive sized conventional zone.

I have not spent time building an ad-hoc LVM comprised of zoned and
conventional media so it's not all ironed out yet.
I think you can see the advantage of being able to put conventional space
anywhere you would like to work around zoned media not being laid out
the the best manner for your setup.

Yes things start to break down if every other zone is a different size ..

The point being that even with supporting zones that order 48 bytes.
in size this saves a lot of space with no loss of information.
I still kind of prefer pushing blk_zone down to a u32 by reducing
the max zone size and dropping the private_data ... but that may
be going a bit too far.

blk_lookup_zone then has an [unfortunate] signature change:


/**
 * blk_lookup_zone() - Lookup zones
 * @q: Request Queue
 * @sector: Location to lookup
 * @start: Starting location zone (OUT: Required)
 * @len: Length of zone (OUT: Required)
 * @lock: Spinlock of zones (OUT: Required)
 */
struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
 sector_t *start, sector_t *len,
 spinlock_t **lock)
{
int iter;
struct blk_zone *bzone = NULL;
struct zone_wps *zi = q->zones;

*start = 0;
*len = 0;
*lock = NULL;

if (!q->zones)
goto out;

for (iter = 0; iter < zi->wps_count; iter++) {
if (sector >= zi->wps[iter]->start_lba &&
sector <  zi->wps[iter]->last_lba) {
struct contiguous_wps *wp = zi->wps[iter];
u64 index = (sector - wp->start_lba) / wp->zone_size;

BUG_ON(index >= wp->zone_count);

bzone = >zones[index];
*len = wp->zone_size;
*start = wp->start_lba + (index * wp->zone_size);
*lock = >lock;
}
}

out:
return bzone;
}

>>>
>>> Indeed, the standards do not mandate any particular zone configuration,
>>> constant zone size, etc. So writing code so that can be handled is certainly
>>> the right way of 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-14 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke  wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  
>> wrote:
>>> Hannes, Shaun,
>>>
>>> Let me add some more comments.
>>>
 On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:

 On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
>
> Adding Hannes and Damien to Cc.
>
> Christoph,
>
> I can make a patch the marshal Hannes' RB-Tree into to a block report, 
> that is
> quite simple. I can even have the open/close/reset zone commands update 
> the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around 
> the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups 
> that do
> not need the RB-Tree to function with zoned media.
>>
>> I have posted patches to integrate with the zone cache, hopefully they
>> make sense.
>>
> [ .. ]
 I have thought about condensing the RB tree information, but then I
 figured that for 'real' SMR handling we cannot assume all zones are of
 fixed size, and hence we need all the information there.
 Any condensing method would assume a given structure of the zones, which
 the standard just doesn't provide.
 Or am I missing something here?

Of course you can condense the zone cache without loosing any
information. Here is the layout I used ... I haven't update the patch
to the latest posted patches but this is the basic idea.

[It was originally done as a follow on of making your zone cache work
 with Seagate's HA drive. I did not include the wp-in-arrays patch
 along with the HA drive support that I sent you in May as you were
 quite terse about RB trees when I tried to discuss this approach with
 you at Vault]

struct blk_zone {
unsigned type:4;
unsigned state:5;
unsigned extra:7;
unsigned wp:40;
void *private_data;
};

struct contiguous_wps {
u64 start_lba;
u64 last_lba; /* or # of blocks */
u64 zone_size; /* size in blocks */
unsigned is_zoned:1;
u32 zone_count;
spinlock_t lock;
struct blk_zone zones[0];
};

struct zone_wps {
u32 wps_count;
struct contiguous_wps **wps;
};

Then in struct request_queue
-struct rb_root zones;
+   struct struct zone_wps *zones;

For each contiguous chunk of zones you need a descriptor. In the current
drives you need 1 or 2 descriptors.

Here a conventional drive is encapsulated as zoned media with one
drive sized conventional zone.

I have not spent time building an ad-hoc LVM comprised of zoned and
conventional media so it's not all ironed out yet.
I think you can see the advantage of being able to put conventional space
anywhere you would like to work around zoned media not being laid out
the the best manner for your setup.

Yes things start to break down if every other zone is a different size ..

The point being that even with supporting zones that order 48 bytes.
in size this saves a lot of space with no loss of information.
I still kind of prefer pushing blk_zone down to a u32 by reducing
the max zone size and dropping the private_data ... but that may
be going a bit too far.

blk_lookup_zone then has an [unfortunate] signature change:


/**
 * blk_lookup_zone() - Lookup zones
 * @q: Request Queue
 * @sector: Location to lookup
 * @start: Starting location zone (OUT: Required)
 * @len: Length of zone (OUT: Required)
 * @lock: Spinlock of zones (OUT: Required)
 */
struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
 sector_t *start, sector_t *len,
 spinlock_t **lock)
{
int iter;
struct blk_zone *bzone = NULL;
struct zone_wps *zi = q->zones;

*start = 0;
*len = 0;
*lock = NULL;

if (!q->zones)
goto out;

for (iter = 0; iter < zi->wps_count; iter++) {
if (sector >= zi->wps[iter]->start_lba &&
sector <  zi->wps[iter]->last_lba) {
struct contiguous_wps *wp = zi->wps[iter];
u64 index = (sector - wp->start_lba) / wp->zone_size;

BUG_ON(index >= wp->zone_count);

bzone = >zones[index];
*len = wp->zone_size;
*start = wp->start_lba + (index * wp->zone_size);
*lock = >lock;
}
}

out:
return bzone;
}

>>>
>>> Indeed, the standards do not mandate any particular zone configuration,
>>> constant zone size, etc. So writing code so that can be handled is certainly
>>> the right way of doing things. However, if we decide to go forward with
>>> mapping 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Damien Le Moal

Shaun,

On 8/10/16 12:58, Shaun Tancheff wrote:

On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal  wrote:

On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:


[trim]


Since disk type == 0 for everything that isn't HM so I would prefer the
sysfs 'zoned' file just report if the drive is HA or HM.


Okay. So let's put in the 'zoned' attribute the device type:
'host-managed', 'host-aware', or 'device managed'.


I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.


This seems good to me.


Another option I forgot is for the "zoned" file to indicate the total 
number of zones of the device, and 0 for a non zoned regular block 
device. That would work as well.


[...]

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.


The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.


Yes, I understood your code. However, since (or if) we keep the zone 
information in the RB-tree cache, there is no need for the report zone 
operation BIO interface. Same for reset write pointer by keeping the 
mapping to discard. blk_lookup_zone can be used in kernel as a report 
zone BIO replacement and works as well for the report zone ioctl 
implementation. For reset, there is blkdev_issue_discrad in kernel, and 
the reset zone ioctl becomes equivalent to BLKDISCARD ioctl. These are 
simple. Open, close and finish zone remains. For these, adding the BIO 
interface seemed an overkill. Hence my choice of propagating the ioctl 
to the driver.
This is debatable of course, and adding an in-kernel interface is not 
hard: we can implement blk_open_zone, blk_close_zone and blk_finish_zone 
using __blkdev_driver_ioctl. That looks clean to me.


Overall, my concern with the BIO based interface for the ZBC commands is 
that it adds one flag for each command, which is not really the 
philosophy of the interface and potentially opens the door for more such 
implementations in the future with new standards and new commands coming 
up. Clearly that is not a sustainable path. So I think that a more 
specific interface for these zone operations is a better choice. That is 
consistent with what happens with the tons of ATA and SCSI commands not 
actually doing data I/Os (mode sense, log pages, SMART, etc). All these 
do not use BIOs and are processed as request REQ_TYPE_BLOCK_PC.



The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is complete
and maps to the standard definitions.


For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...


We could add reporting options to blk_lookup_zones to filter the result 
and use that in the ioctl implementation as well. This can be added 
without any problem.



As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(


Yes, I got it (libzbc does the same actually). I did not add it for 
simplicity. But indeed may be it should be. The same trick can be used 
with the ioctl to driver interface. No problems.



I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
ioctl call with a range exactly aligned on a zone.


I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.


Yes, we could remove the 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Damien Le Moal

Shaun,

On 8/10/16 12:58, Shaun Tancheff wrote:

On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal  wrote:

On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:


[trim]


Since disk type == 0 for everything that isn't HM so I would prefer the
sysfs 'zoned' file just report if the drive is HA or HM.


Okay. So let's put in the 'zoned' attribute the device type:
'host-managed', 'host-aware', or 'device managed'.


I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.


This seems good to me.


Another option I forgot is for the "zoned" file to indicate the total 
number of zones of the device, and 0 for a non zoned regular block 
device. That would work as well.


[...]

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.


The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.


Yes, I understood your code. However, since (or if) we keep the zone 
information in the RB-tree cache, there is no need for the report zone 
operation BIO interface. Same for reset write pointer by keeping the 
mapping to discard. blk_lookup_zone can be used in kernel as a report 
zone BIO replacement and works as well for the report zone ioctl 
implementation. For reset, there is blkdev_issue_discrad in kernel, and 
the reset zone ioctl becomes equivalent to BLKDISCARD ioctl. These are 
simple. Open, close and finish zone remains. For these, adding the BIO 
interface seemed an overkill. Hence my choice of propagating the ioctl 
to the driver.
This is debatable of course, and adding an in-kernel interface is not 
hard: we can implement blk_open_zone, blk_close_zone and blk_finish_zone 
using __blkdev_driver_ioctl. That looks clean to me.


Overall, my concern with the BIO based interface for the ZBC commands is 
that it adds one flag for each command, which is not really the 
philosophy of the interface and potentially opens the door for more such 
implementations in the future with new standards and new commands coming 
up. Clearly that is not a sustainable path. So I think that a more 
specific interface for these zone operations is a better choice. That is 
consistent with what happens with the tons of ATA and SCSI commands not 
actually doing data I/Os (mode sense, log pages, SMART, etc). All these 
do not use BIOs and are processed as request REQ_TYPE_BLOCK_PC.



The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is complete
and maps to the standard definitions.


For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...


We could add reporting options to blk_lookup_zones to filter the result 
and use that in the ioctl implementation as well. This can be added 
without any problem.



As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(


Yes, I got it (libzbc does the same actually). I did not add it for 
simplicity. But indeed may be it should be. The same trick can be used 
with the ioctl to driver interface. No problems.



I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
ioctl call with a range exactly aligned on a zone.


I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.


Yes, we could remove the BLKRESETZONE ioctl and have applications use 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal  wrote:
>> On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:

[trim]

>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>
>> Okay. So let's put in the 'zoned' attribute the device type:
>> 'host-managed', 'host-aware', or 'device managed'.
>
> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
> else. This means that drive managed models are not exposed as zoned block
> devices. For HM vs HA differentiation, an application can look at the
> device type file since it is already present.
>
> We could indeed set the "zoned" file to the device type, but HM drives and
> regular drives will both have "0" in it, so no differentiation possible.
> The other choice could be the "zoned" bits defined by ZBC, but these
> do not define a value for host managed drives, and the drive managed value
> being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

This seems good to me.

 2) Add ioctls for zone management:
 Report zones (get information from RB tree), reset zone (simple wrapper
 to ioctl for block discard), open zone, close zone and finish zone. That
 will allow mkfs like tools to get zone information without having to parse
 thousands of sysfs files (and can also be integrated in libzbc block 
 backend
 driver for a unified interface with the direct SG_IO path for kernels 
 without
 the ZBC code enabled).
>>>
>>> I can add finish zone ... but I really can't think of a use for it, myself.
>>>
>> Which is not the point. The standard defines this, so clearly someone
>> found it a reasonable addendum. So let's add this for completeness.

Agreed.

> Done: I hacked Shaun ioctl code and added finish zone too. The
> difference with Shaun initial code is that the ioctl are propagated down to
> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
> BIO request definition for the zone operations. So a lot less code added.

The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.

> The ioctls do not mimic exactly the ZBC standard. For instance, there is no
> reporting options for report zones, nor is the "all" bit supported for open,
> close or finish zone commands. But the information provided on zones is 
> complete
> and maps to the standard definitions.

For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...

As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(

> I also added a reset_wp ioctl for completeness, but this one simply calls
> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
> ioctl call with a range exactly aligned on a zone.

I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.

[trim]

> Did that too. The blk_zone struct is now exactly 64B. I removed the per zone

Thanks .. being a cache line is harder to whinge about...

> spinlock and replaced it with a flag so that zones can still be locked
> independently using wait_on_bit_lock & friends (I added the functions
> blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
> locking comes in handy to implement the ioctls code without stalling the 
> command
> queue for read, writes and discard on different zones.
>
> I however kept the zone length in the structure. The reason for doing so is 
> that
> this allows supporting drives with non-constant zone sizes, albeit with a more
> limited interface since in such case the device chunk_sectors is not set (that
> is how a user or application can detect that the zones are not constant size).
> For these drives, the block layer may happily merge BIOs across zone 
> boundaries
> and the discard code will not split and align calls on the zones. But upper 
> layers
> (an FS or a device mapper) can still do all this by themselves if they 
> want/can
> support non-constant zone sizes.
>
> The only exception is drives like the Seagate one with only the last zone of a
> different size. This case is handled exactly as if all zones are the same size
> simply because any operation on the 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal  wrote:
>> On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:

[trim]

>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>
>> Okay. So let's put in the 'zoned' attribute the device type:
>> 'host-managed', 'host-aware', or 'device managed'.
>
> I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
> else. This means that drive managed models are not exposed as zoned block
> devices. For HM vs HA differentiation, an application can look at the
> device type file since it is already present.
>
> We could indeed set the "zoned" file to the device type, but HM drives and
> regular drives will both have "0" in it, so no differentiation possible.
> The other choice could be the "zoned" bits defined by ZBC, but these
> do not define a value for host managed drives, and the drive managed value
> being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

This seems good to me.

 2) Add ioctls for zone management:
 Report zones (get information from RB tree), reset zone (simple wrapper
 to ioctl for block discard), open zone, close zone and finish zone. That
 will allow mkfs like tools to get zone information without having to parse
 thousands of sysfs files (and can also be integrated in libzbc block 
 backend
 driver for a unified interface with the direct SG_IO path for kernels 
 without
 the ZBC code enabled).
>>>
>>> I can add finish zone ... but I really can't think of a use for it, myself.
>>>
>> Which is not the point. The standard defines this, so clearly someone
>> found it a reasonable addendum. So let's add this for completeness.

Agreed.

> Done: I hacked Shaun ioctl code and added finish zone too. The
> difference with Shaun initial code is that the ioctl are propagated down to
> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
> BIO request definition for the zone operations. So a lot less code added.

The purpose of the BIO flags is not to enable the ioctls so much as
the other way round. Creating BIO op's is to enable issuing ZBC
commands from device mapper targets and file systems without some
heinous ioctl hacks.
Making the resulting block layer interfaces available via ioctls is just a
reasonable way to exercise the code ... or that was my intent.

> The ioctls do not mimic exactly the ZBC standard. For instance, there is no
> reporting options for report zones, nor is the "all" bit supported for open,
> close or finish zone commands. But the information provided on zones is 
> complete
> and maps to the standard definitions.

For the reporting options I have planned to reuse the stream_id in
struct bio when that is formalized. There are certainly other places in
struct bio to stuff a few extra bits ...

As far as the all bit ... this is being handled by all the zone action
commands. Just pass a sector of ~0ul and it's handled in sd.c by
sd_setup_zone_action_cmnd().

Apologies as apparently my documentation here is lacking :-(

> I also added a reset_wp ioctl for completeness, but this one simply calls
> blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
> ioctl call with a range exactly aligned on a zone.

I'm confused as my patch set includes a Reset WP (BLKRESETZONE) that
creates a REQ_OP_ZONE_RESET .. same as open and close. My
expectation being that BLKDISCARD doesn't really need yet another alias.

[trim]

> Did that too. The blk_zone struct is now exactly 64B. I removed the per zone

Thanks .. being a cache line is harder to whinge about...

> spinlock and replaced it with a flag so that zones can still be locked
> independently using wait_on_bit_lock & friends (I added the functions
> blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
> locking comes in handy to implement the ioctls code without stalling the 
> command
> queue for read, writes and discard on different zones.
>
> I however kept the zone length in the structure. The reason for doing so is 
> that
> this allows supporting drives with non-constant zone sizes, albeit with a more
> limited interface since in such case the device chunk_sectors is not set (that
> is how a user or application can detect that the zones are not constant size).
> For these drives, the block layer may happily merge BIOs across zone 
> boundaries
> and the discard code will not split and align calls on the zones. But upper 
> layers
> (an FS or a device mapper) can still do all this by themselves if they 
> want/can
> support non-constant zone sizes.
>
> The only exception is drives like the Seagate one with only the last zone of a
> different size. This case is handled exactly as if all zones are the same size
> simply because any operation on the last smaller zone will naturally align 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke  wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  
>> wrote:
 On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
 On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:

[trim]
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable 
>> for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.
> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

Adding an additional library dependency seems overkill for a program
that is already doing ioctls and raw block I/O ... but I would leave that
up to each file system. As it sits issuing the ioctl and walking the array
of data returned [see blkreport.c] is already quite trivial.

I believe the goal here is for F2FS, and perhaps NILFS? to "just
work" with the DISCARD to Reset WP and zone cache in place.

Still quite skeptical about other common file systems
"just working" without their respective mkfs et. al. being
zone aware and handling the topology of the media at mkfs time.
Perhaps there is something I am unaware of?

[trim]

>> I can add finish zone ... but I really can't think of a use for it, myself.
>>
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Agreed and queued for the next version.

Regards,
Shaun


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke  wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  
>> wrote:
 On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
 On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:

[trim]
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable 
>> for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.
> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

Adding an additional library dependency seems overkill for a program
that is already doing ioctls and raw block I/O ... but I would leave that
up to each file system. As it sits issuing the ioctl and walking the array
of data returned [see blkreport.c] is already quite trivial.

I believe the goal here is for F2FS, and perhaps NILFS? to "just
work" with the DISCARD to Reset WP and zone cache in place.

Still quite skeptical about other common file systems
"just working" without their respective mkfs et. al. being
zone aware and handling the topology of the media at mkfs time.
Perhaps there is something I am unaware of?

[trim]

>> I can add finish zone ... but I really can't think of a use for it, myself.
>>
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Agreed and queued for the next version.

Regards,
Shaun


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Damien Le Moal
Hannes,

> On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:
[...]
>>> 
>>> Can we agree on an interface ?
>>> Summarizing all the discussions we had, I am all in favor of the following:
>>> 
>>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>>> The already existing device type file can tell us if this is an host
>>> managed disk (type==20) or a host aware disk (type==0). Drive managed
>>> disks could also be detected, but since no information on their zone
>>> configuration can be obtained, lets treat those as regular drives.
>> 
>> Since disk type == 0 for everything that isn't HM so I would prefer the
>> sysfs 'zoned' file just report if the drive is HA or HM.
>> 
> Okay. So let's put in the 'zoned' attribute the device type:
> 'host-managed', 'host-aware', or 'device managed'.

I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

> 
>>> 2) Add ioctls for zone management:
>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>> will allow mkfs like tools to get zone information without having to parse
>>> thousands of sysfs files (and can also be integrated in libzbc block backend
>>> driver for a unified interface with the direct SG_IO path for kernels 
>>> without
>>> the ZBC code enabled).
>> 
>> I can add finish zone ... but I really can't think of a use for it, myself.
>> 
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.
The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is complete
and maps to the standard definitions.

I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
ioctl call with a range exactly aligned on a zone.

> 
>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state flags 
>>> can
>>> be used).
>> 
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>> 
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instead.

Did that too. The blk_zone struct is now exactly 64B. I removed the per zone
spinlock and replaced it with a flag so that zones can still be locked
independently using wait_on_bit_lock & friends (I added the functions
blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
locking comes in handy to implement the ioctls code without stalling the command
queue for read, writes and discard on different zones.

I however kept the zone length in the structure. The reason for doing so is that
this allows supporting drives with non-constant zone sizes, albeit with a more
limited interface since in such case the device chunk_sectors is not set (that
is how a user or application can detect that the zones are not constant size).
For these drives, the block layer may happily merge BIOs across zone boundaries
and the discard code will not split and align calls on the zones. But upper 
layers
(an FS or a device mapper) can still do all this by themselves if they want/can
support non-constant zone sizes.

The only exception is drives like the Seagate one with only the last zone of a
different size. This case is handled exactly as if all zones are the same size
simply because any operation on the last smaller zone will naturally align as
the checks of operation size against the drive 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Damien Le Moal
Hannes,

> On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:
[...]
>>> 
>>> Can we agree on an interface ?
>>> Summarizing all the discussions we had, I am all in favor of the following:
>>> 
>>> 1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
>>> The already existing device type file can tell us if this is an host
>>> managed disk (type==20) or a host aware disk (type==0). Drive managed
>>> disks could also be detected, but since no information on their zone
>>> configuration can be obtained, lets treat those as regular drives.
>> 
>> Since disk type == 0 for everything that isn't HM so I would prefer the
>> sysfs 'zoned' file just report if the drive is HA or HM.
>> 
> Okay. So let's put in the 'zoned' attribute the device type:
> 'host-managed', 'host-aware', or 'device managed'.

I hacked your patches and simply put a "0" or "1" in the sysfs zoned file.
Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
else. This means that drive managed models are not exposed as zoned block
devices. For HM vs HA differentiation, an application can look at the
device type file since it is already present.

We could indeed set the "zoned" file to the device type, but HM drives and
regular drives will both have "0" in it, so no differentiation possible.
The other choice could be the "zoned" bits defined by ZBC, but these
do not define a value for host managed drives, and the drive managed value
being not "0" could be confusing too. So I settled for a simple 0/1 boolean.

> 
>>> 2) Add ioctls for zone management:
>>> Report zones (get information from RB tree), reset zone (simple wrapper
>>> to ioctl for block discard), open zone, close zone and finish zone. That
>>> will allow mkfs like tools to get zone information without having to parse
>>> thousands of sysfs files (and can also be integrated in libzbc block backend
>>> driver for a unified interface with the direct SG_IO path for kernels 
>>> without
>>> the ZBC code enabled).
>> 
>> I can add finish zone ... but I really can't think of a use for it, myself.
>> 
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Done: I hacked Shaun ioctl code and added finish zone too. The
difference with Shaun initial code is that the ioctl are propagated down to
the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need for
BIO request definition for the zone operations. So a lot less code added.
The ioctls do not mimic exactly the ZBC standard. For instance, there is no
reporting options for report zones, nor is the "all" bit supported for open,
close or finish zone commands. But the information provided on zones is complete
and maps to the standard definitions.

I also added a reset_wp ioctl for completeness, but this one simply calls
blkdev_issue_discard internally, so it is in fact equivalent to the BLKDISCARD
ioctl call with a range exactly aligned on a zone.

> 
>>> 3) Try to condense the blkzone data structure to save memory:
>>> I think that we can at the very least remove the zone length, and also
>>> may be the per zone spinlock too (a single spinlock and proper state flags 
>>> can
>>> be used).
>> 
>> I have a variant that is an array of descriptors that roughly mimics the
>> api from blk-zoned.c that I did a few months ago as an example.
>> I should be able to update that to the current kernel + patches.
>> 
> Okay. If we restrict the in-kernel SMR drive handling to devices with
> identical zone sizes of course we can remove the zone length.
> And we can do away with the per-zone spinlock and use a global one instead.

Did that too. The blk_zone struct is now exactly 64B. I removed the per zone
spinlock and replaced it with a flag so that zones can still be locked
independently using wait_on_bit_lock & friends (I added the functions
blk_zone_lock, blk_zone_trylock and blk_zone_unlock to do that). This per zone
locking comes in handy to implement the ioctls code without stalling the command
queue for read, writes and discard on different zones.

I however kept the zone length in the structure. The reason for doing so is that
this allows supporting drives with non-constant zone sizes, albeit with a more
limited interface since in such case the device chunk_sectors is not set (that
is how a user or application can detect that the zones are not constant size).
For these drives, the block layer may happily merge BIOs across zone boundaries
and the discard code will not split and align calls on the zones. But upper 
layers
(an FS or a device mapper) can still do all this by themselves if they want/can
support non-constant zone sizes.

The only exception is drives like the Seagate one with only the last zone of a
different size. This case is handled exactly as if all zones are the same size
simply because any operation on the last smaller zone will naturally align as
the checks of operation size against the drive capacity will 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Hannes Reinecke
On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  wrote:
>> Hannes, Shaun,
>>
>> Let me add some more comments.
>>
>>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>>
>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
 On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

 Adding Hannes and Damien to Cc.

 Christoph,

 I can make a patch the marshal Hannes' RB-Tree into to a block report, 
 that is
 quite simple. I can even have the open/close/reset zone commands update the
 RB-Tree .. the non-private parts anyway. I would prefer to do this around 
 the
 CONFIG_SD_ZBC support, offering the existing type of patch for setups that 
 do
 not need the RB-Tree to function with zoned media.
> 
> I have posted patches to integrate with the zone cache, hopefully they
> make sense.
> 
[ .. ]
>>> I have thought about condensing the RB tree information, but then I
>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>> fixed size, and hence we need all the information there.
>>> Any condensing method would assume a given structure of the zones, which
>>> the standard just doesn't provide.
>>> Or am I missing something here?
>>
>> Indeed, the standards do not mandate any particular zone configuration,
>> constant zone size, etc. So writing code so that can be handled is certainly
>> the right way of doing things. However, if we decide to go forward with
>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
>> zone size (minus the last zone as you said) must be assumed, and that
>> information can be removed from the entries in the RB tree (as it will be
>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
>> that eventual last runt zone with a different size is not a big problem.
> 
>>> As for write pointer handling: yes, the write pointer on the zones is
>>> not really useful for upper-level usage.
>>> Where we do need it is to detect I/O which is crossing the write pointer
>>> (eg when doing reads over the entire zone).
>>> As per spec you will be getting an I/O error here, so we need to split
>>> the I/O on the write pointer to get valid results back.
>>
>> To be precise here, the I/O splitting will be handled by the block layer
>> thanks to the "chunk_sectors" setting. But that relies on a constant zone
>> size assumption too.
>>
>> The RB-tree here is most useful for reads over or after the write pointer as
>> this can have different behavior on different drives (URSWRZ bit). The 
>> RB-tree
>> allows us to hide these differences to upper layers and simplify support at
>> those levels.
> 
> It is unfortunate that path was chosen rather than returning Made Up Data.
But how would you return made up data without knowing that you _have_ to
generate some?
Of course it's trivial to generate made up data once an I/O error
occurred, but that's too late as the I/O error already happened.

The general idea behind this is that I _really_ do want to avoid
triggering an I/O error on initial access. This kind of thing really
tends to freak out users (connect a new drive and getting I/O errors;
not the best user experience ...)

> However I don't think this is a file system or device mapper problem as
> neither of those layers attempt to read blocks they have not written
> (or discarded).
> All I can think of something that probes the drive media for a partition table
> or other identification...isn't the conventional space sufficient to cover
> those cases?
Upon device scan the kernel will attempt to read the partition tables,
which consists of reading the first few megabytes and the last sector
(for GPT shadow partition tables).
And depending on the driver manufacturer you might or might not have
enough conventional space at the right locations.

> Anyway you could just handle the error and sense codes [CHECK CONDITION /
> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
> when URSWRZ is 0. It would have the same effect as using the zone cache
> without the possibility of the zone cache being out-of-sync.
> 
Yes, but then we would already have registered an I/O error.
And as indicated above, I really do _not_ want to handle all the
customer calls for supposed faulty new SMR drives.

>> I too agree that the write pointer value is not very useful to upper layers
>> (e.g. FS). What matters most of the times for these layers is an "allocation
>> pointer" or "submission pointer" which indicates the LBA to use for the next
>> BIO submission. That LBA will most of the time be in advance of the zone WP
>> because of request queing in the block scheduler.
> 
> Which is kind of my point.
> 
>> Note that from what we have done already, in many cases, upper layers do not
>> even 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Hannes Reinecke
On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  wrote:
>> Hannes, Shaun,
>>
>> Let me add some more comments.
>>
>>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>>
>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
 On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

 Adding Hannes and Damien to Cc.

 Christoph,

 I can make a patch the marshal Hannes' RB-Tree into to a block report, 
 that is
 quite simple. I can even have the open/close/reset zone commands update the
 RB-Tree .. the non-private parts anyway. I would prefer to do this around 
 the
 CONFIG_SD_ZBC support, offering the existing type of patch for setups that 
 do
 not need the RB-Tree to function with zoned media.
> 
> I have posted patches to integrate with the zone cache, hopefully they
> make sense.
> 
[ .. ]
>>> I have thought about condensing the RB tree information, but then I
>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>> fixed size, and hence we need all the information there.
>>> Any condensing method would assume a given structure of the zones, which
>>> the standard just doesn't provide.
>>> Or am I missing something here?
>>
>> Indeed, the standards do not mandate any particular zone configuration,
>> constant zone size, etc. So writing code so that can be handled is certainly
>> the right way of doing things. However, if we decide to go forward with
>> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
>> zone size (minus the last zone as you said) must be assumed, and that
>> information can be removed from the entries in the RB tree (as it will be
>> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
>> that eventual last runt zone with a different size is not a big problem.
> 
>>> As for write pointer handling: yes, the write pointer on the zones is
>>> not really useful for upper-level usage.
>>> Where we do need it is to detect I/O which is crossing the write pointer
>>> (eg when doing reads over the entire zone).
>>> As per spec you will be getting an I/O error here, so we need to split
>>> the I/O on the write pointer to get valid results back.
>>
>> To be precise here, the I/O splitting will be handled by the block layer
>> thanks to the "chunk_sectors" setting. But that relies on a constant zone
>> size assumption too.
>>
>> The RB-tree here is most useful for reads over or after the write pointer as
>> this can have different behavior on different drives (URSWRZ bit). The 
>> RB-tree
>> allows us to hide these differences to upper layers and simplify support at
>> those levels.
> 
> It is unfortunate that path was chosen rather than returning Made Up Data.
But how would you return made up data without knowing that you _have_ to
generate some?
Of course it's trivial to generate made up data once an I/O error
occurred, but that's too late as the I/O error already happened.

The general idea behind this is that I _really_ do want to avoid
triggering an I/O error on initial access. This kind of thing really
tends to freak out users (connect a new drive and getting I/O errors;
not the best user experience ...)

> However I don't think this is a file system or device mapper problem as
> neither of those layers attempt to read blocks they have not written
> (or discarded).
> All I can think of something that probes the drive media for a partition table
> or other identification...isn't the conventional space sufficient to cover
> those cases?
Upon device scan the kernel will attempt to read the partition tables,
which consists of reading the first few megabytes and the last sector
(for GPT shadow partition tables).
And depending on the driver manufacturer you might or might not have
enough conventional space at the right locations.

> Anyway you could just handle the error and sense codes [CHECK CONDITION /
> ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
> when URSWRZ is 0. It would have the same effect as using the zone cache
> without the possibility of the zone cache being out-of-sync.
> 
Yes, but then we would already have registered an I/O error.
And as indicated above, I really do _not_ want to handle all the
customer calls for supposed faulty new SMR drives.

>> I too agree that the write pointer value is not very useful to upper layers
>> (e.g. FS). What matters most of the times for these layers is an "allocation
>> pointer" or "submission pointer" which indicates the LBA to use for the next
>> BIO submission. That LBA will most of the time be in advance of the zone WP
>> because of request queing in the block scheduler.
> 
> Which is kind of my point.
> 
>> Note that from what we have done already, in many cases, upper layers do not
>> even need this (e.g. F2FS, btrfs) and work fine without 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-05 Thread Shaun Tancheff
On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  wrote:
> Hannes, Shaun,
>
> Let me add some more comments.
>
>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>
>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:

 Can you please integrate this with Hannes series so that it uses
 his cache of the zone information?
>>>
>>> Adding Hannes and Damien to Cc.
>>>
>>> Christoph,
>>>
>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that 
>>> is
>>> quite simple. I can even have the open/close/reset zone commands update the
>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around 
>>> the
>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that 
>>> do
>>> not need the RB-Tree to function with zoned media.

I have posted patches to integrate with the zone cache, hopefully they
make sense.

>>>
>>> I do still have concerns with the approach which I have shared in smaller
>>> forums but perhaps I have to bring them to this group.
>>>
>>> First is the memory consumption. This isn't really much of a concern for 
>>> large
>>> servers with few drives but I think the embedded NAS market will grumble as
>>> well as the large data pods trying to stuff 300+ drives in a chassis.
>>>
>>> As of now the RB-Tree needs to hold ~3 zones.
>>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>>> around 3.5 MB per zoned drive attached.
>>> Which is fine if it is really needed, but most of it is fixed information
>>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>>> in an array as more than adequate). Worse is that the crucial piece of
>>> information, the current wp needed for scheduling the next write, is mostly
>>> out of date because it is updated only after the write completes and zones
>>> being actively written to must work off of the last location / size that was
>>> submitted, not completed. The work around is for that tracking to be handled
>>> in the private_data member. I am not saying that updating the wp on
>>> completing a write isn’t important, I am saying that the bi_end_io hook is
>>> the existing hook that works just fine.
>>>
>> Which _actually_ is not true; with my patches I'll update the write
>> pointer prior to submit the I/O (on the reasoning that most of the time
>> I/O will succeed) and re-read the zone information if an I/O failed.
>> (Which I'll have to do anyway as after an I/O failure the write pointer
>> status is not clearly defined.)

Apologies for my mis-characterization.

>> I have thought about condensing the RB tree information, but then I
>> figured that for 'real' SMR handling we cannot assume all zones are of
>> fixed size, and hence we need all the information there.
>> Any condensing method would assume a given structure of the zones, which
>> the standard just doesn't provide.
>> Or am I missing something here?
>
> Indeed, the standards do not mandate any particular zone configuration,
> constant zone size, etc. So writing code so that can be handled is certainly
> the right way of doing things. However, if we decide to go forward with
> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
> zone size (minus the last zone as you said) must be assumed, and that
> information can be removed from the entries in the RB tree (as it will be
> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
> that eventual last runt zone with a different size is not a big problem.

>> As for write pointer handling: yes, the write pointer on the zones is
>> not really useful for upper-level usage.
>> Where we do need it is to detect I/O which is crossing the write pointer
>> (eg when doing reads over the entire zone).
>> As per spec you will be getting an I/O error here, so we need to split
>> the I/O on the write pointer to get valid results back.
>
> To be precise here, the I/O splitting will be handled by the block layer
> thanks to the "chunk_sectors" setting. But that relies on a constant zone
> size assumption too.
>
> The RB-tree here is most useful for reads over or after the write pointer as
> this can have different behavior on different drives (URSWRZ bit). The RB-tree
> allows us to hide these differences to upper layers and simplify support at
> those levels.

It is unfortunate that path was chosen rather than returning Made Up Data.
However I don't think this is a file system or device mapper problem as
neither of those layers attempt to read blocks they have not written
(or discarded).
All I can think of something that probes the drive media for a partition table
or other identification...isn't the conventional space sufficient to cover
those cases?
Anyway you could just handle the error and sense codes [CHECK CONDITION /
ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
when URSWRZ is 0. It 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-05 Thread Shaun Tancheff
On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  wrote:
> Hannes, Shaun,
>
> Let me add some more comments.
>
>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>
>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:

 Can you please integrate this with Hannes series so that it uses
 his cache of the zone information?
>>>
>>> Adding Hannes and Damien to Cc.
>>>
>>> Christoph,
>>>
>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that 
>>> is
>>> quite simple. I can even have the open/close/reset zone commands update the
>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around 
>>> the
>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that 
>>> do
>>> not need the RB-Tree to function with zoned media.

I have posted patches to integrate with the zone cache, hopefully they
make sense.

>>>
>>> I do still have concerns with the approach which I have shared in smaller
>>> forums but perhaps I have to bring them to this group.
>>>
>>> First is the memory consumption. This isn't really much of a concern for 
>>> large
>>> servers with few drives but I think the embedded NAS market will grumble as
>>> well as the large data pods trying to stuff 300+ drives in a chassis.
>>>
>>> As of now the RB-Tree needs to hold ~3 zones.
>>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>>> around 3.5 MB per zoned drive attached.
>>> Which is fine if it is really needed, but most of it is fixed information
>>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>>> in an array as more than adequate). Worse is that the crucial piece of
>>> information, the current wp needed for scheduling the next write, is mostly
>>> out of date because it is updated only after the write completes and zones
>>> being actively written to must work off of the last location / size that was
>>> submitted, not completed. The work around is for that tracking to be handled
>>> in the private_data member. I am not saying that updating the wp on
>>> completing a write isn’t important, I am saying that the bi_end_io hook is
>>> the existing hook that works just fine.
>>>
>> Which _actually_ is not true; with my patches I'll update the write
>> pointer prior to submit the I/O (on the reasoning that most of the time
>> I/O will succeed) and re-read the zone information if an I/O failed.
>> (Which I'll have to do anyway as after an I/O failure the write pointer
>> status is not clearly defined.)

Apologies for my mis-characterization.

>> I have thought about condensing the RB tree information, but then I
>> figured that for 'real' SMR handling we cannot assume all zones are of
>> fixed size, and hence we need all the information there.
>> Any condensing method would assume a given structure of the zones, which
>> the standard just doesn't provide.
>> Or am I missing something here?
>
> Indeed, the standards do not mandate any particular zone configuration,
> constant zone size, etc. So writing code so that can be handled is certainly
> the right way of doing things. However, if we decide to go forward with
> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
> zone size (minus the last zone as you said) must be assumed, and that
> information can be removed from the entries in the RB tree (as it will be
> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
> that eventual last runt zone with a different size is not a big problem.

>> As for write pointer handling: yes, the write pointer on the zones is
>> not really useful for upper-level usage.
>> Where we do need it is to detect I/O which is crossing the write pointer
>> (eg when doing reads over the entire zone).
>> As per spec you will be getting an I/O error here, so we need to split
>> the I/O on the write pointer to get valid results back.
>
> To be precise here, the I/O splitting will be handled by the block layer
> thanks to the "chunk_sectors" setting. But that relies on a constant zone
> size assumption too.
>
> The RB-tree here is most useful for reads over or after the write pointer as
> this can have different behavior on different drives (URSWRZ bit). The RB-tree
> allows us to hide these differences to upper layers and simplify support at
> those levels.

It is unfortunate that path was chosen rather than returning Made Up Data.
However I don't think this is a file system or device mapper problem as
neither of those layers attempt to read blocks they have not written
(or discarded).
All I can think of something that probes the drive media for a partition table
or other identification...isn't the conventional space sufficient to cover
those cases?
Anyway you could just handle the error and sense codes [CHECK CONDITION /
ILLEGAL REQUEST / ATTEMPT TO READ INVALID DATA] and return MUD
when URSWRZ is 0. It would have the same effect as using the zone cache

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-02 Thread Damien Le Moal
Hannes, Shaun,

Let me add some more comments.

> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
> 
> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>> 
>>> Can you please integrate this with Hannes series so that it uses
>>> his cache of the zone information?
>> 
>> Adding Hannes and Damien to Cc.
>> 
>> Christoph,
>> 
>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that 
>> is
>> quite simple. I can even have the open/close/reset zone commands update the
>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>> not need the RB-Tree to function with zoned media.
>> 
>> I do still have concerns with the approach which I have shared in smaller
>> forums but perhaps I have to bring them to this group.
>> 
>> First is the memory consumption. This isn't really much of a concern for 
>> large
>> servers with few drives but I think the embedded NAS market will grumble as
>> well as the large data pods trying to stuff 300+ drives in a chassis.
>> 
>> As of now the RB-Tree needs to hold ~3 zones.
>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>> around 3.5 MB per zoned drive attached.
>> Which is fine if it is really needed, but most of it is fixed information
>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>> in an array as more than adequate). Worse is that the crucial piece of
>> information, the current wp needed for scheduling the next write, is mostly
>> out of date because it is updated only after the write completes and zones
>> being actively written to must work off of the last location / size that was
>> submitted, not completed. The work around is for that tracking to be handled
>> in the private_data member. I am not saying that updating the wp on
>> completing a write isn’t important, I am saying that the bi_end_io hook is
>> the existing hook that works just fine.
>> 
> Which _actually_ is not true; with my patches I'll update the write
> pointer prior to submit the I/O (on the reasoning that most of the time
> I/O will succeed) and re-read the zone information if an I/O failed.
> (Which I'll have to do anyway as after an I/O failure the write pointer
> status is not clearly defined.)
> 
> I have thought about condensing the RB tree information, but then I
> figured that for 'real' SMR handling we cannot assume all zones are of
> fixed size, and hence we need all the information there.
> Any condensing method would assume a given structure of the zones, which
> the standard just doesn't provide.
> Or am I missing something here?

Indeed, the standards do not mandate any particular zone configuration,
constant zone size, etc. So writing code so that can be handled is certainly
the right way of doing things. However, if we decide to go forward with
mapping RESET WRITE POINTER command to DISCARD, then at least a constant
zone size (minus the last zone as you said) must be assumed, and that
information can be removed from the entries in the RB tree (as it will be
saved for the sysfs "zone_size" file anyway. Adding a little code to handle
that eventual last runt zone with a different size is not a big problem.

> 
> As for write pointer handling: yes, the write pointer on the zones is
> not really useful for upper-level usage.
> Where we do need it is to detect I/O which is crossing the write pointer
> (eg when doing reads over the entire zone).
> As per spec you will be getting an I/O error here, so we need to split
> the I/O on the write pointer to get valid results back.

To be precise here, the I/O splitting will be handled by the block layer
thanks to the "chunk_sectors" setting. But that relies on a constant zone
size assumption too.

The RB-tree here is most useful for reads over or after the write pointer as
this can have different behavior on different drives (URSWRZ bit). The RB-tree
allows us to hide these differences to upper layers and simplify support at
those levels.

I too agree that the write pointer value is not very useful to upper layers
(e.g. FS). What matters most of the times for these layers is an "allocation
pointer" or "submission pointer" which indicates the LBA to use for the next
BIO submission. That LBA will most of the time be in advance of the zone WP
because of request queing in the block scheduler.
Note that from what we have done already, in many cases, upper layers do not
even need this (e.g. F2FS, btrfs) and work fine without needing *any* 
zone->private_data because that "allocation pointer" information generally
already exists in a different form (e.g. a bit in a bitmap).
For these cases, not having the RB-tree of zones would force a lot
more code to be added to file systems for SMR support.

> 
>> This all tails into domain responsibility. With the RB-Tree doing half of the
>> 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-02 Thread Damien Le Moal
Hannes, Shaun,

Let me add some more comments.

> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
> 
> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>> 
>>> Can you please integrate this with Hannes series so that it uses
>>> his cache of the zone information?
>> 
>> Adding Hannes and Damien to Cc.
>> 
>> Christoph,
>> 
>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that 
>> is
>> quite simple. I can even have the open/close/reset zone commands update the
>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>> not need the RB-Tree to function with zoned media.
>> 
>> I do still have concerns with the approach which I have shared in smaller
>> forums but perhaps I have to bring them to this group.
>> 
>> First is the memory consumption. This isn't really much of a concern for 
>> large
>> servers with few drives but I think the embedded NAS market will grumble as
>> well as the large data pods trying to stuff 300+ drives in a chassis.
>> 
>> As of now the RB-Tree needs to hold ~3 zones.
>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>> around 3.5 MB per zoned drive attached.
>> Which is fine if it is really needed, but most of it is fixed information
>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>> in an array as more than adequate). Worse is that the crucial piece of
>> information, the current wp needed for scheduling the next write, is mostly
>> out of date because it is updated only after the write completes and zones
>> being actively written to must work off of the last location / size that was
>> submitted, not completed. The work around is for that tracking to be handled
>> in the private_data member. I am not saying that updating the wp on
>> completing a write isn’t important, I am saying that the bi_end_io hook is
>> the existing hook that works just fine.
>> 
> Which _actually_ is not true; with my patches I'll update the write
> pointer prior to submit the I/O (on the reasoning that most of the time
> I/O will succeed) and re-read the zone information if an I/O failed.
> (Which I'll have to do anyway as after an I/O failure the write pointer
> status is not clearly defined.)
> 
> I have thought about condensing the RB tree information, but then I
> figured that for 'real' SMR handling we cannot assume all zones are of
> fixed size, and hence we need all the information there.
> Any condensing method would assume a given structure of the zones, which
> the standard just doesn't provide.
> Or am I missing something here?

Indeed, the standards do not mandate any particular zone configuration,
constant zone size, etc. So writing code so that can be handled is certainly
the right way of doing things. However, if we decide to go forward with
mapping RESET WRITE POINTER command to DISCARD, then at least a constant
zone size (minus the last zone as you said) must be assumed, and that
information can be removed from the entries in the RB tree (as it will be
saved for the sysfs "zone_size" file anyway. Adding a little code to handle
that eventual last runt zone with a different size is not a big problem.

> 
> As for write pointer handling: yes, the write pointer on the zones is
> not really useful for upper-level usage.
> Where we do need it is to detect I/O which is crossing the write pointer
> (eg when doing reads over the entire zone).
> As per spec you will be getting an I/O error here, so we need to split
> the I/O on the write pointer to get valid results back.

To be precise here, the I/O splitting will be handled by the block layer
thanks to the "chunk_sectors" setting. But that relies on a constant zone
size assumption too.

The RB-tree here is most useful for reads over or after the write pointer as
this can have different behavior on different drives (URSWRZ bit). The RB-tree
allows us to hide these differences to upper layers and simplify support at
those levels.

I too agree that the write pointer value is not very useful to upper layers
(e.g. FS). What matters most of the times for these layers is an "allocation
pointer" or "submission pointer" which indicates the LBA to use for the next
BIO submission. That LBA will most of the time be in advance of the zone WP
because of request queing in the block scheduler.
Note that from what we have done already, in many cases, upper layers do not
even need this (e.g. F2FS, btrfs) and work fine without needing *any* 
zone->private_data because that "allocation pointer" information generally
already exists in a different form (e.g. a bit in a bitmap).
For these cases, not having the RB-tree of zones would force a lot
more code to be added to file systems for SMR support.

> 
>> This all tails into domain responsibility. With the RB-Tree doing half of the
>> work and the ‘responsible’ 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-02 Thread Hannes Reinecke
On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
> 
> Adding Hannes and Damien to Cc.
> 
> Christoph,
> 
> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
> quite simple. I can even have the open/close/reset zone commands update the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
> not need the RB-Tree to function with zoned media.
> 
> I do still have concerns with the approach which I have shared in smaller
> forums but perhaps I have to bring them to this group.
> 
> First is the memory consumption. This isn't really much of a concern for large
> servers with few drives but I think the embedded NAS market will grumble as
> well as the large data pods trying to stuff 300+ drives in a chassis.
> 
> As of now the RB-Tree needs to hold ~3 zones.
> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
> around 3.5 MB per zoned drive attached.
> Which is fine if it is really needed, but most of it is fixed information
> and it can be significantly condensed (I have proposed 8 bytes per zone held
> in an array as more than adequate). Worse is that the crucial piece of
> information, the current wp needed for scheduling the next write, is mostly
> out of date because it is updated only after the write completes and zones
> being actively written to must work off of the last location / size that was
> submitted, not completed. The work around is for that tracking to be handled
> in the private_data member. I am not saying that updating the wp on
> completing a write isn’t important, I am saying that the bi_end_io hook is
> the existing hook that works just fine.
> 
Which _actually_ is not true; with my patches I'll update the write
pointer prior to submit the I/O (on the reasoning that most of the time
I/O will succeed) and re-read the zone information if an I/O failed.
(Which I'll have to do anyway as after an I/O failure the write pointer
status is not clearly defined.)

I have thought about condensing the RB tree information, but then I
figured that for 'real' SMR handling we cannot assume all zones are of
fixed size, and hence we need all the information there.
Any condensing method would assume a given structure of the zones, which
the standard just doesn't provide.
Or am I missing something here?

As for write pointer handling: yes, the write pointer on the zones is
not really useful for upper-level usage.
Where we do need it is to detect I/O which is crossing the write pointer
(eg when doing reads over the entire zone).
As per spec you will be getting an I/O error here, so we need to split
the I/O on the write pointer to get valid results back.

> This all tails into domain responsibility. With the RB-Tree doing half of the
> work and the ‘responsible’ domain handling the active path via private_data
> why have the split at all? It seems to be a double work to have second object
> tracking the first so that I/O scheduling can function.
> 
Tracking the zone states via RB trees is mainly to handly host-managed
drives seamlessly. Without an RB tree we will be subjected to I/O errors
during boot, as eg with new drives we inevitably will try to read beyond
the write pointer, which in turn will generate I/O errors on certain drives.
I do agree that there is no strict need to setup an RB tree for
host-aware drives; I can easily add an attribute/flag to disable RB
trees for those.
However, tracking zone information in the RB tree _dramatically_ speeds
up device initialisation; issuing a blkdev_discard() for the entire
drive will take _ages_ without it.

> Finally is the error handling path when the RB-Tree encounters and error it
> attempts to requery the drive topology virtually guaranteeing that the
> private_data is now out-of-sync with the RB-Tree. Again this is something
> that can be better encapsulated in the bi_end_io to be informed of the
> failed I/O and schedule the appropriate recovery (including re-querying the
> zone information of the affected zone(s)).
> 
Well, if we have an RB tree with write pointers of course we need to
re-read the zone information on failure (and I thought I did that?).
Plus the error information will be propagated via the usual mechanism,
so they need to take care of updating any information in ->private_data.

I'm perfectly fine with integrating your patches for the various
blkdev_XX callouts and associated ioctls; Jens favours that approach,
too, so we should be combining those efforts.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, 

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-02 Thread Hannes Reinecke
On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
> 
> Adding Hannes and Damien to Cc.
> 
> Christoph,
> 
> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
> quite simple. I can even have the open/close/reset zone commands update the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
> not need the RB-Tree to function with zoned media.
> 
> I do still have concerns with the approach which I have shared in smaller
> forums but perhaps I have to bring them to this group.
> 
> First is the memory consumption. This isn't really much of a concern for large
> servers with few drives but I think the embedded NAS market will grumble as
> well as the large data pods trying to stuff 300+ drives in a chassis.
> 
> As of now the RB-Tree needs to hold ~3 zones.
> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
> around 3.5 MB per zoned drive attached.
> Which is fine if it is really needed, but most of it is fixed information
> and it can be significantly condensed (I have proposed 8 bytes per zone held
> in an array as more than adequate). Worse is that the crucial piece of
> information, the current wp needed for scheduling the next write, is mostly
> out of date because it is updated only after the write completes and zones
> being actively written to must work off of the last location / size that was
> submitted, not completed. The work around is for that tracking to be handled
> in the private_data member. I am not saying that updating the wp on
> completing a write isn’t important, I am saying that the bi_end_io hook is
> the existing hook that works just fine.
> 
Which _actually_ is not true; with my patches I'll update the write
pointer prior to submit the I/O (on the reasoning that most of the time
I/O will succeed) and re-read the zone information if an I/O failed.
(Which I'll have to do anyway as after an I/O failure the write pointer
status is not clearly defined.)

I have thought about condensing the RB tree information, but then I
figured that for 'real' SMR handling we cannot assume all zones are of
fixed size, and hence we need all the information there.
Any condensing method would assume a given structure of the zones, which
the standard just doesn't provide.
Or am I missing something here?

As for write pointer handling: yes, the write pointer on the zones is
not really useful for upper-level usage.
Where we do need it is to detect I/O which is crossing the write pointer
(eg when doing reads over the entire zone).
As per spec you will be getting an I/O error here, so we need to split
the I/O on the write pointer to get valid results back.

> This all tails into domain responsibility. With the RB-Tree doing half of the
> work and the ‘responsible’ domain handling the active path via private_data
> why have the split at all? It seems to be a double work to have second object
> tracking the first so that I/O scheduling can function.
> 
Tracking the zone states via RB trees is mainly to handly host-managed
drives seamlessly. Without an RB tree we will be subjected to I/O errors
during boot, as eg with new drives we inevitably will try to read beyond
the write pointer, which in turn will generate I/O errors on certain drives.
I do agree that there is no strict need to setup an RB tree for
host-aware drives; I can easily add an attribute/flag to disable RB
trees for those.
However, tracking zone information in the RB tree _dramatically_ speeds
up device initialisation; issuing a blkdev_discard() for the entire
drive will take _ages_ without it.

> Finally is the error handling path when the RB-Tree encounters and error it
> attempts to requery the drive topology virtually guaranteeing that the
> private_data is now out-of-sync with the RB-Tree. Again this is something
> that can be better encapsulated in the bi_end_io to be informed of the
> failed I/O and schedule the appropriate recovery (including re-querying the
> zone information of the affected zone(s)).
> 
Well, if we have an RB tree with write pointers of course we need to
re-read the zone information on failure (and I thought I did that?).
Plus the error information will be propagated via the usual mechanism,
so they need to take care of updating any information in ->private_data.

I'm perfectly fine with integrating your patches for the various
blkdev_XX callouts and associated ioctls; Jens favours that approach,
too, so we should be combining those efforts.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-01 Thread Shaun Tancheff
On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

Adding Hannes and Damien to Cc.

Christoph,

I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
quite simple. I can even have the open/close/reset zone commands update the
RB-Tree .. the non-private parts anyway. I would prefer to do this around the
CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
not need the RB-Tree to function with zoned media.

I do still have concerns with the approach which I have shared in smaller
forums but perhaps I have to bring them to this group.

First is the memory consumption. This isn't really much of a concern for large
servers with few drives but I think the embedded NAS market will grumble as
well as the large data pods trying to stuff 300+ drives in a chassis.

As of now the RB-Tree needs to hold ~3 zones.
sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
around 3.5 MB per zoned drive attached.
Which is fine if it is really needed, but most of it is fixed information
and it can be significantly condensed (I have proposed 8 bytes per zone held
in an array as more than adequate). Worse is that the crucial piece of
information, the current wp needed for scheduling the next write, is mostly
out of date because it is updated only after the write completes and zones
being actively written to must work off of the last location / size that was
submitted, not completed. The work around is for that tracking to be handled
in the private_data member. I am not saying that updating the wp on
completing a write isn’t important, I am saying that the bi_end_io hook is
the existing hook that works just fine.

This all tails into domain responsability. With the RB-Tree doing half of the
work and the ‘responsible’ domain handling the active path via private_data
why have the split at all? It seems to be a double work to have second object
tracking the first so that I/O scheduling can function.

Finally is the error handling path when the RB-Tree encounters and error it
attempts to requery the drive topology virtually guaranteeing that the
private_data is now out-of-sync with the RB-Tree. Again this is something
that can be better encapsulated in the bi_end_io to be informed of the
failed I/O and schedule the appropriate recovery (including re-querying the
zone information of the affected zone(s)).

Anyway those are my concerns and why I am still reluctant to drop this line of
support. I have incorporated Hannes changes at various points. Hence the
SCT Write Same to attempt to work around some of the flaws in mapping
discard to reset write pointer.

Thanks and Regards,
Shaun

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html=CwIBAg=IGDlg0lD0b-nebmJJ0Kp8A=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo=



-- 
Shaun Tancheff


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-01 Thread Shaun Tancheff
On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

Adding Hannes and Damien to Cc.

Christoph,

I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
quite simple. I can even have the open/close/reset zone commands update the
RB-Tree .. the non-private parts anyway. I would prefer to do this around the
CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
not need the RB-Tree to function with zoned media.

I do still have concerns with the approach which I have shared in smaller
forums but perhaps I have to bring them to this group.

First is the memory consumption. This isn't really much of a concern for large
servers with few drives but I think the embedded NAS market will grumble as
well as the large data pods trying to stuff 300+ drives in a chassis.

As of now the RB-Tree needs to hold ~3 zones.
sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
around 3.5 MB per zoned drive attached.
Which is fine if it is really needed, but most of it is fixed information
and it can be significantly condensed (I have proposed 8 bytes per zone held
in an array as more than adequate). Worse is that the crucial piece of
information, the current wp needed for scheduling the next write, is mostly
out of date because it is updated only after the write completes and zones
being actively written to must work off of the last location / size that was
submitted, not completed. The work around is for that tracking to be handled
in the private_data member. I am not saying that updating the wp on
completing a write isn’t important, I am saying that the bi_end_io hook is
the existing hook that works just fine.

This all tails into domain responsability. With the RB-Tree doing half of the
work and the ‘responsible’ domain handling the active path via private_data
why have the split at all? It seems to be a double work to have second object
tracking the first so that I/O scheduling can function.

Finally is the error handling path when the RB-Tree encounters and error it
attempts to requery the drive topology virtually guaranteeing that the
private_data is now out-of-sync with the RB-Tree. Again this is something
that can be better encapsulated in the bi_end_io to be informed of the
failed I/O and schedule the appropriate recovery (including re-querying the
zone information of the affected zone(s)).

Anyway those are my concerns and why I am still reluctant to drop this line of
support. I have incorporated Hannes changes at various points. Hence the
SCT Write Same to attempt to work around some of the flaws in mapping
discard to reset write pointer.

Thanks and Regards,
Shaun

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html=CwIBAg=IGDlg0lD0b-nebmJJ0Kp8A=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo=



-- 
Shaun Tancheff


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-01 Thread Christoph Hellwig

Can you please integrate this with Hannes series so that it uses
his cache of the zone information?


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-01 Thread Christoph Hellwig

Can you please integrate this with Hannes series so that it uses
his cache of the zone information?


[PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-07-29 Thread Shaun Tancheff
Hi Jens,

This series is based on linus' current tip after the merge of 'for-4.8/core'

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be suitable
for use by Host Managed drives.

ZAC/ZBC drives add new commands for discovering and working with Zones.

This extends the ZAC/ZBC support up to the block layer allowing direct control
by file systems or device mapper targets. Also by deferring the zone handling
to the authoritative subsystem there is an overall lower memory usage for
holding the active zone information as well as clarifying responsible party
for maintaining the write pointer for each active zone.
By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and 
where the write pointer is based on writes completed w/o errors.
Knowing the drive's zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or 
discards) that are zone aligned.
A perhaps non-obvious approach is that a conventional drive will 
returns a zone report descriptor with a single large conventional zone.

Patches for util-linux can be found here:
https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux

This patch is available here:
https://github.com/stancheff/linux/tree/zbc.bio.v6

g...@github.com:stancheff/linux.git zbc.bio.v6

v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.

Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  95 
 block/ioctl.c | 110 +++
 drivers/scsi/sd.c | 118 
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   7 +-
 include/linux/blk_types.h |   6 +-
 include/linux/blkzoned_api.h  |  25 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 220 ++
 include/uapi/linux/fs.h   |   1 +
 11 files changed, 591 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.8.1



[PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-07-29 Thread Shaun Tancheff
Hi Jens,

This series is based on linus' current tip after the merge of 'for-4.8/core'

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be suitable
for use by Host Managed drives.

ZAC/ZBC drives add new commands for discovering and working with Zones.

This extends the ZAC/ZBC support up to the block layer allowing direct control
by file systems or device mapper targets. Also by deferring the zone handling
to the authoritative subsystem there is an overall lower memory usage for
holding the active zone information as well as clarifying responsible party
for maintaining the write pointer for each active zone.
By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and 
where the write pointer is based on writes completed w/o errors.
Knowing the drive's zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or 
discards) that are zone aligned.
A perhaps non-obvious approach is that a conventional drive will 
returns a zone report descriptor with a single large conventional zone.

Patches for util-linux can be found here:
https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux

This patch is available here:
https://github.com/stancheff/linux/tree/zbc.bio.v6

g...@github.com:stancheff/linux.git zbc.bio.v6

v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.

Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  95 
 block/ioctl.c | 110 +++
 drivers/scsi/sd.c | 118 
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   7 +-
 include/linux/blk_types.h |   6 +-
 include/linux/blkzoned_api.h  |  25 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 220 ++
 include/uapi/linux/fs.h   |   1 +
 11 files changed, 591 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.8.1