Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-18 Thread Dan Williams
[ adding Ashok and David for potential iommu comments ]

On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates  wrote:
> This patch follows from an RFC we did earlier this year [1]. This
> patchset applies cleanly to v4.9-rc1.
>
> Updates since RFC
> -
>   Rebased.
>   Included the iopmem driver in the submission.
>
> History
> ---
>
> There have been several attempts to upstream patchsets that enable
> DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf
> style patches [3]. None have been successful to date. Haggai Eran
> gives a nice overview of the prior art in this space in his cover
> letter [3].
>
> Motivation and Use Cases
> 
>
> PCIe IO devices are getting faster. It is not uncommon now to find PCIe
> network and storage devices that can generate and consume several GB/s.
> Almost always these devices have either a high performance DMA engine, a
> number of exposed PCIe BARs or both.
>
> Until this patch, any high-performance transfer of information between
> two PICe devices has required the use of a staging buffer in system
> memory. With this patch the bandwidth to system memory is not compromised
> when high-throughput transfers occurs between PCIe devices. This means
> that more system memory bandwidth is available to the CPU cores for data
> processing and manipulation. In addition, in systems where the two PCIe
> devices reside behind a PCIe switch the datapath avoids the CPU
> entirely.

I agree with the motivation and the need for a solution, but I have
some questions about this implementation.

>
> Consumers
> -
>
> We provide a PCIe device driver in an accompanying patch that can be
> used to map any PCIe BAR into a DAX capable block device. For
> non-persistent BARs this simply serves as an alternative to using
> system memory bounce buffers. For persistent BARs this can serve as an
> additional storage device in the system.

Why block devices?  I wonder if iopmem was initially designed back
when we were considering enabling DAX for raw block devices.  However,
that support has since been ripped out / abandoned.  You currently
need a filesystem on top of a block-device to get DAX operation.
Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
if all you want is a way to map the bar for another PCI-E device in
the topology.

If you're only using the block-device as a entry-point to create
dax-mappings then a device-dax (drivers/dax/) character-device might
be a better fit.

>
> Testing and Performance
> ---
>
> We have done a moderate about of testing of this patch on a QEMU
> environment and on real hardware. On real hardware we have observed
> peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
> both cases these numbers are limitations of our consumer hardware. In
> addition, we have observed that the CPU DRAM bandwidth is not impacted
> when using IOPMEM which is not the case when a traditional path
> through system memory is taken.
>
> For more information on the testing and performance results see the
> GitHub site [4].
>
> Known Issues
> 
>
> 1. Address Translation. Suggestions have been made that in certain
> architectures and topologies the dma_addr_t passed to the DMA master
> in a peer-2-peer transfer will not correctly route to the IO memory
> intended. However in our testing to date we have not seen this to be
> an issue, even in systems with IOMMUs and PCIe switches. It is our
> understanding that an IOMMU only maps system memory and would not
> interfere with device memory regions. (It certainly has no opportunity
> to do so if the transfer gets routed through a switch).
>

There may still be platforms where peer-to-peer cycles are routed up
through the root bridge and then back down to target device, but we
can address that when / if it happens.  I wonder if we could (ab)use a
software-defined 'pasid' as the requester id for a peer-to-peer
mapping that needs address translation.

> 2. Memory Segment Spacing. This patch has the same limitations that
> ZONE_DEVICE does in that memory regions must be spaces at least
> SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> be usable on neighboring BARs. For our purposes, this is not an issue as
> we'd only be looking at enabling a single BAR in a given PCIe device.
> More exotic use cases may have problems with this.

I'm working on patches for 4.10 to allow mixing multiple
devm_memremap_pages() allocations within the same physical section.
Hopefully this won't be a problem going forward.

> 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
> peer there is potential for coherency issues and for writes to occur out
> of order. This is something that users of this feature need to be
> cognizant of. Though really, this isn't much different than the
> existing situation 

Re: [PATCH] badblocks: fix overlapping check for clearing

2016-10-18 Thread NeilBrown
On Wed, Oct 12 2016, Tomasz Majchrzak wrote:

> On Mon, Oct 10, 2016 at 03:32:58PM -0700, Dan Williams wrote:
>> > On Tue, Sep 06 2016, Tomasz Majchrzak wrote:
>> >> ---
>> >>  block/badblocks.c | 6 --
>> >>  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/block/badblocks.c b/block/badblocks.c
>> >> index 7be53cb..b2ffcc7 100644
>> >> --- a/block/badblocks.c
>> >> +++ b/block/badblocks.c
>> >> @@ -354,7 +354,8 @@ int badblocks_clear(struct badblocks *bb, sector_t s, 
>> >> int sectors)
>> >>* current range.  Earlier ranges could also overlap,
>> >>* but only this one can overlap the end of the range.
>> >>*/
>> >> - if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
>> >> + if ((BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) &&
>> >> + (BB_OFFSET(p[lo]) <= target)) {
>> >
>> > hmmm..
>> > 'target' is the sector just beyond the set of sectors to remove from the
>> > list.
>> > BB_OFFSET(p[lo]) is the first sector in a range that was found in the
>> > list.
>> > If these are equal, then are aren't clearing anything in this range.
>> > So I would have '<', not '<='.
>> >
>> > I don't think this makes the code wrong as we end up assigning to p[lo]
>> > the value that is already there.  But it might be confusing.
>> >
>> >
>> >>   /* Partial overlap, leave the tail of this range */
>> >>   int ack = BB_ACK(p[lo]);
>> >>   sector_t a = BB_OFFSET(p[lo]);
>> >> @@ -377,7 +378,8 @@ int badblocks_clear(struct badblocks *bb, sector_t s, 
>> >> int sectors)
>> >>   lo--;
>> >>   }
>> >>   while (lo >= 0 &&
>> >> -BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
>> >> +(BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) &&
>> >> +(BB_OFFSET(p[lo]) <= target)) {
>> >
>> > Ditto.
>> >
>> > But the code is, I think, correct. Just not how I would have written it.
>> > So
>> >
>> >  Acked-by: NeilBrown 
>> 
>> I agree with the comments to change "<=" to "<".  Tomasz, care to
>> re-send with those changes?
>
> I have just resent the patch with your suggestions included.
>
>> > In the original md context, it would only ever be called on a block that
>> > was already in the list.
>
> Actually MD RAID10 calls it this way. See handle_write_completed, it iterates
> over all copies and clears the bad block if error has not been returned. I 
> have
> a test case which fails for that reason - existing bad block is modified by
> clear block. It is very unlikely to happen in real life as it depends on
> specific layout of bad blocks and their discovery order, however it's a gap 
> that
> needs to be closed.

Ahh, I didn't realize that.  I see that you are correct though.

>
> I had put some effort to see if clearing of non-existing bad block in RAID10 
> can
> lead to some incorrect behaviour but I haven't found any. It seems that my 
> patch
> is sufficient to fix the problem.

Yes.  Thanks for a lot for sorting this out :-)

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v8 0/7] ZBC / Zoned block device support

2016-10-18 Thread Martin K. Petersen
> "Jens" == Jens Axboe  writes:

Jens> I already queued up the other bits, if it's fine with you I'll add
Jens> 6/7 as well.

Sure. Feel free to add by Acked-by:.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 0/7] ZBC / Zoned block device support

2016-10-18 Thread Jens Axboe

On 10/18/2016 06:46 PM, Martin K. Petersen wrote:

"Jens" == Jens Axboe  writes:


Jens> This is starting to look mergeable to me.

Yup.

Jens> Any objections in getting this applied for 4.10? Looks like 6/7
Jens> should go through the SCSI tree, but I can queue up the rest.

I'm OK with it at this point. Probably easier if either you or I take
the whole lot.


I already queued up the other bits, if it's fine with you I'll add 6/7
as well.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-18 Thread Martin K. Petersen
> "Damien" == Damien Le Moal  writes:

Damien> Implement ZBC support functions to setup zoned disks, both
Damien> host-managed and host-aware models. Only zoned disks that
Damien> satisfy the following conditions are supported:
Damien> 1) All zones are the same size, with the exception of an
Damien>eventual last smaller runt zone.
Damien> 2) For host-managed disks, reads are unrestricted (reads are not
Damien>failed due to zone or write pointer alignement constraints).
Damien> Zoned disks that do not satisfy these 2 conditions are setup
Damien> with a capacity of 0 to prevent their use.

Damien> The function sd_zbc_read_zones, called from sd_revalidate_disk,
Damien> checks that the device satisfies the above two constraints. This
Damien> function may also change the disk capacity previously set by
Damien> sd_read_capacity for devices reporting only the capacity of
Damien> conventional zones at the beginning of the LBA range
Damien> (i.e. devices reporting rc_basis set to 0).

Damien> The capacity message output was moved out of sd_read_capacity
Damien> into a new function sd_print_capacity to include this eventual
Damien> capacity change by sd_zbc_read_zones. This new function also
Damien> includes a call to sd_zbc_print_zones to display the number of
Damien> zones and zone size of the device.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-18 Thread Martin K. Petersen
> "Jeff" == Jeff Moyer  writes:

Jeff,

Jeff> Are power of 2 zone sizes required by the standard?  I see why
Jeff> you've done this, but I wonder if we're artificially limiting the
Jeff> implementation, and whether there will be valid devices on the
Jeff> market that simply won't work with Linux because of this.

Standards are deliberately written to be permissive. But Linux doesn't
support arbitrary sector sizes either even though the spec allows it. We
always pick a reasonably sane subset of features to implement and this
case is no different.

After some discussion we decided to rip out all the complexity that was
required to facilitate crazy drive layouts. As a result, the code is now
in a state where we can actually merge it. The hope is that by picking a
specific configuration subset and widely advertising it we can influence
the market.

Also, I am not aware of anybody actually asking the drive vendors to
support crazy zone configurations.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes

2016-10-18 Thread Damien Le Moal

Jeff,

On 10/19/16 01:43, Jeff Moyer wrote:
> Damien Le Moal  writes:
> 
>> diff --git a/Documentation/ABI/testing/sysfs-block 
>> b/Documentation/ABI/testing/sysfs-block
>> index 75a5055..ee2d5cd 100644
>> --- a/Documentation/ABI/testing/sysfs-block
>> +++ b/Documentation/ABI/testing/sysfs-block
>> @@ -251,3 +251,16 @@ Description:
>>  since drive-managed zoned block devices do not support
>>  zone commands, they will be treated as regular block
>>  devices and zoned will report "none".
>> +
>> +What:   /sys/block//queue/chunk_sectors
>> +Date:   September 2016
>> +Contact:Hannes Reinecke 
>> +Description:
>> +chunk_sectors has different meaning depending on the type
>> +of the disk. For a RAID device (dm-raid), chunk_sectors
>> +indicates the size in 512B sectors of the RAID volume
>> +stripe segment. For a zoned block device, either
>> +host-aware or host-managed, chunk_sectors indicates the
>> +size of 512B sectors of the zones of the device, with
>  ^^
>  in

Good catch. Thank you. Will fix this.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [regression, 4.9-rc1] blk-mq: list corruption in request queue

2016-10-18 Thread Jens Axboe

On 10/18/2016 05:07 PM, Dave Chinner wrote:

Hi Jens,

One of my test VMs (4p, 4GB RAM) tripped over this last night
running xfs/297 over a pair of 20GB iscsi luns:

[ 8341.363558] [ cut here ]
[ 8341.364360] WARNING: CPU: 0 PID: 10929 at lib/list_debug.c:33 
__list_add+0x89/0xb0
[ 8341.365439] list_add corruption. prev->next should be next 
(e8c02808), but was c90005f6bda8. (prev=88013363bb80).
[ 8341.366900] Modules linked in:
[ 8341.367305] CPU: 0 PID: 10929 Comm: fsstress Tainted: GW   
4.9.0-rc1-dgc+ #1001
[ 8341.368323] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 8341.369431]  c90009d1b860 81821c60 c90009d1b8b0 

[ 8341.370423]  c90009d1b8a0 810b69fb 002181808107 
880133713840
[ 8341.371415]  88013363bb80 e8c02808 e8c02800 
0008
[ 8341.372442] Call Trace:
[ 8341.372759]  [] dump_stack+0x63/0x83
[ 8341.373411]  [] __warn+0xcb/0xf0
[ 8341.374017]  [] warn_slowpath_fmt+0x4f/0x60
[ 8341.374741]  [] ? part_round_stats+0x4f/0x60
[ 8341.375466]  [] __list_add+0x89/0xb0
[ 8341.376125]  [] blk_sq_make_request+0x3ec/0x520
[ 8341.376881]  [] generic_make_request+0xd0/0x1c0
[ 8341.377637]  [] submit_bio+0x58/0x100
[ 8341.378315]  [] xfs_submit_ioend+0x82/0xd0
[ 8341.379039]  [] ? xfs_start_page_writeback+0x99/0xa0
[ 8341.379845]  [] xfs_do_writepage+0x59a/0x730
[ 8341.380601]  [] write_cache_pages+0x1f6/0x550
[ 8341.381357]  [] ? xfs_aops_discard_page+0x140/0x140
[ 8341.382158]  [] xfs_vm_writepages+0xa0/0xd0
[ 8341.382887]  [] do_writepages+0x1e/0x30
[ 8341.383603]  [] __filemap_fdatawrite_range+0x71/0x90
[ 8341.384423]  [] filemap_write_and_wait_range+0x41/0x90
[ 8341.385255]  [] xfs_free_file_space+0xb4/0x460
[ 8341.386021]  [] ? avc_has_perm+0xad/0x1b0
[ 8341.386715]  [] ? __might_sleep+0x4a/0x80
[ 8341.387422]  [] xfs_zero_file_space+0x39/0xd0
[ 8341.388164]  [] xfs_file_fallocate+0x2fc/0x340
[ 8341.388917]  [] ? selinux_file_permission+0xd7/0x110
[ 8341.389738]  [] ? __might_sleep+0x4a/0x80
[ 8341.390439]  [] vfs_fallocate+0x157/0x220
[ 8341.391156]  [] SyS_fallocate+0x48/0x80
[ 8341.391834]  [] do_syscall_64+0x67/0x180
[ 8341.392517]  [] entry_SYSCALL64_slow_path+0x25/0x25
[ 8341.393343] ---[ end trace 477b0f6e35ebd064 ]---
[ 8341.502708] [ cut here ]
[ 8341.503479] WARNING: CPU: 1 PID: 27731 at lib/list_debug.c:29 
__list_add+0x62/0xb0
[ 8341.505131] list_add corruption. next->prev should be prev 
(e8c02808), but was 880133795dc0. (next=e8c02808).
[ 8341.506657] Modules linked in:
[ 8341.507092] CPU: 1 PID: 27731 Comm: kworker/1:0H Tainted: GW   
4.9.0-rc1-dgc+ #1001
[ 8341.508137] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 8341.509228] Workqueue: kblockd blk_mq_requeue_work
[ 8341.509819]  c900038efcb8 81821c60 c900038efd08 

[ 8341.510760]  c900038efcf8 810b69fb 001d3371cf98 
88013363e100
[ 8341.511729]  e8c02808 e8c02808 c900038efde0 
88013363e100
[ 8341.512708] Call Trace:
[ 8341.513026]  [] dump_stack+0x63/0x83
[ 8341.513669]  [] __warn+0xcb/0xf0
[ 8341.514283]  [] warn_slowpath_fmt+0x4f/0x60
[ 8341.515003]  [] ? set_next_entity+0xb6/0x970
[ 8341.515733]  [] ? account_entity_dequeue+0x70/0x90
[ 8341.516521]  [] __list_add+0x62/0xb0
[ 8341.517162]  [] blk_mq_insert_request+0x11e/0x130
[ 8341.517951]  [] blk_mq_requeue_work+0xbc/0x130
[ 8341.518701]  [] process_one_work+0x180/0x440
[ 8341.519430]  [] worker_thread+0x4e/0x490
[ 8341.520119]  [] ? process_one_work+0x440/0x440
[ 8341.520865]  [] ? process_one_work+0x440/0x440
[ 8341.521612]  [] kthread+0xd5/0xf0
[ 8341.56]  [] ? kthread_park+0x60/0x60
[ 8341.522912]  [] ret_from_fork+0x25/0x30
[ 8341.523620] ---[ end trace 477b0f6e35ebd065 ]---

I haven't seen it before, hence it's probably a regression. I
haven't tried to reproduce it yet, so I don't know if it's easy
to trip over.


Dave Jones just reported the same thing, and also as a regression from
4.8. I'll look into this, nothing sticks out at me immediately. It's
hitting both sq/mq cases.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[regression, 4.9-rc1] blk-mq: list corruption in request queue

2016-10-18 Thread Dave Chinner
Hi Jens,

One of my test VMs (4p, 4GB RAM) tripped over this last night
running xfs/297 over a pair of 20GB iscsi luns:

[ 8341.363558] [ cut here ]
[ 8341.364360] WARNING: CPU: 0 PID: 10929 at lib/list_debug.c:33 
__list_add+0x89/0xb0
[ 8341.365439] list_add corruption. prev->next should be next 
(e8c02808), but was c90005f6bda8. (prev=88013363bb80).
[ 8341.366900] Modules linked in:
[ 8341.367305] CPU: 0 PID: 10929 Comm: fsstress Tainted: GW   
4.9.0-rc1-dgc+ #1001
[ 8341.368323] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 8341.369431]  c90009d1b860 81821c60 c90009d1b8b0 

[ 8341.370423]  c90009d1b8a0 810b69fb 002181808107 
880133713840
[ 8341.371415]  88013363bb80 e8c02808 e8c02800 
0008
[ 8341.372442] Call Trace:
[ 8341.372759]  [] dump_stack+0x63/0x83
[ 8341.373411]  [] __warn+0xcb/0xf0
[ 8341.374017]  [] warn_slowpath_fmt+0x4f/0x60
[ 8341.374741]  [] ? part_round_stats+0x4f/0x60
[ 8341.375466]  [] __list_add+0x89/0xb0
[ 8341.376125]  [] blk_sq_make_request+0x3ec/0x520
[ 8341.376881]  [] generic_make_request+0xd0/0x1c0
[ 8341.377637]  [] submit_bio+0x58/0x100
[ 8341.378315]  [] xfs_submit_ioend+0x82/0xd0
[ 8341.379039]  [] ? xfs_start_page_writeback+0x99/0xa0
[ 8341.379845]  [] xfs_do_writepage+0x59a/0x730
[ 8341.380601]  [] write_cache_pages+0x1f6/0x550
[ 8341.381357]  [] ? xfs_aops_discard_page+0x140/0x140
[ 8341.382158]  [] xfs_vm_writepages+0xa0/0xd0
[ 8341.382887]  [] do_writepages+0x1e/0x30
[ 8341.383603]  [] __filemap_fdatawrite_range+0x71/0x90
[ 8341.384423]  [] filemap_write_and_wait_range+0x41/0x90
[ 8341.385255]  [] xfs_free_file_space+0xb4/0x460
[ 8341.386021]  [] ? avc_has_perm+0xad/0x1b0
[ 8341.386715]  [] ? __might_sleep+0x4a/0x80
[ 8341.387422]  [] xfs_zero_file_space+0x39/0xd0
[ 8341.388164]  [] xfs_file_fallocate+0x2fc/0x340
[ 8341.388917]  [] ? selinux_file_permission+0xd7/0x110
[ 8341.389738]  [] ? __might_sleep+0x4a/0x80
[ 8341.390439]  [] vfs_fallocate+0x157/0x220
[ 8341.391156]  [] SyS_fallocate+0x48/0x80
[ 8341.391834]  [] do_syscall_64+0x67/0x180
[ 8341.392517]  [] entry_SYSCALL64_slow_path+0x25/0x25
[ 8341.393343] ---[ end trace 477b0f6e35ebd064 ]---
[ 8341.502708] [ cut here ]
[ 8341.503479] WARNING: CPU: 1 PID: 27731 at lib/list_debug.c:29 
__list_add+0x62/0xb0
[ 8341.505131] list_add corruption. next->prev should be prev 
(e8c02808), but was 880133795dc0. (next=e8c02808).
[ 8341.506657] Modules linked in:
[ 8341.507092] CPU: 1 PID: 27731 Comm: kworker/1:0H Tainted: GW   
4.9.0-rc1-dgc+ #1001
[ 8341.508137] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 8341.509228] Workqueue: kblockd blk_mq_requeue_work
[ 8341.509819]  c900038efcb8 81821c60 c900038efd08 

[ 8341.510760]  c900038efcf8 810b69fb 001d3371cf98 
88013363e100
[ 8341.511729]  e8c02808 e8c02808 c900038efde0 
88013363e100
[ 8341.512708] Call Trace:
[ 8341.513026]  [] dump_stack+0x63/0x83
[ 8341.513669]  [] __warn+0xcb/0xf0
[ 8341.514283]  [] warn_slowpath_fmt+0x4f/0x60
[ 8341.515003]  [] ? set_next_entity+0xb6/0x970
[ 8341.515733]  [] ? account_entity_dequeue+0x70/0x90
[ 8341.516521]  [] __list_add+0x62/0xb0
[ 8341.517162]  [] blk_mq_insert_request+0x11e/0x130
[ 8341.517951]  [] blk_mq_requeue_work+0xbc/0x130
[ 8341.518701]  [] process_one_work+0x180/0x440
[ 8341.519430]  [] worker_thread+0x4e/0x490
[ 8341.520119]  [] ? process_one_work+0x440/0x440
[ 8341.520865]  [] ? process_one_work+0x440/0x440
[ 8341.521612]  [] kthread+0xd5/0xf0
[ 8341.56]  [] ? kthread_park+0x60/0x60
[ 8341.522912]  [] ret_from_fork+0x25/0x30
[ 8341.523620] ---[ end trace 477b0f6e35ebd065 ]---

I haven't seen it before, hence it's probably a regression. I
haven't tried to reproduce it yet, so I don't know if it's easy
to trip over.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-18 Thread Stephen Bates
This patch follows from an RFC we did earlier this year [1]. This
patchset applies cleanly to v4.9-rc1.

Updates since RFC
-
  Rebased.
  Included the iopmem driver in the submission.

History
---

There have been several attempts to upstream patchsets that enable
DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf
style patches [3]. None have been successful to date. Haggai Eran
gives a nice overview of the prior art in this space in his cover
letter [3].

Motivation and Use Cases


PCIe IO devices are getting faster. It is not uncommon now to find PCIe
network and storage devices that can generate and consume several GB/s.
Almost always these devices have either a high performance DMA engine, a
number of exposed PCIe BARs or both.

Until this patch, any high-performance transfer of information between
two PICe devices has required the use of a staging buffer in system
memory. With this patch the bandwidth to system memory is not compromised
when high-throughput transfers occurs between PCIe devices. This means
that more system memory bandwidth is available to the CPU cores for data
processing and manipulation. In addition, in systems where the two PCIe
devices reside behind a PCIe switch the datapath avoids the CPU
entirely.

Consumers
-

We provide a PCIe device driver in an accompanying patch that can be
used to map any PCIe BAR into a DAX capable block device. For
non-persistent BARs this simply serves as an alternative to using
system memory bounce buffers. For persistent BARs this can serve as an
additional storage device in the system.

Testing and Performance
---

We have done a moderate about of testing of this patch on a QEMU
environment and on real hardware. On real hardware we have observed
peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
both cases these numbers are limitations of our consumer hardware. In
addition, we have observed that the CPU DRAM bandwidth is not impacted
when using IOPMEM which is not the case when a traditional path
through system memory is taken.

For more information on the testing and performance results see the
GitHub site [4].

Known Issues


1. Address Translation. Suggestions have been made that in certain
architectures and topologies the dma_addr_t passed to the DMA master
in a peer-2-peer transfer will not correctly route to the IO memory
intended. However in our testing to date we have not seen this to be
an issue, even in systems with IOMMUs and PCIe switches. It is our
understanding that an IOMMU only maps system memory and would not
interfere with device memory regions. (It certainly has no opportunity
to do so if the transfer gets routed through a switch).

2. Memory Segment Spacing. This patch has the same limitations that
ZONE_DEVICE does in that memory regions must be spaces at least
SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
BARs can be placed closer together than this. Thus ZONE_DEVICE would not
be usable on neighboring BARs. For our purposes, this is not an issue as
we'd only be looking at enabling a single BAR in a given PCIe device.
More exotic use cases may have problems with this.

3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
peer there is potential for coherency issues and for writes to occur out
of order. This is something that users of this feature need to be
cognizant of. Though really, this isn't much different than the
existing situation with things like RDMA: if userspace sets up an MR
for remote use, they need to be careful about using that memory region
themselves.

4. Architecture. Currently this patch is applicable only to x86_64
architectures. The same is true for much of the code pertaining to
PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
ARCH over time.

References
--
[1] https://patchwork.kernel.org/patch/8583221/
[2] http://comments.gmane.org/gmane.linux.drivers.rdma/21849
[3] http://www.spinics.net/lists/linux-rdma/msg38748.html
[4] https://github.com/sbates130272/zone-device

Logan Gunthorpe (1):
  memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

Stephen Bates (2):
  iopmem : Add a block device driver for PCIe attached IO memory.
  iopmem : Add documentation for iopmem driver

 Documentation/blockdev/00-INDEX   |   2 +
 Documentation/blockdev/iopmem.txt |  62 +++
 MAINTAINERS   |   7 +
 drivers/block/Kconfig |  27 
 drivers/block/Makefile|   1 +
 drivers/block/iopmem.c| 333 ++
 drivers/dax/pmem.c|   4 +-
 drivers/nvdimm/pmem.c |   4 +-
 include/linux/memremap.h  |   5 +-
 kernel/memremap.c |  80 -
 tools/testing/nvdimm/test/iomap.c |   3 +-
 11 files changed, 518 insertions(+), 10 deletions(-)
 create mode 100644 

[PATCH 3/3] iopmem : Add documentation for iopmem driver

2016-10-18 Thread Stephen Bates
Add documentation for the iopmem PCIe device driver.

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 Documentation/blockdev/00-INDEX   |  2 ++
 Documentation/blockdev/iopmem.txt | 62 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/blockdev/iopmem.txt

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..913e500 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -8,6 +8,8 @@ cpqarray.txt
- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
 floppy.txt
- notes and driver options for the floppy disk driver.
+iopmem.txt
+   - info on the iopmem block driver.
 mflash.txt
- info on mGine m(g)flash driver for linux.
 nbd.txt
diff --git a/Documentation/blockdev/iopmem.txt 
b/Documentation/blockdev/iopmem.txt
new file mode 100644
index 000..ba805b8
--- /dev/null
+++ b/Documentation/blockdev/iopmem.txt
@@ -0,0 +1,62 @@
+IOPMEM Block Driver
+===
+
+Logan Gunthorpe and Stephen Bates - October 2016
+
+Introduction
+
+
+The iopmem module creates a DAX capable block device from a BAR on a PCIe
+device. iopmem leverages heavily from the pmem driver although it utilizes IO
+memory rather than system memory as its backing store.
+
+Usage
+-
+
+To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM
+to either y or m. A block device will be created for each PCIe attached device
+that matches the vendor and device ID as specified in the module. Currently an
+unallocated PMC PCIe ID is used as the default. Alternatively this driver can
+be bound to any aribtary PCIe function using the sysfs bind entry.
+
+The main purpose for an iopmem block device is expected to be for peer-2-peer
+PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local
+CPU unless you are doing one of the three following things:
+
+1. Creating a DAX capable filesystem on the iopmem device.
+2. Creating some files on the DAX capable filesystem.
+3. Interogating the files on said filesystem to obtain pointers that can be
+   passed to other PCIe devices for p2p DMA operations.
+
+Issues
+--
+
+1. Address Translation. Suggestions have been made that in certain
+architectures and topologies the dma_addr_t passed to the DMA master
+in a peer-2-peer transfer will not correctly route to the IO memory
+intended. However in our testing to date we have not seen this to be
+an issue, even in systems with IOMMUs and PCIe switches. It is our
+understanding that an IOMMU only maps system memory and would not
+interfere with device memory regions. (It certainly has no opportunity
+to do so if the transfer gets routed through a switch).
+
+2. Memory Segment Spacing. This patch has the same limitations that
+ZONE_DEVICE does in that memory regions must be spaces at least
+SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
+BARs can be placed closer together than this. Thus ZONE_DEVICE would not
+be usable on neighboring BARs. For our purposes, this is not an issue as
+we'd only be looking at enabling a single BAR in a given PCIe device.
+More exotic use cases may have problems with this.
+
+3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
+peer there is potential for coherency issues and for writes to occur out
+of order. This is something that users of this feature need to be
+cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
+this isn't much different than the existing situation with RDMA: if
+userspace sets up an MR for remote use, they need to be careful about
+using that memory region themselves.
+
+4. Architecture. Currently this patch is applicable only to x86
+architectures. The same is true for much of the code pertaining to
+PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
+ARCH over time.
--
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-18 Thread Stephen Bates
From: Logan Gunthorpe 

We build on recent work that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2].

1. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

2. For completeness, we add MEMREMAP_WT support to the memremap;
however we have no actual need for this functionality.

3. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

4. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function. We also set up
the page attribute tables for the mapped region correctly based on the
desired mapping.

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 drivers/dax/pmem.c|  4 +-
 drivers/nvdimm/pmem.c |  4 +-
 include/linux/memremap.h  |  5 ++-
 kernel/memremap.c | 80 +--
 tools/testing/nvdimm/test/iomap.c |  3 +-
 5 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 9630d88..58ac456 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "../nvdimm/pfn.h"
 #include "../nvdimm/nd.h"
 #include "dax.h"
@@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;

-   addr = devm_memremap_pages(dev, , _pmem->ref, altmap);
+   addr = devm_memremap_pages(dev, , _pmem->ref, altmap,
+   ARCH_MEMREMAP_PMEM);
if (IS_ERR(addr))
return PTR_ERR(addr);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b3a82..97032a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
addr = devm_memremap_pages(dev, _res, >q_usage_counter,
-   altmap);
+   altmap, ARCH_MEMREMAP_PMEM);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) - resource_size(_res);
@@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev,
res->start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
addr = devm_memremap_pages(dev, >res,
-   >q_usage_counter, NULL);
+   >q_usage_counter, NULL, ARCH_MEMREMAP_PMEM);
pmem->pfn_flags |= PFN_MAP;
} else
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..fc99283 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,12 +51,13 @@ struct dev_pagemap {

 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-   struct percpu_ref *ref, struct vmem_altmap *altmap);
+   struct percpu_ref *ref, struct vmem_altmap *altmap,
+   unsigned long flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
-   struct vmem_altmap *altmap)
+   struct vmem_altmap *altmap, unsigned long flags)
 {
/*
 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..d5f462c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

+enum {
+   PAGEMAP_IO_MEM = 1 << 0,
+};
+
 struct page_map {
struct resource res;
struct percpu_ref *ref;
struct dev_pagemap pgmap;
struct vmem_altmap altmap;
+   void *kaddr;
+   int flags;
 };

+static int add_zone_device_pages(int nid, u64 start, u64 size)
+{
+   struct pglist_data *pgdat = NODE_DATA(nid);

[PATCH 2/3] iopmem : Add a block device driver for PCIe attached IO memory.

2016-10-18 Thread Stephen Bates
Add a new block device driver that binds to PCIe devices and turns
PCIe BARs into DAX capable block devices.

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 MAINTAINERS|   7 ++
 drivers/block/Kconfig  |  27 
 drivers/block/Makefile |   1 +
 drivers/block/iopmem.c | 333 +
 4 files changed, 368 insertions(+)
 create mode 100644 drivers/block/iopmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1cd38a7..c379f9d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6510,6 +6510,13 @@ S:   Maintained
 F: Documentation/devicetree/bindings/iommu/
 F: drivers/iommu/

+IOPMEM BLOCK DEVICE DRVIER
+M: Stephen Bates 
+L: linux-block@vger.kernel.org
+S: Maintained
+F: drivers/block/iopmem.c
+F: Documentation/blockdev/iopmem.txt
+
 IP MASQUERADING
 M: Juanjo Ciarlante 
 S: Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 39dd30b..13ae1e7 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -537,4 +537,31 @@ config BLK_DEV_RSXX
  To compile this driver as a module, choose M here: the
  module will be called rsxx.

+config BLK_DEV_IOPMEM
+   tristate "Persistent block device backed by PCIe Memory"
+   depends on ZONE_DEVICE
+   default n
+   help
+ Say Y here if you want to include a generic device driver
+ that can create a block device from persistent PCIe attached
+ IO memory.
+
+ To compile this driver as a module, choose M here: The
+ module will be called iopmem. A block device will be created
+ for each PCIe attached device that matches the vendor and
+ device ID as specified in the module. Alternativel this
+ driver can be bound to any aribtary PCIe function using the
+ sysfs bind entry.
+
+ This block device supports direct access (DAX) file systems
+ and supports struct page backing for the IO Memory. This
+ makes the underlying memory suitable for things like RDMA
+ Memory Regions and Direct IO which is useful for PCIe
+ peer-to-peer DMA operations.
+
+ Note that persistent is only assured if the memory on the
+ PCIe card has some form of power loss protection. This could
+ be provided via some form of battery, a supercap/NAND combo
+ or some exciting new persistent memory technology.
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 1e9661e..1f4f69b 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)+= mtip32xx/
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_IOPMEM)   += iopmem.o

 skd-y  := skd_main.o
 swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/iopmem.c b/drivers/block/iopmem.c
new file mode 100644
index 000..4a1e693
--- /dev/null
+++ b/drivers/block/iopmem.c
@@ -0,0 +1,333 @@
+/*
+ * IOPMEM Block Device Driver
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/pmem.c.
+ * Copyright (c) 2014, Intel Corporation.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const int BAR_ID = 4;
+
+static struct pci_device_id iopmem_id_table[] = {
+   { PCI_DEVICE(0x11f8, 0xf115) },
+   { 0, }
+};
+MODULE_DEVICE_TABLE(pci, iopmem_id_table);
+
+struct iopmem_device {
+   struct request_queue *queue;
+   struct gendisk *disk;
+   struct device *dev;
+
+   int instance;
+
+   /* One contiguous memory region per device */
+   phys_addr_t phys_addr;
+   void*virt_addr;
+   size_t  size;
+};
+
+  /*
+   * We can only access the iopmem device with full 32-bit word
+   * accesses which cannot be gaurantee'd by the regular memcpy
+   */
+
+static void memcpy_from_iopmem(void *dst, const void *src, size_t sz)
+{
+   u64 *wdst = dst;
+   const u64 *wsrc = src;
+   u64 tmp;
+
+   while (sz >= sizeof(*wdst)) {
+   *wdst++ = *wsrc++;
+   sz -= sizeof(*wdst);
+   }
+
+   if (!sz)
+   return;
+
+   tmp = *wsrc;
+   memcpy(wdst, , sz);
+}
+
+static 

[PATCH v3 02/11] blk-mq: Introduce blk_mq_hctx_stopped()

2016-10-18 Thread Bart Van Assche
Multiple functions test the BLK_MQ_S_STOPPED bit so introduce
a helper function that performs this test.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 12 ++--
 drivers/md/dm-rq.c |  2 +-
 include/linux/blk-mq.h |  5 +
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b5dcafb..b52b3a6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -787,7 +787,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
struct list_head *dptr;
int queued;
 
-   if (unlikely(test_bit(BLK_MQ_S_STOPPED, >state)))
+   if (unlikely(blk_mq_hctx_stopped(hctx)))
return;
 
WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
@@ -912,8 +912,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
-   if (unlikely(test_bit(BLK_MQ_S_STOPPED, >state) ||
-   !blk_mq_hw_queue_mapped(hctx)))
+   if (unlikely(blk_mq_hctx_stopped(hctx) ||
+!blk_mq_hw_queue_mapped(hctx)))
return;
 
if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) {
@@ -938,7 +938,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool 
async)
queue_for_each_hw_ctx(q, hctx, i) {
if ((!blk_mq_hctx_has_pending(hctx) &&
list_empty_careful(>dispatch)) ||
-   test_bit(BLK_MQ_S_STOPPED, >state))
+   blk_mq_hctx_stopped(hctx))
continue;
 
blk_mq_run_hw_queue(hctx, async);
@@ -988,7 +988,7 @@ void blk_mq_start_stopped_hw_queues(struct request_queue 
*q, bool async)
int i;
 
queue_for_each_hw_ctx(q, hctx, i) {
-   if (!test_bit(BLK_MQ_S_STOPPED, >state))
+   if (!blk_mq_hctx_stopped(hctx))
continue;
 
clear_bit(BLK_MQ_S_STOPPED, >state);
@@ -1332,7 +1332,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
blk_mq_put_ctx(data.ctx);
if (!old_rq)
goto done;
-   if (test_bit(BLK_MQ_S_STOPPED, >state) ||
+   if (blk_mq_hctx_stopped(data.hctx) ||
blk_mq_direct_issue_request(old_rq, ) != 0)
blk_mq_insert_request(old_rq, false, true, true);
goto done;
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index dc75bea..76d1666 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -909,7 +909,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 * hctx that it really shouldn't.  The following check guards
 * against this rarity (albeit _not_ race-free).
 */
-   if (unlikely(test_bit(BLK_MQ_S_STOPPED, >state)))
+   if (unlikely(blk_mq_hctx_stopped(hctx)))
return BLK_MQ_RQ_QUEUE_BUSY;
 
if (ti->type->busy && ti->type->busy(ti))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 535ab2e..bb000c3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -239,6 +239,11 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set);
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
+static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
+{
+   return test_bit(BLK_MQ_S_STOPPED, >state);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-18 Thread Bart Van Assche

On 10/18/2016 02:48 PM, Bart Van Assche wrote:

- blk_mq_quiesce_queue() has been reworked (thanks to Ming Lin and Sagi
   for their feedback).


(replying to my own e-mail)

A correction: Ming Lei provided feedback on v2 of this patch series 
instead of Ming Lin.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/11] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-18 Thread Bart Van Assche
Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
that became superfluous because of this change. This patch fixes
a race condition: using queue_flag_clear_unlocked() is not safe
if any other function that manipulates the queue flags can be
called concurrently, e.g. blk_cleanup_queue().

Signed-off-by: Bart Van Assche 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
---
 drivers/nvme/host/core.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e4a6f2d..18a265d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -205,7 +205,7 @@ void nvme_requeue_req(struct request *req)
 
blk_mq_requeue_request(req, false);
spin_lock_irqsave(req->q->queue_lock, flags);
-   if (!blk_queue_stopped(req->q))
+   if (!blk_mq_queue_stopped(req->q))
blk_mq_kick_requeue_list(req->q);
spin_unlock_irqrestore(req->q->queue_lock, flags);
 }
@@ -2077,10 +2077,6 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
 
mutex_lock(>namespaces_mutex);
list_for_each_entry(ns, >namespaces, list) {
-   spin_lock_irq(ns->queue->queue_lock);
-   queue_flag_set(QUEUE_FLAG_STOPPED, ns->queue);
-   spin_unlock_irq(ns->queue->queue_lock);
-
blk_mq_cancel_requeue_work(ns->queue);
blk_mq_stop_hw_queues(ns->queue);
}
@@ -2094,7 +2090,6 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 
mutex_lock(>namespaces_mutex);
list_for_each_entry(ns, >namespaces, list) {
-   queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, ns->queue);
blk_mq_start_stopped_hw_queues(ns->queue, true);
blk_mq_kick_requeue_list(ns->queue);
}
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 09/11] SRP transport, scsi-mq: Wait for .queue_rq() if necessary

2016-10-18 Thread Bart Van Assche
Rename srp_wait_for_queuecommand() into scsi_wait_for_queuecommand().
Ensure that if scsi-mq is enabled that scsi_wait_for_queuecommand()
waits until ongoing shost->hostt->queuecommand() calls have finished.

Signed-off-by: Bart Van Assche 
Cc: James Bottomley 
Cc: Martin K. Petersen 
Cc: Doug Ledford 
---
 drivers/scsi/scsi_lib.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a5a1b5d..b7e9662 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2724,8 +2724,6 @@ EXPORT_SYMBOL_GPL(sdev_evt_send_simple);
 /**
  * scsi_request_fn_active() - number of kernel threads inside scsi_request_fn()
  * @shost: SCSI host for which to count the number of scsi_request_fn() 
callers.
- *
- * To do: add support for scsi-mq in this function.
  */
 static int scsi_request_fn_active(struct Scsi_Host *shost)
 {
@@ -2744,11 +2742,19 @@ static int scsi_request_fn_active(struct Scsi_Host 
*shost)
return request_fn_active;
 }
 
+static void scsi_mq_wait_for_queuecommand(struct Scsi_Host *shost)
+{
+   struct scsi_device *sdev;
+
+   shost_for_each_device(sdev, shost)
+   blk_mq_quiesce_queue(sdev->request_queue);
+}
+
 /**
  * scsi_wait_for_queuecommand() - wait for ongoing queuecommand() calls
  *
  * Wait until the ongoing shost->hostt->queuecommand() calls that are
- * invoked from scsi_request_fn() have finished.
+ * invoked from either scsi_request_fn() or scsi_queue_rq() have finished.
  *
  * To do: avoid that scsi_send_eh_cmnd() calls queuecommand() after
  * scsi_internal_device_block() has blocked a SCSI device and remove and also
@@ -2756,8 +2762,12 @@ static int scsi_request_fn_active(struct Scsi_Host 
*shost)
  */
 void scsi_wait_for_queuecommand(struct Scsi_Host *shost)
 {
-   while (scsi_request_fn_active(shost))
-   msleep(20);
+   if (shost->use_blk_mq) {
+   scsi_mq_wait_for_queuecommand(shost);
+   } else {
+   while (scsi_request_fn_active(shost))
+   msleep(20);
+   }
 }
 EXPORT_SYMBOL(scsi_wait_for_queuecommand);
 
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/11] SRP transport: Move queuecommand() wait code to SCSI core

2016-10-18 Thread Bart Van Assche
Additionally, add a comment about the queuecommand() call from
scsi_send_eh_cmnd().

Signed-off-by: Bart Van Assche 
Cc: James Bottomley 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Doug Ledford 
---
 drivers/scsi/scsi_lib.c   | 40 +++
 drivers/scsi/scsi_transport_srp.c | 35 ++
 include/scsi/scsi_host.h  |  1 +
 3 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ab5b06f..a5a1b5d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2722,6 +2722,46 @@ void sdev_evt_send_simple(struct scsi_device *sdev,
 EXPORT_SYMBOL_GPL(sdev_evt_send_simple);
 
 /**
+ * scsi_request_fn_active() - number of kernel threads inside scsi_request_fn()
+ * @shost: SCSI host for which to count the number of scsi_request_fn() 
callers.
+ *
+ * To do: add support for scsi-mq in this function.
+ */
+static int scsi_request_fn_active(struct Scsi_Host *shost)
+{
+   struct scsi_device *sdev;
+   struct request_queue *q;
+   int request_fn_active = 0;
+
+   shost_for_each_device(sdev, shost) {
+   q = sdev->request_queue;
+
+   spin_lock_irq(q->queue_lock);
+   request_fn_active += q->request_fn_active;
+   spin_unlock_irq(q->queue_lock);
+   }
+
+   return request_fn_active;
+}
+
+/**
+ * scsi_wait_for_queuecommand() - wait for ongoing queuecommand() calls
+ *
+ * Wait until the ongoing shost->hostt->queuecommand() calls that are
+ * invoked from scsi_request_fn() have finished.
+ *
+ * To do: avoid that scsi_send_eh_cmnd() calls queuecommand() after
+ * scsi_internal_device_block() has blocked a SCSI device and remove and also
+ * remove the rport mutex lock and unlock calls from srp_queuecommand().
+ */
+void scsi_wait_for_queuecommand(struct Scsi_Host *shost)
+{
+   while (scsi_request_fn_active(shost))
+   msleep(20);
+}
+EXPORT_SYMBOL(scsi_wait_for_queuecommand);
+
+/**
  * scsi_device_quiesce - Block user issued commands.
  * @sdev:  scsi device to quiesce.
  *
diff --git a/drivers/scsi/scsi_transport_srp.c 
b/drivers/scsi/scsi_transport_srp.c
index e3cd3ec..8b190dc 100644
--- a/drivers/scsi/scsi_transport_srp.c
+++ b/drivers/scsi/scsi_transport_srp.c
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -402,36 +401,6 @@ static void srp_reconnect_work(struct work_struct *work)
}
 }
 
-/**
- * scsi_request_fn_active() - number of kernel threads inside scsi_request_fn()
- * @shost: SCSI host for which to count the number of scsi_request_fn() 
callers.
- *
- * To do: add support for scsi-mq in this function.
- */
-static int scsi_request_fn_active(struct Scsi_Host *shost)
-{
-   struct scsi_device *sdev;
-   struct request_queue *q;
-   int request_fn_active = 0;
-
-   shost_for_each_device(sdev, shost) {
-   q = sdev->request_queue;
-
-   spin_lock_irq(q->queue_lock);
-   request_fn_active += q->request_fn_active;
-   spin_unlock_irq(q->queue_lock);
-   }
-
-   return request_fn_active;
-}
-
-/* Wait until ongoing shost->hostt->queuecommand() calls have finished. */
-static void srp_wait_for_queuecommand(struct Scsi_Host *shost)
-{
-   while (scsi_request_fn_active(shost))
-   msleep(20);
-}
-
 static void __rport_fail_io_fast(struct srp_rport *rport)
 {
struct Scsi_Host *shost = rport_to_shost(rport);
@@ -446,7 +415,7 @@ static void __rport_fail_io_fast(struct srp_rport *rport)
/* Involve the LLD if possible to terminate all I/O on the rport. */
i = to_srp_internal(shost->transportt);
if (i->f->terminate_rport_io) {
-   srp_wait_for_queuecommand(shost);
+   scsi_wait_for_queuecommand(shost);
i->f->terminate_rport_io(rport);
}
 }
@@ -576,7 +545,7 @@ int srp_reconnect_rport(struct srp_rport *rport)
if (res)
goto out;
scsi_target_block(>shost_gendev);
-   srp_wait_for_queuecommand(shost);
+   scsi_wait_for_queuecommand(shost);
res = rport->state != SRP_RPORT_LOST ? i->f->reconnect(rport) : -ENODEV;
pr_debug("%s (state %d): transport.reconnect() returned %d\n",
 dev_name(>shost_gendev), rport->state, res);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7e4cd53..0e2c361 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -789,6 +789,7 @@ extern void scsi_remove_host(struct Scsi_Host *);
 extern struct Scsi_Host *scsi_host_get(struct Scsi_Host *);
 extern void scsi_host_put(struct Scsi_Host *t);
 extern struct Scsi_Host *scsi_host_lookup(unsigned short);
+extern void 

[PATCH v3 07/11] dm: Fix a race condition related to stopping and starting queues

2016-10-18 Thread Bart Van Assche
Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
calls have stopped before setting the "queue stopped" flag. This
allows to remove the "queue stopped" test from dm_mq_queue_rq() and
dm_mq_requeue_request(). This patch fixes a race condition because
dm_mq_queue_rq() is called without holding the queue lock and hence
BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
in progress. This patch prevents that the following hang occurs
sporadically when using dm-mq:

INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
Call Trace:
 [] schedule+0x37/0x90
 [] schedule_timeout+0x27f/0x470
 [] io_schedule_timeout+0x9f/0x110
 [] bit_wait_io+0x16/0x60
 [] __wait_on_bit_lock+0x49/0xa0
 [] __lock_page+0xb9/0xc0
 [] truncate_inode_pages_range+0x3e0/0x760
 [] truncate_inode_pages+0x10/0x20
 [] kill_bdev+0x30/0x40
 [] __blkdev_put+0x71/0x360
 [] blkdev_put+0x49/0x170
 [] blkdev_close+0x20/0x30
 [] __fput+0xe8/0x1f0
 [] fput+0x9/0x10
 [] task_work_run+0x83/0xb0
 [] do_exit+0x3ee/0xc40
 [] do_group_exit+0x4b/0xc0
 [] get_signal+0x2ca/0x940
 [] do_signal+0x23/0x660
 [] exit_to_usermode_loop+0x73/0xb0
 [] syscall_return_slowpath+0xb0/0xc0
 [] entry_SYSCALL_64_fastpath+0xa6/0xa8

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Johannes Thumshirn 
Cc: Mike Snitzer 
---
 drivers/md/dm-rq.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 9c34606..107ed19 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -105,6 +105,8 @@ static void dm_mq_stop_queue(struct request_queue *q)
/* Avoid that requeuing could restart the queue. */
blk_mq_cancel_requeue_work(q);
blk_mq_stop_hw_queues(q);
+   /* Wait until dm_mq_queue_rq() has finished. */
+   blk_mq_quiesce_queue(q);
 }
 
 void dm_stop_queue(struct request_queue *q)
@@ -887,17 +889,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
dm_put_live_table(md, srcu_idx);
}
 
-   /*
-* On suspend dm_stop_queue() handles stopping the blk-mq
-* request_queue BUT: even though the hw_queues are marked
-* BLK_MQ_S_STOPPED at that point there is still a race that
-* is allowing block/blk-mq.c to call ->queue_rq against a
-* hctx that it really shouldn't.  The following check guards
-* against this rarity (albeit _not_ race-free).
-*/
-   if (unlikely(blk_mq_hctx_stopped(hctx)))
-   return BLK_MQ_RQ_QUEUE_BUSY;
-
if (ti->type->busy && ti->type->busy(ti))
return BLK_MQ_RQ_QUEUE_BUSY;
 
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/11] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-18 Thread Bart Van Assche
Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
in the dm start and stop queue functions, only manipulate the latter
flag.

Signed-off-by: Bart Van Assche 
Cc: Mike Snitzer 
---
 drivers/md/dm-rq.c | 18 ++
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d5cec26..9c34606 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -75,12 +75,6 @@ static void dm_old_start_queue(struct request_queue *q)
 
 static void dm_mq_start_queue(struct request_queue *q)
 {
-   unsigned long flags;
-
-   spin_lock_irqsave(q->queue_lock, flags);
-   queue_flag_clear(QUEUE_FLAG_STOPPED, q);
-   spin_unlock_irqrestore(q->queue_lock, flags);
-
blk_mq_start_stopped_hw_queues(q, true);
blk_mq_kick_requeue_list(q);
 }
@@ -105,16 +99,8 @@ static void dm_old_stop_queue(struct request_queue *q)
 
 static void dm_mq_stop_queue(struct request_queue *q)
 {
-   unsigned long flags;
-
-   spin_lock_irqsave(q->queue_lock, flags);
-   if (blk_queue_stopped(q)) {
-   spin_unlock_irqrestore(q->queue_lock, flags);
+   if (blk_mq_queue_stopped(q))
return;
-   }
-
-   queue_flag_set(QUEUE_FLAG_STOPPED, q);
-   spin_unlock_irqrestore(q->queue_lock, flags);
 
/* Avoid that requeuing could restart the queue. */
blk_mq_cancel_requeue_work(q);
@@ -341,7 +327,7 @@ static void __dm_mq_kick_requeue_list(struct request_queue 
*q, unsigned long mse
unsigned long flags;
 
spin_lock_irqsave(q->queue_lock, flags);
-   if (!blk_queue_stopped(q))
+   if (!blk_mq_queue_stopped(q))
blk_mq_delay_kick_requeue_list(q, msecs);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/11] blk-mq: Do not invoke .queue_rq() for a stopped queue

2016-10-18 Thread Bart Van Assche
The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
Cc: Johannes Thumshirn 
Cc: 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ddc2eed..b5dcafb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1332,9 +1332,9 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
blk_mq_put_ctx(data.ctx);
if (!old_rq)
goto done;
-   if (!blk_mq_direct_issue_request(old_rq, ))
-   goto done;
-   blk_mq_insert_request(old_rq, false, true, true);
+   if (test_bit(BLK_MQ_S_STOPPED, >state) ||
+   blk_mq_direct_issue_request(old_rq, ) != 0)
+   blk_mq_insert_request(old_rq, false, true, true);
goto done;
}
 
-- 
2.10.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-18 Thread Shaun Tancheff
On Tue, Oct 18, 2016 at 11:58 AM, Jeff Moyer  wrote:
> Damien Le Moal  writes:
>
>> + if (!is_power_of_2(zone_blocks)) {
>> + if (sdkp->first_scan)
>> + sd_printk(KERN_NOTICE, sdkp,
>> +   "Devices with non power of 2 zone "
>> +   "size are not supported\n");
>> + return -ENODEV;
>> + }
>
> Are power of 2 zone sizes required by the standard?  I see why you've
> done this, but I wonder if we're artificially limiting the
> implementation, and whether there will be valid devices on the market
> that simply won't work with Linux because of this.

The standard does not require power of 2 zones.
That said, I am not aware of any current (or planned) devices other
than a power of 2.
Common zone sizes I am aware of: 256MiB, 128MiB and 1GiB.

Also note that we are excluding the runt zone from the power of 2 expectation.

So conforming devices should (excluding a runt zone):
  - Have zones of the same size.
  - Choose a zone size that is a power of 2.

--Shaun

> -Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html=DQIBAg=IGDlg0lD0b-nebmJJ0Kp8A=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA=A15hLQb19nr4vdRr1Bbbn98FLSj_y-C0VI6FtiA9V_I=rVkinUiv-ZJHIfhlk2VVJM7S2dJtvxOCmwbKMuiOCPU=
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-18 Thread Jeff Moyer
Damien Le Moal  writes:

> + if (!is_power_of_2(zone_blocks)) {
> + if (sdkp->first_scan)
> + sd_printk(KERN_NOTICE, sdkp,
> +   "Devices with non power of 2 zone "
> +   "size are not supported\n");
> + return -ENODEV;
> + }

Are power of 2 zone sizes required by the standard?  I see why you've
done this, but I wonder if we're artificially limiting the
implementation, and whether there will be valid devices on the market
that simply won't work with Linux because of this.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes

2016-10-18 Thread Jeff Moyer
Damien Le Moal  writes:

> diff --git a/Documentation/ABI/testing/sysfs-block 
> b/Documentation/ABI/testing/sysfs-block
> index 75a5055..ee2d5cd 100644
> --- a/Documentation/ABI/testing/sysfs-block
> +++ b/Documentation/ABI/testing/sysfs-block
> @@ -251,3 +251,16 @@ Description:
>   since drive-managed zoned block devices do not support
>   zone commands, they will be treated as regular block
>   devices and zoned will report "none".
> +
> +What:/sys/block//queue/chunk_sectors
> +Date:September 2016
> +Contact: Hannes Reinecke 
> +Description:
> + chunk_sectors has different meaning depending on the type
> + of the disk. For a RAID device (dm-raid), chunk_sectors
> + indicates the size in 512B sectors of the RAID volume
> + stripe segment. For a zoned block device, either
> + host-aware or host-managed, chunk_sectors indicates the
> + size of 512B sectors of the zones of the device, with
 ^^
 in
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 0/7] ZBC / Zoned block device support

2016-10-18 Thread Jens Axboe

On 10/18/2016 12:40 AM, Damien Le Moal wrote:

This series introduces support for zoned block devices. It integrates
earlier submissions by Hannes Reinecke and Shaun Tancheff. Compared to the
previous series version, the code was significantly simplified by limiting
support to zoned devices satisfying the following conditions:
1) All zones of the device are the same size, with the exception of an
   eventual last smaller runt zone.
2) For host-managed disks, reads must be unrestricted (read commands do not
   fail due to zone or write pointer alignement constraints).
Zoned disks that do not satisfy these 2 conditions are ignored.

These 2 conditions allowed dropping the zone information cache implemented
in the previous version. This simplifies the code and also reduces the memory
consumption at run time. Support for zoned devices now only require one bit
per zone (less than 8KB in total). This bit field is used to write-lock
zones and prevent the concurrent execution of multiple write commands in
the same zone. This avoids write ordering problems at dispatch time, for
both the simple queue and scsi-mq settings.

The new operations introduced to suport zone manipulation was reduced to
only the two main ZBC/ZAC defined commands: REPORT ZONES (REQ_OP_ZONE_REPORT)
and RESET WRITE POINTER (REQ_OP_ZONE_RESET). This brings the total number of
operations defined to 8, which fits in the 3 bits (REQ_OP_BITS) reserved for
operation code in bio->bi_opf and req->cmd_flags.

Most of the ZBC specific code is kept out of sd.c and implemented in the
new file sd_zbc.c. Similarly, at the block layer, most of the zoned block
device code is implemented in the new blk-zoned.c.

For host-managed zoned block devices, the sequential write constraint of
write pointer zones is exposed to the user. Users of the disk (applications,
file systems or device mappers) must sequentially write to zones. This means
that for raw block device accesses from applications, buffered writes are
unreliable and direct I/Os must be used (or buffered writes with O_SYNC).

Access to zone manipulation operations is also provided to applications
through a set of new ioctls. This allows applications operating on raw
block devices (e.g. mkfs.xxx) to discover a device zone layout and
manipulate zone state.


This is starting to look mergeable to me. Any objections in getting this
applied for 4.10? Looks like 6/7 should go through the SCSI tree, but I
can queue up the rest.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3 v2] md: unblock array if bad blocks have been acknowledged

2016-10-18 Thread Tomasz Majchrzak
Once external metadata handler acknowledges all bad blocks (by writing
to rdev 'bad_blocks' sysfs file), it requests to unblock the array.
Check if all bad blocks are actually acknowledged as there might be a
race if new bad blocks are notified at the same time. If all bad blocks
are acknowledged, just unblock the array and continue. If not, ignore
the request to unblock (do not fail an array). External metadata handler
is expected to either process remaining bad blocks and try to unblock
again or remove bad block support for a disk (which will cause disk to
fail as in no-support case).

Signed-off-by: Tomasz Majchrzak 
Reviewed-by: Artur Paszkiewicz 
---
 drivers/md/md.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index cc05236..ce585b7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2612,19 +2612,29 @@ state_store(struct md_rdev *rdev, const char *buf, 
size_t len)
set_bit(Blocked, >flags);
err = 0;
} else if (cmd_match(buf, "-blocked")) {
-   if (!test_bit(Faulty, >flags) &&
+   int unblock = 1;
+   int acked = !rdev->badblocks.unacked_exist;
+
+   if ((test_bit(ExternalBbl, >flags) &&
+rdev->badblocks.changed))
+   acked = check_if_badblocks_acked(>badblocks);
+
+   if (test_bit(ExternalBbl, >flags) && !acked) {
+   unblock = 0;
+   } else if (!test_bit(Faulty, >flags) &&
rdev->badblocks.unacked_exist) {
/* metadata handler doesn't understand badblocks,
 * so we need to fail the device
 */
md_error(rdev->mddev, rdev);
}
-   clear_bit(Blocked, >flags);
-   clear_bit(BlockedBadBlocks, >flags);
-   wake_up(>blocked_wait);
-   set_bit(MD_RECOVERY_NEEDED, >mddev->recovery);
-   md_wakeup_thread(rdev->mddev->thread);
-
+   if (unblock) {
+   clear_bit(Blocked, >flags);
+   clear_bit(BlockedBadBlocks, >flags);
+   wake_up(>blocked_wait);
+   set_bit(MD_RECOVERY_NEEDED, >mddev->recovery);
+   md_wakeup_thread(rdev->mddev->thread);
+   }
err = 0;
} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
set_bit(In_sync, >flags);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/7] block: Add 'zoned' queue limit

2016-10-18 Thread Hannes Reinecke
On 10/18/2016 08:40 AM, Damien Le Moal wrote:
> From: Damien Le Moal 
> 
> Add the zoned queue limit to indicate the zoning model of a block device.
> Defined values are 0 (BLK_ZONED_NONE) for regular block devices,
> 1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM)
> for host-managed zone block devices. The standards defined drive managed
> model is not defined here since these block devices do not provide any
> command for accessing zone information. Drive managed model devices will
> be reported as BLK_ZONED_NONE.
> 
> The helper functions blk_queue_zoned_model and bdev_zoned_model return
> the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned
> return a boolean for callers to test if a block device is zoned.
> 
> The zoned attribute is also exported as a string to applications via
> sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and
> BLK_ZONED_HM as "host-managed".
> 
> Signed-off-by: Damien Le Moal 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Martin K. Petersen 
> Reviewed-by: Shaun Tancheff 
> Tested-by: Shaun Tancheff 
> ---
>  Documentation/ABI/testing/sysfs-block | 16 
>  block/blk-settings.c  |  1 +
>  block/blk-sysfs.c | 18 ++
>  include/linux/blkdev.h| 47 
> +++
>  4 files changed, 82 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 4/7] block: Define zoned block device operations

2016-10-18 Thread Damien Le Moal
From: Shaun Tancheff 

Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of
host-managed and host-aware zoned block devices. With with these two
new operations, the total number of operations defined reaches 8 and
still fits with the 3 bits definition of REQ_OP_BITS.

Signed-off-by: Shaun Tancheff 
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-core.c  | 4 
 include/linux/blk_types.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c07..e4eda5d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1941,6 +1941,10 @@ generic_make_request_checks(struct bio *bio)
case REQ_OP_WRITE_SAME:
if (!bdev_write_same(bio->bi_bdev))
goto not_supported;
+   case REQ_OP_ZONE_REPORT:
+   case REQ_OP_ZONE_RESET:
+   if (!bdev_is_zoned(bio->bi_bdev))
+   goto not_supported;
break;
default:
break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cd395ec..dd50dce 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -243,6 +243,8 @@ enum req_op {
REQ_OP_SECURE_ERASE,/* request to securely erase sectors */
REQ_OP_WRITE_SAME,  /* write same block many times */
REQ_OP_FLUSH,   /* request for cache flush */
+   REQ_OP_ZONE_REPORT, /* Get zone information */
+   REQ_OP_ZONE_RESET,  /* Reset a zone write pointer */
 };
 
 #define REQ_OP_BITS 3
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 5/7] block: Implement support for zoned block devices

2016-10-18 Thread Damien Le Moal
From: Hannes Reinecke 

Implement zoned block device zone information reporting and reset.
Zone information are reported as struct blk_zone. This implementation
does not differentiate between host-aware and host-managed device
models and is valid for both. Two functions are provided:
blkdev_report_zones for discovering the zone configuration of a
zoned block device, and blkdev_reset_zones for resetting the write
pointer of sequential zones. The helper function blk_queue_zone_size
and bdev_zone_size are also provided for, as the name suggest,
obtaining the zone size (in 512B sectors) of the zones of the device.

Signed-off-by: Hannes Reinecke 

[Damien: * Removed the zone cache
 * Implement report zones operation based on earlier proposal
   by Shaun Tancheff ]
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 block/Kconfig |   8 ++
 block/Makefile|   1 +
 block/blk-zoned.c | 257 ++
 include/linux/blkdev.h|  31 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned.h | 103 +
 6 files changed, 401 insertions(+)
 create mode 100644 block/blk-zoned.c
 create mode 100644 include/uapi/linux/blkzoned.h

diff --git a/block/Kconfig b/block/Kconfig
index 5136ad4..7bb9bf8 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -89,6 +89,14 @@ config BLK_DEV_INTEGRITY
T10/SCSI Data Integrity Field or the T13/ATA External Path
Protection.  If in doubt, say N.
 
+config BLK_DEV_ZONED
+   bool "Zoned block device support"
+   ---help---
+   Block layer zoned block device support. This option enables
+   support for ZAC/ZBC host-managed and host-aware zoned block devices.
+
+   Say yes here if you have a ZAC or ZBC storage device.
+
 config BLK_DEV_THROTTLING
bool "Block layer bio throttling support"
depends on BLK_CGROUP=y
diff --git a/block/Makefile b/block/Makefile
index 9eda232..aee67fa 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,4 +22,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
 obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)   += cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
+obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o
 
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
new file mode 100644
index 000..1603573
--- /dev/null
+++ b/block/blk-zoned.c
@@ -0,0 +1,257 @@
+/*
+ * Zoned block device handling
+ *
+ * Copyright (c) 2015, Hannes Reinecke
+ * Copyright (c) 2015, SUSE Linux GmbH
+ *
+ * Copyright (c) 2016, Damien Le Moal
+ * Copyright (c) 2016, Western Digital
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+static inline sector_t blk_zone_start(struct request_queue *q,
+ sector_t sector)
+{
+   sector_t zone_mask = blk_queue_zone_size(q) - 1;
+
+   return sector & ~zone_mask;
+}
+
+/*
+ * Check that a zone report belongs to the partition.
+ * If yes, fix its start sector and write pointer, copy it in the
+ * zone information array and return true. Return false otherwise.
+ */
+static bool blkdev_report_zone(struct block_device *bdev,
+  struct blk_zone *rep,
+  struct blk_zone *zone)
+{
+   sector_t offset = get_start_sect(bdev);
+
+   if (rep->start < offset)
+   return false;
+
+   rep->start -= offset;
+   if (rep->start + rep->len > bdev->bd_part->nr_sects)
+   return false;
+
+   if (rep->type == BLK_ZONE_TYPE_CONVENTIONAL)
+   rep->wp = rep->start + rep->len;
+   else
+   rep->wp -= offset;
+   memcpy(zone, rep, sizeof(struct blk_zone));
+
+   return true;
+}
+
+/**
+ * blkdev_report_zones - Get zones information
+ * @bdev:  Target block device
+ * @sector:Sector from which to report zones
+ * @zones: Array of zone structures where to return the zones information
+ * @nr_zones:  Number of zone structures in the zone array
+ * @gfp_mask:  Memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Get zone information starting from the zone containing @sector.
+ *The number of zone information reported may be less than the number
+ *requested by @nr_zones. The number of zones actually reported is
+ *returned in @nr_zones.
+ */
+int blkdev_report_zones(struct block_device *bdev,
+   sector_t sector,
+   struct blk_zone *zones,
+   unsigned int *nr_zones,
+   gfp_t gfp_mask)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+   

[PATCH v8 1/7] block: Add 'zoned' queue limit

2016-10-18 Thread Damien Le Moal
From: Damien Le Moal 

Add the zoned queue limit to indicate the zoning model of a block device.
Defined values are 0 (BLK_ZONED_NONE) for regular block devices,
1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM)
for host-managed zone block devices. The standards defined drive managed
model is not defined here since these block devices do not provide any
command for accessing zone information. Drive managed model devices will
be reported as BLK_ZONED_NONE.

The helper functions blk_queue_zoned_model and bdev_zoned_model return
the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned
return a boolean for callers to test if a block device is zoned.

The zoned attribute is also exported as a string to applications via
sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and
BLK_ZONED_HM as "host-managed".

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 Documentation/ABI/testing/sysfs-block | 16 
 block/blk-settings.c  |  1 +
 block/blk-sysfs.c | 18 ++
 include/linux/blkdev.h| 47 +++
 4 files changed, 82 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 71d184d..75a5055 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -235,3 +235,19 @@ Description:
write_same_max_bytes is 0, write same is not supported
by the device.
 
+What:  /sys/block//queue/zoned
+Date:  September 2016
+Contact:   Damien Le Moal 
+Description:
+   zoned indicates if the device is a zoned block device
+   and the zone model of the device if it is indeed zoned.
+   The possible values indicated by zoned are "none" for
+   regular block devices and "host-aware" or "host-managed"
+   for zoned block devices. The characteristics of
+   host-aware and host-managed zoned block devices are
+   described in the ZBC (Zoned Block Commands) and ZAC
+   (Zoned Device ATA Command Set) standards. These standards
+   also define the "drive-managed" zone model. However,
+   since drive-managed zoned block devices do not support
+   zone commands, they will be treated as regular block
+   devices and zoned will report "none".
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f679ae1..b1d5b7f 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -107,6 +107,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->io_opt = 0;
lim->misaligned = 0;
lim->cluster = 1;
+   lim->zoned = BLK_ZONED_NONE;
 }
 EXPORT_SYMBOL(blk_set_default_limits);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9cc8d7c..ff9cd9c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -257,6 +257,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
 QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
+static ssize_t queue_zoned_show(struct request_queue *q, char *page)
+{
+   switch (blk_queue_zoned_model(q)) {
+   case BLK_ZONED_HA:
+   return sprintf(page, "host-aware\n");
+   case BLK_ZONED_HM:
+   return sprintf(page, "host-managed\n");
+   default:
+   return sprintf(page, "none\n");
+   }
+}
+
 static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
 {
return queue_var_show((blk_queue_nomerges(q) << 1) |
@@ -485,6 +497,11 @@ static struct queue_sysfs_entry queue_nonrot_entry = {
.store = queue_store_nonrot,
 };
 
+static struct queue_sysfs_entry queue_zoned_entry = {
+   .attr = {.name = "zoned", .mode = S_IRUGO },
+   .show = queue_zoned_show,
+};
+
 static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
.show = queue_nomerges_show,
@@ -546,6 +563,7 @@ static struct attribute *default_attrs[] = {
_discard_zeroes_data_entry.attr,
_write_same_max_entry.attr,
_nonrot_entry.attr,
+   _zoned_entry.attr,
_nomerges_entry.attr,
_rq_affinity_entry.attr,
_iostats_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358..f19e16b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -261,6 +261,15 @@ struct blk_queue_tag {
 #define BLK_SCSI_MAX_CMDS  (256)
 #define BLK_SCSI_CMD_PER_LONG  (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
 
+/*
+ * Zoned block device models (zoned limit).
+ */
+enum blk_zoned_model {

[PATCH v8 0/7] ZBC / Zoned block device support

2016-10-18 Thread Damien Le Moal
This series introduces support for zoned block devices. It integrates
earlier submissions by Hannes Reinecke and Shaun Tancheff. Compared to the
previous series version, the code was significantly simplified by limiting
support to zoned devices satisfying the following conditions:
1) All zones of the device are the same size, with the exception of an
   eventual last smaller runt zone.
2) For host-managed disks, reads must be unrestricted (read commands do not
   fail due to zone or write pointer alignement constraints).
Zoned disks that do not satisfy these 2 conditions are ignored.

These 2 conditions allowed dropping the zone information cache implemented
in the previous version. This simplifies the code and also reduces the memory
consumption at run time. Support for zoned devices now only require one bit
per zone (less than 8KB in total). This bit field is used to write-lock
zones and prevent the concurrent execution of multiple write commands in
the same zone. This avoids write ordering problems at dispatch time, for
both the simple queue and scsi-mq settings.

The new operations introduced to suport zone manipulation was reduced to
only the two main ZBC/ZAC defined commands: REPORT ZONES (REQ_OP_ZONE_REPORT)
and RESET WRITE POINTER (REQ_OP_ZONE_RESET). This brings the total number of
operations defined to 8, which fits in the 3 bits (REQ_OP_BITS) reserved for
operation code in bio->bi_opf and req->cmd_flags.

Most of the ZBC specific code is kept out of sd.c and implemented in the
new file sd_zbc.c. Similarly, at the block layer, most of the zoned block
device code is implemented in the new blk-zoned.c.

For host-managed zoned block devices, the sequential write constraint of
write pointer zones is exposed to the user. Users of the disk (applications,
file systems or device mappers) must sequentially write to zones. This means
that for raw block device accesses from applications, buffered writes are
unreliable and direct I/Os must be used (or buffered writes with O_SYNC).

Access to zone manipulation operations is also provided to applications
through a set of new ioctls. This allows applications operating on raw
block devices (e.g. mkfs.xxx) to discover a device zone layout and
manipulate zone state.

v8:
* Fixed compile time warnings (unused variable and sd_printk format)
* For unsupported host-aware drives, the zone write lock bitmap is not
  allocated, so check it before trying to use it

v7:
* Fixed problems with zone write locking:
  - Wrong sdkp->zone_wlock bitmap allocation size
  - Incorrect (reversed condition) test of lock state with test_and_set_bit
  - Potential error in sd_setup_read_write_cmnd could leave a zone locked
without the locking write command being executed

v6:
* Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928)

v5:
* Changed interface of sd_zbc_setup_read_write

v4:
* Fixed several typos and tabs/spaces
* Added description of zoned and chunk_sectors queue attributes in
  Documentation/ABI/testing/sysfs-block
* Fixed sd_read_capacity call in sd.c and to avoid missing information on
  the first pass of a disk scan
* Fixed scsi_disk zone related field to use logical block size unit instead
  of 512B sector unit.

v3:
* Use kcalloc to allocate zone information array for ioctl
* Use kcalloc to allocate zone information array for ioctl
* Export GPL the functions blkdev_report_zones and blkdev_reset_zones
* Shuffled uapi definitions from patch 7 into patch 5

v2
* Removed zone information cache
* Limit support to drives that have unrestricted reads and a constant zone
  size that is a power of two number of LBAs
* Introduce per zone write locking to avoid write reordering for both
  blk-mq and simple queue cases

Damien Le Moal (1):
  block: Add 'zoned' queue limit

Hannes Reinecke (4):
  blk-sysfs: Add 'chunk_sectors' to sysfs attributes
  block: update chunk_sectors in blk_stack_limits()
  block: Implement support for zoned block devices
  sd: Implement support for ZBC devices

Shaun Tancheff (2):
  block: Define zoned block device operations
  blk-zoned: implement ioctls

 Documentation/ABI/testing/sysfs-block |  29 ++
 block/Kconfig |   8 +
 block/Makefile|   1 +
 block/blk-core.c  |   4 +
 block/blk-settings.c  |   5 +
 block/blk-sysfs.c |  29 ++
 block/blk-zoned.c | 350 ++
 block/ioctl.c |   4 +
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.c | 148 ++--
 drivers/scsi/sd.h |  70 
 drivers/scsi/sd_zbc.c | 642 ++
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h|  99 ++
 include/scsi/scsi_proto.h |  17 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned.h | 143 
 include/uapi/linux/fs.h 

[PATCH v8 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes

2016-10-18 Thread Damien Le Moal
From: Hannes Reinecke 

The queue limits already have a 'chunk_sectors' setting, so
we should be presenting it via sysfs.

Signed-off-by: Hannes Reinecke 

[Damien: Updated Documentation/ABI/testing/sysfs-block]

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 Documentation/ABI/testing/sysfs-block | 13 +
 block/blk-sysfs.c | 11 +++
 2 files changed, 24 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 75a5055..ee2d5cd 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -251,3 +251,16 @@ Description:
since drive-managed zoned block devices do not support
zone commands, they will be treated as regular block
devices and zoned will report "none".
+
+What:  /sys/block//queue/chunk_sectors
+Date:  September 2016
+Contact:   Hannes Reinecke 
+Description:
+   chunk_sectors has different meaning depending on the type
+   of the disk. For a RAID device (dm-raid), chunk_sectors
+   indicates the size in 512B sectors of the RAID volume
+   stripe segment. For a zoned block device, either
+   host-aware or host-managed, chunk_sectors indicates the
+   size of 512B sectors of the zones of the device, with
+   the eventual exception of the last zone of the device
+   which may be smaller.
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index ff9cd9c..488c2e2 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -130,6 +130,11 @@ static ssize_t queue_physical_block_size_show(struct 
request_queue *q, char *pag
return queue_var_show(queue_physical_block_size(q), page);
 }
 
+static ssize_t queue_chunk_sectors_show(struct request_queue *q, char *page)
+{
+   return queue_var_show(q->limits.chunk_sectors, page);
+}
+
 static ssize_t queue_io_min_show(struct request_queue *q, char *page)
 {
return queue_var_show(queue_io_min(q), page);
@@ -455,6 +460,11 @@ static struct queue_sysfs_entry 
queue_physical_block_size_entry = {
.show = queue_physical_block_size_show,
 };
 
+static struct queue_sysfs_entry queue_chunk_sectors_entry = {
+   .attr = {.name = "chunk_sectors", .mode = S_IRUGO },
+   .show = queue_chunk_sectors_show,
+};
+
 static struct queue_sysfs_entry queue_io_min_entry = {
.attr = {.name = "minimum_io_size", .mode = S_IRUGO },
.show = queue_io_min_show,
@@ -555,6 +565,7 @@ static struct attribute *default_attrs[] = {
_hw_sector_size_entry.attr,
_logical_block_size_entry.attr,
_physical_block_size_entry.attr,
+   _chunk_sectors_entry.attr,
_io_min_entry.attr,
_io_opt_entry.attr,
_discard_granularity_entry.attr,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 3/7] block: update chunk_sectors in blk_stack_limits()

2016-10-18 Thread Damien Le Moal
From: Hannes Reinecke 

Signed-off-by: Hannes Reinecke 
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 block/blk-settings.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index b1d5b7f..55369a6 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -631,6 +631,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->discard_granularity;
}
 
+   if (b->chunk_sectors)
+   t->chunk_sectors = min_not_zero(t->chunk_sectors,
+   b->chunk_sectors);
+
return ret;
 }
 EXPORT_SYMBOL(blk_stack_limits);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 6/7] sd: Implement support for ZBC devices

2016-10-18 Thread Damien Le Moal
From: Hannes Reinecke 

Implement ZBC support functions to setup zoned disks, both
host-managed and host-aware models. Only zoned disks that satisfy
the following conditions are supported:
1) All zones are the same size, with the exception of an eventual
   last smaller runt zone.
2) For host-managed disks, reads are unrestricted (reads are not
   failed due to zone or write pointer alignement constraints).
Zoned disks that do not satisfy these 2 conditions are setup with
a capacity of 0 to prevent their use.

The function sd_zbc_read_zones, called from sd_revalidate_disk,
checks that the device satisfies the above two constraints. This
function may also change the disk capacity previously set by
sd_read_capacity for devices reporting only the capacity of
conventional zones at the beginning of the LBA range (i.e. devices
reporting rc_basis set to 0).

The capacity message output was moved out of sd_read_capacity into
a new function sd_print_capacity to include this eventual capacity
change by sd_zbc_read_zones. This new function also includes a call
to sd_zbc_print_zones to display the number of zones and zone size
of the device.

Signed-off-by: Hannes Reinecke 

[Damien: * Removed zone cache support
 * Removed mapping of discard to reset write pointer command
 * Modified sd_zbc_read_zones to include checks that the
   device satisfies the kernel constraints
 * Implemeted REPORT ZONES setup and post-processing based
   on code from Shaun Tancheff 
 * Removed confusing use of 512B sector units in functions
   interface]
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.c | 148 ---
 drivers/scsi/sd.h |  70 +
 drivers/scsi/sd_zbc.c | 642 ++
 include/scsi/scsi_proto.h |  17 ++
 5 files changed, 843 insertions(+), 35 deletions(-)
 create mode 100644 drivers/scsi/sd_zbc.c

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index d539798..fabcb6d 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -179,6 +179,7 @@ hv_storvsc-y:= storvsc_drv.o
 
 sd_mod-objs:= sd.o
 sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
+sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o
 
 sr_mod-objs:= sr.o sr_ioctl.o sr_vendor.o
 ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d3e852a..e53d958 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -92,6 +92,7 @@ MODULE_ALIAS_BLOCKDEV_MAJOR(SCSI_DISK15_MAJOR);
 MODULE_ALIAS_SCSI_DEVICE(TYPE_DISK);
 MODULE_ALIAS_SCSI_DEVICE(TYPE_MOD);
 MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
+MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC);
 
 #if !defined(CONFIG_DEBUG_BLOCK_EXT_DEVT)
 #define SD_MINORS  16
@@ -162,7 +163,7 @@ cache_type_store(struct device *dev, struct 
device_attribute *attr,
static const char temp[] = "temporary ";
int len;
 
-   if (sdp->type != TYPE_DISK)
+   if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
/* no cache control on RBC devices; theoretically they
 * can do it, but there's probably so many exceptions
 * it's not worth the risk */
@@ -261,7 +262,7 @@ allow_restart_store(struct device *dev, struct 
device_attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
-   if (sdp->type != TYPE_DISK)
+   if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
return -EINVAL;
 
sdp->allow_restart = simple_strtoul(buf, NULL, 10);
@@ -391,6 +392,11 @@ provisioning_mode_store(struct device *dev, struct 
device_attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
+   if (sd_is_zoned(sdkp)) {
+   sd_config_discard(sdkp, SD_LBP_DISABLE);
+   return count;
+   }
+
if (sdp->type != TYPE_DISK)
return -EINVAL;
 
@@ -458,7 +464,7 @@ max_write_same_blocks_store(struct device *dev, struct 
device_attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
-   if (sdp->type != TYPE_DISK)
+   if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
return -EINVAL;
 
err = kstrtoul(buf, 10, );
@@ -843,6 +849,12 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd)
 
BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
 
+   if (sd_is_zoned(sdkp)) {
+   ret = sd_zbc_setup_write_cmnd(cmd);
+   if (ret != BLKPREP_OK)
+   return ret;
+   }
+
sector >>= ilog2(sdp->sector_size) - 9;
nr_sectors >>= ilog2(sdp->sector_size) - 9;
 
@@ -900,19 

[PATCH v8 7/7] blk-zoned: implement ioctls

2016-10-18 Thread Damien Le Moal
From: Shaun Tancheff 

Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively
obtaining the zone configuration of a zoned block device and resetting
the write pointer of sequential zones of a zoned block device.

The BLKREPORTZONE ioctl maps directly to a single call of the function
blkdev_report_zones. The zone information result is passed as an array
of struct blk_zone identical to the structure used internally for
processing the REQ_OP_ZONE_REPORT operation.  The BLKRESETZONE ioctl
maps to a single call of the blkdev_reset_zones function.

Signed-off-by: Shaun Tancheff 
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-zoned.c | 93 +++
 block/ioctl.c |  4 ++
 include/linux/blkdev.h| 21 ++
 include/uapi/linux/blkzoned.h | 40 +++
 include/uapi/linux/fs.h   |  4 ++
 5 files changed, 162 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1603573..667f95d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -255,3 +255,96 @@ int blkdev_reset_zones(struct block_device *bdev,
return 0;
 }
 EXPORT_SYMBOL_GPL(blkdev_reset_zones);
+
+/**
+ * BLKREPORTZONE ioctl processing.
+ * Called from blkdev_ioctl.
+ */
+int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+   void __user *argp = (void __user *)arg;
+   struct request_queue *q;
+   struct blk_zone_report rep;
+   struct blk_zone *zones;
+   int ret;
+
+   if (!argp)
+   return -EINVAL;
+
+   q = bdev_get_queue(bdev);
+   if (!q)
+   return -ENXIO;
+
+   if (!blk_queue_is_zoned(q))
+   return -ENOTTY;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   if (copy_from_user(, argp, sizeof(struct blk_zone_report)))
+   return -EFAULT;
+
+   if (!rep.nr_zones)
+   return -EINVAL;
+
+   zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+   if (!zones)
+   return -ENOMEM;
+
+   ret = blkdev_report_zones(bdev, rep.sector,
+ zones, _zones,
+ GFP_KERNEL);
+   if (ret)
+   goto out;
+
+   if (copy_to_user(argp, , sizeof(struct blk_zone_report))) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   if (rep.nr_zones) {
+   if (copy_to_user(argp + sizeof(struct blk_zone_report), zones,
+sizeof(struct blk_zone) * rep.nr_zones))
+   ret = -EFAULT;
+   }
+
+ out:
+   kfree(zones);
+
+   return ret;
+}
+
+/**
+ * BLKRESETZONE ioctl processing.
+ * Called from blkdev_ioctl.
+ */
+int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode,
+unsigned int cmd, unsigned long arg)
+{
+   void __user *argp = (void __user *)arg;
+   struct request_queue *q;
+   struct blk_zone_range zrange;
+
+   if (!argp)
+   return -EINVAL;
+
+   q = bdev_get_queue(bdev);
+   if (!q)
+   return -ENXIO;
+
+   if (!blk_queue_is_zoned(q))
+   return -ENOTTY;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   if (copy_from_user(, argp, sizeof(struct blk_zone_range)))
+   return -EFAULT;
+
+   return blkdev_reset_zones(bdev, zrange.sector, zrange.nr_sectors,
+ GFP_KERNEL);
+}
diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..448f78a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -513,6 +513,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
BLKDEV_DISCARD_SECURE);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
+   case BLKREPORTZONE:
+   return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
+   case BLKRESETZONE:
+   return blkdev_reset_zones_ioctl(bdev, mode, cmd, arg);
case HDIO_GETGEO:
return blkdev_getgeo(bdev, argp);
case BLKRAGET:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 252043f..90097dd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -316,6 +316,27 @@ extern int blkdev_report_zones(struct block_device *bdev,
 extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors,
  sector_t nr_sectors, gfp_t gfp_mask);
 
+extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
+