date:20170405

Re: [RFC 3/8] nvmet: Use p2pmem in nvme target

2017-04-05 Thread Sagi Grimberg




I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks?


Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.


Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.


I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.


Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.


Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...


I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?


I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...


Why do you need this? you have a reference to the
queue itself.


This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


That would be better, maybe...

[...]


This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.


Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.


Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.


+
+if (queue->p2pmem)
+pr_debug("using %s for rdma nvme target queue",
+ dev_name(>p2pmem->dev));
+
+kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
rdma_cm_id *cm_id,
 }
 queue->port = cm_id->context;

+nvmet_rdma_queue_setup_p2pmem(queue);
+


Why is all this done for each queue? looks completely redundant to me.


A little bit. Where would you put it?


I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.


 ret = nvmet_rdma_cm_accept(cm_id, queue, >param.conn);
 if (ret)
 goto release_queue;


You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?


Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.


I don't see an urgent reason for it too. I was just curious...

Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-05 Thread Sagi Grimberg




Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.


Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..


I think it still can race if the p2p device is connected with more than
a single port to the switch.

Say it's connected via 2 legs, the bar is accessed from leg A and the
data from the disk comes via leg B. In this case, the data is heading
towards the p2p device via leg B (might be congested), the completion
goes directly to the RC, and then the host issues a read from the
bar via leg A. I don't understand what can guarantee ordering here.

Stephen told me that this still guarantees ordering, but I honestly
can't understand how, perhaps someone can explain to me in a simple
way that I can understand.

Re: 4.11.0-rc5-00011-g08e4e0d oops in mpt3sas driver

2017-04-05 Thread Brad Campbell


On 06/04/17 08:30, Brad Campbell wrote:

G'day All,

This is a vaguely current git head kernel compiled yesterday.

Oopsed and rebooted itself, and then oopsed and rebooted again. There
was no sign of a raid rebuild in the kernel logs, and it's a staging
machine so there is nothing running after a reboot that goes near these
disks. They should have been completely idle the second time around.

This box suffered from bad rcu stalls on 4.10.x stable kernels, so I
upgraded to git head. It's all new hardware (the CPU, Chipset and
board), so I expected some issues with the board, but the LSI cards have
been around for a while now.


Further investigation indicates it might be a deeper problem. This is 
the first oops captured and it has nothing to do with the mpt3 driver.


[49580.533852] BUG: unable to handle kernel paging request at 
817cddfe

[49580.533875] IP: queued_spin_lock_slowpath+0xe7/0x170
[49580.533879] PGD 180a067
[49580.533879] PUD 180b063
[49580.533882] PMD 816001e1
[49580.533885]
[49580.533890] Oops: 0003 [#1] SMP
[49580.533894] Modules linked in: it87(O) deflate zlib_deflate ctr 
des_generic cbc cmac sha1_generic md5 hmac af_key xfrm_algo nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc bonding 
sha256_generic dm_crypt aesni_intel aes_x86_64 crypto_simd cryptd 
glue_helper hwmon_vid netconsole configfs vhost_net vhost kvm_amd kvm 
irqbypass usbhid usb_storage nouveau video drm_kms_helper cfbfillrect 
syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea ttm 
drm mxm_wmi xhci_pci i2c_piix4 xhci_hcd usbcore usb_common wmi 
acpi_cpufreq mpt3sas igb i2c_algo_bit raid_class scsi_transport_sas ahci 
libahci
[49580.533929] CPU: 6 PID: 114 Comm: kswapd0 Tainted: G   O 
4.11.0-rc5-00011-g08e4e0d-dirty #39
[49580.533933] Hardware name: System manufacturer System Product 
Name/PRIME X370-PRO, BIOS 0515 03/30/2017

[49580.534045] task: 8807f9ad task.stack: c943
[49580.534049] RIP: 0010:queued_spin_lock_slowpath+0xe7/0x170
[49580.534052] RSP: 0018:c9433a50 EFLAGS: 00010082
[49580.534056] RAX: 34e1 RBX: 0292 RCX: 
001c
[49580.534059] RDX: 817cddfe RSI: 88081ed99900 RDI: 
8806ddb860e0
[49580.534063] RBP: 8806ddb860e0 R08: 0101 R09: 
dead0200
[49580.534119] R10: ea001c000700 R11: 880006b457b9 R12: 
8806ddb860c8
[49580.534122] R13: 0001 R14: c9433b40 R15: 
8806ddb860c8
[49580.534179] FS:  () GS:88081ed8() 
knlGS:

[49580.534183] CS:  0010 DS:  ES:  CR0: 80050033
[49580.534186] CR2: 817cddfe CR3: 01809000 CR4: 
003406e0

[49580.534190] Call Trace:
[49580.534247]  ? _raw_spin_lock_irqsave+0x1f/0x30
[49580.534253]  ? __remove_mapping+0x65/0x1b0
[49580.534258]  ? page_mkclean_one+0x100/0x100
[49580.534313]  ? page_get_anon_vma+0xa0/0xa0
[49580.534317]  ? shrink_page_list+0x6aa/0xda0
[49580.534321]  ? shrink_inactive_list+0x1f6/0x4b0
[49580.534325]  ? es_reclaim_extents+0x55/0xe0
[49580.534328]  ? inactive_list_is_low.isra.70+0x10e/0x1c0
[49580.534332]  ? shrink_node_memcg.isra.75+0x58c/0x6b0
[49580.534531]  ? shrink_node+0x4a/0x190
[49580.534705]  ? kswapd+0x2b7/0x5d0
[49580.535076]  ? kthread+0xf1/0x130
[49580.535477]  ? shrink_node+0x190/0x190
[49580.535869]  ? __kthread_init_worker+0xa0/0xa0
[49580.536257]  ? ret_from_fork+0x23/0x30
[49580.53] Code: 47 02 c1 e0 10 0f 84 93 00 00 00 48 89 c2 c1 e8 12 
48 c1 ea 0c ff c8 83 e2 30 48 98 48 81 c2 00 99 01 00 48 03 14 c5 20 54 
77 81 <48> 89 32 8b 46 08 85 c0 75 09 f3 90 8b 46 08 85 c0 74 f7 4c 8b
[49580.537489] RIP: queued_spin_lock_slowpath+0xe7/0x170 RSP: 
c9433a50

[49580.537904] CR2: 817cddfe
[49580.540107] ---[ end trace f58d3bdd0830f2bf ]---
[49580.540642] Kernel panic - not syncing: Fatal exception
[49580.541212] Kernel Offset: disabled
[49580.541493] Rebooting in 10 seconds..
[49590.501026] ACPI MEMORY or I/O RESET_REG.


This box survives days of memtest, but I'm not above suspecting the 
underlying hardware if it points to that.

4.11.0-rc5-00011-g08e4e0d oops in mpt3sas driver

2017-04-05 Thread Brad Campbell


G'day All,

This is a vaguely current git head kernel compiled yesterday.

Oopsed and rebooted itself, and then oopsed and rebooted again. There 
was no sign of a raid rebuild in the kernel logs, and it's a staging 
machine so there is nothing running after a reboot that goes near these 
disks. They should have been completely idle the second time around.


This box suffered from bad rcu stalls on 4.10.x stable kernels, so I 
upgraded to git head. It's all new hardware (the CPU, Chipset and 
board), so I expected some issues with the board, but the LSI cards have 
been around for a while now.


I'm pretty sure this is the second time it has done this, but the first 
where I had netconsole set up to capture it.


It is hard to hit and I've had 2 instances in about 5 days of very heavy 
I/O loads. Having said that, the second oops last night happened when 
idle, so I'm now out of ideas.


The only thing that might have been hitting these disks is smartmontools 
routine polling, but inspecting the logs it appears the reboot happened 
well between polls. No sign of anything in the logs, but got it on 
netconsole.


System is a new AMD Ryzen.
Controllers are :
27:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic 
SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)
28:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic 
SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)


Only the 2116 has drives on it, and it has 8 6TB WD Red disks on SATA. 
Nothing else. The 2008 is just sitting taking up a slot.


Kernel is tainted by an out of tree build of the it87 driver, however 
nothing polls or uses that driver unless I'm at the console manually 
checking.


This machine originally oopsed after some heavy load and rebooted :
[49580.533852] BUG: unable to handle kernel paging request at 
817cddfe

[49580.533875] IP: queued_spin_lock_slowpath+0xe7/0x170

Tree is dirty due to this md patch :

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c523fd69a7bc..2eb45d57226c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, 
struct stripe_head_state *s,

BUG_ON(test_bit(R5_Wantread, >flags));
BUG_ON(sh->batch_head);
if ((s->uptodate == disks - 1) &&
+   ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
(s->failed && (disk_idx == s->failed_num[0] ||
-  disk_idx == s->failed_num[1]))) {
+  disk_idx == s->failed_num[1] {
/* have disk failed, and we're requested to fetch it;
 * do compute it
 */


[ 4433.138683] BUG: unable to handle kernel paging request at 
9807cd621d60

[ 4433.138802] IP: _scsih_set_satl_pending+0x0/0x50 [mpt3sas]
[ 4433.138875] PGD 0
[ 4433.138875]
[ 4433.138928] Oops:  [#1] SMP
[ 4433.138976] Modules linked in: deflate zlib_deflate ctr des_generic 
cbc cmac sha1_generic md5 hmac af_key xfrm_algo nfsd auth_rpcgss 
oid_registry nfs_acl nfs lockd grace sunrpc bonding sha256_generic 
dm_crypt aesni_intel aes_x86_64 crypto_simd cryptd glue_helper it87(O) 
hwmon_vid netconsole configfs vhost_net vhost kvm_amd kvm irqbypass 
usbhid usb_storage nouveau video drm_kms_helper cfbfillrect syscopyarea 
cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea ttm drm mxm_wmi 
xhci_pci i2c_piix4 xhci_hcd usbcore usb_common wmi acpi_cpufreq mpt3sas 
raid_class scsi_transport_sas igb i2c_algo_bit ahci libahci
[ 4433.139831] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   O 
4.11.0-rc5-00011-g08e4e0d-dirty #39
[ 4433.140032] Hardware name: System manufacturer System Product 
Name/PRIME X370-PRO, BIOS 0515 03/30/2017

[ 4433.140164] task: 8180e4c0 task.stack: 8180
[ 4433.140257] RIP: 0010:_scsih_set_satl_pending+0x0/0x50 [mpt3sas]
[ 4433.140349] RSP: 0018:88081ec03da0 EFLAGS: 00010046
[ 4433.140429] RAX: 0001 RBX: 8807f66585b0 RCX: 

[ 4433.140540] RDX:  RSI:  RDI: 
9807cd621d30
[ 4433.140651] RBP: 9807cd621d30 R08: 0001 R09: 
1594
[ 4433.140762] R10:  R11: 0018 R12: 

[ 4433.140872] R13: 8807f5c10bd0 R14: 0048 R15: 
0048
[ 4433.140984] FS:  () GS:88081ec0() 
knlGS:

[ 4433.141109] CS:  0010 DS:  ES:  CR0: 80050033
[ 4433.141198] CR2: 9807cd621d60 CR3: 0007da667000 CR4: 
003406f0

[ 4433.141309] Call Trace:
[ 4433.141399]  
[ 4433.141416]  ? _scsih_io_done+0xa2/0xa30 [mpt3sas]
[ 4433.141494]  ? load_balance+0x152/0x870
[ 4433.141549]  ? _base_interrupt+0x116/0xa30 [mpt3sas]
[ 4433.141627]  ? del_timer+0x60/0x60
[ 4433.141680]  ? __handle_irq_event_percpu+0x51/0x1c0
[ 4433.141752]  ?

Re: [PATCH] Make checking the scsi_device_get() return value mandatory

2017-04-05 Thread Bart Van Assche

On Thu, 2017-04-06 at 08:27 +0800, kbuild test robot wrote:
> All warnings (new ones prefixed by >>):
> 
>drivers//scsi/osd/osd_uld.c: In function 'osd_probe':
> > > drivers//scsi/osd/osd_uld.c:467:2: warning: ignoring return value of 
> > > 'scsi_device_get', declared with attribute warn_unused_result 
> > > [-Wunused-result]
> 
>  scsi_device_get(scsi_device);
>  ^~~~

Please ignore this warning. It is triggered because patch "osd_uld: Check
scsi_device_get() return value" is not yet upstream.

Bart.

Re: [PATCH] Make checking the scsi_device_get() return value mandatory

2017-04-05 Thread kbuild test robot

Hi Bart,

[auto build test WARNING on scsi/for-next]
[also build test WARNING on v4.11-rc5 next-20170405]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Bart-Van-Assche/Make-checking-the-scsi_device_get-return-value-mandatory/20170406-072137
base:   https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next
config: x86_64-allyesdebian (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers//scsi/osd/osd_uld.c: In function 'osd_probe':
>> drivers//scsi/osd/osd_uld.c:467:2: warning: ignoring return value of 
>> 'scsi_device_get', declared with attribute warn_unused_result 
>> [-Wunused-result]
 scsi_device_get(scsi_device);
 ^~~~

vim +/scsi_device_get +467 drivers//scsi/osd/osd_uld.c

95b05a7db Boaz Harrosh 2009-01-25  451  
95b05a7db Boaz Harrosh 2009-01-25  452  /* allocate a disk and set it 
up */
95b05a7db Boaz Harrosh 2009-01-25  453  /* FIXME: do we need this since 
sg has already done that */
95b05a7db Boaz Harrosh 2009-01-25  454  disk = alloc_disk(1);
95b05a7db Boaz Harrosh 2009-01-25  455  if (!disk) {
95b05a7db Boaz Harrosh 2009-01-25  456  OSD_ERR("alloc_disk 
failed\n");
95b05a7db Boaz Harrosh 2009-01-25  457  goto err_free_osd;
95b05a7db Boaz Harrosh 2009-01-25  458  }
95b05a7db Boaz Harrosh 2009-01-25  459  disk->major = SCSI_OSD_MAJOR;
95b05a7db Boaz Harrosh 2009-01-25  460  disk->first_minor = oud->minor;
95b05a7db Boaz Harrosh 2009-01-25  461  sprintf(disk->disk_name, 
"osd%d", oud->minor);
95b05a7db Boaz Harrosh 2009-01-25  462  oud->disk = disk;
95b05a7db Boaz Harrosh 2009-01-25  463  
95b05a7db Boaz Harrosh 2009-01-25  464  /* hold one more reference to 
the scsi_device that will get released
95b05a7db Boaz Harrosh 2009-01-25  465   * in __release, in case a 
logout is happening while fs is mounted
95b05a7db Boaz Harrosh 2009-01-25  466   */
95b05a7db Boaz Harrosh 2009-01-25 @467  scsi_device_get(scsi_device);
95b05a7db Boaz Harrosh 2009-01-25  468  osd_dev_init(>od, 
scsi_device);
95b05a7db Boaz Harrosh 2009-01-25  469  
95b05a7db Boaz Harrosh 2009-01-25  470  /* Detect the OSD Version */
95b05a7db Boaz Harrosh 2009-01-25  471  error = __detect_osd(oud);
95b05a7db Boaz Harrosh 2009-01-25  472  if (error) {
95b05a7db Boaz Harrosh 2009-01-25  473  OSD_ERR("osd detection 
failed, non-compatible OSD device\n");
95b05a7db Boaz Harrosh 2009-01-25  474  goto err_put_disk;
95b05a7db Boaz Harrosh 2009-01-25  475  }

:: The code at line 467 was first introduced by commit
:: 95b05a7db5865855c32e0bb8b244c3a7aac1cfeb [SCSI] osd_uld: OSD scsi ULD

:: TO: Boaz Harrosh <bharr...@panasas.com>
:: CC: James Bottomley <james.bottom...@hansenpartnership.com>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH 21/24] scsi: Lock down the eata driver

2017-04-05 Thread David Howells

When the kernel is running in secure boot mode, we lock down the kernel to
prevent userspace from modifying the running kernel image.  Whilst this
includes prohibiting access to things like /dev/mem, it must also prevent
access by means of configuring driver modules in such a way as to cause a
device to access or modify the kernel image.

The eata driver takes a single string parameter that contains a slew of
settings, including hardware resource configuration.  Prohibit use of the
parameter if the kernel is locked down.

Suggested-by: Alan Cox 
Signed-off-by: David Howells 
cc: Dario Ballabio 
cc: "James E.J. Bottomley" 
cc: "Martin K. Petersen" 
cc: linux-scsi@vger.kernel.org
---

 drivers/scsi/eata.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/eata.c b/drivers/scsi/eata.c
index 227dd2c2ec2f..5c036d10c18b 100644
--- a/drivers/scsi/eata.c
+++ b/drivers/scsi/eata.c
@@ -1552,8 +1552,13 @@ static int eata2x_detect(struct scsi_host_template *tpnt)
 
tpnt->proc_name = "eata2x";
 
-   if (strlen(boot_options))
+   if (strlen(boot_options)) {
+   if (kernel_is_locked_down()) {
+   pr_err("Command line-specified device addresses, irqs 
and dma channels are not permitted when the kernel is locked down\n");
+   return -EPERM;
+   }
option_setup(boot_options);
+   }
 
 #if defined(MODULE)
/* io_port could have been modified when loading as a module */

[PATCH v3 2/2] scsi: storvsc: Add support for FC rport.

2017-04-05 Thread Cathy Avery

Included in the current storvsc driver for Hyper-V is the ability
to access luns on an FC fabric via a virtualized fiber channel
adapter exposed by the Hyper-V host. The driver also attaches to
the FC transport to allow host and port names to be published under
/sys/class/fc_host/hostX. Current customer tools running on the VM
require that these names be available in the well known standard
location under fc_host/hostX.

This patch stubs in an rport per fc_host and sets its rport role
as FC_PORT_ROLE_FCP_DUMMY_INITIATOR to indicate to the fc_transport
that it is a pseudo rport in order to scan the scsi stack via
echo "- - -" > /sys/class/scsi_host/hostX/scan.

Signed-off-by: Cathy Avery 
---
 drivers/scsi/storvsc_drv.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 016639d..1ec8579 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -476,6 +476,9 @@ struct storvsc_device {
 */
u64 node_name;
u64 port_name;
+#if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
+   struct fc_rport *rport;
+#endif
 };
 
 struct hv_host_device {
@@ -1823,19 +1826,27 @@ static int storvsc_probe(struct hv_device *device,
target = (device->dev_instance.b[5] << 8 |
 device->dev_instance.b[4]);
ret = scsi_add_device(host, 0, target, 0);
-   if (ret) {
-   scsi_remove_host(host);
-   goto err_out2;
-   }
+   if (ret)
+   goto err_out3;
}
 #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
if (host->transportt == fc_transport_template) {
+   struct fc_rport_identifiers ids = {
+   .roles = FC_PORT_ROLE_FCP_DUMMY_INITIATOR,
+   };
+
fc_host_node_name(host) = stor_device->node_name;
fc_host_port_name(host) = stor_device->port_name;
+   stor_device->rport = fc_remote_port_add(host, 0, );
+   if (!stor_device->rport)
+   goto err_out3;
}
 #endif
return 0;
 
+err_out3:
+   scsi_remove_host(host);
+
 err_out2:
/*
 * Once we have connected with the host, we would need to
@@ -1861,8 +1872,10 @@ static int storvsc_remove(struct hv_device *dev)
struct Scsi_Host *host = stor_device->host;
 
 #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
-   if (host->transportt == fc_transport_template)
+   if (host->transportt == fc_transport_template) {
+   fc_remote_port_delete(stor_device->rport);
fc_remove_host(host);
+   }
 #endif
scsi_remove_host(host);
storvsc_dev_remove(dev);
-- 
2.5.0

[PATCH v3 1/2] scsi: scsi_transport_fc: Add dummy initiator role to rport

2017-04-05 Thread Cathy Avery

This patch allows scsi drivers that expose virturalized fibre channel
devices but that do not expose rports to successfully rescan the scsi
bus via echo "- - -" > /sys/class/scsi_host/hostX/scan.
Drivers can create a pseudo rport and indicate
FC_PORT_ROLE_FCP_DUMMY_INITIATOR as the rport's role in
fc_rport_identifiers. This insures that a valid scsi_target_id
is assigned to the newly created rport and it can meet the
requirements of fc_user_scan_tgt calling scsi_scan_target.

Signed-off-by: Cathy Avery 
---
 drivers/scsi/scsi_transport_fc.c | 10 ++
 include/scsi/scsi_transport_fc.h |  1 +
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 2d753c9..de85602 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -289,9 +289,10 @@ static const struct {
u32 value;
char*name;
 } fc_port_role_names[] = {
-   { FC_PORT_ROLE_FCP_TARGET,  "FCP Target" },
-   { FC_PORT_ROLE_FCP_INITIATOR,   "FCP Initiator" },
-   { FC_PORT_ROLE_IP_PORT, "IP Port" },
+   { FC_PORT_ROLE_FCP_TARGET,  "FCP Target" },
+   { FC_PORT_ROLE_FCP_INITIATOR,   "FCP Initiator" },
+   { FC_PORT_ROLE_IP_PORT, "IP Port" },
+   { FC_PORT_ROLE_FCP_DUMMY_INITIATOR, "FCP Dummy Initiator" },
 };
 fc_bitfield_name_search(port_roles, fc_port_role_names)
 
@@ -2628,7 +2629,8 @@ fc_remote_port_create(struct Scsi_Host *shost, int 
channel,
spin_lock_irqsave(shost->host_lock, flags);
 
rport->number = fc_host->next_rport_number++;
-   if (rport->roles & FC_PORT_ROLE_FCP_TARGET)
+   if ((rport->roles & FC_PORT_ROLE_FCP_TARGET) ||
+   (rport->roles & FC_PORT_ROLE_FCP_DUMMY_INITIATOR))
rport->scsi_target_id = fc_host->next_target_id++;
else
rport->scsi_target_id = -1;
diff --git a/include/scsi/scsi_transport_fc.h b/include/scsi/scsi_transport_fc.h
index b21b8aa5..6e208bb 100644
--- a/include/scsi/scsi_transport_fc.h
+++ b/include/scsi/scsi_transport_fc.h
@@ -162,6 +162,7 @@ enum fc_tgtid_binding_type  {
 #define FC_PORT_ROLE_FCP_TARGET0x01
 #define FC_PORT_ROLE_FCP_INITIATOR 0x02
 #define FC_PORT_ROLE_IP_PORT   0x04
+#define FC_PORT_ROLE_FCP_DUMMY_INITIATOR   0x08
 
 /* The following are for compatibility */
 #define FC_RPORT_ROLE_UNKNOWN  FC_PORT_ROLE_UNKNOWN
-- 
2.5.0

[PATCH v3 0/2] scsi: storvsc: Add support for FC rport

2017-04-05 Thread Cathy Avery

The updated patch set provides a way for drivers ( specifically
storvsc in this case ) that expose virturalized fc devices
but that do not expose rports to be manually scanned. This is
done via creating a pseudo rport in storvsc and a
corresponding dummy initiator rport role in the fc transport.

Changes since v2:
- Additional patch adding FC_PORT_ROLE_FCP_DUMMY_INITIATOR role
  to fc_transport
- Changed storvsc rport role to FC_PORT_ROLE_FCP_DUMMY_INITIATOR

Changes since v1:
- Fix fc_rport_identifiers init [Stephen Hemminger]
- Better error checking


Cathy Avery (2):
  scsi: scsi_transport_fc: Add dummy initiator role to rport
  scsi: storvsc: Add support for FC rport.

 drivers/scsi/scsi_transport_fc.c | 10 ++
 drivers/scsi/storvsc_drv.c   | 23 ++-
 include/scsi/scsi_transport_fc.h |  1 +
 3 files changed, 25 insertions(+), 9 deletions(-)

-- 
2.5.0

Re: ->retries fixups V2

2017-04-05 Thread Jens Axboe

On 04/05/2017 12:16 PM, Christoph Hellwig wrote:
> On Wed, Apr 05, 2017 at 12:06:53PM -0600, Jens Axboe wrote:
>> On Wed, Apr 05 2017, Christoph Hellwig wrote:
>>> This series fixes a few lose bits in terms of how nvme uses ->retries,
>>> including fixing it for non-PCIe transports.  While at it I noticed that
>>> nvme and scsi use the field in entirely different ways, and no other
>>> driver uses it at all.  So I decided to move it into the nvme_request and
>>> scsi_request structures instead.
>>>
>>> Changes since V1:
>>>  - better changelog for one patch
>>>  - move the new retries field to the end of struct nvme_request
>>
>> Applied for 4.12. If we do the below on my box, we remove the (now) 2
>> holes from struct request and shrink it 8 bytes.
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig 

Thanks, added that too.

-- 
Jens Axboe

Re: ->retries fixups V2

2017-04-05 Thread Christoph Hellwig

On Wed, Apr 05, 2017 at 12:06:53PM -0600, Jens Axboe wrote:
> On Wed, Apr 05 2017, Christoph Hellwig wrote:
> > This series fixes a few lose bits in terms of how nvme uses ->retries,
> > including fixing it for non-PCIe transports.  While at it I noticed that
> > nvme and scsi use the field in entirely different ways, and no other
> > driver uses it at all.  So I decided to move it into the nvme_request and
> > scsi_request structures instead.
> > 
> > Changes since V1:
> >  - better changelog for one patch
> >  - move the new retries field to the end of struct nvme_request
> 
> Applied for 4.12. If we do the below on my box, we remove the (now) 2
> holes from struct request and shrink it 8 bytes.

Looks good:

Reviewed-by: Christoph Hellwig

Re: [PATCH v2] scsi: sd: Fix capacity calculation with 32-bit sector_t

2017-04-05 Thread Bart Van Assche

On Wed, 2017-04-05 at 06:15 -0400, Martin K. Petersen wrote:
> We previously made sure that the reported disk capacity was less than
> 0x blocks when the kernel was not compiled with large sector_t
> support (CONFIG_LBDAF). However, this check assumed that the capacity
> was reported in units of 512 bytes.
> 
> Add a sanity check function to ensure that we only enable disks if the
> entire reported capacity can be expressed in terms of sector_t.

Reviewed-by: Bart Van Assche

Re: ->retries fixups V2

2017-04-05 Thread Jens Axboe

On Wed, Apr 05 2017, Christoph Hellwig wrote:
> This series fixes a few lose bits in terms of how nvme uses ->retries,
> including fixing it for non-PCIe transports.  While at it I noticed that
> nvme and scsi use the field in entirely different ways, and no other
> driver uses it at all.  So I decided to move it into the nvme_request and
> scsi_request structures instead.
> 
> Changes since V1:
>  - better changelog for one patch
>  - move the new retries field to the end of struct nvme_request

Applied for 4.12. If we do the below on my box, we remove the (now) 2
holes from struct request and shrink it 8 bytes.

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ce6f9a6534c9..3cf241b0814d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -215,6 +215,8 @@ struct request {
 
unsigned short ioprio;
 
+   unsigned int timeout;
+
void *special;  /* opaque pointer available for LLD use */
 
int errors;
@@ -223,7 +225,6 @@ struct request {
 
unsigned long deadline;
struct list_head timeout_list;
-   unsigned int timeout;
 
/*
 * completion callback.

-- 
Jens Axboe

Re: [PATCH 2/5] nvme: cleanup nvme_req_needs_retry

2017-04-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 3/5] nvme: mark nvme_max_retries static

2017-04-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd

2017-04-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request

2017-04-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

[PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 include/linux/blk_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 67bcf8a5326e..4eae30bfbfca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,7 +168,7 @@ enum req_opf {
/* write the same sector many times */
REQ_OP_WRITE_SAME   = 7,
/* write the zero filled sector many times */
-   REQ_OP_WRITE_ZEROES = 8,
+   REQ_OP_WRITE_ZEROES = 9,
 
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN  = 32,
-- 
2.11.0

[PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios

2017-04-05 Thread Christoph Hellwig

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-merge.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..3990ae406341 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct 
request_queue *q,
return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
+   struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+   *nsegs = 1;
+
+   if (!q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
split = blk_bio_discard_split(q, *bio, bs, );
break;
case REQ_OP_WRITE_ZEROES:
-   split = NULL;
-   nsegs = (*bio)->bi_phys_segments;
+   split = blk_bio_write_zeroes_split(q, *bio, bs, );
break;
case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, bs, );
-- 
2.11.0

always use REQ_OP_WRITE_ZEROES for zeroing offload V2

2017-04-05 Thread Christoph Hellwig

This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
supported by the block layer, and switches existing implementations
of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
removes incorrect discard_zeroes_data, and also switches WRITE SAME
based zeroing in SCSI to this new method.

The series is against the block for-next tree.

A git tree is also avaiable at:

git://git.infradead.org/users/hch/block.git discard-rework.2

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2

Changes since V2:
 - various spelling fixes
 - various reviews captured
 - two new patches from Martin at the end

[PATCH 01/27] sd: split sd_setup_discard_cmnd

2017-04-05 Thread Christoph Hellwig

Split sd_setup_discard_cmnd into one function per provisioning type.  While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/sd.c | 153 ++
 1 file changed, 84 insertions(+), 69 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..b853f91fb3da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
-/**
- * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate on
- * @rq: Request to prepare
- *
- * Will issue either UNMAP or WRITE SAME(16) depending on preference
- * indicated by target device.
- **/
-static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 {
-   struct request *rq = cmd->request;
struct scsi_device *sdp = cmd->device;
-   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-   sector_t sector = blk_rq_pos(rq);
-   unsigned int nr_sectors = blk_rq_sectors(rq);
-   unsigned int len;
-   int ret;
+   struct request *rq = cmd->request;
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+   unsigned int data_len = 24;
char *buf;
-   struct page *page;
 
-   sector >>= ilog2(sdp->sector_size) - 9;
-   nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-   page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-   if (!page)
+   rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!rq->special_vec.bv_page)
return BLKPREP_DEFER;
+   rq->special_vec.bv_offset = 0;
+   rq->special_vec.bv_len = data_len;
+   rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-   switch (sdkp->provisioning_mode) {
-   case SD_LBP_UNMAP:
-   buf = page_address(page);
-
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = UNMAP;
-   cmd->cmnd[8] = 24;
-
-   put_unaligned_be16(6 + 16, [0]);
-   put_unaligned_be16(16, [2]);
-   put_unaligned_be64(sector, [8]);
-   put_unaligned_be32(nr_sectors, [16]);
+   cmd->cmd_len = 10;
+   cmd->cmnd[0] = UNMAP;
+   cmd->cmnd[8] = 24;
 
-   len = 24;
-   break;
+   buf = page_address(rq->special_vec.bv_page);
+   put_unaligned_be16(6 + 16, [0]);
+   put_unaligned_be16(16, [2]);
+   put_unaligned_be64(sector, [8]);
+   put_unaligned_be32(nr_sectors, [16]);
 
-   case SD_LBP_WS16:
-   cmd->cmd_len = 16;
-   cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
-   put_unaligned_be64(sector, >cmnd[2]);
-   put_unaligned_be32(nr_sectors, >cmnd[10]);
+   cmd->allowed = SD_MAX_RETRIES;
+   cmd->transfersize = data_len;
+   rq->timeout = SD_TIMEOUT;
+   scsi_req(rq)->resid_len = data_len;
 
-   len = sdkp->device->sector_size;
-   break;
+   return scsi_init_io(cmd);
+}
 
-   case SD_LBP_WS10:
-   case SD_LBP_ZERO:
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = WRITE_SAME;
-   if (sdkp->provisioning_mode == SD_LBP_WS10)
-   cmd->cmnd[1] = 0x8; /* UNMAP */
-   put_unaligned_be32(sector, >cmnd[2]);
-   put_unaligned_be16(nr_sectors, >cmnd[7]);
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+{
+   struct scsi_device *sdp = cmd->device;
+   struct request *rq = cmd->request;
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 data_len = sdp->sector_size;
 
-   len = sdkp->device->sector_size;
-   break;
+   rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!rq->special_vec.bv_page)
+   return BLKPREP_DEFER;
+   rq->special_vec.bv_offset = 0;
+   rq->special_vec.bv_len = data_len;
+   rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-   default:
-   ret = BLKPREP_INVALID;
-   goto out;
-   }
+   cmd->cmd_len = 16;
+   cmd->cmnd[0] = WRITE_SAME_16;
+   cmd->cmnd[1] = 0x8; /* UNMAP */
+   put_unaligned_be64(sector, >cmnd[2]);
+   put_unaligned_be32(nr_sectors, >cmnd[10]);
 
+   cmd->allowed = SD_MAX_RETRIES;
+

[PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-core.h  |  1 +
 drivers/md/dm-io.c|  8 ++--
 drivers/md/dm-linear.c|  1 +
 drivers/md/dm-mpath.c |  1 +
 drivers/md/dm-rq.c| 11 ---
 drivers/md/dm-stripe.c|  2 ++
 drivers/md/dm-table.c | 30 ++
 drivers/md/dm.c   | 31 ---
 include/linux/device-mapper.h |  6 ++
 9 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..fea5bd52ada8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
+void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index b808cbe22678..3702e502466d 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned 
region,
 */
if (op == REQ_OP_DISCARD)
special_cmd_max_sectors = q->limits.max_discard_sectors;
+   else if (op == REQ_OP_WRITE_ZEROES)
+   special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
else if (op == REQ_OP_WRITE_SAME)
special_cmd_max_sectors = q->limits.max_write_same_sectors;
-   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) &&
+   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
+op == REQ_OP_WRITE_SAME)  &&
special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
@@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region,
 */
switch (op) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
num_bvecs = 0;
break;
case REQ_OP_WRITE_SAME:
@@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region,
bio_set_op_attrs(bio, op, op_flags);
store_io_and_region_in_bio(bio, io, region);
 
-   if (op == REQ_OP_DISCARD) {
+   if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e17fd44ceef5 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
ti->private = lc;
return 0;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..ab55955ed704 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size();
else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..a789bf035621 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool 
mapped)
r = rq_end_io(tio->ti, clone, error, >info);
}
 
-   if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) &&
-!clone->q->limits.max_write_same_sectors))
-   disable_write_same(tio->md);
+   if (unlikely(r == -EREMOTEIO)) {
+   if (req_op(clone) == REQ_OP_WRITE_SAME &&
+   !clone->q->limits.max_write_same_sectors)
+   disable_write_same(tio->md);
+   if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
+   !clone->q->limits.max_write_zeroes_sectors)
+   disable_write_zeroes(tio->md);
+   }
 
if (r <= 0)
/* The target wants to complete the I/O */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..5ef49c121d99 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -169,6

[PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/sd.c | 31 ++-
 drivers/scsi/sd_zbc.c |  1 +
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b853f91fb3da..d8d9c0bdd93c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
return scsi_init_io(cmd);
 }
 
-static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
 {
struct scsi_device *sdp = cmd->device;
struct request *rq = cmd->request;
@@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd 
*cmd)
 
cmd->cmd_len = 16;
cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
+   if (unmap)
+   cmd->cmnd[1] = 0x8; /* UNMAP */
put_unaligned_be64(sector, >cmnd[2]);
put_unaligned_be32(nr_sectors, >cmnd[10]);
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
@@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd 
*cmd, bool unmap)
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
 }
 
+static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
+{
+   struct request *rq = cmd->request;
+   struct scsi_device *sdp = cmd->device;
+   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+
+   if (sdp->no_write_same)
+   return BLKPREP_INVALID;
+   if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
+   return sd_setup_write_same16_cmnd(cmd, false);
+   return sd_setup_write_same10_cmnd(cmd, false);
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
struct request_queue *q = sdkp->disk->queue;
@@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 out:
blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
+   blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
+(logical_block_size >> 9));
 }
 
 /**
@@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
case SD_LBP_UNMAP:
return sd_setup_unmap_cmnd(cmd);
case SD_LBP_WS16:
-   return sd_setup_write_same16_cmnd(cmd);
+   return sd_setup_write_same16_cmnd(cmd, true);
case SD_LBP_WS10:
return sd_setup_write_same10_cmnd(cmd, true);
case SD_LBP_ZERO:
@@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
default:
return BLKPREP_INVALID;
}
+   case REQ_OP_WRITE_ZEROES:
+   return sd_setup_write_zeroes_cmnd(cmd);
case REQ_OP_WRITE_SAME:
return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
@@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
switch (req_op(req)) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
if (!result) {
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 92620c8ea8ad..1994f7799fce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 
switch (req_op(rq)) {
case REQ_OP_WRITE:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
 
-- 
2.11.0

[PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-kcopyd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb7d51..f85846741d50 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct 
dm_io_region *from,
job->pages = _page_list;
 
/*
-* Use WRITE SAME to optimize zeroing if all dests support it.
+* Use WRITE ZEROES to optimize zeroing if all dests support it.
 */
-   job->rw = REQ_OP_WRITE_SAME;
+   job->rw = REQ_OP_WRITE_ZEROES;
for (i = 0; i < job->num_dests; i++)
-   if (!bdev_write_same(job->dests[i].bdev)) {
+   if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
job->rw = WRITE;
break;
}
-- 
2.11.0

Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira


On 04/05/2017 01:23 PM, Song Liu wrote:

Reviewed-by: Song Liu 


Thanks for reviewing, Song Liu.

It's good to know this patch doesn't break anything for you.

cheers,


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

[PATCH 24/27] drbd: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/drbd/drbd_main.c | 3 ++-
 drivers/block/drbd/drbd_nl.c   | 2 ++
 drivers/block/drbd/drbd_receiver.c | 6 +++---
 drivers/block/drbd/drbd_req.c  | 7 +--
 drivers/block/drbd/drbd_worker.c   | 4 +++-
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..8e62d9f65510 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection 
*connection,
(bio->bi_opf & REQ_FUA ? DP_FUA : 0) |
(bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) |
(bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) |
-   (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0);
+   (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) |
+   (bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0);
else
return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0;
 }
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 908c704e20aa..e4516d3b971d 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct 
drbd_device *device,
blk_queue_discard_granularity(q, 512);
q->limits.max_discard_sectors = 
drbd_max_discard_sectors(connection);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
+   q->limits.max_write_zeroes_sectors = 
drbd_max_discard_sectors(connection);
} else {
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
blk_queue_discard_granularity(q, 0);
q->limits.max_discard_sectors = 0;
+   q->limits.max_write_zeroes_sectors = 0;
}
 }
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index bc1d296581f9..1b0a2be24f39 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf)
 static unsigned long wire_flags_to_bio_op(u32 dpf)
 {
if (dpf & DP_DISCARD)
-   return REQ_OP_DISCARD;
+   return REQ_OP_WRITE_ZEROES;
else
return REQ_OP_WRITE;
 }
@@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection 
*connection, struct packet_info *
op_flags = wire_flags_to_bio_flags(dp_flags);
if (pi->cmd == P_TRIM) {
D_ASSERT(peer_device, peer_req->i.size > 0);
-   D_ASSERT(peer_device, op == REQ_OP_DISCARD);
+   D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES);
D_ASSERT(peer_device, peer_req->pages == NULL);
} else if (peer_req->pages == NULL) {
D_ASSERT(device, peer_req->i.size == 0);
@@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection 
*connection, struct pac
 
if (get_ldev(device)) {
struct drbd_peer_request *peer_req;
-   const int op = REQ_OP_DISCARD;
+   const int op = REQ_OP_WRITE_ZEROES;
 
peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector,
   size, 0, GFP_NOIO);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 6da9ea8c48b6..b5730e17b455 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device 
*device, struct bio
drbd_req_make_private_bio(req, bio_src);
req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0)
  | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0)
+ | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0)
  | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0);
req->device = device;
req->master_bio = bio_src;
@@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req)
if (get_ldev(device)) {
if (drbd_insert_fault(device, type))
bio_io_error(bio);
-   else if (bio_op(bio) == REQ_OP_DISCARD)
+   else if (bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+bio_op(bio) == REQ_OP_DISCARD)
drbd_process_discard_req(req);
else
generic_make_request(bio);
@@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct 
bio *bio, unsigned long
_drbd_start_io_acct(device, req);
 
/* process

[PATCH 16/27] zram: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Just the same as discard if the block size equals the system page size.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/zram/zram_drv.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..1710b06f04a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct 
bio *bio)
offset = (bio->bi_iter.bi_sector &
  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, index, offset, bio);
bio_endio(bio);
return;
+   default:
+   break;
}
 
bio_for_each_segment(bvec, bio, iter) {
@@ -1192,6 +1196,8 @@ static int zram_add(void)
zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
zram->disk->queue->limits.chunk_sectors = 0;
blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+
/*
 * zram_bio_discard() will clear all logical blocks if logical block
 * size is identical with physical block size(PAGE_SIZE). But if it is
@@ -1201,10 +1207,7 @@ static int zram_add(void)
 * zeroed.
 */
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
-   zram->disk->queue->limits.discard_zeroes_data = 1;
-   else
-   zram->disk->queue->limits.discard_zeroes_data = 0;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+   blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
add_disk(zram->disk);
 
-- 
2.11.0

[PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches

2017-04-05 Thread Christoph Hellwig

This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 

Reviewed-by: Hannes Reinecke 
---
 fs/block_dev.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f704c3a816f..e405d8e58e31 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, 
loff_t start,
 loff_t len)
 {
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-   struct request_queue *q = bdev_get_queue(bdev);
struct address_space *mapping;
loff_t end = start + len - 1;
loff_t isize;
@@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-   /* Only punch if the device can do zeroing discard. */
-   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
-   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
-GFP_KERNEL, 0);
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 
BLKDEV_ZERO_NOFALLBACK);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
-   if (!blk_queue_discard(q))
-   return -EOPNOTSUPP;
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
-- 
2.11.0

[PATCH 05/27] md: support REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/linear.c| 1 +
 drivers/md/md.h| 7 +++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c | 2 ++
 drivers/md/raid1.c | 4 +++-
 drivers/md/raid10.c| 1 +
 drivers/md/raid5.c | 1 +
 7 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3e38e0207a3e..377a8a3672e3 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1e76d64ce180 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev 
*mddev, struct bio *bio)
!bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio 
*bio)
+{
+   if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+   !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+   mddev->queue->limits.max_write_zeroes_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 79a12b59250b..e95d521d93e9 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, 
struct bio * bio)
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
mddev_check_writesame(mddev, _bh->bio);
+   mddev_check_write_zeroes(mddev, _bh->bio);
generic_make_request(_bh->bio);
return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..ce7a6a56cf73 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev)
 
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_sectors);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, 
mddev->chunk_sectors);
 
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
@@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a34f58772022..b59cc100320a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev)
if (IS_ERR(conf))
return PTR_ERR(conf);
 
-   if (mddev->queue)
+   if (mddev->queue) {
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+   }
 
rdev_for_each(rdev, mddev) {
if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e89a8d78a9ed..28ec3a93acee 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev)
blk_queue_max_discard_sectors(mddev->queue,
  mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
blk_queue_io_min(mddev->queue, chunk_size);
if (conf->geo.raid_disks % conf->geo.near_copies)
blk_queue_io_opt(mddev->queue, chunk_size * 
conf->geo.raid_disks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..8cf1f86dcd05 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev)
mddev->queue->limits.discard_zeroes_data = 0;
 
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);

[PATCH 06/27] dm io: discards don't take a payload

2017-04-05 Thread Christoph Hellwig

Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-io.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..b808cbe22678 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned 
region,
/*
 * Allocate a suitably sized-bio.
 */
-   if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME))
+   switch (op) {
+   case REQ_OP_DISCARD:
+   num_bvecs = 0;
+   break;
+   case REQ_OP_WRITE_SAME:
num_bvecs = 1;
-   else
+   break;
+   default:
num_bvecs = min_t(int, BIO_MAX_PAGES,
  dm_sector_div_up(remaining, 
(PAGE_SIZE >> SECTOR_SHIFT)));
+   }
 
bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
bio->bi_iter.bi_sector = where->sector + (where->count - 
remaining);
-- 
2.11.0

[PATCH 26/27] scsi: sd: Separate zeroout and discard command choices

2017-04-05 Thread Christoph Hellwig

From: "Martin K. Petersen" 

Now that zeroout and discards are distinct operations we need to
separate the policy of choosing the appropriate command. Create a
zeroing_mode which can be one of:

write:  Zeroout assist not present, use regular WRITE
writesame:  Allow WRITE SAME(10/16) with a zeroed payload
writesame_16_unmap: Allow WRITE SAME(16) with UNMAP
writesame_10_unmap: Allow WRITE SAME(10) with UNMAP

The last two are conditional on the device being thin provisioned with
LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.

Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
if none of the _unmap variants are supported, regular WRITE SAME will be
used if the device supports it.

The zeroout_mode is exported in sysfs and the detected mode for a given
device can be overridden using the string constants above.

With this change in place we can now issue WRITE SAME(16) with UNMAP set
for block zeroing applications that require hard guarantees and
logical_block_size granularity. And at the same time use the UNMAP
command with the device's preferred granulary and alignment for discard
operations.

Signed-off-by: Martin K. Petersen 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 56 ---
 drivers/scsi/sd.h |  8 
 2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bcb0cb020fd2..acf9d17b05d8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -418,6 +418,46 @@ provisioning_mode_store(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RW(provisioning_mode);
 
+static const char *zeroing_mode[] = {
+   [SD_ZERO_WRITE] = "write",
+   [SD_ZERO_WS]= "writesame",
+   [SD_ZERO_WS16_UNMAP]= "writesame_16_unmap",
+   [SD_ZERO_WS10_UNMAP]= "writesame_10_unmap",
+};
+
+static ssize_t
+zeroing_mode_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+   struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+   return snprintf(buf, 20, "%s\n", zeroing_mode[sdkp->zeroing_mode]);
+}
+
+static ssize_t
+zeroing_mode_store(struct device *dev, struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct scsi_disk *sdkp = to_scsi_disk(dev);
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   if (!strncmp(buf, zeroing_mode[SD_ZERO_WRITE], 20))
+   sdkp->zeroing_mode = SD_ZERO_WRITE;
+   else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS], 20))
+   sdkp->zeroing_mode = SD_ZERO_WS;
+   else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS16_UNMAP], 20))
+   sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP;
+   else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS10_UNMAP], 20))
+   sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP;
+   else
+   return -EINVAL;
+
+   return count;
+}
+static DEVICE_ATTR_RW(zeroing_mode);
+
 static ssize_t
 max_medium_access_timeouts_show(struct device *dev,
struct device_attribute *attr, char *buf)
@@ -496,6 +536,7 @@ static struct attribute *sd_disk_attrs[] = {
_attr_app_tag_own.attr,
_attr_thin_provisioning.attr,
_attr_provisioning_mode.attr,
+   _attr_zeroing_mode.attr,
_attr_max_write_same_blocks.attr,
_attr_max_medium_access_timeouts.attr,
NULL,
@@ -799,10 +840,10 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
if (!(rq->cmd_flags & REQ_NOUNMAP)) {
-   switch (sdkp->provisioning_mode) {
-   case SD_LBP_WS16:
+   switch (sdkp->zeroing_mode) {
+   case SD_ZERO_WS16_UNMAP:
return sd_setup_write_same16_cmnd(cmd, true);
-   case SD_LBP_WS10:
+   case SD_ZERO_WS10_UNMAP:
return sd_setup_write_same10_cmnd(cmd, true);
}
}
@@ -840,6 +881,15 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
sdkp->max_ws_blocks = 0;
}
 
+   if (sdkp->lbprz && sdkp->lbpws)
+   sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP;
+   else if (sdkp->lbprz && sdkp->lbpws10)
+   sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP;
+   else if (sdkp->max_ws_blocks)
+   sdkp->zeroing_mode = SD_ZERO_WS;
+   else
+   sdkp->zeroing_mode = SD_ZERO_WRITE;
+
 out:
blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 4dac35e96a75..a2c4b5c35379 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -59,6 +59,13 @@ enum {

[PATCH 25/27] block: remove the discard_zeroes_data flag

2017-04-05 Thread Christoph Hellwig

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 Documentation/ABI/testing/sysfs-block | 10 ++-
 Documentation/block/queue-sysfs.txt   |  5 
 block/blk-lib.c   |  7 +
 block/blk-settings.c  |  3 ---
 block/blk-sysfs.c |  2 +-
 block/compat_ioctl.c  |  2 +-
 block/ioctl.c |  2 +-
 drivers/block/drbd/drbd_main.c|  2 --
 drivers/block/drbd/drbd_nl.c  |  7 +
 drivers/block/loop.c  |  2 --
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/nbd.c   |  1 -
 drivers/md/dm-cache-target.c  |  1 -
 drivers/md/dm-crypt.c |  1 -
 drivers/md/dm-raid.c  |  6 ++---
 drivers/md/dm-raid1.c |  1 -
 drivers/md/dm-table.c | 19 -
 drivers/md/dm-thin.c  |  2 --
 drivers/md/raid5.c| 50 +++
 drivers/scsi/sd.c |  5 
 drivers/target/target_core_device.c   |  2 +-
 include/linux/blkdev.h| 15 ---
 include/linux/device-mapper.h |  5 
 23 files changed, 27 insertions(+), 124 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 2da04ce6aeef..dea212db9df3 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -213,14 +213,8 @@ What:  
/sys/block//queue/discard_zeroes_data
 Date:  May 2011
 Contact:   Martin K. Petersen 
 Description:
-   Devices that support discard functionality may return
-   stale or random data when a previously discarded block
-   is read back. This can cause problems if the filesystem
-   expects discarded blocks to be explicitly cleared. If a
-   device reports that it deterministically returns zeroes
-   when a discarded area is read the discard_zeroes_data
-   parameter will be set to one. Otherwise it will be 0 and
-   the result of reading a discarded area is undefined.
+   Will always return 0.  Don't rely on any specific behavior
+   for discards, and don't read this file.
 
 What:  /sys/block//queue/write_same_max_bytes
 Date:  January 2012
diff --git a/Documentation/block/queue-sysfs.txt 
b/Documentation/block/queue-sysfs.txt
index b7f6bdc96d73..2c1e67058fd3 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -43,11 +43,6 @@ large discards are issued, setting this value lower will 
make Linux issue
 smaller discards and potentially help reduce latencies induced by large
 discard operations.
 
-discard_zeroes_data (RO)
-
-When read, this file will show if the discarded block are zeroed by the
-device or not. If its value is '1' the blocks are zeroed otherwise not.
-
 hw_sector_size (RO)
 ---
 This is the hardware sector size of the device, in bytes.
diff --git a/block/blk-lib.c b/block/blk-lib.c
index b0c6c4bcf441..e8caecd71688 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -37,17 +37,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
return -ENXIO;
 
if (flags & BLKDEV_DISCARD_SECURE) {
-   if (flags & BLKDEV_DISCARD_ZERO)
-   return -EOPNOTSUPP;
if (!blk_queue_secure_erase(q))
return -EOPNOTSUPP;
op = REQ_OP_SECURE_ERASE;
} else {
if (!blk_queue_discard(q))
return -EOPNOTSUPP;
-   if ((flags & BLKDEV_DISCARD_ZERO) &&
-   !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
op = REQ_OP_DISCARD;
}
 
@@ -126,7 +121,7 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
);
if (!ret && bio) {
ret = submit_bio_wait(bio);
-   if (ret == -EOPNOTSUPP && !(flags & BLKDEV_DISCARD_ZERO))
+   if (ret == -EOPNOTSUPP)
ret = 0;
bio_put(bio);
}
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1e7174ffc9d4..4fa81ed383ca 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -103,7 +103,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->discard_granularity = 0;
lim->discard_alignment = 0;
lim->discard_misaligned = 0;
-   lim->discard_zeroes_data = 0;
lim->logical_block_size = lim->physical_block_size = lim->io_min =

[PATCH 21/27] mmc: remove the discard_zeroes_data flag

2017-04-05 Thread Christoph Hellwig

mmc only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/mmc/core/queue.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 493eb10ce580..4c54ad34e17a 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -167,8 +167,6 @@ static void mmc_queue_setup_discard(struct request_queue *q,
 
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
blk_queue_max_discard_sectors(q, max_discard);
-   if (card->erased_byte == 0 && !mmc_can_discard(card))
-   q->limits.discard_zeroes_data = 1;
q->limits.discard_granularity = card->pref_erase << 9;
/* granularity must not be greater than max. discard */
if (card->pref_erase > max_discard)
-- 
2.11.0

[PATCH 22/27] block: stop using discards for zeroing

2017-04-05 Thread Christoph Hellwig

Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing, we can remove the call to blkdev_issue_discard.
This means we only have two ways of zeroing left and can simplify the
code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f882e22890b..b0c6c4bcf441 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -279,6 +279,12 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *  Zero-fill a block range, either using hardware offload or by explicitly
  *  writing zeroes to the device.
  *
+ *  Note that this function may fail with -EOPNOTSUPP if the driver signals
+ *  zeroing offload support, but the device fails to process the command (for
+ *  some devices there is no non-destructive way to verify whether this
+ *  operation is actually supported).  In this case the caller should call
+ *  retry the call to blkdev_issue_zeroout() and the fallback path will be 
used.
+ *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  *
@@ -349,12 +355,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
-   if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
-   BLKDEV_DISCARD_ZERO))
-   return 0;
-   }
-
blk_start_plug();
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
, flags);
-- 
2.11.0

[PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag

2017-04-05 Thread Christoph Hellwig

This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c| 5 -
 include/linux/blkdev.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f6d2cb2e1a2..2f882e22890b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
+ *
+ *  If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
+ *  -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
-   if (ret == 0 || (ret && ret != -EOPNOTSUPP))
+   if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
goto out;
 
ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7513ce3dbde..a5055d760661 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device 
*bdev, sector_t sector,
struct bio **biop);
 
 #define BLKDEV_ZERO_NOUNMAP(1 << 0)  /* do not free blocks */
+#define BLKDEV_ZERO_NOFALLBACK (1 << 1)  /* don't write explicit zeroes */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-- 
2.11.0

[PATCH 20/27] rsxx: remove the discard_zeroes_data flag

2017-04-05 Thread Christoph Hellwig

rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/rsxx/dev.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..9c566364ac9c 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -300,7 +300,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
RSXX_HW_BLK_SIZE >> 9);
card->queue->limits.discard_granularity = RSXX_HW_BLK_SIZE;
card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
-   card->queue->limits.discard_zeroes_data = 1;
}
 
card->queue->queuedata = card;
-- 
2.11.0

[PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing

2017-04-05 Thread Christoph Hellwig

We'll always use the WRITE ZEROES code for zeroing now.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e5b853f2b8a2..2a8d638544a7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return 0;
}
 
-   if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-   ZERO_PAGE(0)))
-   return 0;
-
blk_start_plug();
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
, discard);
-- 
2.11.0

[PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout

2017-04-05 Thread Christoph Hellwig

Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c| 31 ++-
 block/ioctl.c  |  2 +-
 drivers/block/drbd/drbd_receiver.c |  9 ++---
 drivers/nvme/target/io-cmd.c   |  2 +-
 fs/block_dev.c |  2 +-
 fs/dax.c   |  2 +-
 fs/xfs/xfs_bmap_util.c |  2 +-
 include/linux/blkdev.h | 16 ++--
 8 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2a8d638544a7..f9f24ec69c27 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
  * @biop:  pointer to anchor bio
- * @discard:   discard flag
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.
+ *
+ *  If a device is using logical block provisioning, the underlying space will
+ *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-   bool discard)
+   unsigned flags)
 {
int ret;
int bi_size = 0;
@@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
  * @sector:start sector
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @discard:   whether to discard the block range
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Zero-fill a block range.  If the discard flag is set and the block
- *  device guarantees that subsequent READ operations to the block range
- *  in question will return zeroes, the blocks will be discarded. Should
- *  the discard request fail, if the discard flag is not set, or if
- *  discard_zeroes_data is not supported, this function will resort to
- *  zeroing the blocks manually, thus provisioning (allocating,
- *  anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
- *  command(s), blkdev_issue_zeroout() will use it to optimize the process of
- *  clearing the block range. Otherwise the zeroing will be performed
- *  using regular WRITE calls.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.  See __blkdev_issue_zeroout() for the
+ *  valid values for %flags.
  */
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-sector_t nr_sects, gfp_t gfp_mask, bool discard)
+   sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
int ret;
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (discard) {
+   if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
BLKDEV_DISCARD_ZERO))
return 0;
@@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
blk_start_plug();
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-   , discard);
+   , flags);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820b93d9..8ea00a41be01 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
fmode_t mode,
truncate_inode_pages_range(mapping, start, end);
 
return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
-   false);
+   BLKDEV_ZERO_NOUNMAP);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index aa6bf9692eff..dc9a6dcd431c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device 
*device, sector_t start, u
tmp = start + granularity - sector_div(tmp, granularity);
 
nr = tmp - start;
-   err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0);
+   err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
+   BLKDEV_ZERO_NOUNMAP);
nr_sectors -= nr;

[PATCH 14/27] sd: implement unmapping Write Zeroes

2017-04-05 Thread Christoph Hellwig

Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/sd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d8d9c0bdd93c..001593ed0444 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
+   if (!(rq->cmd_flags & REQ_NOUNMAP)) {
+   switch (sdkp->provisioning_mode) {
+   case SD_LBP_WS16:
+   return sd_setup_write_same16_cmnd(cmd, true);
+   case SD_LBP_WS10:
+   return sd_setup_write_same10_cmnd(cmd, true);
+   }
+   }
+
if (sdp->no_write_same)
return BLKPREP_INVALID;
if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
-- 
2.11.0

[PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c   | 19 +--
 include/linux/blk_types.h |  6 ++
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f9f24ec69c27..2f6d2cb2e1a2 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
-/**
- * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
- * @bdev:  blockdev to issue
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @biop:  pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled 
pages.
- */
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-   struct bio **biop)
+   struct bio **biop, unsigned flags)
 {
struct bio *bio = *biop;
unsigned int max_write_zeroes_sectors;
@@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
-   bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
+   bio->bi_opf = REQ_OP_WRITE_ZEROES;
+   if (flags & BLKDEV_ZERO_NOUNMAP)
+   bio->bi_opf |= REQ_NOUNMAP;
 
if (nr_sects > max_write_zeroes_sectors) {
bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return -EINVAL;
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
-   biop);
+   biop, flags);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4eae30bfbfca..8eaa7dca7057 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+
+   /* command specific flags for REQ_OP_WRITE_ZEROES: */
+   __REQ_NOUNMAP,  /* do not free blocks when zeroing */
+
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -212,6 +216,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
 
+#define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
+
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.11.0

[PATCH 19/27] rbd: remove the discard_zeroes_data flag

2017-04-05 Thread Christoph Hellwig

rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/rbd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index f24ade333e0c..089ac4179919 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4380,7 +4380,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
q->limits.discard_granularity = segment_size;
q->limits.discard_alignment = segment_size;
blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
-   q->limits.discard_zeroes_data = 1;
 
if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
-- 
2.11.0

[PATCH 18/27] brd: remove discard support

2017-04-05 Thread Christoph Hellwig

It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/brd.c | 54 -
 1 file changed, 54 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a3153b..4ec84d504780 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -134,28 +134,6 @@ static struct page *brd_insert_page(struct brd_device 
*brd, sector_t sector)
return page;
 }
 
-static void brd_free_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-   pgoff_t idx;
-
-   spin_lock(>brd_lock);
-   idx = sector >> PAGE_SECTORS_SHIFT;
-   page = radix_tree_delete(>brd_pages, idx);
-   spin_unlock(>brd_lock);
-   if (page)
-   __free_page(page);
-}
-
-static void brd_zero_page(struct brd_device *brd, sector_t sector)
-{
-   struct page *page;
-
-   page = brd_lookup_page(brd, sector);
-   if (page)
-   clear_highpage(page);
-}
-
 /*
  * Free all backing store pages and radix tree. This must only be called when
  * there are no other users of the device.
@@ -212,24 +190,6 @@ static int copy_to_brd_setup(struct brd_device *brd, 
sector_t sector, size_t n)
return 0;
 }
 
-static void discard_from_brd(struct brd_device *brd,
-   sector_t sector, size_t n)
-{
-   while (n >= PAGE_SIZE) {
-   /*
-* Don't want to actually discard pages here because
-* re-allocating the pages can result in writeback
-* deadlocks under heavy load.
-*/
-   if (0)
-   brd_free_page(brd, sector);
-   else
-   brd_zero_page(brd, sector);
-   sector += PAGE_SIZE >> SECTOR_SHIFT;
-   n -= PAGE_SIZE;
-   }
-}
-
 /*
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
@@ -338,14 +298,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, 
struct bio *bio)
if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
goto io_error;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-   if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-   bio->bi_iter.bi_size & ~PAGE_MASK)
-   goto io_error;
-   discard_from_brd(brd, sector, bio->bi_iter.bi_size);
-   goto out;
-   }
-
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
int err;
@@ -357,7 +309,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, 
struct bio *bio)
sector += len >> SECTOR_SHIFT;
}
 
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
 io_error:
@@ -464,11 +415,6 @@ static struct brd_device *brd_alloc(int i)
 *  is harmless)
 */
blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
-
-   brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
-   blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
-   brd->brd_queue->limits.discard_zeroes_data = 1;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
 #ifdef CONFIG_BLK_DEV_RAM_DAX
queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
 #endif
-- 
2.11.0

[PATCH 17/27] loop: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

It's identical to discard as hole punches will always leave us with
zeroes on reads.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/loop.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f34e017..3bb04c1a4ba1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct 
request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
return lo_discard(lo, rq, pos);
case REQ_OP_WRITE:
if (lo->transfer)
@@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
+   blk_queue_max_write_zeroes_sectors(q, 0);
q->limits.discard_zeroes_data = 0;
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
return;
@@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
switch (req_op(cmd->rq)) {
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
cmd->use_aio = false;
break;
default:
-- 
2.11.0

[PATCH 15/27] nvme: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/nvme/host/core.c | 10 +-
 drivers/nvme/host/nvme.h |  6 +++---
 drivers/nvme/host/pci.c  |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3c908e1bc903..26d5129a640a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -358,6 +358,8 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
case REQ_OP_FLUSH:
nvme_setup_flush(ns, cmd);
break;
+   case REQ_OP_WRITE_ZEROES:
+   /* currently only aliased to deallocate for a few ctrls: */
case REQ_OP_DISCARD:
ret = nvme_setup_discard(ns, req, cmd);
break;
@@ -923,16 +925,14 @@ static void nvme_config_discard(struct nvme_ns *ns)
BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
NVME_DSM_MAX_RANGES);
 
-   if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES)
-   ns->queue->limits.discard_zeroes_data = 1;
-   else
-   ns->queue->limits.discard_zeroes_data = 0;
-
ns->queue->limits.discard_alignment = logical_block_size;
ns->queue->limits.discard_granularity = logical_block_size;
blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+
+   if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
+   blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
 static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 227f281482db..f903726eeb68 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -68,10 +68,10 @@ enum nvme_quirks {
NVME_QUIRK_IDENTIFY_CNS = (1 << 1),
 
/*
-* The controller deterministically returns O's on reads to discarded
-* logical blocks.
+* The controller deterministically returns O's on reads to
+* logical blocks that deallocate was called on.
 */
-   NVME_QUIRK_DISCARD_ZEROES   = (1 << 2),
+   NVME_QUIRK_DEALLOCATE_ZEROES= (1 << 2),
 
/*
 * The controller needs a delay before starts checking the device
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9e686a67d93b..cb530a6bef3f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2113,13 +2113,13 @@ static const struct pci_error_handlers nvme_err_handler 
= {
 static const struct pci_device_id nvme_id_table[] = {
{ PCI_VDEVICE(INTEL, 0x0953),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x0a53),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x0a54),
.driver_data = NVME_QUIRK_STRIPE_SIZE |
-   NVME_QUIRK_DISCARD_ZEROES, },
+   NVME_QUIRK_DEALLOCATE_ZEROES, },
{ PCI_VDEVICE(INTEL, 0x5845),   /* Qemu emulated controller */
.driver_data = NVME_QUIRK_IDENTIFY_CNS, },
{ PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
-- 
2.11.0

[PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout

2017-04-05 Thread Christoph Hellwig

drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/drbd/drbd_debugfs.c  |   3 --
 drivers/block/drbd/drbd_int.h  |   6 ---
 drivers/block/drbd/drbd_receiver.c | 102 ++---
 drivers/block/drbd/drbd_req.c  |   6 +--
 4 files changed, 7 insertions(+), 110 deletions(-)

diff --git a/drivers/block/drbd/drbd_debugfs.c 
b/drivers/block/drbd/drbd_debugfs.c
index de5c3ee8a790..494837e59f23 100644
--- a/drivers/block/drbd/drbd_debugfs.c
+++ b/drivers/block/drbd/drbd_debugfs.c
@@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file 
*m, struct drbd_peer_re
seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, , "in-AL");
seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, , "C");
seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, , "set-in-sync");
-
-   if (f & EE_IS_TRIM)
-   __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, , 
"zero-out", "trim");
seq_print_rq_state_bit(m, f & EE_WRITE_SAME, , "write-same");
seq_putc(m, '\n');
 }
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 724d1c50fc52..d5da45bb03a6 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -437,9 +437,6 @@ enum {
 
/* is this a TRIM aka REQ_DISCARD? */
__EE_IS_TRIM,
-   /* our lower level cannot handle trim,
-* and we want to fall back to zeroout instead */
-   __EE_IS_TRIM_USE_ZEROOUT,
 
/* In case a barrier failed,
 * we need to resubmit without the barrier flag. */
@@ -482,7 +479,6 @@ enum {
 #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)
 #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)
 #define EE_IS_TRIM (1<<__EE_IS_TRIM)
-#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT)
 #define EE_RESUBMITTED (1<<__EE_RESUBMITTED)
 #define EE_WAS_ERROR   (1<<__EE_WAS_ERROR)
 #define EE_HAS_DIGEST  (1<<__EE_HAS_DIGEST)
@@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data);
 extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req);
 
 /* drbd_receiver.c */
-extern int drbd_issue_discard_or_zero_out(struct drbd_device *device,
-   sector_t start, unsigned int nr_sectors, bool discard);
 extern int drbd_receiver(struct drbd_thread *thi);
 extern int drbd_ack_receiver(struct drbd_thread *thi);
 extern void drbd_send_ping_wf(struct work_struct *ws);
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index dc9a6dcd431c..bc1d296581f9 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource 
*resource, struct drbd_backin
drbd_info(resource, "Method to ensure write ordering: %s\n", 
write_ordering_str[resource->write_ordering]);
 }
 
-/*
- * We *may* ignore the discard-zeroes-data setting, if so configured.
- *
- * Assumption is that it "discard_zeroes_data=0" is only because the backend
- * may ignore partial unaligned discards.
- *
- * LVM/DM thin as of at least
- *   LVM version: 2.02.115(2)-RHEL7 (2015-01-28)
- *   Library version: 1.02.93-RHEL7 (2015-01-28)
- *   Driver version:  4.29.0
- * still behaves this way.
- *
- * For unaligned (wrt. alignment and granularity) or too small discards,
- * we zero-out the initial (and/or) trailing unaligned partial chunks,
- * but discard all the aligned full chunks.
- *
- * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
- */
-int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, 
unsigned int nr_sectors, bool discard)
-{
-   struct block_device *bdev = device->ldev->backing_bdev;
-   struct request_queue *q = bdev_get_queue(bdev);
-   sector_t tmp, nr;
-   unsigned int max_discard_sectors, granularity;
-   int alignment;
-   int err = 0;
-
-   if (!discard)
-   goto zero_out;
-
-   /* Zero-sector (unknown) and one-sector granularities are the same.  */
-   granularity = max(q->limits.discard_granularity >> 9, 1U);
-   alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
-
-   max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22));
-   max_discard_sectors -= max_discard_sectors % granularity;
-   if (unlikely(!max_discard_sectors))
-   goto zero_out;
-
-   if (nr_sectors < granularity)
-   goto zero_out;
-
-   tmp = start;
-   if (sector_div(tmp, granularity) != alignment) {
-   if (nr_sectors < 2*granularity)
-   goto zero_out;
-   /* start + gran - (start + gran - align) % gran */
-

[PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards

2017-04-05 Thread Christoph Hellwig

From: "Martin K. Petersen" 

Separating discards and zeroout operations allows us to remove the LBPRZ
block zeroing constraints from discards and honor the device preferences
for UNMAP commands.

If supported by the device, we'll also choose UNMAP over one of the
WRITE SAME variants for discards.

Signed-off-by: Martin K. Petersen 
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 25 ++---
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index acf9d17b05d8..8cf34a8e3eea 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -685,24 +685,11 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
unsigned int logical_block_size = sdkp->device->sector_size;
unsigned int max_blocks = 0;
 
-   /*
-* When LBPRZ is reported, discard alignment and granularity
-* must be fixed to the logical block size. Otherwise the block
-* layer will drop misaligned portions of the request which can
-* lead to data corruption. If LBPRZ is not set, we honor the
-* device preference.
-*/
-   if (sdkp->lbprz) {
-   q->limits.discard_alignment = 0;
-   q->limits.discard_granularity = logical_block_size;
-   } else {
-   q->limits.discard_alignment = sdkp->unmap_alignment *
-   logical_block_size;
-   q->limits.discard_granularity =
-   max(sdkp->physical_block_size,
-   sdkp->unmap_granularity * logical_block_size);
-   }
-
+   q->limits.discard_alignment =
+   sdkp->unmap_alignment * logical_block_size;
+   q->limits.discard_granularity =
+   max(sdkp->physical_block_size,
+   sdkp->unmap_granularity * logical_block_size);
sdkp->provisioning_mode = mode;
 
switch (mode) {
@@ -2842,7 +2829,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
sd_config_discard(sdkp, SD_LBP_WS16);
 
} else {/* LBP VPD page tells us what to use */
-   if (sdkp->lbpu && sdkp->max_unmap_blocks && 
!sdkp->lbprz)
+   if (sdkp->lbpu && sdkp->max_unmap_blocks)
sd_config_discard(sdkp, SD_LBP_UNMAP);
else if (sdkp->lbpws)
sd_config_discard(sdkp, SD_LBP_WS16);
-- 
2.11.0

[PATCH 2/5] nvme: cleanup nvme_req_needs_retry

2017-04-05 Thread Christoph Hellwig

Don't pass the status explicitly but derive it from the requeust,
and unwind the complex condition to be more readable.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0437f44d00f9..b225aacf4b89 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -67,11 +67,17 @@ static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
 
-static inline bool nvme_req_needs_retry(struct request *req, u16 status)
+static inline bool nvme_req_needs_retry(struct request *req)
 {
-   return !(status & NVME_SC_DNR || blk_noretry_request(req)) &&
-   (jiffies - req->start_time) < req->timeout &&
-   req->retries < nvme_max_retries;
+   if (blk_noretry_request(req))
+   return false;
+   if (req->errors & NVME_SC_DNR)
+   return false;
+   if (jiffies - req->start_time >= req->timeout)
+   return false;
+   if (req->retries >= nvme_max_retries)
+   return false;
+   return true;
 }
 
 void nvme_complete_rq(struct request *req)
@@ -79,7 +85,7 @@ void nvme_complete_rq(struct request *req)
int error = 0;
 
if (unlikely(req->errors)) {
-   if (nvme_req_needs_retry(req, req->errors)) {
+   if (nvme_req_needs_retry(req)) {
req->retries++;
blk_mq_requeue_request(req,
!blk_mq_queue_stopped(req->q));
-- 
2.11.0

[PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Christoph Hellwig

The way NVMe uses this field is entirely different from the older
SCSI/BLOCK_PC usage, so move it into struct nvme_request.

Also reduce the size of the file to a unsigned char so that we leave
space for additional smaller fields that will appear soon.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 10 +-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 933e67c60e33..dc05f41c3992 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -49,8 +49,8 @@ unsigned char shutdown_timeout = 5;
 module_param(shutdown_timeout, byte, 0644);
 MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller 
shutdown");
 
-static unsigned int nvme_max_retries = 5;
-module_param_named(max_retries, nvme_max_retries, uint, 0644);
+static u8 nvme_max_retries = 5;
+module_param_named(max_retries, nvme_max_retries, byte, 0644);
 MODULE_PARM_DESC(max_retries, "max number of retries a command may have");
 
 static int nvme_char_major;
@@ -74,7 +74,7 @@ static inline bool nvme_req_needs_retry(struct request *req)
return false;
if (jiffies - req->start_time >= req->timeout)
return false;
-   if (req->retries >= nvme_max_retries)
+   if (nvme_req(req)->retries >= nvme_max_retries)
return false;
return true;
 }
@@ -85,7 +85,7 @@ void nvme_complete_rq(struct request *req)
 
if (unlikely(req->errors)) {
if (nvme_req_needs_retry(req)) {
-   req->retries++;
+   nvme_req(req)->retries++;
blk_mq_requeue_request(req,
!blk_mq_queue_stopped(req->q));
return;
@@ -356,7 +356,7 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
int ret = BLK_MQ_RQ_QUEUE_OK;
 
if (!(req->rq_flags & RQF_DONTPREP)) {
-   req->retries = 0;
+   nvme_req(req)->retries = 0;
req->rq_flags |= RQF_DONTPREP;
}
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 82ba9a305301..9eecb67177df 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -90,6 +90,7 @@ enum nvme_quirks {
 struct nvme_request {
struct nvme_command *cmd;
union nvme_result   result;
+   u8  retries;
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
-- 
2.11.0

[PATCH 3/5] nvme: mark nvme_max_retries static

2017-04-05 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 3 +--
 drivers/nvme/host/nvme.h | 2 --
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b225aacf4b89..933e67c60e33 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -49,10 +49,9 @@ unsigned char shutdown_timeout = 5;
 module_param(shutdown_timeout, byte, 0644);
 MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller 
shutdown");
 
-unsigned int nvme_max_retries = 5;
+static unsigned int nvme_max_retries = 5;
 module_param_named(max_retries, nvme_max_retries, uint, 0644);
 MODULE_PARM_DESC(max_retries, "max number of retries a command may have");
-EXPORT_SYMBOL_GPL(nvme_max_retries);
 
 static int nvme_char_major;
 module_param(nvme_char_major, int, 0);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 227f281482db..82ba9a305301 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -43,8 +43,6 @@ extern unsigned char shutdown_timeout;
 #define NVME_DEFAULT_KATO  5
 #define NVME_KATO_GRACE10
 
-extern unsigned int nvme_max_retries;
-
 enum {
NVME_NS_LBA = 0,
NVME_NS_LIGHTNVM= 1,
-- 
2.11.0

[PATCH 5/5] block, scsi: move the retries field to struct scsi_request

2017-04-05 Thread Christoph Hellwig

Instead of bloating the generic struct request with it.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 block/scsi_ioctl.c | 8 
 drivers/scsi/osd/osd_initiator.c   | 2 +-
 drivers/scsi/osst.c| 2 +-
 drivers/scsi/scsi_error.c  | 2 +-
 drivers/scsi/scsi_lib.c| 4 ++--
 drivers/scsi/sg.c  | 2 +-
 drivers/scsi/st.c  | 2 +-
 drivers/target/target_core_pscsi.c | 2 +-
 include/linux/blkdev.h | 1 -
 include/scsi/scsi_request.h| 1 +
 10 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 2a2fc768b27a..82a43bb19967 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -362,7 +362,7 @@ static int sg_io(struct request_queue *q, struct gendisk 
*bd_disk,
goto out_free_cdb;
 
bio = rq->bio;
-   rq->retries = 0;
+   req->retries = 0;
 
start_time = jiffies;
 
@@ -476,13 +476,13 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk 
*disk, fmode_t mode,
goto error;
 
/* default.  possible overriden later */
-   rq->retries = 5;
+   req->retries = 5;
 
switch (opcode) {
case SEND_DIAGNOSTIC:
case FORMAT_UNIT:
rq->timeout = FORMAT_UNIT_TIMEOUT;
-   rq->retries = 1;
+   req->retries = 1;
break;
case START_STOP:
rq->timeout = START_STOP_TIMEOUT;
@@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk 
*disk, fmode_t mode,
break;
case READ_DEFECT_DATA:
rq->timeout = READ_DEFECT_DATA_TIMEOUT;
-   rq->retries = 1;
+   req->retries = 1;
break;
default:
rq->timeout = BLK_DEFAULT_SG_TIMEOUT;
diff --git a/drivers/scsi/osd/osd_initiator.c b/drivers/scsi/osd/osd_initiator.c
index 6903f03c88af..9d0727b2bdec 100644
--- a/drivers/scsi/osd/osd_initiator.c
+++ b/drivers/scsi/osd/osd_initiator.c
@@ -1602,7 +1602,7 @@ static int _init_blk_request(struct osd_request *or,
req->rq_flags |= RQF_QUIET;
 
req->timeout = or->timeout;
-   req->retries = or->retries;
+   scsi_req(req)->retries = or->retries;
 
if (has_out) {
or->out.req = req;
diff --git a/drivers/scsi/osst.c b/drivers/scsi/osst.c
index c47f4b349bac..41bc1d64bf86 100644
--- a/drivers/scsi/osst.c
+++ b/drivers/scsi/osst.c
@@ -414,7 +414,7 @@ static int osst_execute(struct osst_request *SRpnt, const 
unsigned char *cmd,
memset(rq->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */
memcpy(rq->cmd, cmd, rq->cmd_len);
req->timeout = timeout;
-   req->retries = retries;
+   rq->retries = retries;
req->end_io_data = SRpnt;
 
blk_execute_rq_nowait(req->q, NULL, req, 1, osst_end_async);
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f2cafae150bc..2db412dd4b44 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1988,7 +1988,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
 
req->rq_flags |= RQF_QUIET;
req->timeout = 10 * HZ;
-   req->retries = 5;
+   rq->retries = 5;
 
blk_execute_rq_nowait(req->q, NULL, req, 1, eh_lock_door_done);
 }
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c1519660824b..11972d1075f1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -256,7 +256,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
 
rq->cmd_len = COMMAND_SIZE(cmd[0]);
memcpy(rq->cmd, cmd, rq->cmd_len);
-   req->retries = retries;
+   rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
@@ -1177,7 +1177,7 @@ static int scsi_setup_scsi_cmnd(struct scsi_device *sdev, 
struct request *req)
cmd->cmd_len = scsi_req(req)->cmd_len;
cmd->cmnd = scsi_req(req)->cmd;
cmd->transfersize = blk_rq_bytes(req);
-   cmd->allowed = req->retries;
+   cmd->allowed = scsi_req(req)->retries;
return BLKPREP_OK;
 }
 
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 225abaad4d1c..c5aaceea8d77 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1718,7 +1718,7 @@ sg_start_req(Sg_request *srp, unsigned char *cmd)
 
srp->rq = rq;
rq->end_io_data = srp;
-   rq->retries = SG_DEFAULT_RETRIES;
+   req->retries = SG_DEFAULT_RETRIES;
 
if ((dxfer_len <= 0) || (dxfer_dir == SG_DXFER_NONE))
return 0;
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index e5ef78a6848e..5408643431bb 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -579,7 +579,7 @@ static int st_scsi_execute(struct st_request *SRpnt, const 
unsigned char *cmd,

[PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd

2017-04-05 Thread Christoph Hellwig

->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command.  We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 5 +
 drivers/nvme/host/pci.c  | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3c908e1bc903..0437f44d00f9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -350,6 +350,11 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 {
int ret = BLK_MQ_RQ_QUEUE_OK;
 
+   if (!(req->rq_flags & RQF_DONTPREP)) {
+   req->retries = 0;
+   req->rq_flags |= RQF_DONTPREP;
+   }
+
switch (req_op(req)) {
case REQ_OP_DRV_IN:
case REQ_OP_DRV_OUT:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9e686a67d93b..3818ab15a26e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -326,10 +326,6 @@ static int nvme_init_iod(struct request *rq, struct 
nvme_dev *dev)
iod->nents = 0;
iod->length = size;
 
-   if (!(rq->rq_flags & RQF_DONTPREP)) {
-   rq->retries = 0;
-   rq->rq_flags |= RQF_DONTPREP;
-   }
return BLK_MQ_RQ_QUEUE_OK;
 }
 
-- 
2.11.0

->retries fixups V2

2017-04-05 Thread Christoph Hellwig

This series fixes a few lose bits in terms of how nvme uses ->retries,
including fixing it for non-PCIe transports.  While at it I noticed that
nvme and scsi use the field in entirely different ways, and no other
driver uses it at all.  So I decided to move it into the nvme_request and
scsi_request structures instead.

Changes since V1:
 - better changelog for one patch
 - move the new retries field to the end of struct nvme_request

[PATCH 27/38] Annotate hardware config module parameters in drivers/scsi/

2017-04-05 Thread David Howells

When the kernel is running in secure boot mode, we lock down the kernel to
prevent userspace from modifying the running kernel image.  Whilst this
includes prohibiting access to things like /dev/mem, it must also prevent
access by means of configuring driver modules in such a way as to cause a
device to access or modify the kernel image.

To this end, annotate module_param* statements that refer to hardware
configuration and indicate for future reference what type of parameter they
specify.  The parameter parser in the core sees this information and can
skip such parameters with an error message if the kernel is locked down.
The module initialisation then runs as normal, but just sees whatever the
default values for those parameters is.

Note that we do still need to do the module initialisation because some
drivers have viable defaults set in case parameters aren't specified and
some drivers support automatic configuration (e.g. PNP or PCI) in addition
to manually coded parameters.

This patch annotates drivers in drivers/scsi/.

Suggested-by: Alan Cox 
Signed-off-by: David Howells 
cc: "Juergen E. Fischer" 
cc: "James E.J. Bottomley" 
cc: "Martin K. Petersen" 
cc: Dario Ballabio 
cc: Finn Thain 
cc: Michael Schmitz 
cc: Achim Leubner 
cc: linux-scsi@vger.kernel.org
---

 drivers/scsi/aha152x.c   |4 ++--
 drivers/scsi/aha1542.c   |2 +-
 drivers/scsi/g_NCR5380.c |8 
 drivers/scsi/gdth.c  |2 +-
 drivers/scsi/qlogicfas.c |4 ++--
 5 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/aha152x.c b/drivers/scsi/aha152x.c
index f44d0487236e..ce5dc73d85bb 100644
--- a/drivers/scsi/aha152x.c
+++ b/drivers/scsi/aha152x.c
@@ -331,11 +331,11 @@ MODULE_LICENSE("GPL");
 #if !defined(PCMCIA)
 #if defined(MODULE)
 static int io[] = {0, 0};
-module_param_array(io, int, NULL, 0);
+module_param_hw_array(io, int, ioport, NULL, 0);
 MODULE_PARM_DESC(io,"base io address of controller");
 
 static int irq[] = {0, 0};
-module_param_array(irq, int, NULL, 0);
+module_param_hw_array(irq, int, irq, NULL, 0);
 MODULE_PARM_DESC(irq,"interrupt for controller");
 
 static int scsiid[] = {7, 7};
diff --git a/drivers/scsi/aha1542.c b/drivers/scsi/aha1542.c
index 7db448ec8beb..a23cc9ac5acd 100644
--- a/drivers/scsi/aha1542.c
+++ b/drivers/scsi/aha1542.c
@@ -31,7 +31,7 @@ module_param(isapnp, bool, 0);
 MODULE_PARM_DESC(isapnp, "enable PnP support (default=1)");
 
 static int io[MAXBOARDS] = { 0x330, 0x334, 0, 0 };
-module_param_array(io, int, NULL, 0);
+module_param_hw_array(io, int, ioport, NULL, 0);
 MODULE_PARM_DESC(io, "base IO address of controller 
(0x130,0x134,0x230,0x234,0x330,0x334, default=0x330,0x334)");
 
 /* time AHA spends on the AT-bus during data transfer */
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 67c8dac321ad..c34fc91ba486 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -85,8 +85,8 @@ static int ncr_53c400;
 static int ncr_53c400a;
 static int dtc_3181e;
 static int hp_c2502;
-module_param(ncr_irq, int, 0);
-module_param(ncr_addr, int, 0);
+module_param_hw(ncr_irq, int, irq, 0);
+module_param_hw(ncr_addr, int, ioport, 0);
 module_param(ncr_5380, int, 0);
 module_param(ncr_53c400, int, 0);
 module_param(ncr_53c400a, int, 0);
@@ -94,11 +94,11 @@ module_param(dtc_3181e, int, 0);
 module_param(hp_c2502, int, 0);
 
 static int irq[] = { -1, -1, -1, -1, -1, -1, -1, -1 };
-module_param_array(irq, int, NULL, 0);
+module_param_hw_array(irq, int, irq, NULL, 0);
 MODULE_PARM_DESC(irq, "IRQ number(s) (0=none, 254=auto [default])");
 
 static int base[] = { 0, 0, 0, 0, 0, 0, 0, 0 };
-module_param_array(base, int, NULL, 0);
+module_param_hw_array(base, int, ioport, NULL, 0);
 MODULE_PARM_DESC(base, "base address(es)");
 
 static int card[] = { -1, -1, -1, -1, -1, -1, -1, -1 };
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index d020a13646ae..facc7271f932 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -353,7 +353,7 @@ static int probe_eisa_isa = 0;
 static int force_dma32 = 0;
 
 /* parameters for modprobe/insmod */
-module_param_array(irq, int, NULL, 0);
+module_param_hw_array(irq, int, irq, NULL, 0);
 module_param(disable, int, 0);
 module_param(reserve_mode, int, 0);
 module_param_array(reserve_list, int, NULL, 0);
diff --git a/drivers/scsi/qlogicfas.c b/drivers/scsi/qlogicfas.c
index 61cac87fb86f..840823b99e51 100644
--- a/drivers/scsi/qlogicfas.c
+++ b/drivers/scsi/qlogicfas.c
@@ -137,8 +137,8 @@ static struct Scsi_Host *__qlogicfas_detect(struct 
scsi_host_template *host,
 static struct qlogicfas408_priv *cards;
 static int iobase[MAX_QLOGICFAS];
 static int irq[MAX_QLOGICFAS] = { [0 ... MAX_QLOGICFAS-1] = -1 };
-module_param_array(iobase, int, NULL, 0);
-module_param_array(irq,

[PATCH] Make checking the scsi_device_get() return value mandatory

2017-04-05 Thread Bart Van Assche

Now that all scsi_device_get() callers check the return value of this
function, make checking that return value mandatory.

Signed-off-by: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 include/scsi/scsi_device.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index cdff28519393..7a154a944c6c 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -316,7 +316,7 @@ extern int scsi_unregister_device_handler(struct 
scsi_device_handler *scsi_dh);
 void scsi_attach_vpd(struct scsi_device *sdev);
 
 extern struct scsi_device *scsi_device_from_queue(struct request_queue *q);
-extern int scsi_device_get(struct scsi_device *);
+extern int __must_check scsi_device_get(struct scsi_device *);
 extern void scsi_device_put(struct scsi_device *);
 extern struct scsi_device *scsi_device_lookup(struct Scsi_Host *,
  uint, uint, u64);
-- 
2.12.0

Re: [PATCH] osd_uld: Check scsi_device_get() return value

2017-04-05 Thread Boaz Harrosh

On 03/30/2017 08:17 PM, Bart Van Assche wrote:
> scsi_device_get() can fail. Hence check its return value.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Boaz Harrosh 

Cool thanks
ACK-by: Boaz Harrosh 

> ---
>  drivers/scsi/osd/osd_uld.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
> index e0ce5d2fd14d..1dea4244dd0c 100644
> --- a/drivers/scsi/osd/osd_uld.c
> +++ b/drivers/scsi/osd/osd_uld.c
> @@ -464,14 +464,15 @@ static int osd_probe(struct device *dev)
>   /* hold one more reference to the scsi_device that will get released
>* in __release, in case a logout is happening while fs is mounted
>*/
> - scsi_device_get(scsi_device);
> + if (scsi_device_get(scsi_device))
> + goto err_put_disk;
>   osd_dev_init(>od, scsi_device);
>  
>   /* Detect the OSD Version */
>   error = __detect_osd(oud);
>   if (error) {
>   OSD_ERR("osd detection failed, non-compatible OSD device\n");
> - goto err_put_disk;
> + goto err_put_sdev;
>   }
>  
>   /* init the char-device for communication with user-mode */
> @@ -508,8 +509,9 @@ static int osd_probe(struct device *dev)
>  
>  err_put_cdev:
>   cdev_del(>cdev);
> -err_put_disk:
> +err_put_sdev:
>   scsi_device_put(scsi_device);
> +err_put_disk:
>   put_disk(disk);
>  err_free_osd:
>   dev_set_drvdata(dev, NULL);
>

Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Song Liu


> On Apr 5, 2017, at 8:18 AM, Mauricio Faria de Oliveira 
>  wrote:
> 
> The commit 08024885a2a3 ("ses: Add power_status to SES device slot")
> introduced the 'power_status' attribute to enclosure components and
> the associated callbacks.
> 
> There are 2 callbacks available to get the power status of a device:
> 1) ses_get_power_status() for 'struct enclosure_component_callbacks'
> 2) get_component_power_status() for the sysfs device attribute
> (these are available for kernel-space and user-space, respectively.)
> 
> However, despite both methods being available to get power status
> on demand, that commit also introduced a call to get power status
> in ses_enclosure_data_process().
> 
> This dramatically increased the total probe time for SCSI devices
> on larger configurations, because ses_enclosure_data_process() is
> called several times during the SCSI devices probe and loops over
> the component devices (but that is another problem, another patch).
> 
> That results in a tremendous continuous hammering of SCSI Receive
> Diagnostics commands to the enclosure-services device, which does
> delay the total probe time for the SCSI devices __significantly__:
> 
>  Originally, ~34 minutes on a system attached to ~170 disks:
> 
>[ 9214.490703] mpt3sas version 13.100.00.00 loaded
>...
>[11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0),
>   ordered(0), scsi_level(6), cmd_que(1)
> 
>  With this patch, it decreased to ~2.5 minutes -- a 13.6x faster
> 
>[ 1002.992533] mpt3sas version 13.100.00.00 loaded
>...
>[ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0),
>   ordered(0), scsi_level(6), cmd_que(1)
> 
> Back to the commit discussion.. on the ses_get_power_status() call
> introduced in ses_enclosure_data_process(): impact of removing it.
> 
> That may possibly be in place to initialize the power status value
> on device probe.  However, those 2 functions available to retrieve
> that value _do_ automatically refresh/update it.  So the potential
> benefit would be a direct access of the 'power_status' field which
> does not use the callbacks...
> 
> But the only reader of 'struct enclosure_component::power_status'
> is the get_component_power_status() callback for sysfs attribute,
> and it _does_ check for and call the .get_power_status callback,
> (which indeed is defined and implemented by that commit), so the
> power status value is, again, automatically updated.
> 
> So, the remaining potential for a direct/non-callback access to
> the power_status attribute would be out-of-tree modules -- well,
> for those, if they are for whatever reason interested in values
> that are set during device probe and not up-to-date by the time
> they need it.. well, that would be curious.
> 
> Well, to handle that more properly, set the initial power state
> value to '-1' (i.e., uninitialized) instead of '1' (power 'on'),
> and check for it in that callback which may do an direct access
> to the field value _if_ a callback function is not defined.
> 
> Signed-off-by: Mauricio Faria de Oliveira 
> Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot")
> ---
> 
> v2:
> - remove module parameter.
> - better return values for -1/uninitalized power_status.
>   (thanks: Dan Williams )
> 
> drivers/misc/enclosure.c | 7 ++-
> drivers/scsi/ses.c   | 1 -
> 2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
> index 65fed7146e9b..d3fe3ea902d4 100644
> --- a/drivers/misc/enclosure.c
> +++ b/drivers/misc/enclosure.c
> @@ -148,7 +148,7 @@ struct enclosure_device *
>   for (i = 0; i < components; i++) {
>   edev->component[i].number = -1;
>   edev->component[i].slot = -1;
> - edev->component[i].power_status = 1;
> + edev->component[i].power_status = -1;
>   }
> 
>   mutex_lock(_list_lock);
> @@ -594,6 +594,11 @@ static ssize_t get_component_power_status(struct device 
> *cdev,
> 
>   if (edev->cb->get_power_status)
>   edev->cb->get_power_status(edev, ecomp);
> +
> + /* If still uninitialized, the callback failed or does not exist. */
> + if (ecomp->power_status == -1)
> + return (edev->cb->get_power_status) ? -EIO : -ENOTTY;
> +
>   return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off");
> }
> 
> diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
> index 50adabbb5808..f1cdf32d7514 100644
> --- a/drivers/scsi/ses.c
> +++ b/drivers/scsi/ses.c
> @@ -548,7 +548,6 @@ static void ses_enclosure_data_process(struct 
> enclosure_device *edev,
>   ecomp = >component[components++];
> 
>   if (!IS_ERR(ecomp)) {
> - ses_get_power_status(edev, ecomp);
>

Re: problem with discard granularity in sd

2017-04-05 Thread David Buckley

Hi Martin,

I really appreciate the response. Here is the VPD page data you asked for:

Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 1
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Provisioning type: 2

Block limits VPD page (SBC):
  Write same no zero (WSNZ): 1
  Maximum compare and write length: 1 blocks
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 8388607 blocks
  Optimal transfer length: 128 blocks
  Maximum prefetch length: 0 blocks
  Maximum unmap LBA count: 8192
  Maximum unmap block descriptor count: 64
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 1
  Unmap granularity alignment: 0
  Maximum write same length: 0x4000 blocks

As I mentioned previously, I'm fairly certain that the issue I'm
seeing is due to the fact that while NetApp LUNs are presented as 512B
logical/4K physical disks for compatibility, they actually don't
support requests smaller than 4K (which makes sense as NetApp LUNs are
actually just files allocated on the 4K-block WAFL filesystem).

If the optimal granularity values from VPD are respected (as is the
case with 3.10 kernels), the result is:

minimum_io_size = 512 *8 = 4096  (logical_block_size * VPD optimal
transfer length granularity)
discard_granularity = 512 * 8 = 4096  (logical_block_size * VPD
optimal unmap granularity)

With recent kernels the value of minimum_io_size is unchanged, but
discard_granularity is explicitly set to logical_block_size which
results in unmap requests either being dropped or ignored.

I understand that the VPD values from LUNs may be atypical compared to
physical disks, and I expect there are some major differences in how
unmap (and zeroing) is handled between physical disks and LUNs.  But
it is unfortunate that a solution for misaligned I/O would break
fully-aligned requests which worked flawlessly in previous kernels.

Let me know if there's any additional information I can provide. This
has resulted in a 2-3x increase in raw disk requirements for some
workloads (unfortunately on SSD too), and I'd love to find a solution
that doesn't require rolling back to a 3.10 kernel.

Thanks again!
-David

On Tue, Apr 4, 2017 at 5:12 PM, Martin K. Petersen
 wrote:
> David Buckley  writes:
>
> David,
>
>> They result in discard granularity being forced to logical block size
>> if the disk reports LBPRZ is enabled (which the netapp luns do).
>
> Block zeroing and unmapping are currently sharing some plumbing and that
> has lead to some compromises. In this case a bias towards ensuring data
> integrity for zeroing at the expense of not aligning unmap requests.
>
> Christoph has worked on separating those two functions. His code is
> currently under review.
>
>> I'm not sure of the implications of either the netapp changes, though
>> reporting 4k logical blocks seems potential as this is supported in
>> newer OS at least.
>
> Yes, but it may break legacy applications that assume a 512-byte logical
> block size.
>
>> The sd change potentially would at least partially undo the patches
>> referenced above.  But it would seem that (assuming an aligned
>> filesystem with 4k blocks and minimum_io_size=4096) there is no
>> possibility of a partial block discard or advantage to sending the
>> discard requests in 512 blocks?
>
> The unmap granularity inside a device is often much, much bigger than
> 4K. So aligning to that probably won't make a difference. And it's
> imperative to filesystems that zeroing works at logical block size
> granularity.
>
> The expected behavior for a device is that it unmaps whichever full
> unmap granularity chunks are described by a received request. And then
> explicitly zeroes any partial chunks at the head and tail. So I am
> surprised you see no reclamation whatsoever.
>
> With the impending zero/unmap separation things might fare better. But
> I'd still like to understand the behavior you observe. Please provide
> the output of:
>
> sg_vpd -p lbpv /dev/sdN
> sg_vpd -p bl /dev/sdN
>
> for one of the LUNs and I'll take a look.
>
> Thanks!
>
> --
> Martin K. Petersen  Oracle Linux Engineering

RE: [RFC 4/8] p2pmem: Add debugfs "stats" file

2017-04-05 Thread Steve Wise

> 
> > +   p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> > +   if (!p2pmem_debugfs_root)
> > +   pr_info("could not create debugfs entry, continuing\n");
> > +
> 
> Why continue? I think it'd be better to just fail it.
> 

Because not having debugfs support isn't fatal to using p2pmem.  So I
believe it is better to continue.  But this is trivial, IMO, so either was
is ok with me.

> Besides, this can be safely squashed into patch 1.

Yes.

RE: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region

2017-04-05 Thread Steve Wise

> 
> 
> > +static void setup_memwin_p2pmem(struct adapter *adap)
> > +{
> > +   unsigned int mem_base = t4_read_reg(adap,
> CIM_EXTMEM2_BASE_ADDR_A);
> > +   unsigned int mem_size = t4_read_reg(adap,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +
> > +   if (!use_p2pmem)
> > +   return;
> 
> This is weird, why even call this if !use_p2pmem?
> 

The use_p2pmem was added after the original change.  I'll update as you
suggest.

> > +static int init_p2pmem(struct adapter *adapter)
> > +{
> > +   unsigned int mem_size = t4_read_reg(adapter,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +   struct p2pmem_dev *p;
> > +   int rc;
> > +   struct resource res;
> > +
> > +   if (!mem_size || !use_p2pmem)
> > +   return 0;
> 
> Again, weird...

Yup.

Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors

2017-04-05 Thread Varun Prakash

On Wed, Apr 05, 2017 at 08:26:57AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 04, 2017 at 01:49:44PM +0530, Varun Prakash wrote:
> > On Tue, Apr 04, 2017 at 08:46:14AM +0200, Christoph Hellwig wrote:
> > > Does this one work better?
> > >
> > 
> > csiostor driver is triggering following warning during module unload.
> 
> Looks like we need to explicitly ignore the CSIO_IM_NONE case in
> csio_free_irqs.  New patch below:
>

Yes, we have to ignore CSIO_IM_NONE case in csio_free_irqs(), but I don't
see any CSIO_IM_NONE specific code in csio_free_irqs().

There is one more issue - this patch changes the order in which
csio_hw_intr_disable() and free_irq() are called.
In the current code first csio_hw_intr_disable() is called then free_irq()
is called, with this patch free_irq() is called without disabling hw
interrupts, this can cause regressions.

> ---
> From e5a4178cb810be581b6d9b8f48f13b12e88eae74 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Thu, 12 Jan 2017 11:17:29 +0100
> Subject: csiostor: switch to pci_alloc_irq_vectors
> 
> And get automatic MSI-X affinity for free.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/csiostor/csio_hw.h   |   4 +-
>  drivers/scsi/csiostor/csio_init.c |   9 +--
>  drivers/scsi/csiostor/csio_isr.c  | 127 
> +-
>  3 files changed, 51 insertions(+), 89 deletions(-)
> 
> diff --git a/drivers/scsi/csiostor/csio_hw.h b/drivers/scsi/csiostor/csio_hw.h
> index 029bef82c057..a084d83ce70f 100644
> --- a/drivers/scsi/csiostor/csio_hw.h
> +++ b/drivers/scsi/csiostor/csio_hw.h
> @@ -95,7 +95,6 @@ enum {
>  };
>  
>  struct csio_msix_entries {
> - unsigned short  vector; /* Assigned MSI-X vector */
>   void*dev_id;/* Priv object associated w/ this msix*/
>   chardesc[24];   /* Description of this vector */
>  };
> @@ -591,8 +590,9 @@ int csio_enqueue_evt(struct csio_hw *, enum csio_evt, 
> void *, uint16_t);
>  void csio_evtq_flush(struct csio_hw *hw);
>  
>  int csio_request_irqs(struct csio_hw *);
> +void csio_free_irqs(struct csio_hw *);
>  void csio_intr_enable(struct csio_hw *);
> -void csio_intr_disable(struct csio_hw *, bool);
> +void csio_intr_disable(struct csio_hw *);
>  void csio_hw_fatal_err(struct csio_hw *);
>  
>  struct csio_lnode *csio_lnode_alloc(struct csio_hw *);
> diff --git a/drivers/scsi/csiostor/csio_init.c 
> b/drivers/scsi/csiostor/csio_init.c
> index dbe416ff46c2..7e60699c8b55 100644
> --- a/drivers/scsi/csiostor/csio_init.c
> +++ b/drivers/scsi/csiostor/csio_init.c
> @@ -461,8 +461,7 @@ csio_config_queues(struct csio_hw *hw)
>   return 0;
>  
>  intr_disable:
> - csio_intr_disable(hw, false);
> -
> + csio_intr_disable(hw);
>   return -EINVAL;
>  }
>  
> @@ -575,7 +574,8 @@ static struct csio_hw *csio_hw_alloc(struct pci_dev *pdev)
>  static void
>  csio_hw_free(struct csio_hw *hw)
>  {
> - csio_intr_disable(hw, true);
> + csio_free_irqs(hw);
> + csio_intr_disable(hw);

This changes the order, free_irq() is called without disabling hw interrupts. 

>   csio_hw_exit_workers(hw);
>   csio_hw_exit(hw);
>   iounmap(hw->regstart);
> @@ -1068,7 +1068,8 @@ csio_pci_error_detected(struct pci_dev *pdev, 
> pci_channel_state_t state)
>   spin_unlock_irq(>lock);
>   csio_lnodes_unblock_request(hw);
>   csio_lnodes_exit(hw, 0);
> - csio_intr_disable(hw, true);
> + csio_free_irqs(hw);
> + csio_intr_disable(hw);

Same here.

>   pci_disable_device(pdev);
>   return state == pci_channel_io_perm_failure ?
>   PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
> diff --git a/drivers/scsi/csiostor/csio_isr.c 
> b/drivers/scsi/csiostor/csio_isr.c
> index 2fb71c6c3b37..a4ad847e9c53 100644
> --- a/drivers/scsi/csiostor/csio_isr.c
> +++ b/drivers/scsi/csiostor/csio_isr.c
> @@ -383,17 +383,15 @@ csio_request_irqs(struct csio_hw *hw)
>   int rv, i, j, k = 0;
>   struct csio_msix_entries *entryp = >msix_entries[0];
>   struct csio_scsi_cpu_info *info;
> + struct pci_dev *pdev = hw->pdev;
>  
>   if (hw->intr_mode != CSIO_IM_MSIX) {
> - rv = request_irq(hw->pdev->irq, csio_fcoe_isr,
> - (hw->intr_mode == CSIO_IM_MSI) ?
> - 0 : IRQF_SHARED,
> - KBUILD_MODNAME, hw);
> + rv = request_irq(pci_irq_vector(pdev, 0), csio_fcoe_isr,
> + hw->intr_mode == CSIO_IM_MSI ? 0 : IRQF_SHARED,
> + KBUILD_MODNAME, hw);
>   if (rv) {
> - if (hw->intr_mode == CSIO_IM_MSI)
> - pci_disable_msi(hw->pdev);
>   csio_err(hw, "Failed to allocate interrupt line.\n");
> - return -EINVAL;
> + goto out_free_irqs;
>   }
>  
>

Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Dan Williams

On Wed, Apr 5, 2017 at 8:18 AM, Mauricio Faria de Oliveira
 wrote:
> The commit 08024885a2a3 ("ses: Add power_status to SES device slot")
> introduced the 'power_status' attribute to enclosure components and
> the associated callbacks.
>
> There are 2 callbacks available to get the power status of a device:
> 1) ses_get_power_status() for 'struct enclosure_component_callbacks'
> 2) get_component_power_status() for the sysfs device attribute
> (these are available for kernel-space and user-space, respectively.)
>
> However, despite both methods being available to get power status
> on demand, that commit also introduced a call to get power status
> in ses_enclosure_data_process().
>
> This dramatically increased the total probe time for SCSI devices
> on larger configurations, because ses_enclosure_data_process() is
> called several times during the SCSI devices probe and loops over
> the component devices (but that is another problem, another patch).
>
> That results in a tremendous continuous hammering of SCSI Receive
> Diagnostics commands to the enclosure-services device, which does
> delay the total probe time for the SCSI devices __significantly__:
>
>   Originally, ~34 minutes on a system attached to ~170 disks:
>
> [ 9214.490703] mpt3sas version 13.100.00.00 loaded
> ...
> [11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0),
>ordered(0), scsi_level(6), cmd_que(1)
>
>   With this patch, it decreased to ~2.5 minutes -- a 13.6x faster
>
> [ 1002.992533] mpt3sas version 13.100.00.00 loaded
> ...
> [ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0),
>ordered(0), scsi_level(6), cmd_que(1)
>
> Back to the commit discussion.. on the ses_get_power_status() call
> introduced in ses_enclosure_data_process(): impact of removing it.
>
> That may possibly be in place to initialize the power status value
> on device probe.  However, those 2 functions available to retrieve
> that value _do_ automatically refresh/update it.  So the potential
> benefit would be a direct access of the 'power_status' field which
> does not use the callbacks...
>
> But the only reader of 'struct enclosure_component::power_status'
> is the get_component_power_status() callback for sysfs attribute,
> and it _does_ check for and call the .get_power_status callback,
> (which indeed is defined and implemented by that commit), so the
> power status value is, again, automatically updated.
>
> So, the remaining potential for a direct/non-callback access to
> the power_status attribute would be out-of-tree modules -- well,
> for those, if they are for whatever reason interested in values
> that are set during device probe and not up-to-date by the time
> they need it.. well, that would be curious.
>
> Well, to handle that more properly, set the initial power state
> value to '-1' (i.e., uninitialized) instead of '1' (power 'on'),
> and check for it in that callback which may do an direct access
> to the field value _if_ a callback function is not defined.
>
> Signed-off-by: Mauricio Faria de Oliveira 
> Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot")

Reviewed-by: Dan Williams

Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Christoph Hellwig

On Wed, Apr 05, 2017 at 11:14:30AM -0400, Keith Busch wrote:
> On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote:
> > The way NVMe uses this field is entirely different from the older
> > SCSI/BLOCK_PC usage, so move it into struct nvme_request.
> > 
> > Also reduce the size of the file to a unsigned char so that we leave space
> > for additional smaller fields that will appear soon.
> 
> What new fields can we expect? Why temporarily pad the middle of the
> struct instead of appending to the end? The "result" packed nicely
> on 64-bit, at least.

I've got another u8 and a u16 in pending patches.  That being said
moving it to the end might make sense.

[PATCH v2] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira

The commit 08024885a2a3 ("ses: Add power_status to SES device slot")
introduced the 'power_status' attribute to enclosure components and
the associated callbacks.

There are 2 callbacks available to get the power status of a device:
1) ses_get_power_status() for 'struct enclosure_component_callbacks'
2) get_component_power_status() for the sysfs device attribute
(these are available for kernel-space and user-space, respectively.)

However, despite both methods being available to get power status
on demand, that commit also introduced a call to get power status
in ses_enclosure_data_process().

This dramatically increased the total probe time for SCSI devices
on larger configurations, because ses_enclosure_data_process() is
called several times during the SCSI devices probe and loops over
the component devices (but that is another problem, another patch).

That results in a tremendous continuous hammering of SCSI Receive
Diagnostics commands to the enclosure-services device, which does
delay the total probe time for the SCSI devices __significantly__:

  Originally, ~34 minutes on a system attached to ~170 disks:

[ 9214.490703] mpt3sas version 13.100.00.00 loaded
...
[11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0),
   ordered(0), scsi_level(6), cmd_que(1)

  With this patch, it decreased to ~2.5 minutes -- a 13.6x faster

[ 1002.992533] mpt3sas version 13.100.00.00 loaded
...
[ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0),
   ordered(0), scsi_level(6), cmd_que(1)

Back to the commit discussion.. on the ses_get_power_status() call
introduced in ses_enclosure_data_process(): impact of removing it.

That may possibly be in place to initialize the power status value
on device probe.  However, those 2 functions available to retrieve
that value _do_ automatically refresh/update it.  So the potential
benefit would be a direct access of the 'power_status' field which
does not use the callbacks...

But the only reader of 'struct enclosure_component::power_status'
is the get_component_power_status() callback for sysfs attribute,
and it _does_ check for and call the .get_power_status callback,
(which indeed is defined and implemented by that commit), so the
power status value is, again, automatically updated.

So, the remaining potential for a direct/non-callback access to
the power_status attribute would be out-of-tree modules -- well,
for those, if they are for whatever reason interested in values
that are set during device probe and not up-to-date by the time
they need it.. well, that would be curious.

Well, to handle that more properly, set the initial power state
value to '-1' (i.e., uninitialized) instead of '1' (power 'on'),
and check for it in that callback which may do an direct access
to the field value _if_ a callback function is not defined.

Signed-off-by: Mauricio Faria de Oliveira 
Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot")
---

v2:
 - remove module parameter.
 - better return values for -1/uninitalized power_status.
   (thanks: Dan Williams )

 drivers/misc/enclosure.c | 7 ++-
 drivers/scsi/ses.c   | 1 -
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed7146e9b..d3fe3ea902d4 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -148,7 +148,7 @@ struct enclosure_device *
for (i = 0; i < components; i++) {
edev->component[i].number = -1;
edev->component[i].slot = -1;
-   edev->component[i].power_status = 1;
+   edev->component[i].power_status = -1;
}
 
mutex_lock(_list_lock);
@@ -594,6 +594,11 @@ static ssize_t get_component_power_status(struct device 
*cdev,
 
if (edev->cb->get_power_status)
edev->cb->get_power_status(edev, ecomp);
+
+   /* If still uninitialized, the callback failed or does not exist. */
+   if (ecomp->power_status == -1)
+   return (edev->cb->get_power_status) ? -EIO : -ENOTTY;
+
return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off");
 }
 
diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c
index 50adabbb5808..f1cdf32d7514 100644
--- a/drivers/scsi/ses.c
+++ b/drivers/scsi/ses.c
@@ -548,7 +548,6 @@ static void ses_enclosure_data_process(struct 
enclosure_device *edev,
ecomp = >component[components++];
 
if (!IS_ERR(ecomp)) {
-   ses_get_power_status(edev, ecomp);
if (addl_desc_ptr)
ses_process_descriptor(
ecomp,
-- 
1.8.3.1

Re: [PATCHv5 8/8] scsi: inline command aborts

2017-04-05 Thread Christoph Hellwig

I think this also need to remove the abort_work field from
struct scsi_cmnd.

>  }
>  
>  #ifdef CONFIG_SCSI_LOGGING
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 53e3343..2355100 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -115,11 +115,9 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
> *shost)
>   * scmd_eh_abort_handler - Handle command aborts
>   * @work:command to be aborted.
>   */
> -void
> -scmd_eh_abort_handler(struct work_struct *work)
> +int
> +scmd_eh_abort_handler(struct scsi_cmnd *scmd)
>  {
> - struct scsi_cmnd *scmd =
> - container_of(work, struct scsi_cmnd, abort_work.work);
>   struct scsi_device *sdev = scmd->device;
>   int rtn;
>  
> @@ -127,42 +125,40 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host 
> *shost)
>   SCSI_LOG_ERROR_RECOVERY(3,
>   scmd_printk(KERN_INFO, scmd,
>   "eh timeout, not aborting\n"));
> - } else {
> - SCSI_LOG_ERROR_RECOVERY(3,
> - scmd_printk(KERN_INFO, scmd,
> - "aborting command\n"));
> - rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> - if (rtn == SUCCESS) {
> - set_host_byte(scmd, DID_TIME_OUT);
> - if (scsi_host_eh_past_deadline(sdev->host)) {
> - SCSI_LOG_ERROR_RECOVERY(3,
> - scmd_printk(KERN_INFO, scmd,
> - "eh timeout, not retrying "
> - "aborted command\n"));
> - } else if (!scsi_noretry_cmd(scmd) &&
> - (++scmd->retries <= scmd->allowed)) {
> - SCSI_LOG_ERROR_RECOVERY(3,
> - scmd_printk(KERN_WARNING, scmd,
> - "retry aborted command\n"));
> - scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
> - return;
> - } else {
> - SCSI_LOG_ERROR_RECOVERY(3,
> - scmd_printk(KERN_WARNING, scmd,
> - "finish aborted 
> command\n"));
> - scsi_finish_command(scmd);
> - return;
> - }
> - } else {
> + return FAILED;
> + }
> + SCSI_LOG_ERROR_RECOVERY(3,
> + scmd_printk(KERN_INFO, scmd,
> + "aborting command\n"));
> + rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> + if (rtn == SUCCESS) {

If you restructure the function anywat this could become an
early return / goto out.

Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Keith Busch

On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote:
> The way NVMe uses this field is entirely different from the older
> SCSI/BLOCK_PC usage, so move it into struct nvme_request.
> 
> Also reduce the size of the file to a unsigned char so that we leave space
> for additional smaller fields that will appear soon.

What new fields can we expect? Why temporarily pad the middle of the
struct instead of appending to the end? The "result" packed nicely
on 64-bit, at least.

>  struct nvme_request {
>   struct nvme_command *cmd;
> + u8  retries;
>   union nvme_result   result;
>  };

Re: [PATCHv5 7/8] scsi: make asynchronous aborts mandatory

2017-04-05 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCHv5 6/8] scsi: make scsi_eh_scmd_add() always succeed

2017-04-05 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCHv5 5/8] scsi: make eh_eflags persistent

2017-04-05 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCHv5 3/8] scsi: always send command aborts

2017-04-05 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCHv5 1/8] scsi_error: count medium access timeout only once per EH run

2017-04-05 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCHv5 2/8] sd: Return SUCCESS in sd_eh_action() after device offline

2017-04-05 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Mauricio Faria de Oliveira


On 04/05/2017 11:41 AM, Dan Williams wrote:

On Wed, Apr 5, 2017 at 6:13 AM, Mauricio Faria de Oliveira
 wrote:



1) imagine .get_power_status couldn't update the 'power_status' field
   (it's a bit unlikely with the in-tree ses driver, but in the case
that ses_get_page2_descriptor() returns NULL, ses_get_power_status()
doesn't update 'power_status').

Ok, in that case we should probably return -EIO, if get_power_status
is non-NULL, but the power_status is still -1.


2) an out-of-tree module employs another enclosure_component_callbacks
   structure, which doesn't implement a .get_power_status hook.

If get_power_status is null and power_status is -1 I think we should
return -ENOTTY to indicate the failure.


3) or it does, but that failed, and didn't update 'power_status' (e.g.,
   like case 1)

I think we're back -EIO for this.


Absolutely. I see those are better values to convey the failure/reason.



However, for an in-tree usage, it's clear that just dropping that line
seemed to be sufficient -- the rest in the patch is just consideration
for whoever might be using it in out-of-tree ways.

(and if such consideration is deemed not to be required in this case,
 I'd be fine with dropping the related changes and making this patch
 a one-line. :- )

Let's not carry module parameters that only theoretically help an out
of tree module. If such a driver exists its owner can just holler if
this policy change breaks their usage, and then we can have a
discussion about why their driver isn't upstream already.


Okay, got it.
For the record, the patch author has been in To:/Cc:, for such cases.

I'll submit a v2.

Thanks,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd

2017-04-05 Thread Christoph Hellwig

On Wed, Apr 05, 2017 at 04:43:34PM +0200, Johannes Thumshirn wrote:
> On Wed, Apr 05, 2017 at 04:18:52PM +0200, Christoph Hellwig wrote:
> > This way we get the behavior right for the non-PCIe transports.
> 
> Could you please share a bit of your minds inner workings for us mere mortals?

It's initialized to zero the first time the command is submitted
and will then be incremented when queueing a retry.

Re: always use REQ_OP_WRITE_ZEROES for zeroing offload V2

2017-04-05 Thread Christoph Hellwig

Looks like the mail server crapped out under the load..

I'll resend tomorrow morning, in the meantime the git url below
works.

On Wed, Apr 05, 2017 at 04:21:38PM +0200, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.
> 
> A git tree is also avaiable at:
> 
> git://git.infradead.org/users/hch/block.git discard-rework.2
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2
> 
> Changes since V2:
>  - various spelling fixes
>  - various reviews captured
>  - two new patches from Martin at the end
---end quoted text---

Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request

2017-04-05 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 04:18:56PM +0200, Christoph Hellwig wrote:
> Instead of bloating the generic struct request with it.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 2/5] nvme: cleanup nvme_req_needs_retry

2017-04-05 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 04:18:53PM +0200, Christoph Hellwig wrote:
> Don't pass the status explicitly but derive it from the requeust,
> and unwind the complex condition to be more readable.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote:
> The way NVMe uses this field is entirely different from the older
> SCSI/BLOCK_PC usage, so move it into struct nvme_request.
> 
> Also reduce the size of the file to a unsigned char so that we leave space
> for additional smaller fields that will appear soon.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 3/5] nvme: mark nvme_max_retries static

2017-04-05 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 04:18:54PM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
> ---

Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd

2017-04-05 Thread Johannes Thumshirn

On Wed, Apr 05, 2017 at 04:18:52PM +0200, Christoph Hellwig wrote:
> This way we get the behavior right for the non-PCIe transports.

Could you please share a bit of your minds inner workings for us mere mortals?

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe

2017-04-05 Thread Dan Williams

On Wed, Apr 5, 2017 at 6:13 AM, Mauricio Faria de Oliveira
 wrote:
> Hi Dan,
>
> Thanks for reviewing.
>
> On 04/04/2017 06:07 PM, Dan Williams wrote:
>>>
>>> @@ -594,6 +594,10 @@ static ssize_t get_component_power_status(struct
>>> device *cdev,
>>>
>>> if (edev->cb->get_power_status)
>>> edev->cb->get_power_status(edev, ecomp);
>>> +
>>> +   if (ecomp->power_status == -1)
>>> +   return -EINVAL;
>>> +
>>
>> Can we ever hit this if get_power_status is non-null?
>
>
> I admit that is cautionary, and more for consistency with the chance
> of an uninitialized state being still in place, so not to return any
> (incorrect) value in that state.
>
> But, yes, in a few cases -- although unlikely.
>
> 1) imagine .get_power_status couldn't update the 'power_status' field
>(it's a bit unlikely with the in-tree ses driver, but in the case
> that ses_get_page2_descriptor() returns NULL, ses_get_power_status()
> doesn't update 'power_status').

Ok, in that case we should probably return -EIO, if get_power_status
is non-NULL, but the power_status is still -1.

> 2) an out-of-tree module employs another enclosure_component_callbacks
>structure, which doesn't implement a .get_power_status hook.

If get_power_status is null and power_status is -1 I think we should
return -ENOTTY to indicate the failure.

> 3) or it does, but that failed, and didn't update 'power_status' (e.g.,
>like case 1)

I think we're back -EIO for this.

>
>
>>> +static bool power_status_on_probe; /* default: false, 0. */
>>> +module_param(power_status_on_probe, bool, 0644);
>>> +MODULE_PARM_DESC(power_status_on_probe, "get power status of SES device
>>> slots "
>>> +   "(enclosure components) at probe
>>> time "
>>> +   "(warning: may delay total probe
>>> time)");
>>> +
>>
>> I don't think we need this module parameter as long as the only way to
>> read the power status is through sysfs. We can always just leave it to
>> be read on-demand when needed.
>
>
> I agree for the in-tree module.  My only concern, as described in the
> commit message, is the case someone is using an out-of-tree module /
> has an implementation that depends on that (e.g., reading the 'power_
> status' field directly, instead of using .get_power_status.
>
> I've been a bit overzealous when considering why that call was inserted
> in ses_enclosure_data_process(), and didn't want to break them.
>
> However, for an in-tree usage, it's clear that just dropping that line
> seemed to be sufficient -- the rest in the patch is just consideration
> for whoever might be using it in out-of-tree ways.
>
> (and if such consideration is deemed not to be required in this case,
>  I'd be fine with dropping the related changes and making this patch
>  a one-line. :- )

Let's not carry module parameters that only theoretically help an out
of tree module. If such a driver exists its owner can just holler if
this policy change breaks their usage, and then we can have a
discussion about why their driver isn't upstream already.

Re: [PATCH 27/39] Annotate hardware config module parameters in drivers/scsi/

2017-04-05 Thread David Howells

Finn Thain  wrote:

> I can see how base addresses and IO ports are relevant, but the irq 
> parameter changes below don't protect the kernel image AFAICT. What's the 
> rationale for those changes? I think it should be stated here.

Easier grepping for one.  But I'm also currently preventing the changing of
any hardware parameters.  If a driver refuses to share an irq, you could use
it to deprive another driver of the ability to get an irq at all.

David

[PATCH 16/27] zram: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Just the same as discard if the block size equals the system page size.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/zram/zram_drv.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dceb5edd1e54..1710b06f04a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct 
bio *bio)
offset = (bio->bi_iter.bi_sector &
  (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
 
-   if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, index, offset, bio);
bio_endio(bio);
return;
+   default:
+   break;
}
 
bio_for_each_segment(bvec, bio, iter) {
@@ -1192,6 +1196,8 @@ static int zram_add(void)
zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
zram->disk->queue->limits.chunk_sectors = 0;
blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
+   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+
/*
 * zram_bio_discard() will clear all logical blocks if logical block
 * size is identical with physical block size(PAGE_SIZE). But if it is
@@ -1201,10 +1207,7 @@ static int zram_add(void)
 * zeroed.
 */
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
-   zram->disk->queue->limits.discard_zeroes_data = 1;
-   else
-   zram->disk->queue->limits.discard_zeroes_data = 0;
-   queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
+   blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
add_disk(zram->disk);
 
-- 
2.11.0

[PATCH 14/27] sd: implement unmapping Write Zeroes

2017-04-05 Thread Christoph Hellwig

Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/sd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d8d9c0bdd93c..001593ed0444 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd 
*cmd)
u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
 
+   if (!(rq->cmd_flags & REQ_NOUNMAP)) {
+   switch (sdkp->provisioning_mode) {
+   case SD_LBP_WS16:
+   return sd_setup_write_same16_cmnd(cmd, true);
+   case SD_LBP_WS10:
+   return sd_setup_write_same10_cmnd(cmd, true);
+   }
+   }
+
if (sdp->no_write_same)
return BLKPREP_INVALID;
if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
-- 
2.11.0

[PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches

2017-04-05 Thread Christoph Hellwig

This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 

Reviewed-by: Hannes Reinecke 
---
 fs/block_dev.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2f704c3a816f..e405d8e58e31 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, 
loff_t start,
 loff_t len)
 {
struct block_device *bdev = I_BDEV(bdev_file_inode(file));
-   struct request_queue *q = bdev_get_queue(bdev);
struct address_space *mapping;
loff_t end = start + len - 1;
loff_t isize;
@@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int 
mode, loff_t start,
GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-   /* Only punch if the device can do zeroing discard. */
-   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
-   return -EOPNOTSUPP;
-   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
-GFP_KERNEL, 0);
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 
BLKDEV_ZERO_NOFALLBACK);
break;
case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
-   if (!blk_queue_discard(q))
-   return -EOPNOTSUPP;
error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 GFP_KERNEL, 0);
break;
-- 
2.11.0

[PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag

2017-04-05 Thread Christoph Hellwig

This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.

Also clean up the convoluted check for the return condition that this
new flag is added to.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c| 5 -
 include/linux/blkdev.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2f6d2cb2e1a2..2f882e22890b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
  *
  *  If a device is using logical block provisioning, the underlying space will
  *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
+ *
+ *  If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return
+ *  -EOPNOTSUPP if no explicit hardware offload for zeroing is provided.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
biop, flags);
-   if (ret == 0 || (ret && ret != -EOPNOTSUPP))
+   if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK))
goto out;
 
ret = 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e7513ce3dbde..a5055d760661 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device 
*bdev, sector_t sector,
struct bio **biop);
 
 #define BLKDEV_ZERO_NOUNMAP(1 << 0)  /* do not free blocks */
+#define BLKDEV_ZERO_NOFALLBACK (1 << 1)  /* don't write explicit zeroes */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-- 
2.11.0

[PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.

Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c   | 19 +--
 include/linux/blk_types.h |  6 ++
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f9f24ec69c27..2f6d2cb2e1a2 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
 
-/**
- * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES
- * @bdev:  blockdev to issue
- * @sector:start sector
- * @nr_sects:  number of sectors to write
- * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @biop:  pointer to anchor bio
- *
- * Description:
- *  Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled 
pages.
- */
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
-   struct bio **biop)
+   struct bio **biop, unsigned flags)
 {
struct bio *bio = *biop;
unsigned int max_write_zeroes_sectors;
@@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
bio->bi_bdev = bdev;
-   bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0);
+   bio->bi_opf = REQ_OP_WRITE_ZEROES;
+   if (flags & BLKDEV_ZERO_NOUNMAP)
+   bio->bi_opf |= REQ_NOUNMAP;
 
if (nr_sects > max_write_zeroes_sectors) {
bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return -EINVAL;
 
ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
-   biop);
+   biop, flags);
if (ret == 0 || (ret && ret != -EOPNOTSUPP))
goto out;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4eae30bfbfca..8eaa7dca7057 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,10 @@ enum req_flag_bits {
__REQ_PREFLUSH, /* request for cache flush */
__REQ_RAHEAD,   /* read ahead, can fail anytime */
__REQ_BACKGROUND,   /* background IO */
+
+   /* command specific flags for REQ_OP_WRITE_ZEROES: */
+   __REQ_NOUNMAP,  /* do not free blocks when zeroing */
+
__REQ_NR_BITS,  /* stops here */
 };
 
@@ -212,6 +216,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
 
+#define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP)
+
 #define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.11.0

[PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout

2017-04-05 Thread Christoph Hellwig

Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c| 31 ++-
 block/ioctl.c  |  2 +-
 drivers/block/drbd/drbd_receiver.c |  9 ++---
 drivers/nvme/target/io-cmd.c   |  2 +-
 fs/block_dev.c |  2 +-
 fs/dax.c   |  2 +-
 fs/xfs/xfs_bmap_util.c |  2 +-
 include/linux/blkdev.h | 16 ++--
 8 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 2a8d638544a7..f9f24ec69c27 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
  * @biop:  pointer to anchor bio
- * @discard:   discard flag
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Generate and issue number of bios with zerofiled pages.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.
+ *
+ *  If a device is using logical block provisioning, the underlying space will
+ *  not be released if %flags contains BLKDEV_ZERO_NOUNMAP.
  */
 int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
-   bool discard)
+   unsigned flags)
 {
int ret;
int bi_size = 0;
@@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout);
  * @sector:start sector
  * @nr_sects:  number of sectors to write
  * @gfp_mask:  memory allocation flags (for bio_alloc)
- * @discard:   whether to discard the block range
+ * @flags: controls detailed behavior
  *
  * Description:
- *  Zero-fill a block range.  If the discard flag is set and the block
- *  device guarantees that subsequent READ operations to the block range
- *  in question will return zeroes, the blocks will be discarded. Should
- *  the discard request fail, if the discard flag is not set, or if
- *  discard_zeroes_data is not supported, this function will resort to
- *  zeroing the blocks manually, thus provisioning (allocating,
- *  anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME
- *  command(s), blkdev_issue_zeroout() will use it to optimize the process of
- *  clearing the block range. Otherwise the zeroing will be performed
- *  using regular WRITE calls.
+ *  Zero-fill a block range, either using hardware offload or by explicitly
+ *  writing zeroes to the device.  See __blkdev_issue_zeroout() for the
+ *  valid values for %flags.
  */
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-sector_t nr_sects, gfp_t gfp_mask, bool discard)
+   sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
int ret;
struct bio *bio = NULL;
struct blk_plug plug;
 
-   if (discard) {
+   if (!(flags & BLKDEV_ZERO_NOUNMAP)) {
if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask,
BLKDEV_DISCARD_ZERO))
return 0;
@@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
blk_start_plug();
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
-   , discard);
+   , flags);
if (ret == 0 && bio) {
ret = submit_bio_wait(bio);
bio_put(bio);
diff --git a/block/ioctl.c b/block/ioctl.c
index 7b88820b93d9..8ea00a41be01 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
fmode_t mode,
truncate_inode_pages_range(mapping, start, end);
 
return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
-   false);
+   BLKDEV_ZERO_NOUNMAP);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index aa6bf9692eff..dc9a6dcd431c 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device 
*device, sector_t start, u
tmp = start + granularity - sector_div(tmp, granularity);
 
nr = tmp - start;
-   err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0);
+   err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO,
+   BLKDEV_ZERO_NOUNMAP);
nr_sectors -= nr;

[PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing

2017-04-05 Thread Christoph Hellwig

We'll always use the WRITE ZEROES code for zeroing now.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-lib.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e5b853f2b8a2..2a8d638544a7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return 0;
}
 
-   if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
-   ZERO_PAGE(0)))
-   return 0;
-
blk_start_plug();
ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask,
, discard);
-- 
2.11.0

[PATCH 01/27] sd: split sd_setup_discard_cmnd

2017-04-05 Thread Christoph Hellwig

Split sd_setup_discard_cmnd into one function per provisioning type.  While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/sd.c | 153 ++
 1 file changed, 84 insertions(+), 69 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..b853f91fb3da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
 
-/**
- * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate on
- * @rq: Request to prepare
- *
- * Will issue either UNMAP or WRITE SAME(16) depending on preference
- * indicated by target device.
- **/
-static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 {
-   struct request *rq = cmd->request;
struct scsi_device *sdp = cmd->device;
-   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
-   sector_t sector = blk_rq_pos(rq);
-   unsigned int nr_sectors = blk_rq_sectors(rq);
-   unsigned int len;
-   int ret;
+   struct request *rq = cmd->request;
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+   unsigned int data_len = 24;
char *buf;
-   struct page *page;
 
-   sector >>= ilog2(sdp->sector_size) - 9;
-   nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
-   page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-   if (!page)
+   rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!rq->special_vec.bv_page)
return BLKPREP_DEFER;
+   rq->special_vec.bv_offset = 0;
+   rq->special_vec.bv_len = data_len;
+   rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-   switch (sdkp->provisioning_mode) {
-   case SD_LBP_UNMAP:
-   buf = page_address(page);
-
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = UNMAP;
-   cmd->cmnd[8] = 24;
-
-   put_unaligned_be16(6 + 16, [0]);
-   put_unaligned_be16(16, [2]);
-   put_unaligned_be64(sector, [8]);
-   put_unaligned_be32(nr_sectors, [16]);
+   cmd->cmd_len = 10;
+   cmd->cmnd[0] = UNMAP;
+   cmd->cmnd[8] = 24;
 
-   len = 24;
-   break;
+   buf = page_address(rq->special_vec.bv_page);
+   put_unaligned_be16(6 + 16, [0]);
+   put_unaligned_be16(16, [2]);
+   put_unaligned_be64(sector, [8]);
+   put_unaligned_be32(nr_sectors, [16]);
 
-   case SD_LBP_WS16:
-   cmd->cmd_len = 16;
-   cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
-   put_unaligned_be64(sector, >cmnd[2]);
-   put_unaligned_be32(nr_sectors, >cmnd[10]);
+   cmd->allowed = SD_MAX_RETRIES;
+   cmd->transfersize = data_len;
+   rq->timeout = SD_TIMEOUT;
+   scsi_req(rq)->resid_len = data_len;
 
-   len = sdkp->device->sector_size;
-   break;
+   return scsi_init_io(cmd);
+}
 
-   case SD_LBP_WS10:
-   case SD_LBP_ZERO:
-   cmd->cmd_len = 10;
-   cmd->cmnd[0] = WRITE_SAME;
-   if (sdkp->provisioning_mode == SD_LBP_WS10)
-   cmd->cmnd[1] = 0x8; /* UNMAP */
-   put_unaligned_be32(sector, >cmnd[2]);
-   put_unaligned_be16(nr_sectors, >cmnd[7]);
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+{
+   struct scsi_device *sdp = cmd->device;
+   struct request *rq = cmd->request;
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 data_len = sdp->sector_size;
 
-   len = sdkp->device->sector_size;
-   break;
+   rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!rq->special_vec.bv_page)
+   return BLKPREP_DEFER;
+   rq->special_vec.bv_offset = 0;
+   rq->special_vec.bv_len = data_len;
+   rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 
-   default:
-   ret = BLKPREP_INVALID;
-   goto out;
-   }
+   cmd->cmd_len = 16;
+   cmd->cmnd[0] = WRITE_SAME_16;
+   cmd->cmnd[1] = 0x8; /* UNMAP */
+   put_unaligned_be64(sector, >cmnd[2]);
+   put_unaligned_be32(nr_sectors, >cmnd[10]);
 
+   cmd->allowed = SD_MAX_RETRIES;
+

always use REQ_OP_WRITE_ZEROES for zeroing offload V2

2017-04-05 Thread Christoph Hellwig

This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
supported by the block layer, and switches existing implementations
of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
removes incorrect discard_zeroes_data, and also switches WRITE SAME
based zeroing in SCSI to this new method.

The series is against the block for-next tree.

A git tree is also avaiable at:

git://git.infradead.org/users/hch/block.git discard-rework.2

Gitweb:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2

Changes since V2:
 - various spelling fixes
 - various reviews captured
 - two new patches from Martin at the end

[PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 include/linux/blk_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 67bcf8a5326e..4eae30bfbfca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,7 +168,7 @@ enum req_opf {
/* write the same sector many times */
REQ_OP_WRITE_SAME   = 7,
/* write the zero filled sector many times */
-   REQ_OP_WRITE_ZEROES = 8,
+   REQ_OP_WRITE_ZEROES = 9,
 
/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN  = 32,
-- 
2.11.0

[PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 drivers/scsi/sd.c | 31 ++-
 drivers/scsi/sd_zbc.c |  1 +
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b853f91fb3da..d8d9c0bdd93c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
return scsi_init_io(cmd);
 }
 
-static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap)
 {
struct scsi_device *sdp = cmd->device;
struct request *rq = cmd->request;
@@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd 
*cmd)
 
cmd->cmd_len = 16;
cmd->cmnd[0] = WRITE_SAME_16;
-   cmd->cmnd[1] = 0x8; /* UNMAP */
+   if (unmap)
+   cmd->cmnd[1] = 0x8; /* UNMAP */
put_unaligned_be64(sector, >cmnd[2]);
put_unaligned_be32(nr_sectors, >cmnd[10]);
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
@@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd 
*cmd, bool unmap)
 
cmd->allowed = SD_MAX_RETRIES;
cmd->transfersize = data_len;
-   rq->timeout = SD_TIMEOUT;
+   rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT;
scsi_req(rq)->resid_len = data_len;
 
return scsi_init_io(cmd);
 }
 
+static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
+{
+   struct request *rq = cmd->request;
+   struct scsi_device *sdp = cmd->device;
+   struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+   u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+   u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+
+   if (sdp->no_write_same)
+   return BLKPREP_INVALID;
+   if (sdkp->ws16 || sector > 0x || nr_sectors > 0x)
+   return sd_setup_write_same16_cmnd(cmd, false);
+   return sd_setup_write_same10_cmnd(cmd, false);
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
struct request_queue *q = sdkp->disk->queue;
@@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp)
 out:
blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks *
 (logical_block_size >> 9));
+   blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks *
+(logical_block_size >> 9));
 }
 
 /**
@@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
case SD_LBP_UNMAP:
return sd_setup_unmap_cmnd(cmd);
case SD_LBP_WS16:
-   return sd_setup_write_same16_cmnd(cmd);
+   return sd_setup_write_same16_cmnd(cmd, true);
case SD_LBP_WS10:
return sd_setup_write_same10_cmnd(cmd, true);
case SD_LBP_ZERO:
@@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
default:
return BLKPREP_INVALID;
}
+   case REQ_OP_WRITE_ZEROES:
+   return sd_setup_write_zeroes_cmnd(cmd);
case REQ_OP_WRITE_SAME:
return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
@@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 
switch (req_op(req)) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
if (!result) {
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 92620c8ea8ad..1994f7799fce 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 
switch (req_op(rq)) {
case REQ_OP_WRITE:
+   case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE_SAME:
case REQ_OP_ZONE_RESET:
 
-- 
2.11.0

[PATCH 05/27] md: support REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/linear.c| 1 +
 drivers/md/md.h| 7 +++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c | 2 ++
 drivers/md/raid1.c | 4 +++-
 drivers/md/raid10.c| 1 +
 drivers/md/raid5.c | 1 +
 7 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 3e38e0207a3e..377a8a3672e3 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1e76d64ce180 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev 
*mddev, struct bio *bio)
!bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
mddev->queue->limits.max_write_same_sectors = 0;
 }
+
+static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio 
*bio)
+{
+   if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
+   !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors)
+   mddev->queue->limits.max_write_zeroes_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 79a12b59250b..e95d521d93e9 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, 
struct bio * bio)
mp_bh->bio.bi_end_io = multipath_end_request;
mp_bh->bio.bi_private = mp_bh;
mddev_check_writesame(mddev, _bh->bio);
+   mddev_check_write_zeroes(mddev, _bh->bio);
generic_make_request(_bh->bio);
return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..ce7a6a56cf73 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev)
 
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_sectors);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 
mddev->chunk_sectors);
blk_queue_max_discard_sectors(mddev->queue, 
mddev->chunk_sectors);
 
blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
@@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)
  split, 
disk_devt(mddev->gendisk),
  bio_sector);
mddev_check_writesame(mddev, split);
+   mddev_check_write_zeroes(mddev, split);
generic_make_request(split);
}
} while (split != bio);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a34f58772022..b59cc100320a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev)
if (IS_ERR(conf))
return PTR_ERR(conf);
 
-   if (mddev->queue)
+   if (mddev->queue) {
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+   }
 
rdev_for_each(rdev, mddev) {
if (!mddev->gendisk)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e89a8d78a9ed..28ec3a93acee 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev)
blk_queue_max_discard_sectors(mddev->queue,
  mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
blk_queue_io_min(mddev->queue, chunk_size);
if (conf->geo.raid_disks % conf->geo.near_copies)
blk_queue_io_opt(mddev->queue, chunk_size * 
conf->geo.raid_disks);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed5cd705b985..8cf1f86dcd05 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev)
mddev->queue->limits.discard_zeroes_data = 0;
 
blk_queue_max_write_same_sectors(mddev->queue, 0);
+   blk_queue_max_write_zeroes_sectors(mddev->queue, 0);

[PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios

2017-04-05 Thread Christoph Hellwig

Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
---
 block/blk-merge.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..3990ae406341 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct 
request_queue *q,
return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
+   struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+   *nsegs = 1;
+
+   if (!q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+   return NULL;
+
+   return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
@@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
split = blk_bio_discard_split(q, *bio, bs, );
break;
case REQ_OP_WRITE_ZEROES:
-   split = NULL;
-   nsegs = (*bio)->bi_phys_segments;
+   split = blk_bio_write_zeroes_split(q, *bio, bs, );
break;
case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, bs, );
-- 
2.11.0

[PATCH 06/27] dm io: discards don't take a payload

2017-04-05 Thread Christoph Hellwig

Fix up do_region to not allocate a bio_vec for discards.  We've
got rid of the discard payload allocated by the caller years ago.

Obviously this wasn't actually harmful given how long it's been
there, but it's still good to avoid the pointless allocation.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-io.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..b808cbe22678 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned 
region,
/*
 * Allocate a suitably sized-bio.
 */
-   if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME))
+   switch (op) {
+   case REQ_OP_DISCARD:
+   num_bvecs = 0;
+   break;
+   case REQ_OP_WRITE_SAME:
num_bvecs = 1;
-   else
+   break;
+   default:
num_bvecs = min_t(int, BIO_MAX_PAGES,
  dm_sector_div_up(remaining, 
(PAGE_SIZE >> SECTOR_SHIFT)));
+   }
 
bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios);
bio->bi_iter.bi_sector = where->sector + (where->count - 
remaining);
-- 
2.11.0

[PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

Copy & paste from the REQ_OP_WRITE_SAME code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-core.h  |  1 +
 drivers/md/dm-io.c|  8 ++--
 drivers/md/dm-linear.c|  1 +
 drivers/md/dm-mpath.c |  1 +
 drivers/md/dm-rq.c| 11 ---
 drivers/md/dm-stripe.c|  2 ++
 drivers/md/dm-table.c | 30 ++
 drivers/md/dm.c   | 31 ---
 include/linux/device-mapper.h |  6 ++
 9 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 136fda3ff9e5..fea5bd52ada8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md);
 void dm_init_normal_md_queue(struct mapped_device *md);
 int md_in_flight(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
+void disable_write_zeroes(struct mapped_device *md);
 
 static inline struct completion *dm_get_completion_from_kobject(struct kobject 
*kobj)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index b808cbe22678..3702e502466d 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned 
region,
 */
if (op == REQ_OP_DISCARD)
special_cmd_max_sectors = q->limits.max_discard_sectors;
+   else if (op == REQ_OP_WRITE_ZEROES)
+   special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
else if (op == REQ_OP_WRITE_SAME)
special_cmd_max_sectors = q->limits.max_write_same_sectors;
-   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) &&
+   if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
+op == REQ_OP_WRITE_SAME)  &&
special_cmd_max_sectors == 0) {
dec_count(io, region, -EOPNOTSUPP);
return;
@@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region,
 */
switch (op) {
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
num_bvecs = 0;
break;
case REQ_OP_WRITE_SAME:
@@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region,
bio_set_op_attrs(bio, op, op_flags);
store_io_and_region_in_bio(bio, io, region);
 
-   if (op == REQ_OP_DISCARD) {
+   if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
num_sectors = min_t(sector_t, special_cmd_max_sectors, 
remaining);
bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b989a9..e17fd44ceef5 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
ti->private = lc;
return 0;
 
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7f223dbed49f..ab55955ed704 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned 
argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_write_same_bios = 1;
+   ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size();
else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..a789bf035621 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool 
mapped)
r = rq_end_io(tio->ti, clone, error, >info);
}
 
-   if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) &&
-!clone->q->limits.max_write_same_sectors))
-   disable_write_same(tio->md);
+   if (unlikely(r == -EREMOTEIO)) {
+   if (req_op(clone) == REQ_OP_WRITE_SAME &&
+   !clone->q->limits.max_write_same_sectors)
+   disable_write_same(tio->md);
+   if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
+   !clone->q->limits.max_write_zeroes_sectors)
+   disable_write_zeroes(tio->md);
+   }
 
if (r <= 0)
/* The target wants to complete the I/O */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 28193a57bf47..5ef49c121d99 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -169,6

[PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

It seems like the code currently passes whatever it was using for writes
to WRITE SAME.  Just switch it to WRITE ZEROES, although that doesn't
need any payload.

Untested, and confused by the code, maybe someone who understands it
better than me can help..

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-kcopyd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb7d51..f85846741d50 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct 
dm_io_region *from,
job->pages = _page_list;
 
/*
-* Use WRITE SAME to optimize zeroing if all dests support it.
+* Use WRITE ZEROES to optimize zeroing if all dests support it.
 */
-   job->rw = REQ_OP_WRITE_SAME;
+   job->rw = REQ_OP_WRITE_ZEROES;
for (i = 0; i < job->num_dests; i++)
-   if (!bdev_write_same(job->dests[i].bdev)) {
+   if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) {
job->rw = WRITE;
break;
}
-- 
2.11.0

[PATCH 17/27] loop: implement REQ_OP_WRITE_ZEROES

2017-04-05 Thread Christoph Hellwig

It's identical to discard as hole punches will always leave us with
zeroes on reads.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
---
 drivers/block/loop.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cc981f34e017..3bb04c1a4ba1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct 
request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
return lo_discard(lo, rq, pos);
case REQ_OP_WRITE:
if (lo->transfer)
@@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
+   blk_queue_max_write_zeroes_sectors(q, 0);
q->limits.discard_zeroes_data = 0;
queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q);
return;
@@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_granularity = inode->i_sb->s_blocksize;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+   blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
q->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
 }
@@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
switch (req_op(cmd->rq)) {
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
+   case REQ_OP_WRITE_ZEROES:
cmd->use_aio = false;
break;
default:
-- 
2.11.0

1 2 >

1 - 100 of 151 matches

Mail list logo