Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
I hadn't done this yet but I think a simple closest device in the tree would solve the issue sufficiently. However, I originally had it so the user has to pick the device and I prefer that approach. But if the user picks the device, then why bother restricting what he picks? Because the user can get it wrong, and its our job to do what we can in order to prevent the user from screwing itself. Per the thread with Sinan, I'd prefer to use what the user picks. You were one of the biggest opponents to that so I'd like to hear your opinion on removing the restrictions. I wasn't against it that much, I'm all for making things "just work" with minimal configuration steps, but I'm not sure we can get it right without it. Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available. Even if it was available, it would be hard to make real use of this given that we wouldn't know how to pre-post recv buffers (for in-capsule data). But let's leave this out of the scope entirely... I don't understand what you're referring to. We'd simply use the CMB buffer as a p2pmem device, why does that change anything? I'm referring to the in-capsule data buffers pre-posts that we do. Because we prepare a buffer that would contain in-capsule data, we have no knowledge to which device the incoming I/O is directed to, which means we can (and will) have I/O where the data lies in CMB of device A but it's really targeted to device B - which sorta defeats the purpose of what we're trying to optimize here... Why do you need this? you have a reference to the queue itself. This keeps track of whether the response was actually allocated with p2pmem or not. It's needed for when we free the SGL because the queue may have a p2pmem device assigned to it but, if the alloc failed and it fell back on system memory then we need to know how to free it. I'm currently looking at having SGLs having an iomem flag. In which case, this would no longer be needed as the flag in the SGL could be used. That would be better, maybe... [...] This is a problem. namespaces can be added at any point in time. No one guarantee that dma_devs are all the namepaces we'll ever see. Yeah, well restricting p2pmem based on all the devices in use is hard. So we'd need a call into the transport every time an ns is added and we'd have to drop the p2pmem if they add one that isn't supported. This complexity is just one of the reasons I prefer just letting the user chose. Still the user can get it wrong. Not sure we can get a way without keeping track of this as new devices join the subsystem. + +if (queue->p2pmem) +pr_debug("using %s for rdma nvme target queue", + dev_name(>p2pmem->dev)); + +kfree(dma_devs); +} + static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id, struct rdma_cm_event *event) { @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id, } queue->port = cm_id->context; +nvmet_rdma_queue_setup_p2pmem(queue); + Why is all this done for each queue? looks completely redundant to me. A little bit. Where would you put it? I think we'll need a representation of a controller in nvmet-rdma for that. we sort of got a way without it so far, but I don't think we can anymore with this. ret = nvmet_rdma_cm_accept(cm_id, queue, >param.conn); if (ret) goto release_queue; You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm curious why? Yes, the thinking was that these transfers were small anyway so there would not be significant benefit to pushing them through p2pmem. There's really no reason why we couldn't do that if it made sense to though. I don't see an urgent reason for it too. I was just curious...
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
Note that the nvme completion queues are still on the host memory, so this means we have lost the ordering between data and completions as they go to different pcie targets. Hmm, in this simple up/down case with a switch, I think it might actually be OK. Transactions might not complete at the NVMe device before the CPU processes the RDMA completion, however due to the PCI-E ordering rules new TLPs directed to the NVMe will complete after the RMDA TLPs and thus observe the new data. (eg order preserving) It would be very hard to use P2P if fabric ordering is not preserved.. I think it still can race if the p2p device is connected with more than a single port to the switch. Say it's connected via 2 legs, the bar is accessed from leg A and the data from the disk comes via leg B. In this case, the data is heading towards the p2p device via leg B (might be congested), the completion goes directly to the RC, and then the host issues a read from the bar via leg A. I don't understand what can guarantee ordering here. Stephen told me that this still guarantees ordering, but I honestly can't understand how, perhaps someone can explain to me in a simple way that I can understand.
Re: 4.11.0-rc5-00011-g08e4e0d oops in mpt3sas driver
On 06/04/17 08:30, Brad Campbell wrote: G'day All, This is a vaguely current git head kernel compiled yesterday. Oopsed and rebooted itself, and then oopsed and rebooted again. There was no sign of a raid rebuild in the kernel logs, and it's a staging machine so there is nothing running after a reboot that goes near these disks. They should have been completely idle the second time around. This box suffered from bad rcu stalls on 4.10.x stable kernels, so I upgraded to git head. It's all new hardware (the CPU, Chipset and board), so I expected some issues with the board, but the LSI cards have been around for a while now. Further investigation indicates it might be a deeper problem. This is the first oops captured and it has nothing to do with the mpt3 driver. [49580.533852] BUG: unable to handle kernel paging request at 817cddfe [49580.533875] IP: queued_spin_lock_slowpath+0xe7/0x170 [49580.533879] PGD 180a067 [49580.533879] PUD 180b063 [49580.533882] PMD 816001e1 [49580.533885] [49580.533890] Oops: 0003 [#1] SMP [49580.533894] Modules linked in: it87(O) deflate zlib_deflate ctr des_generic cbc cmac sha1_generic md5 hmac af_key xfrm_algo nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc bonding sha256_generic dm_crypt aesni_intel aes_x86_64 crypto_simd cryptd glue_helper hwmon_vid netconsole configfs vhost_net vhost kvm_amd kvm irqbypass usbhid usb_storage nouveau video drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea ttm drm mxm_wmi xhci_pci i2c_piix4 xhci_hcd usbcore usb_common wmi acpi_cpufreq mpt3sas igb i2c_algo_bit raid_class scsi_transport_sas ahci libahci [49580.533929] CPU: 6 PID: 114 Comm: kswapd0 Tainted: G O 4.11.0-rc5-00011-g08e4e0d-dirty #39 [49580.533933] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 0515 03/30/2017 [49580.534045] task: 8807f9ad task.stack: c943 [49580.534049] RIP: 0010:queued_spin_lock_slowpath+0xe7/0x170 [49580.534052] RSP: 0018:c9433a50 EFLAGS: 00010082 [49580.534056] RAX: 34e1 RBX: 0292 RCX: 001c [49580.534059] RDX: 817cddfe RSI: 88081ed99900 RDI: 8806ddb860e0 [49580.534063] RBP: 8806ddb860e0 R08: 0101 R09: dead0200 [49580.534119] R10: ea001c000700 R11: 880006b457b9 R12: 8806ddb860c8 [49580.534122] R13: 0001 R14: c9433b40 R15: 8806ddb860c8 [49580.534179] FS: () GS:88081ed8() knlGS: [49580.534183] CS: 0010 DS: ES: CR0: 80050033 [49580.534186] CR2: 817cddfe CR3: 01809000 CR4: 003406e0 [49580.534190] Call Trace: [49580.534247] ? _raw_spin_lock_irqsave+0x1f/0x30 [49580.534253] ? __remove_mapping+0x65/0x1b0 [49580.534258] ? page_mkclean_one+0x100/0x100 [49580.534313] ? page_get_anon_vma+0xa0/0xa0 [49580.534317] ? shrink_page_list+0x6aa/0xda0 [49580.534321] ? shrink_inactive_list+0x1f6/0x4b0 [49580.534325] ? es_reclaim_extents+0x55/0xe0 [49580.534328] ? inactive_list_is_low.isra.70+0x10e/0x1c0 [49580.534332] ? shrink_node_memcg.isra.75+0x58c/0x6b0 [49580.534531] ? shrink_node+0x4a/0x190 [49580.534705] ? kswapd+0x2b7/0x5d0 [49580.535076] ? kthread+0xf1/0x130 [49580.535477] ? shrink_node+0x190/0x190 [49580.535869] ? __kthread_init_worker+0xa0/0xa0 [49580.536257] ? ret_from_fork+0x23/0x30 [49580.53] Code: 47 02 c1 e0 10 0f 84 93 00 00 00 48 89 c2 c1 e8 12 48 c1 ea 0c ff c8 83 e2 30 48 98 48 81 c2 00 99 01 00 48 03 14 c5 20 54 77 81 <48> 89 32 8b 46 08 85 c0 75 09 f3 90 8b 46 08 85 c0 74 f7 4c 8b [49580.537489] RIP: queued_spin_lock_slowpath+0xe7/0x170 RSP: c9433a50 [49580.537904] CR2: 817cddfe [49580.540107] ---[ end trace f58d3bdd0830f2bf ]--- [49580.540642] Kernel panic - not syncing: Fatal exception [49580.541212] Kernel Offset: disabled [49580.541493] Rebooting in 10 seconds.. [49590.501026] ACPI MEMORY or I/O RESET_REG. This box survives days of memtest, but I'm not above suspecting the underlying hardware if it points to that.
4.11.0-rc5-00011-g08e4e0d oops in mpt3sas driver
G'day All, This is a vaguely current git head kernel compiled yesterday. Oopsed and rebooted itself, and then oopsed and rebooted again. There was no sign of a raid rebuild in the kernel logs, and it's a staging machine so there is nothing running after a reboot that goes near these disks. They should have been completely idle the second time around. This box suffered from bad rcu stalls on 4.10.x stable kernels, so I upgraded to git head. It's all new hardware (the CPU, Chipset and board), so I expected some issues with the board, but the LSI cards have been around for a while now. I'm pretty sure this is the second time it has done this, but the first where I had netconsole set up to capture it. It is hard to hit and I've had 2 instances in about 5 days of very heavy I/O loads. Having said that, the second oops last night happened when idle, so I'm now out of ideas. The only thing that might have been hitting these disks is smartmontools routine polling, but inspecting the logs it appears the reboot happened well between polls. No sign of anything in the logs, but got it on netconsole. System is a new AMD Ryzen. Controllers are : 27:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02) 28:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) Only the 2116 has drives on it, and it has 8 6TB WD Red disks on SATA. Nothing else. The 2008 is just sitting taking up a slot. Kernel is tainted by an out of tree build of the it87 driver, however nothing polls or uses that driver unless I'm at the console manually checking. This machine originally oopsed after some heavy load and rebooted : [49580.533852] BUG: unable to handle kernel paging request at 817cddfe [49580.533875] IP: queued_spin_lock_slowpath+0xe7/0x170 Tree is dirty due to this md patch : diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c523fd69a7bc..2eb45d57226c 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s, BUG_ON(test_bit(R5_Wantread, >flags)); BUG_ON(sh->batch_head); if ((s->uptodate == disks - 1) && + ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) || (s->failed && (disk_idx == s->failed_num[0] || - disk_idx == s->failed_num[1]))) { + disk_idx == s->failed_num[1] { /* have disk failed, and we're requested to fetch it; * do compute it */ [ 4433.138683] BUG: unable to handle kernel paging request at 9807cd621d60 [ 4433.138802] IP: _scsih_set_satl_pending+0x0/0x50 [mpt3sas] [ 4433.138875] PGD 0 [ 4433.138875] [ 4433.138928] Oops: [#1] SMP [ 4433.138976] Modules linked in: deflate zlib_deflate ctr des_generic cbc cmac sha1_generic md5 hmac af_key xfrm_algo nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc bonding sha256_generic dm_crypt aesni_intel aes_x86_64 crypto_simd cryptd glue_helper it87(O) hwmon_vid netconsole configfs vhost_net vhost kvm_amd kvm irqbypass usbhid usb_storage nouveau video drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea ttm drm mxm_wmi xhci_pci i2c_piix4 xhci_hcd usbcore usb_common wmi acpi_cpufreq mpt3sas raid_class scsi_transport_sas igb i2c_algo_bit ahci libahci [ 4433.139831] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.11.0-rc5-00011-g08e4e0d-dirty #39 [ 4433.140032] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 0515 03/30/2017 [ 4433.140164] task: 8180e4c0 task.stack: 8180 [ 4433.140257] RIP: 0010:_scsih_set_satl_pending+0x0/0x50 [mpt3sas] [ 4433.140349] RSP: 0018:88081ec03da0 EFLAGS: 00010046 [ 4433.140429] RAX: 0001 RBX: 8807f66585b0 RCX: [ 4433.140540] RDX: RSI: RDI: 9807cd621d30 [ 4433.140651] RBP: 9807cd621d30 R08: 0001 R09: 1594 [ 4433.140762] R10: R11: 0018 R12: [ 4433.140872] R13: 8807f5c10bd0 R14: 0048 R15: 0048 [ 4433.140984] FS: () GS:88081ec0() knlGS: [ 4433.141109] CS: 0010 DS: ES: CR0: 80050033 [ 4433.141198] CR2: 9807cd621d60 CR3: 0007da667000 CR4: 003406f0 [ 4433.141309] Call Trace: [ 4433.141399] [ 4433.141416] ? _scsih_io_done+0xa2/0xa30 [mpt3sas] [ 4433.141494] ? load_balance+0x152/0x870 [ 4433.141549] ? _base_interrupt+0x116/0xa30 [mpt3sas] [ 4433.141627] ? del_timer+0x60/0x60 [ 4433.141680] ? __handle_irq_event_percpu+0x51/0x1c0 [ 4433.141752] ?
Re: [PATCH] Make checking the scsi_device_get() return value mandatory
On Thu, 2017-04-06 at 08:27 +0800, kbuild test robot wrote: > All warnings (new ones prefixed by >>): > >drivers//scsi/osd/osd_uld.c: In function 'osd_probe': > > > drivers//scsi/osd/osd_uld.c:467:2: warning: ignoring return value of > > > 'scsi_device_get', declared with attribute warn_unused_result > > > [-Wunused-result] > > scsi_device_get(scsi_device); > ^~~~ Please ignore this warning. It is triggered because patch "osd_uld: Check scsi_device_get() return value" is not yet upstream. Bart.
Re: [PATCH] Make checking the scsi_device_get() return value mandatory
Hi Bart, [auto build test WARNING on scsi/for-next] [also build test WARNING on v4.11-rc5 next-20170405] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Bart-Van-Assche/Make-checking-the-scsi_device_get-return-value-mandatory/20170406-072137 base: https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next config: x86_64-allyesdebian (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): drivers//scsi/osd/osd_uld.c: In function 'osd_probe': >> drivers//scsi/osd/osd_uld.c:467:2: warning: ignoring return value of >> 'scsi_device_get', declared with attribute warn_unused_result >> [-Wunused-result] scsi_device_get(scsi_device); ^~~~ vim +/scsi_device_get +467 drivers//scsi/osd/osd_uld.c 95b05a7db Boaz Harrosh 2009-01-25 451 95b05a7db Boaz Harrosh 2009-01-25 452 /* allocate a disk and set it up */ 95b05a7db Boaz Harrosh 2009-01-25 453 /* FIXME: do we need this since sg has already done that */ 95b05a7db Boaz Harrosh 2009-01-25 454 disk = alloc_disk(1); 95b05a7db Boaz Harrosh 2009-01-25 455 if (!disk) { 95b05a7db Boaz Harrosh 2009-01-25 456 OSD_ERR("alloc_disk failed\n"); 95b05a7db Boaz Harrosh 2009-01-25 457 goto err_free_osd; 95b05a7db Boaz Harrosh 2009-01-25 458 } 95b05a7db Boaz Harrosh 2009-01-25 459 disk->major = SCSI_OSD_MAJOR; 95b05a7db Boaz Harrosh 2009-01-25 460 disk->first_minor = oud->minor; 95b05a7db Boaz Harrosh 2009-01-25 461 sprintf(disk->disk_name, "osd%d", oud->minor); 95b05a7db Boaz Harrosh 2009-01-25 462 oud->disk = disk; 95b05a7db Boaz Harrosh 2009-01-25 463 95b05a7db Boaz Harrosh 2009-01-25 464 /* hold one more reference to the scsi_device that will get released 95b05a7db Boaz Harrosh 2009-01-25 465 * in __release, in case a logout is happening while fs is mounted 95b05a7db Boaz Harrosh 2009-01-25 466 */ 95b05a7db Boaz Harrosh 2009-01-25 @467 scsi_device_get(scsi_device); 95b05a7db Boaz Harrosh 2009-01-25 468 osd_dev_init(>od, scsi_device); 95b05a7db Boaz Harrosh 2009-01-25 469 95b05a7db Boaz Harrosh 2009-01-25 470 /* Detect the OSD Version */ 95b05a7db Boaz Harrosh 2009-01-25 471 error = __detect_osd(oud); 95b05a7db Boaz Harrosh 2009-01-25 472 if (error) { 95b05a7db Boaz Harrosh 2009-01-25 473 OSD_ERR("osd detection failed, non-compatible OSD device\n"); 95b05a7db Boaz Harrosh 2009-01-25 474 goto err_put_disk; 95b05a7db Boaz Harrosh 2009-01-25 475 } :: The code at line 467 was first introduced by commit :: 95b05a7db5865855c32e0bb8b244c3a7aac1cfeb [SCSI] osd_uld: OSD scsi ULD :: TO: Boaz Harrosh <bharr...@panasas.com> :: CC: James Bottomley <james.bottom...@hansenpartnership.com> --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH 21/24] scsi: Lock down the eata driver
When the kernel is running in secure boot mode, we lock down the kernel to prevent userspace from modifying the running kernel image. Whilst this includes prohibiting access to things like /dev/mem, it must also prevent access by means of configuring driver modules in such a way as to cause a device to access or modify the kernel image. The eata driver takes a single string parameter that contains a slew of settings, including hardware resource configuration. Prohibit use of the parameter if the kernel is locked down. Suggested-by: Alan CoxSigned-off-by: David Howells cc: Dario Ballabio cc: "James E.J. Bottomley" cc: "Martin K. Petersen" cc: linux-scsi@vger.kernel.org --- drivers/scsi/eata.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/eata.c b/drivers/scsi/eata.c index 227dd2c2ec2f..5c036d10c18b 100644 --- a/drivers/scsi/eata.c +++ b/drivers/scsi/eata.c @@ -1552,8 +1552,13 @@ static int eata2x_detect(struct scsi_host_template *tpnt) tpnt->proc_name = "eata2x"; - if (strlen(boot_options)) + if (strlen(boot_options)) { + if (kernel_is_locked_down()) { + pr_err("Command line-specified device addresses, irqs and dma channels are not permitted when the kernel is locked down\n"); + return -EPERM; + } option_setup(boot_options); + } #if defined(MODULE) /* io_port could have been modified when loading as a module */
[PATCH v3 2/2] scsi: storvsc: Add support for FC rport.
Included in the current storvsc driver for Hyper-V is the ability to access luns on an FC fabric via a virtualized fiber channel adapter exposed by the Hyper-V host. The driver also attaches to the FC transport to allow host and port names to be published under /sys/class/fc_host/hostX. Current customer tools running on the VM require that these names be available in the well known standard location under fc_host/hostX. This patch stubs in an rport per fc_host and sets its rport role as FC_PORT_ROLE_FCP_DUMMY_INITIATOR to indicate to the fc_transport that it is a pseudo rport in order to scan the scsi stack via echo "- - -" > /sys/class/scsi_host/hostX/scan. Signed-off-by: Cathy Avery--- drivers/scsi/storvsc_drv.c | 23 ++- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 016639d..1ec8579 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -476,6 +476,9 @@ struct storvsc_device { */ u64 node_name; u64 port_name; +#if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) + struct fc_rport *rport; +#endif }; struct hv_host_device { @@ -1823,19 +1826,27 @@ static int storvsc_probe(struct hv_device *device, target = (device->dev_instance.b[5] << 8 | device->dev_instance.b[4]); ret = scsi_add_device(host, 0, target, 0); - if (ret) { - scsi_remove_host(host); - goto err_out2; - } + if (ret) + goto err_out3; } #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) if (host->transportt == fc_transport_template) { + struct fc_rport_identifiers ids = { + .roles = FC_PORT_ROLE_FCP_DUMMY_INITIATOR, + }; + fc_host_node_name(host) = stor_device->node_name; fc_host_port_name(host) = stor_device->port_name; + stor_device->rport = fc_remote_port_add(host, 0, ); + if (!stor_device->rport) + goto err_out3; } #endif return 0; +err_out3: + scsi_remove_host(host); + err_out2: /* * Once we have connected with the host, we would need to @@ -1861,8 +1872,10 @@ static int storvsc_remove(struct hv_device *dev) struct Scsi_Host *host = stor_device->host; #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) - if (host->transportt == fc_transport_template) + if (host->transportt == fc_transport_template) { + fc_remote_port_delete(stor_device->rport); fc_remove_host(host); + } #endif scsi_remove_host(host); storvsc_dev_remove(dev); -- 2.5.0
[PATCH v3 1/2] scsi: scsi_transport_fc: Add dummy initiator role to rport
This patch allows scsi drivers that expose virturalized fibre channel devices but that do not expose rports to successfully rescan the scsi bus via echo "- - -" > /sys/class/scsi_host/hostX/scan. Drivers can create a pseudo rport and indicate FC_PORT_ROLE_FCP_DUMMY_INITIATOR as the rport's role in fc_rport_identifiers. This insures that a valid scsi_target_id is assigned to the newly created rport and it can meet the requirements of fc_user_scan_tgt calling scsi_scan_target. Signed-off-by: Cathy Avery--- drivers/scsi/scsi_transport_fc.c | 10 ++ include/scsi/scsi_transport_fc.h | 1 + 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 2d753c9..de85602 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -289,9 +289,10 @@ static const struct { u32 value; char*name; } fc_port_role_names[] = { - { FC_PORT_ROLE_FCP_TARGET, "FCP Target" }, - { FC_PORT_ROLE_FCP_INITIATOR, "FCP Initiator" }, - { FC_PORT_ROLE_IP_PORT, "IP Port" }, + { FC_PORT_ROLE_FCP_TARGET, "FCP Target" }, + { FC_PORT_ROLE_FCP_INITIATOR, "FCP Initiator" }, + { FC_PORT_ROLE_IP_PORT, "IP Port" }, + { FC_PORT_ROLE_FCP_DUMMY_INITIATOR, "FCP Dummy Initiator" }, }; fc_bitfield_name_search(port_roles, fc_port_role_names) @@ -2628,7 +2629,8 @@ fc_remote_port_create(struct Scsi_Host *shost, int channel, spin_lock_irqsave(shost->host_lock, flags); rport->number = fc_host->next_rport_number++; - if (rport->roles & FC_PORT_ROLE_FCP_TARGET) + if ((rport->roles & FC_PORT_ROLE_FCP_TARGET) || + (rport->roles & FC_PORT_ROLE_FCP_DUMMY_INITIATOR)) rport->scsi_target_id = fc_host->next_target_id++; else rport->scsi_target_id = -1; diff --git a/include/scsi/scsi_transport_fc.h b/include/scsi/scsi_transport_fc.h index b21b8aa5..6e208bb 100644 --- a/include/scsi/scsi_transport_fc.h +++ b/include/scsi/scsi_transport_fc.h @@ -162,6 +162,7 @@ enum fc_tgtid_binding_type { #define FC_PORT_ROLE_FCP_TARGET0x01 #define FC_PORT_ROLE_FCP_INITIATOR 0x02 #define FC_PORT_ROLE_IP_PORT 0x04 +#define FC_PORT_ROLE_FCP_DUMMY_INITIATOR 0x08 /* The following are for compatibility */ #define FC_RPORT_ROLE_UNKNOWN FC_PORT_ROLE_UNKNOWN -- 2.5.0
[PATCH v3 0/2] scsi: storvsc: Add support for FC rport
The updated patch set provides a way for drivers ( specifically storvsc in this case ) that expose virturalized fc devices but that do not expose rports to be manually scanned. This is done via creating a pseudo rport in storvsc and a corresponding dummy initiator rport role in the fc transport. Changes since v2: - Additional patch adding FC_PORT_ROLE_FCP_DUMMY_INITIATOR role to fc_transport - Changed storvsc rport role to FC_PORT_ROLE_FCP_DUMMY_INITIATOR Changes since v1: - Fix fc_rport_identifiers init [Stephen Hemminger] - Better error checking Cathy Avery (2): scsi: scsi_transport_fc: Add dummy initiator role to rport scsi: storvsc: Add support for FC rport. drivers/scsi/scsi_transport_fc.c | 10 ++ drivers/scsi/storvsc_drv.c | 23 ++- include/scsi/scsi_transport_fc.h | 1 + 3 files changed, 25 insertions(+), 9 deletions(-) -- 2.5.0
Re: ->retries fixups V2
On 04/05/2017 12:16 PM, Christoph Hellwig wrote: > On Wed, Apr 05, 2017 at 12:06:53PM -0600, Jens Axboe wrote: >> On Wed, Apr 05 2017, Christoph Hellwig wrote: >>> This series fixes a few lose bits in terms of how nvme uses ->retries, >>> including fixing it for non-PCIe transports. While at it I noticed that >>> nvme and scsi use the field in entirely different ways, and no other >>> driver uses it at all. So I decided to move it into the nvme_request and >>> scsi_request structures instead. >>> >>> Changes since V1: >>> - better changelog for one patch >>> - move the new retries field to the end of struct nvme_request >> >> Applied for 4.12. If we do the below on my box, we remove the (now) 2 >> holes from struct request and shrink it 8 bytes. > > Looks good: > > Reviewed-by: Christoph HellwigThanks, added that too. -- Jens Axboe
Re: ->retries fixups V2
On Wed, Apr 05, 2017 at 12:06:53PM -0600, Jens Axboe wrote: > On Wed, Apr 05 2017, Christoph Hellwig wrote: > > This series fixes a few lose bits in terms of how nvme uses ->retries, > > including fixing it for non-PCIe transports. While at it I noticed that > > nvme and scsi use the field in entirely different ways, and no other > > driver uses it at all. So I decided to move it into the nvme_request and > > scsi_request structures instead. > > > > Changes since V1: > > - better changelog for one patch > > - move the new retries field to the end of struct nvme_request > > Applied for 4.12. If we do the below on my box, we remove the (now) 2 > holes from struct request and shrink it 8 bytes. Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH v2] scsi: sd: Fix capacity calculation with 32-bit sector_t
On Wed, 2017-04-05 at 06:15 -0400, Martin K. Petersen wrote: > We previously made sure that the reported disk capacity was less than > 0x blocks when the kernel was not compiled with large sector_t > support (CONFIG_LBDAF). However, this check assumed that the capacity > was reported in units of 512 bytes. > > Add a sanity check function to ensure that we only enable disks if the > entire reported capacity can be expressed in terms of sector_t. Reviewed-by: Bart Van Assche
Re: ->retries fixups V2
On Wed, Apr 05 2017, Christoph Hellwig wrote: > This series fixes a few lose bits in terms of how nvme uses ->retries, > including fixing it for non-PCIe transports. While at it I noticed that > nvme and scsi use the field in entirely different ways, and no other > driver uses it at all. So I decided to move it into the nvme_request and > scsi_request structures instead. > > Changes since V1: > - better changelog for one patch > - move the new retries field to the end of struct nvme_request Applied for 4.12. If we do the below on my box, we remove the (now) 2 holes from struct request and shrink it 8 bytes. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ce6f9a6534c9..3cf241b0814d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -215,6 +215,8 @@ struct request { unsigned short ioprio; + unsigned int timeout; + void *special; /* opaque pointer available for LLD use */ int errors; @@ -223,7 +225,6 @@ struct request { unsigned long deadline; struct list_head timeout_list; - unsigned int timeout; /* * completion callback. -- Jens Axboe
Re: [PATCH 2/5] nvme: cleanup nvme_req_needs_retry
Reviewed-by: Sagi Grimberg
Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
Reviewed-by: Sagi Grimberg
Re: [PATCH 3/5] nvme: mark nvme_max_retries static
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd
Reviewed-by: Sagi Grimberg
Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request
Reviewed-by: Sagi Grimberg
[PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES
Make life easy for implementations that needs to send a data buffer to the device (e.g. SCSI) by numbering it as a data out command. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- include/linux/blk_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 67bcf8a5326e..4eae30bfbfca 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -168,7 +168,7 @@ enum req_opf { /* write the same sector many times */ REQ_OP_WRITE_SAME = 7, /* write the zero filled sector many times */ - REQ_OP_WRITE_ZEROES = 8, + REQ_OP_WRITE_ZEROES = 9, /* SCSI passthrough using struct scsi_request */ REQ_OP_SCSI_IN = 32, -- 2.11.0
[PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations that limit the write zeroes size. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-merge.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 2afa262425d1..3990ae406341 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct request_queue *q, return bio_split(bio, split_sectors, GFP_NOIO, bs); } +static struct bio *blk_bio_write_zeroes_split(struct request_queue *q, + struct bio *bio, struct bio_set *bs, unsigned *nsegs) +{ + *nsegs = 1; + + if (!q->limits.max_write_zeroes_sectors) + return NULL; + + if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors) + return NULL; + + return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs); +} + static struct bio *blk_bio_write_same_split(struct request_queue *q, struct bio *bio, struct bio_set *bs, @@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, split = blk_bio_discard_split(q, *bio, bs, ); break; case REQ_OP_WRITE_ZEROES: - split = NULL; - nsegs = (*bio)->bi_phys_segments; + split = blk_bio_write_zeroes_split(q, *bio, bs, ); break; case REQ_OP_WRITE_SAME: split = blk_bio_write_same_split(q, *bio, bs, ); -- 2.11.0
always use REQ_OP_WRITE_ZEROES for zeroing offload V2
This series makes REQ_OP_WRITE_ZEROES the only zeroing offload supported by the block layer, and switches existing implementations of REQ_OP_DISCARD that correctly set discard_zeroes_data to it, removes incorrect discard_zeroes_data, and also switches WRITE SAME based zeroing in SCSI to this new method. The series is against the block for-next tree. A git tree is also avaiable at: git://git.infradead.org/users/hch/block.git discard-rework.2 Gitweb: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2 Changes since V2: - various spelling fixes - various reviews captured - two new patches from Martin at the end
[PATCH 01/27] sd: split sd_setup_discard_cmnd
Split sd_setup_discard_cmnd into one function per provisioning type. While this creates some very slight duplication of boilerplate code it keeps the code modular for additions of new provisioning types, and for reusing the write same functions for the upcoming scsi implementation of the Write Zeroes operation. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/scsi/sd.c | 153 ++ 1 file changed, 84 insertions(+), 69 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index fcfeddc79331..b853f91fb3da 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); } -/** - * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device - * @sdp: scsi device to operate on - * @rq: Request to prepare - * - * Will issue either UNMAP or WRITE SAME(16) depending on preference - * indicated by target device. - **/ -static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) +static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd) { - struct request *rq = cmd->request; struct scsi_device *sdp = cmd->device; - struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); - sector_t sector = blk_rq_pos(rq); - unsigned int nr_sectors = blk_rq_sectors(rq); - unsigned int len; - int ret; + struct request *rq = cmd->request; + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + unsigned int data_len = 24; char *buf; - struct page *page; - sector >>= ilog2(sdp->sector_size) - 9; - nr_sectors >>= ilog2(sdp->sector_size) - 9; - - page = alloc_page(GFP_ATOMIC | __GFP_ZERO); - if (!page) + rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO); + if (!rq->special_vec.bv_page) return BLKPREP_DEFER; + rq->special_vec.bv_offset = 0; + rq->special_vec.bv_len = data_len; + rq->rq_flags |= RQF_SPECIAL_PAYLOAD; - switch (sdkp->provisioning_mode) { - case SD_LBP_UNMAP: - buf = page_address(page); - - cmd->cmd_len = 10; - cmd->cmnd[0] = UNMAP; - cmd->cmnd[8] = 24; - - put_unaligned_be16(6 + 16, [0]); - put_unaligned_be16(16, [2]); - put_unaligned_be64(sector, [8]); - put_unaligned_be32(nr_sectors, [16]); + cmd->cmd_len = 10; + cmd->cmnd[0] = UNMAP; + cmd->cmnd[8] = 24; - len = 24; - break; + buf = page_address(rq->special_vec.bv_page); + put_unaligned_be16(6 + 16, [0]); + put_unaligned_be16(16, [2]); + put_unaligned_be64(sector, [8]); + put_unaligned_be32(nr_sectors, [16]); - case SD_LBP_WS16: - cmd->cmd_len = 16; - cmd->cmnd[0] = WRITE_SAME_16; - cmd->cmnd[1] = 0x8; /* UNMAP */ - put_unaligned_be64(sector, >cmnd[2]); - put_unaligned_be32(nr_sectors, >cmnd[10]); + cmd->allowed = SD_MAX_RETRIES; + cmd->transfersize = data_len; + rq->timeout = SD_TIMEOUT; + scsi_req(rq)->resid_len = data_len; - len = sdkp->device->sector_size; - break; + return scsi_init_io(cmd); +} - case SD_LBP_WS10: - case SD_LBP_ZERO: - cmd->cmd_len = 10; - cmd->cmnd[0] = WRITE_SAME; - if (sdkp->provisioning_mode == SD_LBP_WS10) - cmd->cmnd[1] = 0x8; /* UNMAP */ - put_unaligned_be32(sector, >cmnd[2]); - put_unaligned_be16(nr_sectors, >cmnd[7]); +static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd) +{ + struct scsi_device *sdp = cmd->device; + struct request *rq = cmd->request; + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + u32 data_len = sdp->sector_size; - len = sdkp->device->sector_size; - break; + rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO); + if (!rq->special_vec.bv_page) + return BLKPREP_DEFER; + rq->special_vec.bv_offset = 0; + rq->special_vec.bv_len = data_len; + rq->rq_flags |= RQF_SPECIAL_PAYLOAD; - default: - ret = BLKPREP_INVALID; - goto out; - } + cmd->cmd_len = 16; + cmd->cmnd[0] = WRITE_SAME_16; + cmd->cmnd[1] = 0x8; /* UNMAP */ + put_unaligned_be64(sector, >cmnd[2]); + put_unaligned_be32(nr_sectors, >cmnd[10]); + cmd->allowed = SD_MAX_RETRIES; +
[PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/dm-core.h | 1 + drivers/md/dm-io.c| 8 ++-- drivers/md/dm-linear.c| 1 + drivers/md/dm-mpath.c | 1 + drivers/md/dm-rq.c| 11 --- drivers/md/dm-stripe.c| 2 ++ drivers/md/dm-table.c | 30 ++ drivers/md/dm.c | 31 --- include/linux/device-mapper.h | 6 ++ 9 files changed, 83 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 136fda3ff9e5..fea5bd52ada8 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md); void dm_init_normal_md_queue(struct mapped_device *md); int md_in_flight(struct mapped_device *md); void disable_write_same(struct mapped_device *md); +void disable_write_zeroes(struct mapped_device *md); static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj) { diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index b808cbe22678..3702e502466d 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c @@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned region, */ if (op == REQ_OP_DISCARD) special_cmd_max_sectors = q->limits.max_discard_sectors; + else if (op == REQ_OP_WRITE_ZEROES) + special_cmd_max_sectors = q->limits.max_write_zeroes_sectors; else if (op == REQ_OP_WRITE_SAME) special_cmd_max_sectors = q->limits.max_write_same_sectors; - if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) && + if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES || +op == REQ_OP_WRITE_SAME) && special_cmd_max_sectors == 0) { dec_count(io, region, -EOPNOTSUPP); return; @@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region, */ switch (op) { case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: num_bvecs = 0; break; case REQ_OP_WRITE_SAME: @@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region, bio_set_op_attrs(bio, op, op_flags); store_io_and_region_in_bio(bio, io, region); - if (op == REQ_OP_DISCARD) { + if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) { num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining); bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT; remaining -= num_sectors; diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 4788b0b989a9..e17fd44ceef5 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv) ti->num_flush_bios = 1; ti->num_discard_bios = 1; ti->num_write_same_bios = 1; + ti->num_write_zeroes_bios = 1; ti->private = lc; return 0; diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 7f223dbed49f..ab55955ed704 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv) ti->num_flush_bios = 1; ti->num_discard_bios = 1; ti->num_write_same_bios = 1; + ti->num_write_zeroes_bios = 1; if (m->queue_mode == DM_TYPE_BIO_BASED) ti->per_io_data_size = multipath_per_bio_data_size(); else diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 6886bf160fb2..a789bf035621 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool mapped) r = rq_end_io(tio->ti, clone, error, >info); } - if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) && -!clone->q->limits.max_write_same_sectors)) - disable_write_same(tio->md); + if (unlikely(r == -EREMOTEIO)) { + if (req_op(clone) == REQ_OP_WRITE_SAME && + !clone->q->limits.max_write_same_sectors) + disable_write_same(tio->md); + if (req_op(clone) == REQ_OP_WRITE_ZEROES && + !clone->q->limits.max_write_zeroes_sectors) + disable_write_zeroes(tio->md); + } if (r <= 0) /* The target wants to complete the I/O */ diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 28193a57bf47..5ef49c121d99 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -169,6
[PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES
Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/scsi/sd.c | 31 ++- drivers/scsi/sd_zbc.c | 1 + 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index b853f91fb3da..d8d9c0bdd93c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd) return scsi_init_io(cmd); } -static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd) +static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap) { struct scsi_device *sdp = cmd->device; struct request *rq = cmd->request; @@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd) cmd->cmd_len = 16; cmd->cmnd[0] = WRITE_SAME_16; - cmd->cmnd[1] = 0x8; /* UNMAP */ + if (unmap) + cmd->cmnd[1] = 0x8; /* UNMAP */ put_unaligned_be64(sector, >cmnd[2]); put_unaligned_be32(nr_sectors, >cmnd[10]); cmd->allowed = SD_MAX_RETRIES; cmd->transfersize = data_len; - rq->timeout = SD_TIMEOUT; + rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT; scsi_req(rq)->resid_len = data_len; return scsi_init_io(cmd); @@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap) cmd->allowed = SD_MAX_RETRIES; cmd->transfersize = data_len; - rq->timeout = SD_TIMEOUT; + rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT; scsi_req(rq)->resid_len = data_len; return scsi_init_io(cmd); } +static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd) +{ + struct request *rq = cmd->request; + struct scsi_device *sdp = cmd->device; + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + + if (sdp->no_write_same) + return BLKPREP_INVALID; + if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) + return sd_setup_write_same16_cmnd(cmd, false); + return sd_setup_write_same10_cmnd(cmd, false); +} + static void sd_config_write_same(struct scsi_disk *sdkp) { struct request_queue *q = sdkp->disk->queue; @@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp) out: blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * (logical_block_size >> 9)); + blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks * +(logical_block_size >> 9)); } /** @@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd) case SD_LBP_UNMAP: return sd_setup_unmap_cmnd(cmd); case SD_LBP_WS16: - return sd_setup_write_same16_cmnd(cmd); + return sd_setup_write_same16_cmnd(cmd, true); case SD_LBP_WS10: return sd_setup_write_same10_cmnd(cmd, true); case SD_LBP_ZERO: @@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd) default: return BLKPREP_INVALID; } + case REQ_OP_WRITE_ZEROES: + return sd_setup_write_zeroes_cmnd(cmd); case REQ_OP_WRITE_SAME: return sd_setup_write_same_cmnd(cmd); case REQ_OP_FLUSH: @@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt) switch (req_op(req)) { case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: case REQ_OP_WRITE_SAME: case REQ_OP_ZONE_RESET: if (!result) { diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 92620c8ea8ad..1994f7799fce 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, switch (req_op(rq)) { case REQ_OP_WRITE: + case REQ_OP_WRITE_ZEROES: case REQ_OP_WRITE_SAME: case REQ_OP_ZONE_RESET: -- 2.11.0
[PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES
It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/dm-kcopyd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c index 9e9d04cb7d51..f85846741d50 100644 --- a/drivers/md/dm-kcopyd.c +++ b/drivers/md/dm-kcopyd.c @@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, job->pages = _page_list; /* -* Use WRITE SAME to optimize zeroing if all dests support it. +* Use WRITE ZEROES to optimize zeroing if all dests support it. */ - job->rw = REQ_OP_WRITE_SAME; + job->rw = REQ_OP_WRITE_ZEROES; for (i = 0; i < job->num_dests; i++) - if (!bdev_write_same(job->dests[i].bdev)) { + if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) { job->rw = WRITE; break; } -- 2.11.0
Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe
On 04/05/2017 01:23 PM, Song Liu wrote: Reviewed-by: Song LiuThanks for reviewing, Song Liu. It's good to know this patch doesn't break anything for you. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH 24/27] drbd: implement REQ_OP_WRITE_ZEROES
It seems like DRBD assumes its on the wire TRIM request always zeroes data. Use that fact to implement REQ_OP_WRITE_ZEROES. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/drbd/drbd_main.c | 3 ++- drivers/block/drbd/drbd_nl.c | 2 ++ drivers/block/drbd/drbd_receiver.c | 6 +++--- drivers/block/drbd/drbd_req.c | 7 +-- drivers/block/drbd/drbd_worker.c | 4 +++- 5 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 92c60cbd04ee..8e62d9f65510 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection *connection, (bio->bi_opf & REQ_FUA ? DP_FUA : 0) | (bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) | (bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) | - (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0); + (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) | + (bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0); else return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0; } diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index 908c704e20aa..e4516d3b971d 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct drbd_device *device, blk_queue_discard_granularity(q, 512); q->limits.max_discard_sectors = drbd_max_discard_sectors(connection); queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); + q->limits.max_write_zeroes_sectors = drbd_max_discard_sectors(connection); } else { queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q); blk_queue_discard_granularity(q, 0); q->limits.max_discard_sectors = 0; + q->limits.max_write_zeroes_sectors = 0; } } diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index bc1d296581f9..1b0a2be24f39 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf) static unsigned long wire_flags_to_bio_op(u32 dpf) { if (dpf & DP_DISCARD) - return REQ_OP_DISCARD; + return REQ_OP_WRITE_ZEROES; else return REQ_OP_WRITE; } @@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info * op_flags = wire_flags_to_bio_flags(dp_flags); if (pi->cmd == P_TRIM) { D_ASSERT(peer_device, peer_req->i.size > 0); - D_ASSERT(peer_device, op == REQ_OP_DISCARD); + D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES); D_ASSERT(peer_device, peer_req->pages == NULL); } else if (peer_req->pages == NULL) { D_ASSERT(device, peer_req->i.size == 0); @@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac if (get_ldev(device)) { struct drbd_peer_request *peer_req; - const int op = REQ_OP_DISCARD; + const int op = REQ_OP_WRITE_ZEROES; peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector, size, 0, GFP_NOIO); diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index 6da9ea8c48b6..b5730e17b455 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio drbd_req_make_private_bio(req, bio_src); req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0) | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0) + | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0) | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0); req->device = device; req->master_bio = bio_src; @@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req) if (get_ldev(device)) { if (drbd_insert_fault(device, type)) bio_io_error(bio); - else if (bio_op(bio) == REQ_OP_DISCARD) + else if (bio_op(bio) == REQ_OP_WRITE_ZEROES || +bio_op(bio) == REQ_OP_DISCARD) drbd_process_discard_req(req); else generic_make_request(bio); @@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio, unsigned long _drbd_start_io_acct(device, req); /* process
[PATCH 16/27] zram: implement REQ_OP_WRITE_ZEROES
Just the same as discard if the block size equals the system page size. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/zram/zram_drv.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index dceb5edd1e54..1710b06f04a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct bio *bio) offset = (bio->bi_iter.bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; - if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + switch (bio_op(bio)) { + case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: zram_bio_discard(zram, index, offset, bio); bio_endio(bio); return; + default: + break; } bio_for_each_segment(bvec, bio, iter) { @@ -1192,6 +1196,8 @@ static int zram_add(void) zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE; zram->disk->queue->limits.chunk_sectors = 0; blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX); + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue); + /* * zram_bio_discard() will clear all logical blocks if logical block * size is identical with physical block size(PAGE_SIZE). But if it is @@ -1201,10 +1207,7 @@ static int zram_add(void) * zeroed. */ if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE) - zram->disk->queue->limits.discard_zeroes_data = 1; - else - zram->disk->queue->limits.discard_zeroes_data = 0; - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue); + blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX); add_disk(zram->disk); -- 2.11.0
[PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches
This gets us support for non-discard efficient write of zeroes (e.g. NVMe) and prepares for removing the discard_zeroes_data flag. Also remove a pointless discard support check, which is done in blkdev_issue_discard already. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- fs/block_dev.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2f704c3a816f..e405d8e58e31 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len) { struct block_device *bdev = I_BDEV(bdev_file_inode(file)); - struct request_queue *q = bdev_get_queue(bdev); struct address_space *mapping; loff_t end = start + len - 1; loff_t isize; @@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, GFP_KERNEL, BLKDEV_ZERO_NOUNMAP); break; case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: - /* Only punch if the device can do zeroing discard. */ - if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) - return -EOPNOTSUPP; - error = blkdev_issue_discard(bdev, start >> 9, len >> 9, -GFP_KERNEL, 0); + error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, +GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK); break; case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE: - if (!blk_queue_discard(q)) - return -EOPNOTSUPP; error = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL, 0); break; -- 2.11.0
[PATCH 05/27] md: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/linear.c| 1 + drivers/md/md.h| 7 +++ drivers/md/multipath.c | 1 + drivers/md/raid0.c | 2 ++ drivers/md/raid1.c | 4 +++- drivers/md/raid10.c| 1 + drivers/md/raid5.c | 1 + 7 files changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/md/linear.c b/drivers/md/linear.c index 3e38e0207a3e..377a8a3672e3 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio) split, disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, split); + mddev_check_write_zeroes(mddev, split); generic_make_request(split); } } while (split != bio); diff --git a/drivers/md/md.h b/drivers/md/md.h index dde8ecb760c8..1e76d64ce180 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio) !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors) mddev->queue->limits.max_write_same_sectors = 0; } + +static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio) +{ + if (bio_op(bio) == REQ_OP_WRITE_ZEROES && + !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors) + mddev->queue->limits.max_write_zeroes_sectors = 0; +} #endif /* _MD_MD_H */ diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index 79a12b59250b..e95d521d93e9 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio) mp_bh->bio.bi_end_io = multipath_end_request; mp_bh->bio.bi_private = mp_bh; mddev_check_writesame(mddev, _bh->bio); + mddev_check_write_zeroes(mddev, _bh->bio); generic_make_request(_bh->bio); return; } diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 93347ca7c7a6..ce7a6a56cf73 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev) blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_write_zeroes_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); @@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) split, disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, split); + mddev_check_write_zeroes(mddev, split); generic_make_request(split); } } while (split != bio); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a34f58772022..b59cc100320a 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev) if (IS_ERR(conf)) return PTR_ERR(conf); - if (mddev->queue) + if (mddev->queue) { blk_queue_max_write_same_sectors(mddev->queue, 0); + blk_queue_max_write_zeroes_sectors(mddev->queue, 0); + } rdev_for_each(rdev, mddev) { if (!mddev->gendisk) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index e89a8d78a9ed..28ec3a93acee 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev) blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, 0); + blk_queue_max_write_zeroes_sectors(mddev->queue, 0); blk_queue_io_min(mddev->queue, chunk_size); if (conf->geo.raid_disks % conf->geo.near_copies) blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index ed5cd705b985..8cf1f86dcd05 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev) mddev->queue->limits.discard_zeroes_data = 0; blk_queue_max_write_same_sectors(mddev->queue, 0); + blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
[PATCH 06/27] dm io: discards don't take a payload
Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/dm-io.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index 03940bf36f6c..b808cbe22678 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c @@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned region, /* * Allocate a suitably sized-bio. */ - if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME)) + switch (op) { + case REQ_OP_DISCARD: + num_bvecs = 0; + break; + case REQ_OP_WRITE_SAME: num_bvecs = 1; - else + break; + default: num_bvecs = min_t(int, BIO_MAX_PAGES, dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT))); + } bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios); bio->bi_iter.bi_sector = where->sector + (where->count - remaining); -- 2.11.0
[PATCH 26/27] scsi: sd: Separate zeroout and discard command choices
From: "Martin K. Petersen"Now that zeroout and discards are distinct operations we need to separate the policy of choosing the appropriate command. Create a zeroing_mode which can be one of: write: Zeroout assist not present, use regular WRITE writesame: Allow WRITE SAME(10/16) with a zeroed payload writesame_16_unmap: Allow WRITE SAME(16) with UNMAP writesame_10_unmap: Allow WRITE SAME(10) with UNMAP The last two are conditional on the device being thin provisioned with LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively. Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And if none of the _unmap variants are supported, regular WRITE SAME will be used if the device supports it. The zeroout_mode is exported in sysfs and the detected mode for a given device can be overridden using the string constants above. With this change in place we can now issue WRITE SAME(16) with UNMAP set for block zeroing applications that require hard guarantees and logical_block_size granularity. And at the same time use the UNMAP command with the device's preferred granulary and alignment for discard operations. Signed-off-by: Martin K. Petersen Signed-off-by: Christoph Hellwig --- drivers/scsi/sd.c | 56 --- drivers/scsi/sd.h | 8 2 files changed, 61 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index bcb0cb020fd2..acf9d17b05d8 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -418,6 +418,46 @@ provisioning_mode_store(struct device *dev, struct device_attribute *attr, } static DEVICE_ATTR_RW(provisioning_mode); +static const char *zeroing_mode[] = { + [SD_ZERO_WRITE] = "write", + [SD_ZERO_WS]= "writesame", + [SD_ZERO_WS16_UNMAP]= "writesame_16_unmap", + [SD_ZERO_WS10_UNMAP]= "writesame_10_unmap", +}; + +static ssize_t +zeroing_mode_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct scsi_disk *sdkp = to_scsi_disk(dev); + + return snprintf(buf, 20, "%s\n", zeroing_mode[sdkp->zeroing_mode]); +} + +static ssize_t +zeroing_mode_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct scsi_disk *sdkp = to_scsi_disk(dev); + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (!strncmp(buf, zeroing_mode[SD_ZERO_WRITE], 20)) + sdkp->zeroing_mode = SD_ZERO_WRITE; + else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS], 20)) + sdkp->zeroing_mode = SD_ZERO_WS; + else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS16_UNMAP], 20)) + sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP; + else if (!strncmp(buf, zeroing_mode[SD_ZERO_WS10_UNMAP], 20)) + sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP; + else + return -EINVAL; + + return count; +} +static DEVICE_ATTR_RW(zeroing_mode); + static ssize_t max_medium_access_timeouts_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -496,6 +536,7 @@ static struct attribute *sd_disk_attrs[] = { _attr_app_tag_own.attr, _attr_thin_provisioning.attr, _attr_provisioning_mode.attr, + _attr_zeroing_mode.attr, _attr_max_write_same_blocks.attr, _attr_max_medium_access_timeouts.attr, NULL, @@ -799,10 +840,10 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd) u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); if (!(rq->cmd_flags & REQ_NOUNMAP)) { - switch (sdkp->provisioning_mode) { - case SD_LBP_WS16: + switch (sdkp->zeroing_mode) { + case SD_ZERO_WS16_UNMAP: return sd_setup_write_same16_cmnd(cmd, true); - case SD_LBP_WS10: + case SD_ZERO_WS10_UNMAP: return sd_setup_write_same10_cmnd(cmd, true); } } @@ -840,6 +881,15 @@ static void sd_config_write_same(struct scsi_disk *sdkp) sdkp->max_ws_blocks = 0; } + if (sdkp->lbprz && sdkp->lbpws) + sdkp->zeroing_mode = SD_ZERO_WS16_UNMAP; + else if (sdkp->lbprz && sdkp->lbpws10) + sdkp->zeroing_mode = SD_ZERO_WS10_UNMAP; + else if (sdkp->max_ws_blocks) + sdkp->zeroing_mode = SD_ZERO_WS; + else + sdkp->zeroing_mode = SD_ZERO_WRITE; + out: blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * (logical_block_size >> 9)); diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 4dac35e96a75..a2c4b5c35379 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -59,6 +59,13 @@ enum {
[PATCH 25/27] block: remove the discard_zeroes_data flag
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- Documentation/ABI/testing/sysfs-block | 10 ++- Documentation/block/queue-sysfs.txt | 5 block/blk-lib.c | 7 + block/blk-settings.c | 3 --- block/blk-sysfs.c | 2 +- block/compat_ioctl.c | 2 +- block/ioctl.c | 2 +- drivers/block/drbd/drbd_main.c| 2 -- drivers/block/drbd/drbd_nl.c | 7 + drivers/block/loop.c | 2 -- drivers/block/mtip32xx/mtip32xx.c | 1 - drivers/block/nbd.c | 1 - drivers/md/dm-cache-target.c | 1 - drivers/md/dm-crypt.c | 1 - drivers/md/dm-raid.c | 6 ++--- drivers/md/dm-raid1.c | 1 - drivers/md/dm-table.c | 19 - drivers/md/dm-thin.c | 2 -- drivers/md/raid5.c| 50 +++ drivers/scsi/sd.c | 5 drivers/target/target_core_device.c | 2 +- include/linux/blkdev.h| 15 --- include/linux/device-mapper.h | 5 23 files changed, 27 insertions(+), 124 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 2da04ce6aeef..dea212db9df3 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -213,14 +213,8 @@ What: /sys/block//queue/discard_zeroes_data Date: May 2011 Contact: Martin K. Petersen Description: - Devices that support discard functionality may return - stale or random data when a previously discarded block - is read back. This can cause problems if the filesystem - expects discarded blocks to be explicitly cleared. If a - device reports that it deterministically returns zeroes - when a discarded area is read the discard_zeroes_data - parameter will be set to one. Otherwise it will be 0 and - the result of reading a discarded area is undefined. + Will always return 0. Don't rely on any specific behavior + for discards, and don't read this file. What: /sys/block//queue/write_same_max_bytes Date: January 2012 diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index b7f6bdc96d73..2c1e67058fd3 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -43,11 +43,6 @@ large discards are issued, setting this value lower will make Linux issue smaller discards and potentially help reduce latencies induced by large discard operations. -discard_zeroes_data (RO) - -When read, this file will show if the discarded block are zeroed by the -device or not. If its value is '1' the blocks are zeroed otherwise not. - hw_sector_size (RO) --- This is the hardware sector size of the device, in bytes. diff --git a/block/blk-lib.c b/block/blk-lib.c index b0c6c4bcf441..e8caecd71688 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -37,17 +37,12 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, return -ENXIO; if (flags & BLKDEV_DISCARD_SECURE) { - if (flags & BLKDEV_DISCARD_ZERO) - return -EOPNOTSUPP; if (!blk_queue_secure_erase(q)) return -EOPNOTSUPP; op = REQ_OP_SECURE_ERASE; } else { if (!blk_queue_discard(q)) return -EOPNOTSUPP; - if ((flags & BLKDEV_DISCARD_ZERO) && - !q->limits.discard_zeroes_data) - return -EOPNOTSUPP; op = REQ_OP_DISCARD; } @@ -126,7 +121,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, ); if (!ret && bio) { ret = submit_bio_wait(bio); - if (ret == -EOPNOTSUPP && !(flags & BLKDEV_DISCARD_ZERO)) + if (ret == -EOPNOTSUPP) ret = 0; bio_put(bio); } diff --git a/block/blk-settings.c b/block/blk-settings.c index 1e7174ffc9d4..4fa81ed383ca 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -103,7 +103,6 @@ void blk_set_default_limits(struct queue_limits *lim) lim->discard_granularity = 0; lim->discard_alignment = 0; lim->discard_misaligned = 0; - lim->discard_zeroes_data = 0; lim->logical_block_size = lim->physical_block_size = lim->io_min =
[PATCH 21/27] mmc: remove the discard_zeroes_data flag
mmc only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/mmc/core/queue.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c index 493eb10ce580..4c54ad34e17a 100644 --- a/drivers/mmc/core/queue.c +++ b/drivers/mmc/core/queue.c @@ -167,8 +167,6 @@ static void mmc_queue_setup_discard(struct request_queue *q, queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); blk_queue_max_discard_sectors(q, max_discard); - if (card->erased_byte == 0 && !mmc_can_discard(card)) - q->limits.discard_zeroes_data = 1; q->limits.discard_granularity = card->pref_erase << 9; /* granularity must not be greater than max. discard */ if (card->pref_erase > max_discard) -- 2.11.0
[PATCH 22/27] block: stop using discards for zeroing
Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that support efficient zeroing, we can remove the call to blkdev_issue_discard. This means we only have two ways of zeroing left and can simplify the code. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2f882e22890b..b0c6c4bcf441 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -279,6 +279,12 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, * Zero-fill a block range, either using hardware offload or by explicitly * writing zeroes to the device. * + * Note that this function may fail with -EOPNOTSUPP if the driver signals + * zeroing offload support, but the device fails to process the command (for + * some devices there is no non-destructive way to verify whether this + * operation is actually supported). In this case the caller should call + * retry the call to blkdev_issue_zeroout() and the fallback path will be used. + * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. * @@ -349,12 +355,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, struct bio *bio = NULL; struct blk_plug plug; - if (!(flags & BLKDEV_ZERO_NOUNMAP)) { - if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, - BLKDEV_DISCARD_ZERO)) - return 0; - } - blk_start_plug(); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, , flags); -- 2.11.0
[PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if the caller doesn't want them. Also clean up the convoluted check for the return condition that this new flag is added to. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c| 5 - include/linux/blkdev.h | 1 + 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2f6d2cb2e1a2..2f882e22890b 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. + * + * If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return + * -EOPNOTSUPP if no explicit hardware offload for zeroing is provided. */ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, @@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); - if (ret == 0 || (ret && ret != -EOPNOTSUPP)) + if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) goto out; ret = 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e7513ce3dbde..a5055d760661 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, struct bio **biop); #define BLKDEV_ZERO_NOUNMAP(1 << 0) /* do not free blocks */ +#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, -- 2.11.0
[PATCH 20/27] rsxx: remove the discard_zeroes_data flag
rsxx only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/rsxx/dev.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c index f81d70b39d10..9c566364ac9c 100644 --- a/drivers/block/rsxx/dev.c +++ b/drivers/block/rsxx/dev.c @@ -300,7 +300,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card) RSXX_HW_BLK_SIZE >> 9); card->queue->limits.discard_granularity = RSXX_HW_BLK_SIZE; card->queue->limits.discard_alignment = RSXX_HW_BLK_SIZE; - card->queue->limits.discard_zeroes_data = 1; } card->queue->queuedata = card; -- 2.11.0
[PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing
We'll always use the WRITE ZEROES code for zeroing now. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c | 4 1 file changed, 4 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index e5b853f2b8a2..2a8d638544a7 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return 0; } - if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask, - ZERO_PAGE(0))) - return 0; - blk_start_plug(); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, , discard); -- 2.11.0
[PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with similar semantics, but without referring to diѕcard. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c| 31 ++- block/ioctl.c | 2 +- drivers/block/drbd/drbd_receiver.c | 9 ++--- drivers/nvme/target/io-cmd.c | 2 +- fs/block_dev.c | 2 +- fs/dax.c | 2 +- fs/xfs/xfs_bmap_util.c | 2 +- include/linux/blkdev.h | 16 ++-- 8 files changed, 35 insertions(+), 31 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2a8d638544a7..f9f24ec69c27 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, * @nr_sects: number of sectors to write * @gfp_mask: memory allocation flags (for bio_alloc) * @biop: pointer to anchor bio - * @discard: discard flag + * @flags: controls detailed behavior * * Description: - * Generate and issue number of bios with zerofiled pages. + * Zero-fill a block range, either using hardware offload or by explicitly + * writing zeroes to the device. + * + * If a device is using logical block provisioning, the underlying space will + * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. */ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, - bool discard) + unsigned flags) { int ret; int bi_size = 0; @@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); * @sector:start sector * @nr_sects: number of sectors to write * @gfp_mask: memory allocation flags (for bio_alloc) - * @discard: whether to discard the block range + * @flags: controls detailed behavior * * Description: - * Zero-fill a block range. If the discard flag is set and the block - * device guarantees that subsequent READ operations to the block range - * in question will return zeroes, the blocks will be discarded. Should - * the discard request fail, if the discard flag is not set, or if - * discard_zeroes_data is not supported, this function will resort to - * zeroing the blocks manually, thus provisioning (allocating, - * anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME - * command(s), blkdev_issue_zeroout() will use it to optimize the process of - * clearing the block range. Otherwise the zeroing will be performed - * using regular WRITE calls. + * Zero-fill a block range, either using hardware offload or by explicitly + * writing zeroes to the device. See __blkdev_issue_zeroout() for the + * valid values for %flags. */ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, -sector_t nr_sects, gfp_t gfp_mask, bool discard) + sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { int ret; struct bio *bio = NULL; struct blk_plug plug; - if (discard) { + if (!(flags & BLKDEV_ZERO_NOUNMAP)) { if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, BLKDEV_DISCARD_ZERO)) return 0; @@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, blk_start_plug(); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - , discard); + , flags); if (ret == 0 && bio) { ret = submit_bio_wait(bio); bio_put(bio); diff --git a/block/ioctl.c b/block/ioctl.c index 7b88820b93d9..8ea00a41be01 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode, truncate_inode_pages_range(mapping, start, end); return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL, - false); + BLKDEV_ZERO_NOUNMAP); } static int put_ushort(unsigned long arg, unsigned short val) diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index aa6bf9692eff..dc9a6dcd431c 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, u tmp = start + granularity - sector_div(tmp, granularity); nr = tmp - start; - err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0); + err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, + BLKDEV_ZERO_NOUNMAP); nr_sectors -= nr;
[PATCH 14/27] sd: implement unmapping Write Zeroes
Try to use a write same with unmap bit variant if the device supports it and the caller allows for it. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/scsi/sd.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d8d9c0bdd93c..001593ed0444 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd) u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + if (!(rq->cmd_flags & REQ_NOUNMAP)) { + switch (sdkp->provisioning_mode) { + case SD_LBP_WS16: + return sd_setup_write_same16_cmnd(cmd, true); + case SD_LBP_WS10: + return sd_setup_write_same10_cmnd(cmd, true); + } + } + if (sdp->no_write_same) return BLKPREP_INVALID; if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) -- 2.11.0
[PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
If this flag is set logical provisioning capable device should release space for the zeroed blocks if possible, if it is not set devices should keep the blocks anchored. Also remove an out of sync kerneldoc comment for a static function that would have become even more out of data with this change. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c | 19 +-- include/linux/blk_types.h | 6 ++ 2 files changed, 11 insertions(+), 14 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index f9f24ec69c27..2f6d2cb2e1a2 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL(blkdev_issue_write_same); -/** - * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES - * @bdev: blockdev to issue - * @sector:start sector - * @nr_sects: number of sectors to write - * @gfp_mask: memory allocation flags (for bio_alloc) - * @biop: pointer to anchor bio - * - * Description: - * Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled pages. - */ static int __blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, - struct bio **biop) + struct bio **biop, unsigned flags) { struct bio *bio = *biop; unsigned int max_write_zeroes_sectors; @@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, bio = next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; bio->bi_bdev = bdev; - bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0); + bio->bi_opf = REQ_OP_WRITE_ZEROES; + if (flags & BLKDEV_ZERO_NOUNMAP) + bio->bi_opf |= REQ_NOUNMAP; if (nr_sects > max_write_zeroes_sectors) { bio->bi_iter.bi_size = max_write_zeroes_sectors << 9; @@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return -EINVAL; ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, - biop); + biop, flags); if (ret == 0 || (ret && ret != -EOPNOTSUPP)) goto out; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 4eae30bfbfca..8eaa7dca7057 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -195,6 +195,10 @@ enum req_flag_bits { __REQ_PREFLUSH, /* request for cache flush */ __REQ_RAHEAD, /* read ahead, can fail anytime */ __REQ_BACKGROUND, /* background IO */ + + /* command specific flags for REQ_OP_WRITE_ZEROES: */ + __REQ_NOUNMAP, /* do not free blocks when zeroing */ + __REQ_NR_BITS, /* stops here */ }; @@ -212,6 +216,8 @@ enum req_flag_bits { #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) +#define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP) + #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) -- 2.11.0
[PATCH 19/27] rbd: remove the discard_zeroes_data flag
rbd only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/rbd.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index f24ade333e0c..089ac4179919 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4380,7 +4380,6 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) q->limits.discard_granularity = segment_size; q->limits.discard_alignment = segment_size; blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE); - q->limits.discard_zeroes_data = 1; if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC)) q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; -- 2.11.0
[PATCH 18/27] brd: remove discard support
It's just a in-driver reimplementation of writing zeroes to the pages, which fails if the discards aren't page aligned. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/brd.c | 54 - 1 file changed, 54 deletions(-) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 3adc32a3153b..4ec84d504780 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -134,28 +134,6 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) return page; } -static void brd_free_page(struct brd_device *brd, sector_t sector) -{ - struct page *page; - pgoff_t idx; - - spin_lock(>brd_lock); - idx = sector >> PAGE_SECTORS_SHIFT; - page = radix_tree_delete(>brd_pages, idx); - spin_unlock(>brd_lock); - if (page) - __free_page(page); -} - -static void brd_zero_page(struct brd_device *brd, sector_t sector) -{ - struct page *page; - - page = brd_lookup_page(brd, sector); - if (page) - clear_highpage(page); -} - /* * Free all backing store pages and radix tree. This must only be called when * there are no other users of the device. @@ -212,24 +190,6 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n) return 0; } -static void discard_from_brd(struct brd_device *brd, - sector_t sector, size_t n) -{ - while (n >= PAGE_SIZE) { - /* -* Don't want to actually discard pages here because -* re-allocating the pages can result in writeback -* deadlocks under heavy load. -*/ - if (0) - brd_free_page(brd, sector); - else - brd_zero_page(brd, sector); - sector += PAGE_SIZE >> SECTOR_SHIFT; - n -= PAGE_SIZE; - } -} - /* * Copy n bytes from src to the brd starting at sector. Does not sleep. */ @@ -338,14 +298,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio) if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) goto io_error; - if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { - if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) || - bio->bi_iter.bi_size & ~PAGE_MASK) - goto io_error; - discard_from_brd(brd, sector, bio->bi_iter.bi_size); - goto out; - } - bio_for_each_segment(bvec, bio, iter) { unsigned int len = bvec.bv_len; int err; @@ -357,7 +309,6 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio) sector += len >> SECTOR_SHIFT; } -out: bio_endio(bio); return BLK_QC_T_NONE; io_error: @@ -464,11 +415,6 @@ static struct brd_device *brd_alloc(int i) * is harmless) */ blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE); - - brd->brd_queue->limits.discard_granularity = PAGE_SIZE; - blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); - brd->brd_queue->limits.discard_zeroes_data = 1; - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); #ifdef CONFIG_BLK_DEV_RAM_DAX queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); #endif -- 2.11.0
[PATCH 17/27] loop: implement REQ_OP_WRITE_ZEROES
It's identical to discard as hole punches will always leave us with zeroes on reads. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/loop.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index cc981f34e017..3bb04c1a4ba1 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) case REQ_OP_FLUSH: return lo_req_flush(lo, rq); case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: return lo_discard(lo, rq, pos); case REQ_OP_WRITE: if (lo->transfer) @@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo) q->limits.discard_granularity = 0; q->limits.discard_alignment = 0; blk_queue_max_discard_sectors(q, 0); + blk_queue_max_write_zeroes_sectors(q, 0); q->limits.discard_zeroes_data = 0; queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q); return; @@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo) q->limits.discard_granularity = inode->i_sb->s_blocksize; q->limits.discard_alignment = 0; blk_queue_max_discard_sectors(q, UINT_MAX >> 9); + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9); q->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); } @@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, switch (req_op(cmd->rq)) { case REQ_OP_FLUSH: case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: cmd->use_aio = false; break; default: -- 2.11.0
[PATCH 15/27] nvme: implement REQ_OP_WRITE_ZEROES
But now for the real NVMe Write Zeroes yet, just to get rid of the discard abuse for zeroing. Also rename the quirk flag to be a bit more self-explanatory. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/nvme/host/core.c | 10 +- drivers/nvme/host/nvme.h | 6 +++--- drivers/nvme/host/pci.c | 6 +++--- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3c908e1bc903..26d5129a640a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -358,6 +358,8 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req, case REQ_OP_FLUSH: nvme_setup_flush(ns, cmd); break; + case REQ_OP_WRITE_ZEROES: + /* currently only aliased to deallocate for a few ctrls: */ case REQ_OP_DISCARD: ret = nvme_setup_discard(ns, req, cmd); break; @@ -923,16 +925,14 @@ static void nvme_config_discard(struct nvme_ns *ns) BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) < NVME_DSM_MAX_RANGES); - if (ctrl->quirks & NVME_QUIRK_DISCARD_ZEROES) - ns->queue->limits.discard_zeroes_data = 1; - else - ns->queue->limits.discard_zeroes_data = 0; - ns->queue->limits.discard_alignment = logical_block_size; ns->queue->limits.discard_granularity = logical_block_size; blk_queue_max_discard_sectors(ns->queue, UINT_MAX); blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES); queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue); + + if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES) + blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX); } static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 227f281482db..f903726eeb68 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -68,10 +68,10 @@ enum nvme_quirks { NVME_QUIRK_IDENTIFY_CNS = (1 << 1), /* -* The controller deterministically returns O's on reads to discarded -* logical blocks. +* The controller deterministically returns O's on reads to +* logical blocks that deallocate was called on. */ - NVME_QUIRK_DISCARD_ZEROES = (1 << 2), + NVME_QUIRK_DEALLOCATE_ZEROES= (1 << 2), /* * The controller needs a delay before starts checking the device diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 9e686a67d93b..cb530a6bef3f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2113,13 +2113,13 @@ static const struct pci_error_handlers nvme_err_handler = { static const struct pci_device_id nvme_id_table[] = { { PCI_VDEVICE(INTEL, 0x0953), .driver_data = NVME_QUIRK_STRIPE_SIZE | - NVME_QUIRK_DISCARD_ZEROES, }, + NVME_QUIRK_DEALLOCATE_ZEROES, }, { PCI_VDEVICE(INTEL, 0x0a53), .driver_data = NVME_QUIRK_STRIPE_SIZE | - NVME_QUIRK_DISCARD_ZEROES, }, + NVME_QUIRK_DEALLOCATE_ZEROES, }, { PCI_VDEVICE(INTEL, 0x0a54), .driver_data = NVME_QUIRK_STRIPE_SIZE | - NVME_QUIRK_DISCARD_ZEROES, }, + NVME_QUIRK_DEALLOCATE_ZEROES, }, { PCI_VDEVICE(INTEL, 0x5845), /* Qemu emulated controller */ .driver_data = NVME_QUIRK_IDENTIFY_CNS, }, { PCI_DEVICE(0x1c58, 0x0003), /* HGST adapter */ -- 2.11.0
[PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
drbd always wants its discard wire operations to zero the blocks, so use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of reinventing it poorly. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/drbd/drbd_debugfs.c | 3 -- drivers/block/drbd/drbd_int.h | 6 --- drivers/block/drbd/drbd_receiver.c | 102 ++--- drivers/block/drbd/drbd_req.c | 6 +-- 4 files changed, 7 insertions(+), 110 deletions(-) diff --git a/drivers/block/drbd/drbd_debugfs.c b/drivers/block/drbd/drbd_debugfs.c index de5c3ee8a790..494837e59f23 100644 --- a/drivers/block/drbd/drbd_debugfs.c +++ b/drivers/block/drbd/drbd_debugfs.c @@ -236,9 +236,6 @@ static void seq_print_peer_request_flags(struct seq_file *m, struct drbd_peer_re seq_print_rq_state_bit(m, f & EE_CALL_AL_COMPLETE_IO, , "in-AL"); seq_print_rq_state_bit(m, f & EE_SEND_WRITE_ACK, , "C"); seq_print_rq_state_bit(m, f & EE_MAY_SET_IN_SYNC, , "set-in-sync"); - - if (f & EE_IS_TRIM) - __seq_print_rq_state_bit(m, f & EE_IS_TRIM_USE_ZEROOUT, , "zero-out", "trim"); seq_print_rq_state_bit(m, f & EE_WRITE_SAME, , "write-same"); seq_putc(m, '\n'); } diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index 724d1c50fc52..d5da45bb03a6 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drbd_int.h @@ -437,9 +437,6 @@ enum { /* is this a TRIM aka REQ_DISCARD? */ __EE_IS_TRIM, - /* our lower level cannot handle trim, -* and we want to fall back to zeroout instead */ - __EE_IS_TRIM_USE_ZEROOUT, /* In case a barrier failed, * we need to resubmit without the barrier flag. */ @@ -482,7 +479,6 @@ enum { #define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO) #define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC) #define EE_IS_TRIM (1<<__EE_IS_TRIM) -#define EE_IS_TRIM_USE_ZEROOUT (1<<__EE_IS_TRIM_USE_ZEROOUT) #define EE_RESUBMITTED (1<<__EE_RESUBMITTED) #define EE_WAS_ERROR (1<<__EE_WAS_ERROR) #define EE_HAS_DIGEST (1<<__EE_HAS_DIGEST) @@ -1561,8 +1557,6 @@ extern void start_resync_timer_fn(unsigned long data); extern void drbd_endio_write_sec_final(struct drbd_peer_request *peer_req); /* drbd_receiver.c */ -extern int drbd_issue_discard_or_zero_out(struct drbd_device *device, - sector_t start, unsigned int nr_sectors, bool discard); extern int drbd_receiver(struct drbd_thread *thi); extern int drbd_ack_receiver(struct drbd_thread *thi); extern void drbd_send_ping_wf(struct work_struct *ws); diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index dc9a6dcd431c..bc1d296581f9 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -1448,108 +1448,14 @@ void drbd_bump_write_ordering(struct drbd_resource *resource, struct drbd_backin drbd_info(resource, "Method to ensure write ordering: %s\n", write_ordering_str[resource->write_ordering]); } -/* - * We *may* ignore the discard-zeroes-data setting, if so configured. - * - * Assumption is that it "discard_zeroes_data=0" is only because the backend - * may ignore partial unaligned discards. - * - * LVM/DM thin as of at least - * LVM version: 2.02.115(2)-RHEL7 (2015-01-28) - * Library version: 1.02.93-RHEL7 (2015-01-28) - * Driver version: 4.29.0 - * still behaves this way. - * - * For unaligned (wrt. alignment and granularity) or too small discards, - * we zero-out the initial (and/or) trailing unaligned partial chunks, - * but discard all the aligned full chunks. - * - * At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1". - */ -int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, unsigned int nr_sectors, bool discard) -{ - struct block_device *bdev = device->ldev->backing_bdev; - struct request_queue *q = bdev_get_queue(bdev); - sector_t tmp, nr; - unsigned int max_discard_sectors, granularity; - int alignment; - int err = 0; - - if (!discard) - goto zero_out; - - /* Zero-sector (unknown) and one-sector granularities are the same. */ - granularity = max(q->limits.discard_granularity >> 9, 1U); - alignment = (bdev_discard_alignment(bdev) >> 9) % granularity; - - max_discard_sectors = min(q->limits.max_discard_sectors, (1U << 22)); - max_discard_sectors -= max_discard_sectors % granularity; - if (unlikely(!max_discard_sectors)) - goto zero_out; - - if (nr_sectors < granularity) - goto zero_out; - - tmp = start; - if (sector_div(tmp, granularity) != alignment) { - if (nr_sectors < 2*granularity) - goto zero_out; - /* start + gran - (start + gran - align) % gran */ -
[PATCH 27/27] scsi: sd: Remove LBPRZ dependency for discards
From: "Martin K. Petersen"Separating discards and zeroout operations allows us to remove the LBPRZ block zeroing constraints from discards and honor the device preferences for UNMAP commands. If supported by the device, we'll also choose UNMAP over one of the WRITE SAME variants for discards. Signed-off-by: Martin K. Petersen Signed-off-by: Christoph Hellwig --- drivers/scsi/sd.c | 25 ++--- 1 file changed, 6 insertions(+), 19 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index acf9d17b05d8..8cf34a8e3eea 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -685,24 +685,11 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) unsigned int logical_block_size = sdkp->device->sector_size; unsigned int max_blocks = 0; - /* -* When LBPRZ is reported, discard alignment and granularity -* must be fixed to the logical block size. Otherwise the block -* layer will drop misaligned portions of the request which can -* lead to data corruption. If LBPRZ is not set, we honor the -* device preference. -*/ - if (sdkp->lbprz) { - q->limits.discard_alignment = 0; - q->limits.discard_granularity = logical_block_size; - } else { - q->limits.discard_alignment = sdkp->unmap_alignment * - logical_block_size; - q->limits.discard_granularity = - max(sdkp->physical_block_size, - sdkp->unmap_granularity * logical_block_size); - } - + q->limits.discard_alignment = + sdkp->unmap_alignment * logical_block_size; + q->limits.discard_granularity = + max(sdkp->physical_block_size, + sdkp->unmap_granularity * logical_block_size); sdkp->provisioning_mode = mode; switch (mode) { @@ -2842,7 +2829,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp) sd_config_discard(sdkp, SD_LBP_WS16); } else {/* LBP VPD page tells us what to use */ - if (sdkp->lbpu && sdkp->max_unmap_blocks && !sdkp->lbprz) + if (sdkp->lbpu && sdkp->max_unmap_blocks) sd_config_discard(sdkp, SD_LBP_UNMAP); else if (sdkp->lbpws) sd_config_discard(sdkp, SD_LBP_WS16); -- 2.11.0
[PATCH 2/5] nvme: cleanup nvme_req_needs_retry
Don't pass the status explicitly but derive it from the requeust, and unwind the complex condition to be more readable. Signed-off-by: Christoph HellwigReviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 0437f44d00f9..b225aacf4b89 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -67,11 +67,17 @@ static DEFINE_SPINLOCK(dev_list_lock); static struct class *nvme_class; -static inline bool nvme_req_needs_retry(struct request *req, u16 status) +static inline bool nvme_req_needs_retry(struct request *req) { - return !(status & NVME_SC_DNR || blk_noretry_request(req)) && - (jiffies - req->start_time) < req->timeout && - req->retries < nvme_max_retries; + if (blk_noretry_request(req)) + return false; + if (req->errors & NVME_SC_DNR) + return false; + if (jiffies - req->start_time >= req->timeout) + return false; + if (req->retries >= nvme_max_retries) + return false; + return true; } void nvme_complete_rq(struct request *req) @@ -79,7 +85,7 @@ void nvme_complete_rq(struct request *req) int error = 0; if (unlikely(req->errors)) { - if (nvme_req_needs_retry(req, req->errors)) { + if (nvme_req_needs_retry(req)) { req->retries++; blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q)); -- 2.11.0
[PATCH 4/5] nvme: move the retries count to struct nvme_request
The way NVMe uses this field is entirely different from the older SCSI/BLOCK_PC usage, so move it into struct nvme_request. Also reduce the size of the file to a unsigned char so that we leave space for additional smaller fields that will appear soon. Signed-off-by: Christoph HellwigReviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 10 +- drivers/nvme/host/nvme.h | 1 + 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 933e67c60e33..dc05f41c3992 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -49,8 +49,8 @@ unsigned char shutdown_timeout = 5; module_param(shutdown_timeout, byte, 0644); MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown"); -static unsigned int nvme_max_retries = 5; -module_param_named(max_retries, nvme_max_retries, uint, 0644); +static u8 nvme_max_retries = 5; +module_param_named(max_retries, nvme_max_retries, byte, 0644); MODULE_PARM_DESC(max_retries, "max number of retries a command may have"); static int nvme_char_major; @@ -74,7 +74,7 @@ static inline bool nvme_req_needs_retry(struct request *req) return false; if (jiffies - req->start_time >= req->timeout) return false; - if (req->retries >= nvme_max_retries) + if (nvme_req(req)->retries >= nvme_max_retries) return false; return true; } @@ -85,7 +85,7 @@ void nvme_complete_rq(struct request *req) if (unlikely(req->errors)) { if (nvme_req_needs_retry(req)) { - req->retries++; + nvme_req(req)->retries++; blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q)); return; @@ -356,7 +356,7 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req, int ret = BLK_MQ_RQ_QUEUE_OK; if (!(req->rq_flags & RQF_DONTPREP)) { - req->retries = 0; + nvme_req(req)->retries = 0; req->rq_flags |= RQF_DONTPREP; } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 82ba9a305301..9eecb67177df 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -90,6 +90,7 @@ enum nvme_quirks { struct nvme_request { struct nvme_command *cmd; union nvme_result result; + u8 retries; }; static inline struct nvme_request *nvme_req(struct request *req) -- 2.11.0
[PATCH 3/5] nvme: mark nvme_max_retries static
Signed-off-by: Christoph HellwigReviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 3 +-- drivers/nvme/host/nvme.h | 2 -- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index b225aacf4b89..933e67c60e33 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -49,10 +49,9 @@ unsigned char shutdown_timeout = 5; module_param(shutdown_timeout, byte, 0644); MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown"); -unsigned int nvme_max_retries = 5; +static unsigned int nvme_max_retries = 5; module_param_named(max_retries, nvme_max_retries, uint, 0644); MODULE_PARM_DESC(max_retries, "max number of retries a command may have"); -EXPORT_SYMBOL_GPL(nvme_max_retries); static int nvme_char_major; module_param(nvme_char_major, int, 0); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 227f281482db..82ba9a305301 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -43,8 +43,6 @@ extern unsigned char shutdown_timeout; #define NVME_DEFAULT_KATO 5 #define NVME_KATO_GRACE10 -extern unsigned int nvme_max_retries; - enum { NVME_NS_LBA = 0, NVME_NS_LIGHTNVM= 1, -- 2.11.0
[PATCH 5/5] block, scsi: move the retries field to struct scsi_request
Instead of bloating the generic struct request with it. Signed-off-by: Christoph HellwigReviewed-by: Johannes Thumshirn --- block/scsi_ioctl.c | 8 drivers/scsi/osd/osd_initiator.c | 2 +- drivers/scsi/osst.c| 2 +- drivers/scsi/scsi_error.c | 2 +- drivers/scsi/scsi_lib.c| 4 ++-- drivers/scsi/sg.c | 2 +- drivers/scsi/st.c | 2 +- drivers/target/target_core_pscsi.c | 2 +- include/linux/blkdev.h | 1 - include/scsi/scsi_request.h| 1 + 10 files changed, 13 insertions(+), 13 deletions(-) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 2a2fc768b27a..82a43bb19967 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -362,7 +362,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk, goto out_free_cdb; bio = rq->bio; - rq->retries = 0; + req->retries = 0; start_time = jiffies; @@ -476,13 +476,13 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, goto error; /* default. possible overriden later */ - rq->retries = 5; + req->retries = 5; switch (opcode) { case SEND_DIAGNOSTIC: case FORMAT_UNIT: rq->timeout = FORMAT_UNIT_TIMEOUT; - rq->retries = 1; + req->retries = 1; break; case START_STOP: rq->timeout = START_STOP_TIMEOUT; @@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, break; case READ_DEFECT_DATA: rq->timeout = READ_DEFECT_DATA_TIMEOUT; - rq->retries = 1; + req->retries = 1; break; default: rq->timeout = BLK_DEFAULT_SG_TIMEOUT; diff --git a/drivers/scsi/osd/osd_initiator.c b/drivers/scsi/osd/osd_initiator.c index 6903f03c88af..9d0727b2bdec 100644 --- a/drivers/scsi/osd/osd_initiator.c +++ b/drivers/scsi/osd/osd_initiator.c @@ -1602,7 +1602,7 @@ static int _init_blk_request(struct osd_request *or, req->rq_flags |= RQF_QUIET; req->timeout = or->timeout; - req->retries = or->retries; + scsi_req(req)->retries = or->retries; if (has_out) { or->out.req = req; diff --git a/drivers/scsi/osst.c b/drivers/scsi/osst.c index c47f4b349bac..41bc1d64bf86 100644 --- a/drivers/scsi/osst.c +++ b/drivers/scsi/osst.c @@ -414,7 +414,7 @@ static int osst_execute(struct osst_request *SRpnt, const unsigned char *cmd, memset(rq->cmd, 0, BLK_MAX_CDB); /* ATAPI hates garbage after CDB */ memcpy(rq->cmd, cmd, rq->cmd_len); req->timeout = timeout; - req->retries = retries; + rq->retries = retries; req->end_io_data = SRpnt; blk_execute_rq_nowait(req->q, NULL, req, 1, osst_end_async); diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f2cafae150bc..2db412dd4b44 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1988,7 +1988,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev) req->rq_flags |= RQF_QUIET; req->timeout = 10 * HZ; - req->retries = 5; + rq->retries = 5; blk_execute_rq_nowait(req->q, NULL, req, 1, eh_lock_door_done); } diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index c1519660824b..11972d1075f1 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -256,7 +256,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, rq->cmd_len = COMMAND_SIZE(cmd[0]); memcpy(rq->cmd, cmd, rq->cmd_len); - req->retries = retries; + rq->retries = retries; req->timeout = timeout; req->cmd_flags |= flags; req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT; @@ -1177,7 +1177,7 @@ static int scsi_setup_scsi_cmnd(struct scsi_device *sdev, struct request *req) cmd->cmd_len = scsi_req(req)->cmd_len; cmd->cmnd = scsi_req(req)->cmd; cmd->transfersize = blk_rq_bytes(req); - cmd->allowed = req->retries; + cmd->allowed = scsi_req(req)->retries; return BLKPREP_OK; } diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 225abaad4d1c..c5aaceea8d77 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1718,7 +1718,7 @@ sg_start_req(Sg_request *srp, unsigned char *cmd) srp->rq = rq; rq->end_io_data = srp; - rq->retries = SG_DEFAULT_RETRIES; + req->retries = SG_DEFAULT_RETRIES; if ((dxfer_len <= 0) || (dxfer_dir == SG_DXFER_NONE)) return 0; diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c index e5ef78a6848e..5408643431bb 100644 --- a/drivers/scsi/st.c +++ b/drivers/scsi/st.c @@ -579,7 +579,7 @@ static int st_scsi_execute(struct st_request *SRpnt, const unsigned char *cmd,
[PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd
->retries is counting the number of times a command is resubmitted, and be cleared on the first time we see the command. We currently don't do that for non-PCIe command, which is easily fixed by moving the setup to common code. Signed-off-by: Christoph Hellwig--- drivers/nvme/host/core.c | 5 + drivers/nvme/host/pci.c | 4 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3c908e1bc903..0437f44d00f9 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -350,6 +350,11 @@ int nvme_setup_cmd(struct nvme_ns *ns, struct request *req, { int ret = BLK_MQ_RQ_QUEUE_OK; + if (!(req->rq_flags & RQF_DONTPREP)) { + req->retries = 0; + req->rq_flags |= RQF_DONTPREP; + } + switch (req_op(req)) { case REQ_OP_DRV_IN: case REQ_OP_DRV_OUT: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 9e686a67d93b..3818ab15a26e 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -326,10 +326,6 @@ static int nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->nents = 0; iod->length = size; - if (!(rq->rq_flags & RQF_DONTPREP)) { - rq->retries = 0; - rq->rq_flags |= RQF_DONTPREP; - } return BLK_MQ_RQ_QUEUE_OK; } -- 2.11.0
->retries fixups V2
This series fixes a few lose bits in terms of how nvme uses ->retries, including fixing it for non-PCIe transports. While at it I noticed that nvme and scsi use the field in entirely different ways, and no other driver uses it at all. So I decided to move it into the nvme_request and scsi_request structures instead. Changes since V1: - better changelog for one patch - move the new retries field to the end of struct nvme_request
[PATCH 27/38] Annotate hardware config module parameters in drivers/scsi/
When the kernel is running in secure boot mode, we lock down the kernel to prevent userspace from modifying the running kernel image. Whilst this includes prohibiting access to things like /dev/mem, it must also prevent access by means of configuring driver modules in such a way as to cause a device to access or modify the kernel image. To this end, annotate module_param* statements that refer to hardware configuration and indicate for future reference what type of parameter they specify. The parameter parser in the core sees this information and can skip such parameters with an error message if the kernel is locked down. The module initialisation then runs as normal, but just sees whatever the default values for those parameters is. Note that we do still need to do the module initialisation because some drivers have viable defaults set in case parameters aren't specified and some drivers support automatic configuration (e.g. PNP or PCI) in addition to manually coded parameters. This patch annotates drivers in drivers/scsi/. Suggested-by: Alan CoxSigned-off-by: David Howells cc: "Juergen E. Fischer" cc: "James E.J. Bottomley" cc: "Martin K. Petersen" cc: Dario Ballabio cc: Finn Thain cc: Michael Schmitz cc: Achim Leubner cc: linux-scsi@vger.kernel.org --- drivers/scsi/aha152x.c |4 ++-- drivers/scsi/aha1542.c |2 +- drivers/scsi/g_NCR5380.c |8 drivers/scsi/gdth.c |2 +- drivers/scsi/qlogicfas.c |4 ++-- 5 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/scsi/aha152x.c b/drivers/scsi/aha152x.c index f44d0487236e..ce5dc73d85bb 100644 --- a/drivers/scsi/aha152x.c +++ b/drivers/scsi/aha152x.c @@ -331,11 +331,11 @@ MODULE_LICENSE("GPL"); #if !defined(PCMCIA) #if defined(MODULE) static int io[] = {0, 0}; -module_param_array(io, int, NULL, 0); +module_param_hw_array(io, int, ioport, NULL, 0); MODULE_PARM_DESC(io,"base io address of controller"); static int irq[] = {0, 0}; -module_param_array(irq, int, NULL, 0); +module_param_hw_array(irq, int, irq, NULL, 0); MODULE_PARM_DESC(irq,"interrupt for controller"); static int scsiid[] = {7, 7}; diff --git a/drivers/scsi/aha1542.c b/drivers/scsi/aha1542.c index 7db448ec8beb..a23cc9ac5acd 100644 --- a/drivers/scsi/aha1542.c +++ b/drivers/scsi/aha1542.c @@ -31,7 +31,7 @@ module_param(isapnp, bool, 0); MODULE_PARM_DESC(isapnp, "enable PnP support (default=1)"); static int io[MAXBOARDS] = { 0x330, 0x334, 0, 0 }; -module_param_array(io, int, NULL, 0); +module_param_hw_array(io, int, ioport, NULL, 0); MODULE_PARM_DESC(io, "base IO address of controller (0x130,0x134,0x230,0x234,0x330,0x334, default=0x330,0x334)"); /* time AHA spends on the AT-bus during data transfer */ diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c index 67c8dac321ad..c34fc91ba486 100644 --- a/drivers/scsi/g_NCR5380.c +++ b/drivers/scsi/g_NCR5380.c @@ -85,8 +85,8 @@ static int ncr_53c400; static int ncr_53c400a; static int dtc_3181e; static int hp_c2502; -module_param(ncr_irq, int, 0); -module_param(ncr_addr, int, 0); +module_param_hw(ncr_irq, int, irq, 0); +module_param_hw(ncr_addr, int, ioport, 0); module_param(ncr_5380, int, 0); module_param(ncr_53c400, int, 0); module_param(ncr_53c400a, int, 0); @@ -94,11 +94,11 @@ module_param(dtc_3181e, int, 0); module_param(hp_c2502, int, 0); static int irq[] = { -1, -1, -1, -1, -1, -1, -1, -1 }; -module_param_array(irq, int, NULL, 0); +module_param_hw_array(irq, int, irq, NULL, 0); MODULE_PARM_DESC(irq, "IRQ number(s) (0=none, 254=auto [default])"); static int base[] = { 0, 0, 0, 0, 0, 0, 0, 0 }; -module_param_array(base, int, NULL, 0); +module_param_hw_array(base, int, ioport, NULL, 0); MODULE_PARM_DESC(base, "base address(es)"); static int card[] = { -1, -1, -1, -1, -1, -1, -1, -1 }; diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c index d020a13646ae..facc7271f932 100644 --- a/drivers/scsi/gdth.c +++ b/drivers/scsi/gdth.c @@ -353,7 +353,7 @@ static int probe_eisa_isa = 0; static int force_dma32 = 0; /* parameters for modprobe/insmod */ -module_param_array(irq, int, NULL, 0); +module_param_hw_array(irq, int, irq, NULL, 0); module_param(disable, int, 0); module_param(reserve_mode, int, 0); module_param_array(reserve_list, int, NULL, 0); diff --git a/drivers/scsi/qlogicfas.c b/drivers/scsi/qlogicfas.c index 61cac87fb86f..840823b99e51 100644 --- a/drivers/scsi/qlogicfas.c +++ b/drivers/scsi/qlogicfas.c @@ -137,8 +137,8 @@ static struct Scsi_Host *__qlogicfas_detect(struct scsi_host_template *host, static struct qlogicfas408_priv *cards; static int iobase[MAX_QLOGICFAS]; static int irq[MAX_QLOGICFAS] = { [0 ... MAX_QLOGICFAS-1] = -1 }; -module_param_array(iobase, int, NULL, 0); -module_param_array(irq,
[PATCH] Make checking the scsi_device_get() return value mandatory
Now that all scsi_device_get() callers check the return value of this function, make checking that return value mandatory. Signed-off-by: Bart Van AsscheCc: Hannes Reinecke Cc: Johannes Thumshirn --- include/scsi/scsi_device.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index cdff28519393..7a154a944c6c 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -316,7 +316,7 @@ extern int scsi_unregister_device_handler(struct scsi_device_handler *scsi_dh); void scsi_attach_vpd(struct scsi_device *sdev); extern struct scsi_device *scsi_device_from_queue(struct request_queue *q); -extern int scsi_device_get(struct scsi_device *); +extern int __must_check scsi_device_get(struct scsi_device *); extern void scsi_device_put(struct scsi_device *); extern struct scsi_device *scsi_device_lookup(struct Scsi_Host *, uint, uint, u64); -- 2.12.0
Re: [PATCH] osd_uld: Check scsi_device_get() return value
On 03/30/2017 08:17 PM, Bart Van Assche wrote: > scsi_device_get() can fail. Hence check its return value. > > Signed-off-by: Bart Van Assche> Cc: Boaz Harrosh Cool thanks ACK-by: Boaz Harrosh > --- > drivers/scsi/osd/osd_uld.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c > index e0ce5d2fd14d..1dea4244dd0c 100644 > --- a/drivers/scsi/osd/osd_uld.c > +++ b/drivers/scsi/osd/osd_uld.c > @@ -464,14 +464,15 @@ static int osd_probe(struct device *dev) > /* hold one more reference to the scsi_device that will get released >* in __release, in case a logout is happening while fs is mounted >*/ > - scsi_device_get(scsi_device); > + if (scsi_device_get(scsi_device)) > + goto err_put_disk; > osd_dev_init(>od, scsi_device); > > /* Detect the OSD Version */ > error = __detect_osd(oud); > if (error) { > OSD_ERR("osd detection failed, non-compatible OSD device\n"); > - goto err_put_disk; > + goto err_put_sdev; > } > > /* init the char-device for communication with user-mode */ > @@ -508,8 +509,9 @@ static int osd_probe(struct device *dev) > > err_put_cdev: > cdev_del(>cdev); > -err_put_disk: > +err_put_sdev: > scsi_device_put(scsi_device); > +err_put_disk: > put_disk(disk); > err_free_osd: > dev_set_drvdata(dev, NULL); >
Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe
> On Apr 5, 2017, at 8:18 AM, Mauricio Faria de Oliveira >wrote: > > The commit 08024885a2a3 ("ses: Add power_status to SES device slot") > introduced the 'power_status' attribute to enclosure components and > the associated callbacks. > > There are 2 callbacks available to get the power status of a device: > 1) ses_get_power_status() for 'struct enclosure_component_callbacks' > 2) get_component_power_status() for the sysfs device attribute > (these are available for kernel-space and user-space, respectively.) > > However, despite both methods being available to get power status > on demand, that commit also introduced a call to get power status > in ses_enclosure_data_process(). > > This dramatically increased the total probe time for SCSI devices > on larger configurations, because ses_enclosure_data_process() is > called several times during the SCSI devices probe and loops over > the component devices (but that is another problem, another patch). > > That results in a tremendous continuous hammering of SCSI Receive > Diagnostics commands to the enclosure-services device, which does > delay the total probe time for the SCSI devices __significantly__: > > Originally, ~34 minutes on a system attached to ~170 disks: > >[ 9214.490703] mpt3sas version 13.100.00.00 loaded >... >[11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0), > ordered(0), scsi_level(6), cmd_que(1) > > With this patch, it decreased to ~2.5 minutes -- a 13.6x faster > >[ 1002.992533] mpt3sas version 13.100.00.00 loaded >... >[ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0), > ordered(0), scsi_level(6), cmd_que(1) > > Back to the commit discussion.. on the ses_get_power_status() call > introduced in ses_enclosure_data_process(): impact of removing it. > > That may possibly be in place to initialize the power status value > on device probe. However, those 2 functions available to retrieve > that value _do_ automatically refresh/update it. So the potential > benefit would be a direct access of the 'power_status' field which > does not use the callbacks... > > But the only reader of 'struct enclosure_component::power_status' > is the get_component_power_status() callback for sysfs attribute, > and it _does_ check for and call the .get_power_status callback, > (which indeed is defined and implemented by that commit), so the > power status value is, again, automatically updated. > > So, the remaining potential for a direct/non-callback access to > the power_status attribute would be out-of-tree modules -- well, > for those, if they are for whatever reason interested in values > that are set during device probe and not up-to-date by the time > they need it.. well, that would be curious. > > Well, to handle that more properly, set the initial power state > value to '-1' (i.e., uninitialized) instead of '1' (power 'on'), > and check for it in that callback which may do an direct access > to the field value _if_ a callback function is not defined. > > Signed-off-by: Mauricio Faria de Oliveira > Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot") > --- > > v2: > - remove module parameter. > - better return values for -1/uninitalized power_status. > (thanks: Dan Williams ) > > drivers/misc/enclosure.c | 7 ++- > drivers/scsi/ses.c | 1 - > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c > index 65fed7146e9b..d3fe3ea902d4 100644 > --- a/drivers/misc/enclosure.c > +++ b/drivers/misc/enclosure.c > @@ -148,7 +148,7 @@ struct enclosure_device * > for (i = 0; i < components; i++) { > edev->component[i].number = -1; > edev->component[i].slot = -1; > - edev->component[i].power_status = 1; > + edev->component[i].power_status = -1; > } > > mutex_lock(_list_lock); > @@ -594,6 +594,11 @@ static ssize_t get_component_power_status(struct device > *cdev, > > if (edev->cb->get_power_status) > edev->cb->get_power_status(edev, ecomp); > + > + /* If still uninitialized, the callback failed or does not exist. */ > + if (ecomp->power_status == -1) > + return (edev->cb->get_power_status) ? -EIO : -ENOTTY; > + > return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off"); > } > > diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c > index 50adabbb5808..f1cdf32d7514 100644 > --- a/drivers/scsi/ses.c > +++ b/drivers/scsi/ses.c > @@ -548,7 +548,6 @@ static void ses_enclosure_data_process(struct > enclosure_device *edev, > ecomp = >component[components++]; > > if (!IS_ERR(ecomp)) { > - ses_get_power_status(edev, ecomp); >
Re: problem with discard granularity in sd
Hi Martin, I really appreciate the response. Here is the VPD page data you asked for: Logical block provisioning VPD page (SBC): Unmap command supported (LBPU): 1 Write same (16) with unmap bit supported (LBWS): 1 Write same (10) with unmap bit supported (LBWS10): 1 Logical block provisioning read zeros (LBPRZ): 1 Anchored LBAs supported (ANC_SUP): 0 Threshold exponent: 0 Descriptor present (DP): 0 Provisioning type: 2 Block limits VPD page (SBC): Write same no zero (WSNZ): 1 Maximum compare and write length: 1 blocks Optimal transfer length granularity: 8 blocks Maximum transfer length: 8388607 blocks Optimal transfer length: 128 blocks Maximum prefetch length: 0 blocks Maximum unmap LBA count: 8192 Maximum unmap block descriptor count: 64 Optimal unmap granularity: 8 Unmap granularity alignment valid: 1 Unmap granularity alignment: 0 Maximum write same length: 0x4000 blocks As I mentioned previously, I'm fairly certain that the issue I'm seeing is due to the fact that while NetApp LUNs are presented as 512B logical/4K physical disks for compatibility, they actually don't support requests smaller than 4K (which makes sense as NetApp LUNs are actually just files allocated on the 4K-block WAFL filesystem). If the optimal granularity values from VPD are respected (as is the case with 3.10 kernels), the result is: minimum_io_size = 512 *8 = 4096 (logical_block_size * VPD optimal transfer length granularity) discard_granularity = 512 * 8 = 4096 (logical_block_size * VPD optimal unmap granularity) With recent kernels the value of minimum_io_size is unchanged, but discard_granularity is explicitly set to logical_block_size which results in unmap requests either being dropped or ignored. I understand that the VPD values from LUNs may be atypical compared to physical disks, and I expect there are some major differences in how unmap (and zeroing) is handled between physical disks and LUNs. But it is unfortunate that a solution for misaligned I/O would break fully-aligned requests which worked flawlessly in previous kernels. Let me know if there's any additional information I can provide. This has resulted in a 2-3x increase in raw disk requirements for some workloads (unfortunately on SSD too), and I'd love to find a solution that doesn't require rolling back to a 3.10 kernel. Thanks again! -David On Tue, Apr 4, 2017 at 5:12 PM, Martin K. Petersenwrote: > David Buckley writes: > > David, > >> They result in discard granularity being forced to logical block size >> if the disk reports LBPRZ is enabled (which the netapp luns do). > > Block zeroing and unmapping are currently sharing some plumbing and that > has lead to some compromises. In this case a bias towards ensuring data > integrity for zeroing at the expense of not aligning unmap requests. > > Christoph has worked on separating those two functions. His code is > currently under review. > >> I'm not sure of the implications of either the netapp changes, though >> reporting 4k logical blocks seems potential as this is supported in >> newer OS at least. > > Yes, but it may break legacy applications that assume a 512-byte logical > block size. > >> The sd change potentially would at least partially undo the patches >> referenced above. But it would seem that (assuming an aligned >> filesystem with 4k blocks and minimum_io_size=4096) there is no >> possibility of a partial block discard or advantage to sending the >> discard requests in 512 blocks? > > The unmap granularity inside a device is often much, much bigger than > 4K. So aligning to that probably won't make a difference. And it's > imperative to filesystems that zeroing works at logical block size > granularity. > > The expected behavior for a device is that it unmaps whichever full > unmap granularity chunks are described by a received request. And then > explicitly zeroes any partial chunks at the head and tail. So I am > surprised you see no reclamation whatsoever. > > With the impending zero/unmap separation things might fare better. But > I'd still like to understand the behavior you observe. Please provide > the output of: > > sg_vpd -p lbpv /dev/sdN > sg_vpd -p bl /dev/sdN > > for one of the LUNs and I'll take a look. > > Thanks! > > -- > Martin K. Petersen Oracle Linux Engineering
RE: [RFC 4/8] p2pmem: Add debugfs "stats" file
> > > + p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL); > > + if (!p2pmem_debugfs_root) > > + pr_info("could not create debugfs entry, continuing\n"); > > + > > Why continue? I think it'd be better to just fail it. > Because not having debugfs support isn't fatal to using p2pmem. So I believe it is better to continue. But this is trivial, IMO, so either was is ok with me. > Besides, this can be safely squashed into patch 1. Yes.
RE: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
> > > > +static void setup_memwin_p2pmem(struct adapter *adap) > > +{ > > + unsigned int mem_base = t4_read_reg(adap, > CIM_EXTMEM2_BASE_ADDR_A); > > + unsigned int mem_size = t4_read_reg(adap, > CIM_EXTMEM2_ADDR_SIZE_A); > > + > > + if (!use_p2pmem) > > + return; > > This is weird, why even call this if !use_p2pmem? > The use_p2pmem was added after the original change. I'll update as you suggest. > > +static int init_p2pmem(struct adapter *adapter) > > +{ > > + unsigned int mem_size = t4_read_reg(adapter, > CIM_EXTMEM2_ADDR_SIZE_A); > > + struct p2pmem_dev *p; > > + int rc; > > + struct resource res; > > + > > + if (!mem_size || !use_p2pmem) > > + return 0; > > Again, weird... Yup.
Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors
On Wed, Apr 05, 2017 at 08:26:57AM +0200, Christoph Hellwig wrote: > On Tue, Apr 04, 2017 at 01:49:44PM +0530, Varun Prakash wrote: > > On Tue, Apr 04, 2017 at 08:46:14AM +0200, Christoph Hellwig wrote: > > > Does this one work better? > > > > > > > csiostor driver is triggering following warning during module unload. > > Looks like we need to explicitly ignore the CSIO_IM_NONE case in > csio_free_irqs. New patch below: > Yes, we have to ignore CSIO_IM_NONE case in csio_free_irqs(), but I don't see any CSIO_IM_NONE specific code in csio_free_irqs(). There is one more issue - this patch changes the order in which csio_hw_intr_disable() and free_irq() are called. In the current code first csio_hw_intr_disable() is called then free_irq() is called, with this patch free_irq() is called without disabling hw interrupts, this can cause regressions. > --- > From e5a4178cb810be581b6d9b8f48f13b12e88eae74 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig> Date: Thu, 12 Jan 2017 11:17:29 +0100 > Subject: csiostor: switch to pci_alloc_irq_vectors > > And get automatic MSI-X affinity for free. > > Signed-off-by: Christoph Hellwig > --- > drivers/scsi/csiostor/csio_hw.h | 4 +- > drivers/scsi/csiostor/csio_init.c | 9 +-- > drivers/scsi/csiostor/csio_isr.c | 127 > +- > 3 files changed, 51 insertions(+), 89 deletions(-) > > diff --git a/drivers/scsi/csiostor/csio_hw.h b/drivers/scsi/csiostor/csio_hw.h > index 029bef82c057..a084d83ce70f 100644 > --- a/drivers/scsi/csiostor/csio_hw.h > +++ b/drivers/scsi/csiostor/csio_hw.h > @@ -95,7 +95,6 @@ enum { > }; > > struct csio_msix_entries { > - unsigned short vector; /* Assigned MSI-X vector */ > void*dev_id;/* Priv object associated w/ this msix*/ > chardesc[24]; /* Description of this vector */ > }; > @@ -591,8 +590,9 @@ int csio_enqueue_evt(struct csio_hw *, enum csio_evt, > void *, uint16_t); > void csio_evtq_flush(struct csio_hw *hw); > > int csio_request_irqs(struct csio_hw *); > +void csio_free_irqs(struct csio_hw *); > void csio_intr_enable(struct csio_hw *); > -void csio_intr_disable(struct csio_hw *, bool); > +void csio_intr_disable(struct csio_hw *); > void csio_hw_fatal_err(struct csio_hw *); > > struct csio_lnode *csio_lnode_alloc(struct csio_hw *); > diff --git a/drivers/scsi/csiostor/csio_init.c > b/drivers/scsi/csiostor/csio_init.c > index dbe416ff46c2..7e60699c8b55 100644 > --- a/drivers/scsi/csiostor/csio_init.c > +++ b/drivers/scsi/csiostor/csio_init.c > @@ -461,8 +461,7 @@ csio_config_queues(struct csio_hw *hw) > return 0; > > intr_disable: > - csio_intr_disable(hw, false); > - > + csio_intr_disable(hw); > return -EINVAL; > } > > @@ -575,7 +574,8 @@ static struct csio_hw *csio_hw_alloc(struct pci_dev *pdev) > static void > csio_hw_free(struct csio_hw *hw) > { > - csio_intr_disable(hw, true); > + csio_free_irqs(hw); > + csio_intr_disable(hw); This changes the order, free_irq() is called without disabling hw interrupts. > csio_hw_exit_workers(hw); > csio_hw_exit(hw); > iounmap(hw->regstart); > @@ -1068,7 +1068,8 @@ csio_pci_error_detected(struct pci_dev *pdev, > pci_channel_state_t state) > spin_unlock_irq(>lock); > csio_lnodes_unblock_request(hw); > csio_lnodes_exit(hw, 0); > - csio_intr_disable(hw, true); > + csio_free_irqs(hw); > + csio_intr_disable(hw); Same here. > pci_disable_device(pdev); > return state == pci_channel_io_perm_failure ? > PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET; > diff --git a/drivers/scsi/csiostor/csio_isr.c > b/drivers/scsi/csiostor/csio_isr.c > index 2fb71c6c3b37..a4ad847e9c53 100644 > --- a/drivers/scsi/csiostor/csio_isr.c > +++ b/drivers/scsi/csiostor/csio_isr.c > @@ -383,17 +383,15 @@ csio_request_irqs(struct csio_hw *hw) > int rv, i, j, k = 0; > struct csio_msix_entries *entryp = >msix_entries[0]; > struct csio_scsi_cpu_info *info; > + struct pci_dev *pdev = hw->pdev; > > if (hw->intr_mode != CSIO_IM_MSIX) { > - rv = request_irq(hw->pdev->irq, csio_fcoe_isr, > - (hw->intr_mode == CSIO_IM_MSI) ? > - 0 : IRQF_SHARED, > - KBUILD_MODNAME, hw); > + rv = request_irq(pci_irq_vector(pdev, 0), csio_fcoe_isr, > + hw->intr_mode == CSIO_IM_MSI ? 0 : IRQF_SHARED, > + KBUILD_MODNAME, hw); > if (rv) { > - if (hw->intr_mode == CSIO_IM_MSI) > - pci_disable_msi(hw->pdev); > csio_err(hw, "Failed to allocate interrupt line.\n"); > - return -EINVAL; > + goto out_free_irqs; > } > >
Re: [PATCH v2] scsi: ses: don't get power status of SES device slot on probe
On Wed, Apr 5, 2017 at 8:18 AM, Mauricio Faria de Oliveirawrote: > The commit 08024885a2a3 ("ses: Add power_status to SES device slot") > introduced the 'power_status' attribute to enclosure components and > the associated callbacks. > > There are 2 callbacks available to get the power status of a device: > 1) ses_get_power_status() for 'struct enclosure_component_callbacks' > 2) get_component_power_status() for the sysfs device attribute > (these are available for kernel-space and user-space, respectively.) > > However, despite both methods being available to get power status > on demand, that commit also introduced a call to get power status > in ses_enclosure_data_process(). > > This dramatically increased the total probe time for SCSI devices > on larger configurations, because ses_enclosure_data_process() is > called several times during the SCSI devices probe and loops over > the component devices (but that is another problem, another patch). > > That results in a tremendous continuous hammering of SCSI Receive > Diagnostics commands to the enclosure-services device, which does > delay the total probe time for the SCSI devices __significantly__: > > Originally, ~34 minutes on a system attached to ~170 disks: > > [ 9214.490703] mpt3sas version 13.100.00.00 loaded > ... > [11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0), >ordered(0), scsi_level(6), cmd_que(1) > > With this patch, it decreased to ~2.5 minutes -- a 13.6x faster > > [ 1002.992533] mpt3sas version 13.100.00.00 loaded > ... > [ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0), >ordered(0), scsi_level(6), cmd_que(1) > > Back to the commit discussion.. on the ses_get_power_status() call > introduced in ses_enclosure_data_process(): impact of removing it. > > That may possibly be in place to initialize the power status value > on device probe. However, those 2 functions available to retrieve > that value _do_ automatically refresh/update it. So the potential > benefit would be a direct access of the 'power_status' field which > does not use the callbacks... > > But the only reader of 'struct enclosure_component::power_status' > is the get_component_power_status() callback for sysfs attribute, > and it _does_ check for and call the .get_power_status callback, > (which indeed is defined and implemented by that commit), so the > power status value is, again, automatically updated. > > So, the remaining potential for a direct/non-callback access to > the power_status attribute would be out-of-tree modules -- well, > for those, if they are for whatever reason interested in values > that are set during device probe and not up-to-date by the time > they need it.. well, that would be curious. > > Well, to handle that more properly, set the initial power state > value to '-1' (i.e., uninitialized) instead of '1' (power 'on'), > and check for it in that callback which may do an direct access > to the field value _if_ a callback function is not defined. > > Signed-off-by: Mauricio Faria de Oliveira > Fixes: 08024885a2a3 ("ses: Add power_status to SES device slot") Reviewed-by: Dan Williams
Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
On Wed, Apr 05, 2017 at 11:14:30AM -0400, Keith Busch wrote: > On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote: > > The way NVMe uses this field is entirely different from the older > > SCSI/BLOCK_PC usage, so move it into struct nvme_request. > > > > Also reduce the size of the file to a unsigned char so that we leave space > > for additional smaller fields that will appear soon. > > What new fields can we expect? Why temporarily pad the middle of the > struct instead of appending to the end? The "result" packed nicely > on 64-bit, at least. I've got another u8 and a u16 in pending patches. That being said moving it to the end might make sense.
[PATCH v2] scsi: ses: don't get power status of SES device slot on probe
The commit 08024885a2a3 ("ses: Add power_status to SES device slot") introduced the 'power_status' attribute to enclosure components and the associated callbacks. There are 2 callbacks available to get the power status of a device: 1) ses_get_power_status() for 'struct enclosure_component_callbacks' 2) get_component_power_status() for the sysfs device attribute (these are available for kernel-space and user-space, respectively.) However, despite both methods being available to get power status on demand, that commit also introduced a call to get power status in ses_enclosure_data_process(). This dramatically increased the total probe time for SCSI devices on larger configurations, because ses_enclosure_data_process() is called several times during the SCSI devices probe and loops over the component devices (but that is another problem, another patch). That results in a tremendous continuous hammering of SCSI Receive Diagnostics commands to the enclosure-services device, which does delay the total probe time for the SCSI devices __significantly__: Originally, ~34 minutes on a system attached to ~170 disks: [ 9214.490703] mpt3sas version 13.100.00.00 loaded ... [11256.580231] scsi 17:0:177:0: qdepth(16), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) With this patch, it decreased to ~2.5 minutes -- a 13.6x faster [ 1002.992533] mpt3sas version 13.100.00.00 loaded ... [ 1151.978831] scsi 11:0:177:0: qdepth(16), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1) Back to the commit discussion.. on the ses_get_power_status() call introduced in ses_enclosure_data_process(): impact of removing it. That may possibly be in place to initialize the power status value on device probe. However, those 2 functions available to retrieve that value _do_ automatically refresh/update it. So the potential benefit would be a direct access of the 'power_status' field which does not use the callbacks... But the only reader of 'struct enclosure_component::power_status' is the get_component_power_status() callback for sysfs attribute, and it _does_ check for and call the .get_power_status callback, (which indeed is defined and implemented by that commit), so the power status value is, again, automatically updated. So, the remaining potential for a direct/non-callback access to the power_status attribute would be out-of-tree modules -- well, for those, if they are for whatever reason interested in values that are set during device probe and not up-to-date by the time they need it.. well, that would be curious. Well, to handle that more properly, set the initial power state value to '-1' (i.e., uninitialized) instead of '1' (power 'on'), and check for it in that callback which may do an direct access to the field value _if_ a callback function is not defined. Signed-off-by: Mauricio Faria de OliveiraFixes: 08024885a2a3 ("ses: Add power_status to SES device slot") --- v2: - remove module parameter. - better return values for -1/uninitalized power_status. (thanks: Dan Williams ) drivers/misc/enclosure.c | 7 ++- drivers/scsi/ses.c | 1 - 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed7146e9b..d3fe3ea902d4 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -148,7 +148,7 @@ struct enclosure_device * for (i = 0; i < components; i++) { edev->component[i].number = -1; edev->component[i].slot = -1; - edev->component[i].power_status = 1; + edev->component[i].power_status = -1; } mutex_lock(_list_lock); @@ -594,6 +594,11 @@ static ssize_t get_component_power_status(struct device *cdev, if (edev->cb->get_power_status) edev->cb->get_power_status(edev, ecomp); + + /* If still uninitialized, the callback failed or does not exist. */ + if (ecomp->power_status == -1) + return (edev->cb->get_power_status) ? -EIO : -ENOTTY; + return snprintf(buf, 40, "%s\n", ecomp->power_status ? "on" : "off"); } diff --git a/drivers/scsi/ses.c b/drivers/scsi/ses.c index 50adabbb5808..f1cdf32d7514 100644 --- a/drivers/scsi/ses.c +++ b/drivers/scsi/ses.c @@ -548,7 +548,6 @@ static void ses_enclosure_data_process(struct enclosure_device *edev, ecomp = >component[components++]; if (!IS_ERR(ecomp)) { - ses_get_power_status(edev, ecomp); if (addl_desc_ptr) ses_process_descriptor( ecomp, -- 1.8.3.1
Re: [PATCHv5 8/8] scsi: inline command aborts
I think this also need to remove the abort_work field from struct scsi_cmnd. > } > > #ifdef CONFIG_SCSI_LOGGING > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 53e3343..2355100 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -115,11 +115,9 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host > *shost) > * scmd_eh_abort_handler - Handle command aborts > * @work:command to be aborted. > */ > -void > -scmd_eh_abort_handler(struct work_struct *work) > +int > +scmd_eh_abort_handler(struct scsi_cmnd *scmd) > { > - struct scsi_cmnd *scmd = > - container_of(work, struct scsi_cmnd, abort_work.work); > struct scsi_device *sdev = scmd->device; > int rtn; > > @@ -127,42 +125,40 @@ static int scsi_host_eh_past_deadline(struct Scsi_Host > *shost) > SCSI_LOG_ERROR_RECOVERY(3, > scmd_printk(KERN_INFO, scmd, > "eh timeout, not aborting\n")); > - } else { > - SCSI_LOG_ERROR_RECOVERY(3, > - scmd_printk(KERN_INFO, scmd, > - "aborting command\n")); > - rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd); > - if (rtn == SUCCESS) { > - set_host_byte(scmd, DID_TIME_OUT); > - if (scsi_host_eh_past_deadline(sdev->host)) { > - SCSI_LOG_ERROR_RECOVERY(3, > - scmd_printk(KERN_INFO, scmd, > - "eh timeout, not retrying " > - "aborted command\n")); > - } else if (!scsi_noretry_cmd(scmd) && > - (++scmd->retries <= scmd->allowed)) { > - SCSI_LOG_ERROR_RECOVERY(3, > - scmd_printk(KERN_WARNING, scmd, > - "retry aborted command\n")); > - scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY); > - return; > - } else { > - SCSI_LOG_ERROR_RECOVERY(3, > - scmd_printk(KERN_WARNING, scmd, > - "finish aborted > command\n")); > - scsi_finish_command(scmd); > - return; > - } > - } else { > + return FAILED; > + } > + SCSI_LOG_ERROR_RECOVERY(3, > + scmd_printk(KERN_INFO, scmd, > + "aborting command\n")); > + rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd); > + if (rtn == SUCCESS) { If you restructure the function anywat this could become an early return / goto out.
Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote: > The way NVMe uses this field is entirely different from the older > SCSI/BLOCK_PC usage, so move it into struct nvme_request. > > Also reduce the size of the file to a unsigned char so that we leave space > for additional smaller fields that will appear soon. What new fields can we expect? Why temporarily pad the middle of the struct instead of appending to the end? The "result" packed nicely on 64-bit, at least. > struct nvme_request { > struct nvme_command *cmd; > + u8 retries; > union nvme_result result; > };
Re: [PATCHv5 7/8] scsi: make asynchronous aborts mandatory
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCHv5 6/8] scsi: make scsi_eh_scmd_add() always succeed
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCHv5 5/8] scsi: make eh_eflags persistent
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCHv5 3/8] scsi: always send command aborts
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCHv5 1/8] scsi_error: count medium access timeout only once per EH run
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCHv5 2/8] sd: Return SUCCESS in sd_eh_action() after device offline
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe
On 04/05/2017 11:41 AM, Dan Williams wrote: On Wed, Apr 5, 2017 at 6:13 AM, Mauricio Faria de Oliveirawrote: 1) imagine .get_power_status couldn't update the 'power_status' field (it's a bit unlikely with the in-tree ses driver, but in the case that ses_get_page2_descriptor() returns NULL, ses_get_power_status() doesn't update 'power_status'). Ok, in that case we should probably return -EIO, if get_power_status is non-NULL, but the power_status is still -1. 2) an out-of-tree module employs another enclosure_component_callbacks structure, which doesn't implement a .get_power_status hook. If get_power_status is null and power_status is -1 I think we should return -ENOTTY to indicate the failure. 3) or it does, but that failed, and didn't update 'power_status' (e.g., like case 1) I think we're back -EIO for this. Absolutely. I see those are better values to convey the failure/reason. However, for an in-tree usage, it's clear that just dropping that line seemed to be sufficient -- the rest in the patch is just consideration for whoever might be using it in out-of-tree ways. (and if such consideration is deemed not to be required in this case, I'd be fine with dropping the related changes and making this patch a one-line. :- ) Let's not carry module parameters that only theoretically help an out of tree module. If such a driver exists its owner can just holler if this policy change breaks their usage, and then we can have a discussion about why their driver isn't upstream already. Okay, got it. For the record, the patch author has been in To:/Cc:, for such cases. I'll submit a v2. Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd
On Wed, Apr 05, 2017 at 04:43:34PM +0200, Johannes Thumshirn wrote: > On Wed, Apr 05, 2017 at 04:18:52PM +0200, Christoph Hellwig wrote: > > This way we get the behavior right for the non-PCIe transports. > > Could you please share a bit of your minds inner workings for us mere mortals? It's initialized to zero the first time the command is submitted and will then be incremented when queueing a retry.
Re: always use REQ_OP_WRITE_ZEROES for zeroing offload V2
Looks like the mail server crapped out under the load.. I'll resend tomorrow morning, in the meantime the git url below works. On Wed, Apr 05, 2017 at 04:21:38PM +0200, Christoph Hellwig wrote: > This series makes REQ_OP_WRITE_ZEROES the only zeroing offload > supported by the block layer, and switches existing implementations > of REQ_OP_DISCARD that correctly set discard_zeroes_data to it, > removes incorrect discard_zeroes_data, and also switches WRITE SAME > based zeroing in SCSI to this new method. > > The series is against the block for-next tree. > > A git tree is also avaiable at: > > git://git.infradead.org/users/hch/block.git discard-rework.2 > > Gitweb: > > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2 > > Changes since V2: > - various spelling fixes > - various reviews captured > - two new patches from Martin at the end ---end quoted text---
Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request
On Wed, Apr 05, 2017 at 04:18:56PM +0200, Christoph Hellwig wrote: > Instead of bloating the generic struct request with it. > > Signed-off-by: Christoph Hellwig> --- Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 2/5] nvme: cleanup nvme_req_needs_retry
On Wed, Apr 05, 2017 at 04:18:53PM +0200, Christoph Hellwig wrote: > Don't pass the status explicitly but derive it from the requeust, > and unwind the complex condition to be more readable. > > Signed-off-by: Christoph Hellwig> --- Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote: > The way NVMe uses this field is entirely different from the older > SCSI/BLOCK_PC usage, so move it into struct nvme_request. > > Also reduce the size of the file to a unsigned char so that we leave space > for additional smaller fields that will appear soon. > > Signed-off-by: Christoph Hellwig> --- Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 3/5] nvme: mark nvme_max_retries static
On Wed, Apr 05, 2017 at 04:18:54PM +0200, Christoph Hellwig wrote: > Signed-off-by: Christoph Hellwig> --- Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/5] nvme: move ->retries setup to nvme_setup_cmd
On Wed, Apr 05, 2017 at 04:18:52PM +0200, Christoph Hellwig wrote: > This way we get the behavior right for the non-PCIe transports. Could you please share a bit of your minds inner workings for us mere mortals? Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH] scsi: ses: don't get power status of SES device slot on probe
On Wed, Apr 5, 2017 at 6:13 AM, Mauricio Faria de Oliveirawrote: > Hi Dan, > > Thanks for reviewing. > > On 04/04/2017 06:07 PM, Dan Williams wrote: >>> >>> @@ -594,6 +594,10 @@ static ssize_t get_component_power_status(struct >>> device *cdev, >>> >>> if (edev->cb->get_power_status) >>> edev->cb->get_power_status(edev, ecomp); >>> + >>> + if (ecomp->power_status == -1) >>> + return -EINVAL; >>> + >> >> Can we ever hit this if get_power_status is non-null? > > > I admit that is cautionary, and more for consistency with the chance > of an uninitialized state being still in place, so not to return any > (incorrect) value in that state. > > But, yes, in a few cases -- although unlikely. > > 1) imagine .get_power_status couldn't update the 'power_status' field >(it's a bit unlikely with the in-tree ses driver, but in the case > that ses_get_page2_descriptor() returns NULL, ses_get_power_status() > doesn't update 'power_status'). Ok, in that case we should probably return -EIO, if get_power_status is non-NULL, but the power_status is still -1. > 2) an out-of-tree module employs another enclosure_component_callbacks >structure, which doesn't implement a .get_power_status hook. If get_power_status is null and power_status is -1 I think we should return -ENOTTY to indicate the failure. > 3) or it does, but that failed, and didn't update 'power_status' (e.g., >like case 1) I think we're back -EIO for this. > > >>> +static bool power_status_on_probe; /* default: false, 0. */ >>> +module_param(power_status_on_probe, bool, 0644); >>> +MODULE_PARM_DESC(power_status_on_probe, "get power status of SES device >>> slots " >>> + "(enclosure components) at probe >>> time " >>> + "(warning: may delay total probe >>> time)"); >>> + >> >> I don't think we need this module parameter as long as the only way to >> read the power status is through sysfs. We can always just leave it to >> be read on-demand when needed. > > > I agree for the in-tree module. My only concern, as described in the > commit message, is the case someone is using an out-of-tree module / > has an implementation that depends on that (e.g., reading the 'power_ > status' field directly, instead of using .get_power_status. > > I've been a bit overzealous when considering why that call was inserted > in ses_enclosure_data_process(), and didn't want to break them. > > However, for an in-tree usage, it's clear that just dropping that line > seemed to be sufficient -- the rest in the patch is just consideration > for whoever might be using it in out-of-tree ways. > > (and if such consideration is deemed not to be required in this case, > I'd be fine with dropping the related changes and making this patch > a one-line. :- ) Let's not carry module parameters that only theoretically help an out of tree module. If such a driver exists its owner can just holler if this policy change breaks their usage, and then we can have a discussion about why their driver isn't upstream already.
Re: [PATCH 27/39] Annotate hardware config module parameters in drivers/scsi/
Finn Thainwrote: > I can see how base addresses and IO ports are relevant, but the irq > parameter changes below don't protect the kernel image AFAICT. What's the > rationale for those changes? I think it should be stated here. Easier grepping for one. But I'm also currently preventing the changing of any hardware parameters. If a driver refuses to share an irq, you could use it to deprive another driver of the ability to get an irq at all. David
[PATCH 16/27] zram: implement REQ_OP_WRITE_ZEROES
Just the same as discard if the block size equals the system page size. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/zram/zram_drv.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index dceb5edd1e54..1710b06f04a7 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -829,10 +829,14 @@ static void __zram_make_request(struct zram *zram, struct bio *bio) offset = (bio->bi_iter.bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; - if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + switch (bio_op(bio)) { + case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: zram_bio_discard(zram, index, offset, bio); bio_endio(bio); return; + default: + break; } bio_for_each_segment(bvec, bio, iter) { @@ -1192,6 +1196,8 @@ static int zram_add(void) zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE; zram->disk->queue->limits.chunk_sectors = 0; blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX); + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue); + /* * zram_bio_discard() will clear all logical blocks if logical block * size is identical with physical block size(PAGE_SIZE). But if it is @@ -1201,10 +1207,7 @@ static int zram_add(void) * zeroed. */ if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE) - zram->disk->queue->limits.discard_zeroes_data = 1; - else - zram->disk->queue->limits.discard_zeroes_data = 0; - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue); + blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX); add_disk(zram->disk); -- 2.11.0
[PATCH 14/27] sd: implement unmapping Write Zeroes
Try to use a write same with unmap bit variant if the device supports it and the caller allows for it. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/scsi/sd.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d8d9c0bdd93c..001593ed0444 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -803,6 +803,15 @@ static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd) u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + if (!(rq->cmd_flags & REQ_NOUNMAP)) { + switch (sdkp->provisioning_mode) { + case SD_LBP_WS16: + return sd_setup_write_same16_cmnd(cmd, true); + case SD_LBP_WS10: + return sd_setup_write_same10_cmnd(cmd, true); + } + } + if (sdp->no_write_same) return BLKPREP_INVALID; if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) -- 2.11.0
[PATCH 13/27] block_dev: use blkdev_issue_zerout for hole punches
This gets us support for non-discard efficient write of zeroes (e.g. NVMe) and prepares for removing the discard_zeroes_data flag. Also remove a pointless discard support check, which is done in blkdev_issue_discard already. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- fs/block_dev.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2f704c3a816f..e405d8e58e31 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -2069,7 +2069,6 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t len) { struct block_device *bdev = I_BDEV(bdev_file_inode(file)); - struct request_queue *q = bdev_get_queue(bdev); struct address_space *mapping; loff_t end = start + len - 1; loff_t isize; @@ -2108,15 +2107,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, GFP_KERNEL, BLKDEV_ZERO_NOUNMAP); break; case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: - /* Only punch if the device can do zeroing discard. */ - if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) - return -EOPNOTSUPP; - error = blkdev_issue_discard(bdev, start >> 9, len >> 9, -GFP_KERNEL, 0); + error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, +GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK); break; case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE: - if (!blk_queue_discard(q)) - return -EOPNOTSUPP; error = blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL, 0); break; -- 2.11.0
[PATCH 12/27] block: add a new BLKDEV_ZERO_NOFALLBACK flag
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if the caller doesn't want them. Also clean up the convoluted check for the return condition that this new flag is added to. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c| 5 - include/linux/blkdev.h | 1 + 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2f6d2cb2e1a2..2f882e22890b 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -281,6 +281,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, * * If a device is using logical block provisioning, the underlying space will * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. + * + * If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return + * -EOPNOTSUPP if no explicit hardware offload for zeroing is provided. */ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, @@ -298,7 +301,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, biop, flags); - if (ret == 0 || (ret && ret != -EOPNOTSUPP)) + if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) goto out; ret = 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e7513ce3dbde..a5055d760661 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1351,6 +1351,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, struct bio **biop); #define BLKDEV_ZERO_NOUNMAP(1 << 0) /* do not free blocks */ +#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, -- 2.11.0
[PATCH 11/27] block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
If this flag is set logical provisioning capable device should release space for the zeroed blocks if possible, if it is not set devices should keep the blocks anchored. Also remove an out of sync kerneldoc comment for a static function that would have become even more out of data with this change. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c | 19 +-- include/linux/blk_types.h | 6 ++ 2 files changed, 11 insertions(+), 14 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index f9f24ec69c27..2f6d2cb2e1a2 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -226,20 +226,9 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL(blkdev_issue_write_same); -/** - * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES - * @bdev: blockdev to issue - * @sector:start sector - * @nr_sects: number of sectors to write - * @gfp_mask: memory allocation flags (for bio_alloc) - * @biop: pointer to anchor bio - * - * Description: - * Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled pages. - */ static int __blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, - struct bio **biop) + struct bio **biop, unsigned flags) { struct bio *bio = *biop; unsigned int max_write_zeroes_sectors; @@ -258,7 +247,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, bio = next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; bio->bi_bdev = bdev; - bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0); + bio->bi_opf = REQ_OP_WRITE_ZEROES; + if (flags & BLKDEV_ZERO_NOUNMAP) + bio->bi_opf |= REQ_NOUNMAP; if (nr_sects > max_write_zeroes_sectors) { bio->bi_iter.bi_size = max_write_zeroes_sectors << 9; @@ -306,7 +297,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return -EINVAL; ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, - biop); + biop, flags); if (ret == 0 || (ret && ret != -EOPNOTSUPP)) goto out; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 4eae30bfbfca..8eaa7dca7057 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -195,6 +195,10 @@ enum req_flag_bits { __REQ_PREFLUSH, /* request for cache flush */ __REQ_RAHEAD, /* read ahead, can fail anytime */ __REQ_BACKGROUND, /* background IO */ + + /* command specific flags for REQ_OP_WRITE_ZEROES: */ + __REQ_NOUNMAP, /* do not free blocks when zeroing */ + __REQ_NR_BITS, /* stops here */ }; @@ -212,6 +216,8 @@ enum req_flag_bits { #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) +#define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP) + #define REQ_FAILFAST_MASK \ (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) -- 2.11.0
[PATCH 10/27] block: add a flags argument to (__)blkdev_issue_zeroout
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with similar semantics, but without referring to diѕcard. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c| 31 ++- block/ioctl.c | 2 +- drivers/block/drbd/drbd_receiver.c | 9 ++--- drivers/nvme/target/io-cmd.c | 2 +- fs/block_dev.c | 2 +- fs/dax.c | 2 +- fs/xfs/xfs_bmap_util.c | 2 +- include/linux/blkdev.h | 16 ++-- 8 files changed, 35 insertions(+), 31 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 2a8d638544a7..f9f24ec69c27 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -282,14 +282,18 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, * @nr_sects: number of sectors to write * @gfp_mask: memory allocation flags (for bio_alloc) * @biop: pointer to anchor bio - * @discard: discard flag + * @flags: controls detailed behavior * * Description: - * Generate and issue number of bios with zerofiled pages. + * Zero-fill a block range, either using hardware offload or by explicitly + * writing zeroes to the device. + * + * If a device is using logical block provisioning, the underlying space will + * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. */ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, - bool discard) + unsigned flags) { int ret; int bi_size = 0; @@ -337,28 +341,21 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); * @sector:start sector * @nr_sects: number of sectors to write * @gfp_mask: memory allocation flags (for bio_alloc) - * @discard: whether to discard the block range + * @flags: controls detailed behavior * * Description: - * Zero-fill a block range. If the discard flag is set and the block - * device guarantees that subsequent READ operations to the block range - * in question will return zeroes, the blocks will be discarded. Should - * the discard request fail, if the discard flag is not set, or if - * discard_zeroes_data is not supported, this function will resort to - * zeroing the blocks manually, thus provisioning (allocating, - * anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME - * command(s), blkdev_issue_zeroout() will use it to optimize the process of - * clearing the block range. Otherwise the zeroing will be performed - * using regular WRITE calls. + * Zero-fill a block range, either using hardware offload or by explicitly + * writing zeroes to the device. See __blkdev_issue_zeroout() for the + * valid values for %flags. */ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, -sector_t nr_sects, gfp_t gfp_mask, bool discard) + sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { int ret; struct bio *bio = NULL; struct blk_plug plug; - if (discard) { + if (!(flags & BLKDEV_ZERO_NOUNMAP)) { if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, BLKDEV_DISCARD_ZERO)) return 0; @@ -366,7 +363,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, blk_start_plug(); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - , discard); + , flags); if (ret == 0 && bio) { ret = submit_bio_wait(bio); bio_put(bio); diff --git a/block/ioctl.c b/block/ioctl.c index 7b88820b93d9..8ea00a41be01 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode, truncate_inode_pages_range(mapping, start, end); return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL, - false); + BLKDEV_ZERO_NOUNMAP); } static int put_ushort(unsigned long arg, unsigned short val) diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index aa6bf9692eff..dc9a6dcd431c 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -1499,19 +1499,22 @@ int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t start, u tmp = start + granularity - sector_div(tmp, granularity); nr = tmp - start; - err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, 0); + err |= blkdev_issue_zeroout(bdev, start, nr, GFP_NOIO, + BLKDEV_ZERO_NOUNMAP); nr_sectors -= nr;
[PATCH 09/27] block: stop using blkdev_issue_write_same for zeroing
We'll always use the WRITE ZEROES code for zeroing now. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c | 4 1 file changed, 4 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index e5b853f2b8a2..2a8d638544a7 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -364,10 +364,6 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return 0; } - if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask, - ZERO_PAGE(0))) - return 0; - blk_start_plug(); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, , discard); -- 2.11.0
[PATCH 01/27] sd: split sd_setup_discard_cmnd
Split sd_setup_discard_cmnd into one function per provisioning type. While this creates some very slight duplication of boilerplate code it keeps the code modular for additions of new provisioning types, and for reusing the write same functions for the upcoming scsi implementation of the Write Zeroes operation. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/scsi/sd.c | 153 ++ 1 file changed, 84 insertions(+), 69 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index fcfeddc79331..b853f91fb3da 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); } -/** - * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device - * @sdp: scsi device to operate on - * @rq: Request to prepare - * - * Will issue either UNMAP or WRITE SAME(16) depending on preference - * indicated by target device. - **/ -static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) +static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd) { - struct request *rq = cmd->request; struct scsi_device *sdp = cmd->device; - struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); - sector_t sector = blk_rq_pos(rq); - unsigned int nr_sectors = blk_rq_sectors(rq); - unsigned int len; - int ret; + struct request *rq = cmd->request; + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + unsigned int data_len = 24; char *buf; - struct page *page; - sector >>= ilog2(sdp->sector_size) - 9; - nr_sectors >>= ilog2(sdp->sector_size) - 9; - - page = alloc_page(GFP_ATOMIC | __GFP_ZERO); - if (!page) + rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO); + if (!rq->special_vec.bv_page) return BLKPREP_DEFER; + rq->special_vec.bv_offset = 0; + rq->special_vec.bv_len = data_len; + rq->rq_flags |= RQF_SPECIAL_PAYLOAD; - switch (sdkp->provisioning_mode) { - case SD_LBP_UNMAP: - buf = page_address(page); - - cmd->cmd_len = 10; - cmd->cmnd[0] = UNMAP; - cmd->cmnd[8] = 24; - - put_unaligned_be16(6 + 16, [0]); - put_unaligned_be16(16, [2]); - put_unaligned_be64(sector, [8]); - put_unaligned_be32(nr_sectors, [16]); + cmd->cmd_len = 10; + cmd->cmnd[0] = UNMAP; + cmd->cmnd[8] = 24; - len = 24; - break; + buf = page_address(rq->special_vec.bv_page); + put_unaligned_be16(6 + 16, [0]); + put_unaligned_be16(16, [2]); + put_unaligned_be64(sector, [8]); + put_unaligned_be32(nr_sectors, [16]); - case SD_LBP_WS16: - cmd->cmd_len = 16; - cmd->cmnd[0] = WRITE_SAME_16; - cmd->cmnd[1] = 0x8; /* UNMAP */ - put_unaligned_be64(sector, >cmnd[2]); - put_unaligned_be32(nr_sectors, >cmnd[10]); + cmd->allowed = SD_MAX_RETRIES; + cmd->transfersize = data_len; + rq->timeout = SD_TIMEOUT; + scsi_req(rq)->resid_len = data_len; - len = sdkp->device->sector_size; - break; + return scsi_init_io(cmd); +} - case SD_LBP_WS10: - case SD_LBP_ZERO: - cmd->cmd_len = 10; - cmd->cmnd[0] = WRITE_SAME; - if (sdkp->provisioning_mode == SD_LBP_WS10) - cmd->cmnd[1] = 0x8; /* UNMAP */ - put_unaligned_be32(sector, >cmnd[2]); - put_unaligned_be16(nr_sectors, >cmnd[7]); +static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd) +{ + struct scsi_device *sdp = cmd->device; + struct request *rq = cmd->request; + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + u32 data_len = sdp->sector_size; - len = sdkp->device->sector_size; - break; + rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO); + if (!rq->special_vec.bv_page) + return BLKPREP_DEFER; + rq->special_vec.bv_offset = 0; + rq->special_vec.bv_len = data_len; + rq->rq_flags |= RQF_SPECIAL_PAYLOAD; - default: - ret = BLKPREP_INVALID; - goto out; - } + cmd->cmd_len = 16; + cmd->cmnd[0] = WRITE_SAME_16; + cmd->cmnd[1] = 0x8; /* UNMAP */ + put_unaligned_be64(sector, >cmnd[2]); + put_unaligned_be32(nr_sectors, >cmnd[10]); + cmd->allowed = SD_MAX_RETRIES; +
always use REQ_OP_WRITE_ZEROES for zeroing offload V2
This series makes REQ_OP_WRITE_ZEROES the only zeroing offload supported by the block layer, and switches existing implementations of REQ_OP_DISCARD that correctly set discard_zeroes_data to it, removes incorrect discard_zeroes_data, and also switches WRITE SAME based zeroing in SCSI to this new method. The series is against the block for-next tree. A git tree is also avaiable at: git://git.infradead.org/users/hch/block.git discard-rework.2 Gitweb: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework.2 Changes since V2: - various spelling fixes - various reviews captured - two new patches from Martin at the end
[PATCH 02/27] block: renumber REQ_OP_WRITE_ZEROES
Make life easy for implementations that needs to send a data buffer to the device (e.g. SCSI) by numbering it as a data out command. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- include/linux/blk_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 67bcf8a5326e..4eae30bfbfca 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -168,7 +168,7 @@ enum req_opf { /* write the same sector many times */ REQ_OP_WRITE_SAME = 7, /* write the zero filled sector many times */ - REQ_OP_WRITE_ZEROES = 8, + REQ_OP_WRITE_ZEROES = 9, /* SCSI passthrough using struct scsi_request */ REQ_OP_SCSI_IN = 32, -- 2.11.0
[PATCH 04/27] sd: implement REQ_OP_WRITE_ZEROES
Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- drivers/scsi/sd.c | 31 ++- drivers/scsi/sd_zbc.c | 1 + 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index b853f91fb3da..d8d9c0bdd93c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -735,7 +735,7 @@ static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd) return scsi_init_io(cmd); } -static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd) +static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, bool unmap) { struct scsi_device *sdp = cmd->device; struct request *rq = cmd->request; @@ -752,13 +752,14 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd) cmd->cmd_len = 16; cmd->cmnd[0] = WRITE_SAME_16; - cmd->cmnd[1] = 0x8; /* UNMAP */ + if (unmap) + cmd->cmnd[1] = 0x8; /* UNMAP */ put_unaligned_be64(sector, >cmnd[2]); put_unaligned_be32(nr_sectors, >cmnd[10]); cmd->allowed = SD_MAX_RETRIES; cmd->transfersize = data_len; - rq->timeout = SD_TIMEOUT; + rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT; scsi_req(rq)->resid_len = data_len; return scsi_init_io(cmd); @@ -788,12 +789,27 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap) cmd->allowed = SD_MAX_RETRIES; cmd->transfersize = data_len; - rq->timeout = SD_TIMEOUT; + rq->timeout = unmap ? SD_TIMEOUT : SD_WRITE_SAME_TIMEOUT; scsi_req(rq)->resid_len = data_len; return scsi_init_io(cmd); } +static int sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd) +{ + struct request *rq = cmd->request; + struct scsi_device *sdp = cmd->device; + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); + u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9); + u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9); + + if (sdp->no_write_same) + return BLKPREP_INVALID; + if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) + return sd_setup_write_same16_cmnd(cmd, false); + return sd_setup_write_same10_cmnd(cmd, false); +} + static void sd_config_write_same(struct scsi_disk *sdkp) { struct request_queue *q = sdkp->disk->queue; @@ -823,6 +839,8 @@ static void sd_config_write_same(struct scsi_disk *sdkp) out: blk_queue_max_write_same_sectors(q, sdkp->max_ws_blocks * (logical_block_size >> 9)); + blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks * +(logical_block_size >> 9)); } /** @@ -1163,7 +1181,7 @@ static int sd_init_command(struct scsi_cmnd *cmd) case SD_LBP_UNMAP: return sd_setup_unmap_cmnd(cmd); case SD_LBP_WS16: - return sd_setup_write_same16_cmnd(cmd); + return sd_setup_write_same16_cmnd(cmd, true); case SD_LBP_WS10: return sd_setup_write_same10_cmnd(cmd, true); case SD_LBP_ZERO: @@ -1171,6 +1189,8 @@ static int sd_init_command(struct scsi_cmnd *cmd) default: return BLKPREP_INVALID; } + case REQ_OP_WRITE_ZEROES: + return sd_setup_write_zeroes_cmnd(cmd); case REQ_OP_WRITE_SAME: return sd_setup_write_same_cmnd(cmd); case REQ_OP_FLUSH: @@ -1810,6 +1830,7 @@ static int sd_done(struct scsi_cmnd *SCpnt) switch (req_op(req)) { case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: case REQ_OP_WRITE_SAME: case REQ_OP_ZONE_RESET: if (!result) { diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 92620c8ea8ad..1994f7799fce 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -329,6 +329,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd, switch (req_op(rq)) { case REQ_OP_WRITE: + case REQ_OP_WRITE_ZEROES: case REQ_OP_WRITE_SAME: case REQ_OP_ZONE_RESET: -- 2.11.0
[PATCH 05/27] md: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/linear.c| 1 + drivers/md/md.h| 7 +++ drivers/md/multipath.c | 1 + drivers/md/raid0.c | 2 ++ drivers/md/raid1.c | 4 +++- drivers/md/raid10.c| 1 + drivers/md/raid5.c | 1 + 7 files changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/md/linear.c b/drivers/md/linear.c index 3e38e0207a3e..377a8a3672e3 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -293,6 +293,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio) split, disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, split); + mddev_check_write_zeroes(mddev, split); generic_make_request(split); } } while (split != bio); diff --git a/drivers/md/md.h b/drivers/md/md.h index dde8ecb760c8..1e76d64ce180 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -709,4 +709,11 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio) !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors) mddev->queue->limits.max_write_same_sectors = 0; } + +static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio) +{ + if (bio_op(bio) == REQ_OP_WRITE_ZEROES && + !bdev_get_queue(bio->bi_bdev)->limits.max_write_zeroes_sectors) + mddev->queue->limits.max_write_zeroes_sectors = 0; +} #endif /* _MD_MD_H */ diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index 79a12b59250b..e95d521d93e9 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -139,6 +139,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio) mp_bh->bio.bi_end_io = multipath_end_request; mp_bh->bio.bi_private = mp_bh; mddev_check_writesame(mddev, _bh->bio); + mddev_check_write_zeroes(mddev, _bh->bio); generic_make_request(_bh->bio); return; } diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 93347ca7c7a6..ce7a6a56cf73 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -383,6 +383,7 @@ static int raid0_run(struct mddev *mddev) blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_write_zeroes_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); @@ -504,6 +505,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) split, disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, split); + mddev_check_write_zeroes(mddev, split); generic_make_request(split); } } while (split != bio); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a34f58772022..b59cc100320a 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -3177,8 +3177,10 @@ static int raid1_run(struct mddev *mddev) if (IS_ERR(conf)) return PTR_ERR(conf); - if (mddev->queue) + if (mddev->queue) { blk_queue_max_write_same_sectors(mddev->queue, 0); + blk_queue_max_write_zeroes_sectors(mddev->queue, 0); + } rdev_for_each(rdev, mddev) { if (!mddev->gendisk) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index e89a8d78a9ed..28ec3a93acee 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3749,6 +3749,7 @@ static int raid10_run(struct mddev *mddev) blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); blk_queue_max_write_same_sectors(mddev->queue, 0); + blk_queue_max_write_zeroes_sectors(mddev->queue, 0); blk_queue_io_min(mddev->queue, chunk_size); if (conf->geo.raid_disks % conf->geo.near_copies) blk_queue_io_opt(mddev->queue, chunk_size * conf->geo.raid_disks); diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index ed5cd705b985..8cf1f86dcd05 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7272,6 +7272,7 @@ static int raid5_run(struct mddev *mddev) mddev->queue->limits.discard_zeroes_data = 0; blk_queue_max_write_same_sectors(mddev->queue, 0); + blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
[PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations that limit the write zeroes size. Signed-off-by: Christoph HellwigReviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-merge.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 2afa262425d1..3990ae406341 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct request_queue *q, return bio_split(bio, split_sectors, GFP_NOIO, bs); } +static struct bio *blk_bio_write_zeroes_split(struct request_queue *q, + struct bio *bio, struct bio_set *bs, unsigned *nsegs) +{ + *nsegs = 1; + + if (!q->limits.max_write_zeroes_sectors) + return NULL; + + if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors) + return NULL; + + return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs); +} + static struct bio *blk_bio_write_same_split(struct request_queue *q, struct bio *bio, struct bio_set *bs, @@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, split = blk_bio_discard_split(q, *bio, bs, ); break; case REQ_OP_WRITE_ZEROES: - split = NULL; - nsegs = (*bio)->bi_phys_segments; + split = blk_bio_write_zeroes_split(q, *bio, bs, ); break; case REQ_OP_WRITE_SAME: split = blk_bio_write_same_split(q, *bio, bs, ); -- 2.11.0
[PATCH 06/27] dm io: discards don't take a payload
Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/dm-io.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index 03940bf36f6c..b808cbe22678 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c @@ -328,11 +328,17 @@ static void do_region(int op, int op_flags, unsigned region, /* * Allocate a suitably sized-bio. */ - if ((op == REQ_OP_DISCARD) || (op == REQ_OP_WRITE_SAME)) + switch (op) { + case REQ_OP_DISCARD: + num_bvecs = 0; + break; + case REQ_OP_WRITE_SAME: num_bvecs = 1; - else + break; + default: num_bvecs = min_t(int, BIO_MAX_PAGES, dm_sector_div_up(remaining, (PAGE_SIZE >> SECTOR_SHIFT))); + } bio = bio_alloc_bioset(GFP_NOIO, num_bvecs, io->client->bios); bio->bi_iter.bi_sector = where->sector + (where->count - remaining); -- 2.11.0
[PATCH 07/27] dm: support REQ_OP_WRITE_ZEROES
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/dm-core.h | 1 + drivers/md/dm-io.c| 8 ++-- drivers/md/dm-linear.c| 1 + drivers/md/dm-mpath.c | 1 + drivers/md/dm-rq.c| 11 --- drivers/md/dm-stripe.c| 2 ++ drivers/md/dm-table.c | 30 ++ drivers/md/dm.c | 31 --- include/linux/device-mapper.h | 6 ++ 9 files changed, 83 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 136fda3ff9e5..fea5bd52ada8 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -132,6 +132,7 @@ void dm_init_md_queue(struct mapped_device *md); void dm_init_normal_md_queue(struct mapped_device *md); int md_in_flight(struct mapped_device *md); void disable_write_same(struct mapped_device *md); +void disable_write_zeroes(struct mapped_device *md); static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj) { diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index b808cbe22678..3702e502466d 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c @@ -312,9 +312,12 @@ static void do_region(int op, int op_flags, unsigned region, */ if (op == REQ_OP_DISCARD) special_cmd_max_sectors = q->limits.max_discard_sectors; + else if (op == REQ_OP_WRITE_ZEROES) + special_cmd_max_sectors = q->limits.max_write_zeroes_sectors; else if (op == REQ_OP_WRITE_SAME) special_cmd_max_sectors = q->limits.max_write_same_sectors; - if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_SAME) && + if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES || +op == REQ_OP_WRITE_SAME) && special_cmd_max_sectors == 0) { dec_count(io, region, -EOPNOTSUPP); return; @@ -330,6 +333,7 @@ static void do_region(int op, int op_flags, unsigned region, */ switch (op) { case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: num_bvecs = 0; break; case REQ_OP_WRITE_SAME: @@ -347,7 +351,7 @@ static void do_region(int op, int op_flags, unsigned region, bio_set_op_attrs(bio, op, op_flags); store_io_and_region_in_bio(bio, io, region); - if (op == REQ_OP_DISCARD) { + if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) { num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining); bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT; remaining -= num_sectors; diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 4788b0b989a9..e17fd44ceef5 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -59,6 +59,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv) ti->num_flush_bios = 1; ti->num_discard_bios = 1; ti->num_write_same_bios = 1; + ti->num_write_zeroes_bios = 1; ti->private = lc; return 0; diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 7f223dbed49f..ab55955ed704 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1103,6 +1103,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv) ti->num_flush_bios = 1; ti->num_discard_bios = 1; ti->num_write_same_bios = 1; + ti->num_write_zeroes_bios = 1; if (m->queue_mode == DM_TYPE_BIO_BASED) ti->per_io_data_size = multipath_per_bio_data_size(); else diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 6886bf160fb2..a789bf035621 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -298,9 +298,14 @@ static void dm_done(struct request *clone, int error, bool mapped) r = rq_end_io(tio->ti, clone, error, >info); } - if (unlikely(r == -EREMOTEIO && (req_op(clone) == REQ_OP_WRITE_SAME) && -!clone->q->limits.max_write_same_sectors)) - disable_write_same(tio->md); + if (unlikely(r == -EREMOTEIO)) { + if (req_op(clone) == REQ_OP_WRITE_SAME && + !clone->q->limits.max_write_same_sectors) + disable_write_same(tio->md); + if (req_op(clone) == REQ_OP_WRITE_ZEROES && + !clone->q->limits.max_write_zeroes_sectors) + disable_write_zeroes(tio->md); + } if (r <= 0) /* The target wants to complete the I/O */ diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 28193a57bf47..5ef49c121d99 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -169,6
[PATCH 08/27] dm kcopyd: switch to use REQ_OP_WRITE_ZEROES
It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/md/dm-kcopyd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c index 9e9d04cb7d51..f85846741d50 100644 --- a/drivers/md/dm-kcopyd.c +++ b/drivers/md/dm-kcopyd.c @@ -733,11 +733,11 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, job->pages = _page_list; /* -* Use WRITE SAME to optimize zeroing if all dests support it. +* Use WRITE ZEROES to optimize zeroing if all dests support it. */ - job->rw = REQ_OP_WRITE_SAME; + job->rw = REQ_OP_WRITE_ZEROES; for (i = 0; i < job->num_dests; i++) - if (!bdev_write_same(job->dests[i].bdev)) { + if (!bdev_write_zeroes_sectors(job->dests[i].bdev)) { job->rw = WRITE; break; } -- 2.11.0
[PATCH 17/27] loop: implement REQ_OP_WRITE_ZEROES
It's identical to discard as hole punches will always leave us with zeroes on reads. Signed-off-by: Christoph HellwigReviewed-by: Hannes Reinecke --- drivers/block/loop.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index cc981f34e017..3bb04c1a4ba1 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -528,6 +528,7 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) case REQ_OP_FLUSH: return lo_req_flush(lo, rq); case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: return lo_discard(lo, rq, pos); case REQ_OP_WRITE: if (lo->transfer) @@ -826,6 +827,7 @@ static void loop_config_discard(struct loop_device *lo) q->limits.discard_granularity = 0; q->limits.discard_alignment = 0; blk_queue_max_discard_sectors(q, 0); + blk_queue_max_write_zeroes_sectors(q, 0); q->limits.discard_zeroes_data = 0; queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q); return; @@ -834,6 +836,7 @@ static void loop_config_discard(struct loop_device *lo) q->limits.discard_granularity = inode->i_sb->s_blocksize; q->limits.discard_alignment = 0; blk_queue_max_discard_sectors(q, UINT_MAX >> 9); + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9); q->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); } @@ -1660,6 +1663,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, switch (req_op(cmd->rq)) { case REQ_OP_FLUSH: case REQ_OP_DISCARD: + case REQ_OP_WRITE_ZEROES: cmd->use_aio = false; break; default: -- 2.11.0