Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
On 10/17/2017 10:18 PM, Bart Van Assche wrote: > On Tue, 2017-10-17 at 08:33 +0200, Hannes Reinecke wrote: >> How do you ensure that PREEMPT requests are not stuck in the queue >> _behind_ non-PREEMPT requests? >> Once they are in the queue the request are already allocated, so your >> deferred allocation won't work. >> _And_ deferred requests will be re-inserted at the head of the queue. >> Consequently the PREEMPT request will never ever scheduled during quiesce. >> How do you avoid such a scenario? > > Hello Hannes, > > The blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls that have been > added through patch 9/10 to scsi_device_quiesce() will only succeed after all > outstanding requests have finished. That includes non-preempt requests and > requests that are about to be requeued. The changes in scsi_device_quiesce() > are such that blk_queue_enter() will fail for non-preempt requests after the > queue has been unfrozen. In other words, if a scsi_device_quiesce() call > succeeds, it is guaranteed that there are no non-preempt requests in the queue > and also that no new non-preempt requests will enter the queue until > scsi_device_resume() is called. > Ah. Right. That was the explanation I was after. Reviewed-by: Hannes ReineckeCheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH] sd: always retry READ CAPACITY for ALUA state transition
On 10/17/2017 03:57 PM, James Bottomley wrote: > On Tue, 2017-10-17 at 09:11 +0200, Hannes Reinecke wrote: >> During ALUA state transitions the device might return >> a sense code 02/04/0a (Logical unit not accessible, asymmetric >> access state transition). As this is a transient error >> we should just retry the READ CAPACITY call until >> the state transition finishes and the correct >> capacity can be returned. > > This will lock up the system if some ALUA initiator gets into a state > where it always returns transitioning and never completes, which > doesn't look like the best way to handle problem devices. > > I thought after the ALUA transition the LUN gives a unit attention ... > can't you use that some way to trigger the capacity re-read, so do > asynchronous event notification instead of polling forever. > Hmm. Will give it a try. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v10 00/10] block, scsi, md: Improve suspend and resume
I've indeed tested some previous versions of this patchset, but still use Ming's solution. Can it be clarified which one (Bart's or Ming's) is a correct approach? On středa 18. října 2017 1:28:14 CEST Jens Axboe wrote: > On 10/17/2017 05:26 PM, Bart Van Assche wrote: > > Hello Jens, > > > > It is known that during the resume following a hibernate, especially when > > using an md RAID1 array created on top of SCSI devices, sometimes the > > system hangs instead of coming up properly. This patch series fixes that > > problem. These patches have been tested on top of the block layer for-next > > branch. Please consider these changes for kernel v4.15. > > The folks that had trouble with this issue, have they tested the > patchset? I'd like to get a solution queued up for this sooner rather > than later, but I'd greatly prefer if we had some folks that have > tested and verified it.
Re: [PATCH 1/1] qla2xxx: Initialize Work element before requesting IRQs
On Mon, 2017-10-16 at 11:26 -0700, Madhani, Himanshu wrote: > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > index 937209805baf..3bd956d3bc5d 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -3061,6 +3061,8 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct > pci_device_id *id) > host->max_cmd_len, host->max_channel, host->max_lun, > host->transportt, sht->vendor_id); > > + INIT_WORK(_vha->iocb_work, qla2x00_iocb_work_fn); > + > /* Set up the irqs */ > ret = qla2x00_request_irqs(ha, rsp); > if (ret) > @@ -3175,8 +3177,6 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct > pci_device_id *id) > host->can_queue, base_vha->req, > base_vha->mgmt_svr_loop_id, host->sg_tablesize); > > - INIT_WORK(_vha->iocb_work, qla2x00_iocb_work_fn); > - > if (ha->mqenable) { > bool mq = false; > bool startit = false; Hello Himanshu, That patch indeed fixes the bug described in the patch description when applied on top of kernel v4.13.7. However, with that patch applied I ran into another bug. Can you have a look? BUG: unable to handle kernel NULL pointer dereference at 0190 IP: qlt_free_session_done+0x172/0x570 [qla2xxx] PGD 0 P4D 0 Oops: [#1] SMP CPU: 0 PID: 47 Comm: kworker/0:1 Not tainted 4.13.7+ #1 Hardware name: Gigabyte Technology Co., Ltd. Z97X-UD5H/Z97X-UD5H, BIOS F10 08/03/2015 Workqueue: events qlt_free_session_done [qla2xxx] task: 9c4bcee94300 task.stack: ba99c01d4000 RIP: 0010:qlt_free_session_done+0x172/0x570 [qla2xxx] RSP: 0018:ba99c01d7dc8 EFLAGS: 00010046 RAX: RBX: 9c4bb5c95720 RCX: c09174c8 RDX: 0001 RSI: 9c4bb5c95720 RDI: 9c4baf7e5ce4 RBP: ba99c01d7e50 R08: c0903620 R09: 9c4bc840e400 R10: ba99c01d7db0 R11: R12: 9c4bc840e400 R13: R14: 9c4baf7e5000 R15: 9c4bc840e4c0 FS: () GS:9c4bdfa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0190 CR3: 00040c624000 CR4: 001426f0 Call Trace: ? qlt_unreg_sess+0xfe/0x110 [qla2xxx] ? qla24xx_delete_sess_fn+0x69/0x80 [qla2xxx] process_one_work+0x1d6/0x3d0 worker_thread+0x42/0x3e0 kthread+0x11f/0x140 ? trace_event_raw_event_workqueue_execute_start+0x90/0x90 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x22/0x30 Code: 00 00 00 00 41 c7 87 ac 00 00 00 07 00 00 00 83 e0 f9 83 c8 04 41 f6 87 71 ff ff ff 02 41 88 87 70 ff ff ff 48 8b 83 40 04 00 00 <8b> 80 90 01 00 00 41 89 47 4c 74 22 41 8b 87 68 ff ff ff 25 00 RIP: qlt_free_session_done+0x172/0x570 [qla2xxx] RSP: ba99c01d7dc8 CR2: 0190 ---[ end trace 89dee74f51a05258 ]--- (gdb) list *(qlt_free_session_done+0x172) 0x661c2 is in qlt_free_session_done (drivers/scsi/qla2xxx/qla_target.c:1027). 1022} 1023 1024sess->disc_state = DSC_DELETED; 1025sess->fw_login_state = DSC_LS_PORT_UNAVAIL; 1026sess->deleted = QLA_SESS_DELETED; 1027sess->login_retry = vha->hw->login_retry_count; 1028 1029if (sess->login_succ && !IS_SW_RESV_ADDR(sess->d_id)) { 1030vha->fcport_count--; 1031sess->login_succ = 0;
[Bug 195285] qla2xxx FW immediatly crashing after target start
https://bugzilla.kernel.org/show_bug.cgi?id=195285 --- Comment #14 from himanshu.madh...@cavium.com --- I am OOO. I will respond to your message when i am back at work. Thanks, Himanshu -- You are receiving this mail because: You are watching the assignee of the bug.
[Bug 195285] qla2xxx FW immediatly crashing after target start
https://bugzilla.kernel.org/show_bug.cgi?id=195285 Anthony (anthony.blood...@gmail.com) changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |CODE_FIX --- Comment #13 from Anthony (anthony.blood...@gmail.com) --- Patch in upstream now. Please close the bug. -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH] scsi: gdth: Convert timers to use timer_setup()
On Tue, Oct 17, 2017 at 8:06 PM, Martin K. Petersenwrote: > > Kees, > >> Thanks for the reviews! Do you want the timer tree to carry these >> patches, or can you pick them up in the scsi tree? > > Up to you. I'm not going to rebase my 4.15 queue this late in the cycle > so the patches would end up going in at the end of the merge window. If > you prefer to have them hit the first pull you should queue them up in > the timer tree. Okay, thanks, I'll take them via the timer tree. -Kees -- Kees Cook Pixel Security
Re: [PATCH] sd: Micro-optimize READ / WRITE CDB encoding
Bart, > Only compute 'threshold' if .last_sector_bug has been set. Reduce > the number of branches that has to be taken to check the starting > LBA and transfer length alignment. Optimize the encoding of the > READ(10), WRITE(10), READ(16), WRITE(16), READ(32) and WRITE(32) > CDB encoding. Please take a look at this. It missed 4.15 because I've been fighting a hardware bug the last several weeks. However, I'd still like to get it in shape for 4.16: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=scsi-work Feel free to beat me to it. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: gdth: Convert timers to use timer_setup()
Kees, > Thanks for the reviews! Do you want the timer tree to carry these > patches, or can you pick them up in the scsi tree? Up to you. I'm not going to rebase my 4.15 queue this late in the cycle so the patches would end up going in at the end of the merge window. If you prefer to have them hit the first pull you should queue them up in the timer tree. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry
Hannes, > All EMC SYMMETRIX support REPORT_LUNS, even if configured to report > SCSI-2 for whatever reason. Applied to 4.15/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: Always retry internal target error
Hannes, > + if (!strncmp(scmd->device->vendor, "EMC", 3) && > + !strncmp(scmd->device->model, "SYMMETRIX", 9) && > + (sshdr.asc == 0x44) && (sshdr.ascq == 0x0)) { > + /* > + * EMC Symmetrix returns 'Internal target failure' > + * for a variety of internal issues, all of which > + * can be recovered by retry. > + */ > + return ADD_TO_MLQUEUE; > + } It's decidedly awful to have vendor/model-specific triggers in scsi_error. What are the drawbacks of just always refiring on AC/0x44/ITF? -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: Do not retry invalid function error
Hannes, > Hitachi USP-V returns 'Invalid function' when the internal staging > mechanism encountered an error. These errors should not be retried on > another path. s/invalid/illegal/ Applied to 4.15/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: Add TRY_VPD_PAGES to HITACHI OPEN-V blacklist entry
Hannes, > HITACHI is always supporting VPD pages, even though it's claiming to > support SCSI Revision 3 only. Applied to 4.15/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: Add 'AIX VDASD' to blacklist
Hannes, > The AIX VDASD devices do support VPD pages, but implement only SPC. So > set BLIST_TRY_VPD_PAGS to correctly display the VPD information in > sysfs. Applied to 4.15/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] scsi: sg: Re-fix off by one in sg_fill_request_table()
Ben, > Commit 109bade9c625 ("scsi: sg: use standard lists for sg_requests") > introduced an off-by-one error in sg_ioctl(), which was fixed by > commit bd46fc406b30 ("scsi: sg: off by one in sg_ioctl()"). > > Unfortunately commit 4759df905a47 ("scsi: sg: factor out > sg_fill_request_table()") moved that code, and reintroduced the > bug (perhaps due to a botched rebase). Fix it again. Applied to 4.14/scsi-fixes. Thank you! -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v10 00/10] block, scsi, md: Improve suspend and resume
On 10/17/2017 05:40 PM, Bart Van Assche wrote: > On Tue, 2017-10-17 at 17:28 -0600, Jens Axboe wrote: >> On 10/17/2017 05:26 PM, Bart Van Assche wrote: >>> It is known that during the resume following a hibernate, especially when >>> using an md RAID1 array created on top of SCSI devices, sometimes the system >>> hangs instead of coming up properly. This patch series fixes that >>> problem. These patches have been tested on top of the block layer for-next >>> branch. Please consider these changes for kernel v4.15. >> >> The folks that had trouble with this issue, have they tested the >> patchset? I'd like to get a solution queued up for this sooner rather >> than later, but I'd greatly prefer if we had some folks that have >> tested and verified it. > > Hello Jens, > > Previous versions of this series have been tested by the two reporters of > this issue (Oleksandr Natalenko and Martin Steigerwald). I think this version > is close enough to what has been tested by Oleksandr and Martin for their > test results still to be valid. See also: > * https://www.mail-archive.com/linux-block@vger.kernel.org/msg13930.html > * https://www.mail-archive.com/linux-block@vger.kernel.org/msg14389.html OK good. Can we add their tested-by then? -- Jens Axboe
Re: [PATCH 0/2] block/SCSI MQ: two RESTART related patches
On Tue, Oct 17, 2017 at 04:47:23PM +0100, John Garry wrote: > On 17/10/2017 06:12, Ming Lei wrote: > > On Tue, Oct 17, 2017 at 01:04:16PM +0800, Ming Lei wrote: > > > Hi Jens, > > > > > > The 1st patch runs idle hctx after dealy in scsi_mq_get_budget(), > > > so that we can keep same behaviour with before, and it can be > > > thought as a fix. > > > > > > The 2nd patch cleans up RESTART, and removes handling for TAG_SHARED > > > from current blk-mq's RESTART mechanism because SCSI_MQ can covers its > > > restart by itself, so that no need to handle TAG_SHARED in blk-mq > > > RESTART. And >20% IOPS boost is observed in my rand read test over > > > scsi_debug. > > > > > > John, please test this two patches and see if it may improve your SAS > > > IO performance, and you can find the two patches in the following branch: > > > > > > https://github.com/ming1/linux/commits/blk_mq_improve_restart_V1 > > > > Hi Ming, > > As requested, here's my figures for blk_mq_improve_restart_V1: > without default SCSI_MQ, deadline scheduler, CONFIG PREEMPT off > read, rw, write IOPS > 989K, 113K/113K, 835K > > with default SCSI_MQ, mq-deadline scheduler, CONFIG PREEMPT off > read, rw, write IOPS > 738K, 130K/130K, 686K > > For axboe for-next tip (21ed538): > without default SCSI_MQ, deadline scheduler, CONFIG PREEMPT off > read, rw, write IOPS > 977K, 117K/117K, 821K > > with default SCSI_MQ, mq-deadline scheduler, CONFIG PREEMPT off > read, rw, write IOPS > 733K, 128K/128K, 676K > > All cases do not have LLDD mq exposed/enabled. So unfortunately not much > difference with your branch. That looks a bit strange, this patchset should cut half of blk_mq_run_hw_queues() since SCSI-MQ does it by itself in scsi_end_request(). So looks your issue is related with neither IO merge nor RESTART. just wondering what the possible reason is? But this patchset is still correct thing to do, even though your issue can't be addressed. Thanks, Ming
Re: [PATCH V8 4/5] libsas: Align SMP req/resp todma_get_cache_alignment()
Hi, Marek, Yes, I know in include/linux/slab.h, there is #define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN But I observed that the req/resp isn't aligned to ARCH_DMA_MINALIGN and I don't know why. Problems I experienced is: In non-coherent mode, mvsas with an expander cannot detect any disks. Huacai -- Original -- From: "Marek Szyprowski"; Date: Tue, Oct 17, 2017 07:55 PM To: "Huacai Chen" ; "Christoph Hellwig" ; Cc: "Robin Murphy" ; "Andrew Morton" ; "Fuxin Zhang" ; "linux-kernel" ; "Ralf Baechle" ; "JamesHogan" ; "linux-mips" ; "James E . J .Bottomley" ; "Martin K . Petersen" ; "linux-scsi" ; "Tejun Heo" ; "linux-ide" ; "stable" ; Subject: Re: [PATCH V8 4/5] libsas: Align SMP req/resp todma_get_cache_alignment() Hi Huacai, On 2017-10-17 10:05, Huacai Chen wrote: > In non-coherent DMA mode, kernel uses cache flushing operations to > maintain I/O coherency, so libsas's SMP request/response should be > aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer and a kernel > structure share a same cache line, and if the kernel structure has > dirty data, cache_invalidate (no writeback) will cause data corruption. > > Cc: sta...@vger.kernel.org > Signed-off-by: Huacai Chen > --- > drivers/scsi/libsas/sas_expander.c | 93 > +++--- > 1 file changed, 57 insertions(+), 36 deletions(-) > > diff --git a/drivers/scsi/libsas/sas_expander.c > b/drivers/scsi/libsas/sas_expander.c > index 6b4fd23..124a44b 100644 > --- a/drivers/scsi/libsas/sas_expander.c > +++ b/drivers/scsi/libsas/sas_expander.c > @@ -164,17 +164,17 @@ static int smp_execute_task(struct domain_device *dev, > void *req, int req_size, > > /* -- Allocations -- */ > > -static inline void *alloc_smp_req(int size) > +static inline void *alloc_smp_req(int size, int align) > { > - u8 *p = kzalloc(size, GFP_KERNEL); > + u8 *p = kzalloc(ALIGN(size, align), GFP_KERNEL); If I remember correctly, kernel guarantees that each kmalloced buffer is always at least aligned to the CPU cache line, so CPU cache can be invalidated on the allocated buffer without corrupting anything else. Taking this into account, I wonder if the above change make sense. Have you experienced any problems without this change? > if (p) > p[0] = SMP_REQUEST; > return p; > } > > -static inline void *alloc_smp_resp(int size) > +static inline void *alloc_smp_resp(int size, int align) > { > - return kzalloc(size, GFP_KERNEL); > + return kzalloc(ALIGN(size, align), GFP_KERNEL); Save a above. > } > > static char sas_route_char(struct domain_device *dev, struct ex_phy *phy) > @@ -403,15 +403,17 @@ static int sas_ex_phy_discover_helper(struct > domain_device *dev, u8 *disc_req, > int sas_ex_phy_discover(struct domain_device *dev, int single) > { > struct expander_device *ex = >ex_dev; > - int res = 0; > + int res = 0, align; > u8 *disc_req; > u8 *disc_resp; > > - disc_req = alloc_smp_req(DISCOVER_REQ_SIZE); > + align = dma_get_cache_alignment(>phy->dev); > + > + disc_req = alloc_smp_req(DISCOVER_REQ_SIZE, align); > if (!disc_req) > return -ENOMEM; > > - disc_resp = alloc_smp_resp(DISCOVER_RESP_SIZE); > + disc_resp = alloc_smp_resp(DISCOVER_RESP_SIZE, align); > if (!disc_resp) { > kfree(disc_req); > return -ENOMEM; > @@ -480,14 +482,15 @@ static int sas_ex_general(struct domain_device *dev) > { > u8 *rg_req; > struct smp_resp *rg_resp; > - int res; > - int i; > + int i, res, align; > > - rg_req = alloc_smp_req(RG_REQ_SIZE); > + align = dma_get_cache_alignment(>phy->dev); > + > + rg_req = alloc_smp_req(RG_REQ_SIZE, align); > if (!rg_req) > return -ENOMEM; > > - rg_resp = alloc_smp_resp(RG_RESP_SIZE); > + rg_resp = alloc_smp_resp(RG_RESP_SIZE, align); > if (!rg_resp) { > kfree(rg_req); > return -ENOMEM; > @@ -552,13 +555,15 @@ static int sas_ex_manuf_info(struct domain_device *dev) > { > u8 *mi_req; > u8 *mi_resp; > - int res; > + int res, align; > > - mi_req = alloc_smp_req(MI_REQ_SIZE); > + align = dma_get_cache_alignment(>phy->dev); > + > + mi_req = alloc_smp_req(MI_REQ_SIZE, align); > if (!mi_req) > return -ENOMEM; > > - mi_resp = alloc_smp_resp(MI_RESP_SIZE); > + mi_resp = alloc_smp_resp(MI_RESP_SIZE, align); > if (!mi_resp) { > kfree(mi_req);
Re: [PATCH V8 5/5] libata: Align DMA buffer todma_get_cache_alignment()
Hi, Sergei, Without this patch, in non-coherent mode, ata_do_set_mode() return with "no PIO support", and disk detection fails. Huacai -- Original -- From: "Sergei Shtylyov"; Date: Tue, Oct 17, 2017 05:43 PM To: "Huacai Chen" ; "Christoph Hellwig" ; Cc: "Marek Szyprowski" ; "Robin Murphy" ; "Andrew Morton" ; "Fuxin Zhang" ; "linux-kernel" ; "Ralf Baechle" ; "James Hogan" ; "linux-mips" ; "James E . J . Bottomley" ; "Martin K . Petersen" ; "linux-scsi" ; "Tejun Heo" ; "linux-ide" ; "stable" ; Subject: Re: [PATCH V8 5/5] libata: Align DMA buffer todma_get_cache_alignment() On 10/17/2017 11:05 AM, Huacai Chen wrote: > In non-coherent DMA mode, kernel uses cache flushing operations to > maintain I/O coherency, so in ata_do_dev_read_id() the DMA buffer > should be aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer > and a kernel structure share a same cache line, and if the kernel > structure has dirty data, cache_invalidate (no writeback) will cause > data corruption. > > Cc: sta...@vger.kernel.org > Signed-off-by: Huacai Chen > --- > drivers/ata/libata-core.c | 15 +-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c > index ee4c1ec..e134955 100644 > --- a/drivers/ata/libata-core.c > +++ b/drivers/ata/libata-core.c > @@ -1833,8 +1833,19 @@ static u32 ata_pio_mask_no_iordy(const struct > ata_device *adev) > unsigned int ata_do_dev_read_id(struct ata_device *dev, > struct ata_taskfile *tf, u16 *id) > { > - return ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, > - id, sizeof(id[0]) * ATA_ID_WORDS, 0); > + u16 *devid; > + int res, size = sizeof(u16) * ATA_ID_WORDS; > + > + if (IS_ALIGNED((unsigned long)id, dma_get_cache_alignment(>tdev))) > + res = ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, id, > size, 0); > + else { > + devid = kmalloc(size, GFP_KERNEL); > + res = ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, devid, > size, 0); > + memcpy(id, devid, size); > + kfree(devid); > + } > + > + return res; This function is called only for the PIO mode commands, so I doubt this is necessary... MBR, Sergei
Re: [PATCH v10 00/10] block, scsi, md: Improve suspend and resume
On Tue, 2017-10-17 at 17:28 -0600, Jens Axboe wrote: > On 10/17/2017 05:26 PM, Bart Van Assche wrote: > > It is known that during the resume following a hibernate, especially when > > using an md RAID1 array created on top of SCSI devices, sometimes the system > > hangs instead of coming up properly. This patch series fixes that > > problem. These patches have been tested on top of the block layer for-next > > branch. Please consider these changes for kernel v4.15. > > The folks that had trouble with this issue, have they tested the > patchset? I'd like to get a solution queued up for this sooner rather > than later, but I'd greatly prefer if we had some folks that have > tested and verified it. Hello Jens, Previous versions of this series have been tested by the two reporters of this issue (Oleksandr Natalenko and Martin Steigerwald). I think this version is close enough to what has been tested by Oleksandr and Martin for their test results still to be valid. See also: * https://www.mail-archive.com/linux-block@vger.kernel.org/msg13930.html * https://www.mail-archive.com/linux-block@vger.kernel.org/msg14389.html Thanks, Bart.
Re: [PATCH v10 00/10] block, scsi, md: Improve suspend and resume
On 10/17/2017 05:26 PM, Bart Van Assche wrote: > Hello Jens, > > It is known that during the resume following a hibernate, especially when > using an md RAID1 array created on top of SCSI devices, sometimes the system > hangs instead of coming up properly. This patch series fixes that > problem. These patches have been tested on top of the block layer for-next > branch. Please consider these changes for kernel v4.15. The folks that had trouble with this issue, have they tested the patchset? I'd like to get a solution queued up for this sooner rather than later, but I'd greatly prefer if we had some folks that have tested and verified it. -- Jens Axboe
[PATCH v10 02/10] md: Introduce md_stop_all_writes()
Introduce md_stop_all_writes() because the next patch will add a second caller for this function. This patch does not change any functionality. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Reviewed-by: Shaohua Li Reviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: linux-r...@vger.kernel.org Cc: Ming Lei Cc: Christoph Hellwig --- drivers/md/md.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 8933cafc212d..b99584e5d6b1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8937,8 +8937,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, } EXPORT_SYMBOL_GPL(rdev_clear_badblocks); -static int md_notify_reboot(struct notifier_block *this, - unsigned long code, void *x) +static void md_stop_all_writes(void) { struct list_head *tmp; struct mddev *mddev; @@ -8962,6 +8961,12 @@ static int md_notify_reboot(struct notifier_block *this, */ if (need_delay) mdelay(1000*1); +} + +static int md_notify_reboot(struct notifier_block *this, + unsigned long code, void *x) +{ + md_stop_all_writes(); return NOTIFY_DONE; } -- 2.14.2
[PATCH v10 10/10] block, nvme: Introduce blk_mq_req_flags_t
Several block layer and NVMe core functions accept a combination of BLK_MQ_REQ_* flags through the 'flags' argument but there is no verification at compile time whether the right type of block layer flags is passed. Make it possible for sparse to verify this. This patch does not change any functionality. Signed-off-by: Bart Van AsscheReviewed-by: Hannes Reinecke Cc: linux-n...@lists.infradead.org Cc: Christoph Hellwig Cc: Johannes Thumshirn Cc: Ming Lei --- block/blk-core.c | 12 ++-- block/blk-mq.c| 4 ++-- block/blk-mq.h| 2 +- drivers/nvme/host/core.c | 5 +++-- drivers/nvme/host/nvme.h | 5 +++-- include/linux/blk-mq.h| 17 +++-- include/linux/blk_types.h | 2 ++ include/linux/blkdev.h| 4 ++-- 8 files changed, 30 insertions(+), 21 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 63e88e78c1fe..4fc9d6d44edd 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -799,7 +799,7 @@ EXPORT_SYMBOL(blk_alloc_queue); * @q: request queue pointer * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT */ -int blk_queue_enter(struct request_queue *q, unsigned int flags) +int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { const bool preempt = flags & BLK_MQ_REQ_PREEMPT; @@ -1225,7 +1225,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *__get_request(struct request_list *rl, unsigned int op, -struct bio *bio, unsigned int flags) +struct bio *bio, blk_mq_req_flags_t flags) { struct request_queue *q = rl->q; struct request *rq; @@ -1408,7 +1408,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *get_request(struct request_queue *q, unsigned int op, - struct bio *bio, unsigned int flags) + struct bio *bio, blk_mq_req_flags_t flags) { const bool is_sync = op_is_sync(op); DEFINE_WAIT(wait); @@ -1458,7 +1458,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, /* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */ static struct request *blk_old_get_request(struct request_queue *q, - unsigned int op, unsigned int flags) + unsigned int op, blk_mq_req_flags_t flags) { struct request *rq; gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : @@ -1495,7 +1495,7 @@ static struct request *blk_old_get_request(struct request_queue *q, * @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT. */ struct request *blk_get_request_flags(struct request_queue *q, unsigned int op, - unsigned int flags) + blk_mq_req_flags_t flags) { struct request *req; @@ -2291,7 +2291,7 @@ blk_qc_t generic_make_request(struct bio *bio) current->bio_list = bio_list_on_stack; do { struct request_queue *q = bio->bi_disk->queue; - unsigned int flags = bio->bi_opf & REQ_NOWAIT ? + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? BLK_MQ_REQ_NOWAIT : 0; if (likely(blk_queue_enter(q, flags) == 0)) { diff --git a/block/blk-mq.c b/block/blk-mq.c index c6bff60e6b8b..c037b1ad64a7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -380,7 +380,7 @@ static struct request *blk_mq_get_request(struct request_queue *q, } struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, - unsigned int flags) + blk_mq_req_flags_t flags) { struct blk_mq_alloc_data alloc_data = { .flags = flags }; struct request *rq; @@ -406,7 +406,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, EXPORT_SYMBOL(blk_mq_alloc_request); struct request *blk_mq_alloc_request_hctx(struct request_queue *q, - unsigned int op, unsigned int flags, unsigned int hctx_idx) + unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx) { struct blk_mq_alloc_data alloc_data = { .flags = flags }; struct request *rq; diff --git a/block/blk-mq.h b/block/blk-mq.h index 522b420dedc0..5dcfe4fa5e0d 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -110,7 +110,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx) struct blk_mq_alloc_data { /* input parameter */ struct request_queue *q; - unsigned int flags; + blk_mq_req_flags_t flags; unsigned int shallow_depth;
[PATCH v10 05/10] block: Introduce blk_get_request_flags()
A side effect of this patch is that the GFP mask that is passed to several allocation functions in the legacy block layer is changed from GFP_KERNEL into __GFP_DIRECT_RECLAIM. Signed-off-by: Bart Van AsscheReviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: Christoph Hellwig Cc: Ming Lei Cc: Johannes Thumshirn --- block/blk-core.c | 50 +++--- include/linux/blkdev.h | 3 +++ 2 files changed, 38 insertions(+), 15 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 23a0110fc62b..559818be8e02 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1157,7 +1157,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * @rl: request list to allocate from * @op: operation and flags * @bio: bio to allocate request for (can be %NULL) - * @gfp_mask: allocation mask + * @flags: BLQ_MQ_REQ_* flags * * Get a free request from @q. This function may fail under memory * pressure or if @q is dead. @@ -1167,7 +1167,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *__get_request(struct request_list *rl, unsigned int op, - struct bio *bio, gfp_t gfp_mask) +struct bio *bio, unsigned int flags) { struct request_queue *q = rl->q; struct request *rq; @@ -1176,6 +1176,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, struct io_cq *icq = NULL; const bool is_sync = op_is_sync(op); int may_queue; + gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : +__GFP_DIRECT_RECLAIM; req_flags_t rq_flags = RQF_ALLOCED; lockdep_assert_held(q->queue_lock); @@ -1336,7 +1338,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * @q: request_queue to allocate request from * @op: operation and flags * @bio: bio to allocate request for (can be %NULL) - * @gfp_mask: allocation mask + * @flags: BLK_MQ_REQ_* flags. * * Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask, * this function keeps retrying under memory pressure and fails iff @q is dead. @@ -1346,7 +1348,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *get_request(struct request_queue *q, unsigned int op, - struct bio *bio, gfp_t gfp_mask) + struct bio *bio, unsigned int flags) { const bool is_sync = op_is_sync(op); DEFINE_WAIT(wait); @@ -1358,7 +1360,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, rl = blk_get_rl(q, bio);/* transferred to @rq on success */ retry: - rq = __get_request(rl, op, bio, gfp_mask); + rq = __get_request(rl, op, bio, flags); if (!IS_ERR(rq)) return rq; @@ -1367,7 +1369,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, return ERR_PTR(-EAGAIN); } - if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { + if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; } @@ -1394,10 +1396,13 @@ static struct request *get_request(struct request_queue *q, unsigned int op, goto retry; } +/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */ static struct request *blk_old_get_request(struct request_queue *q, - unsigned int op, gfp_t gfp_mask) + unsigned int op, unsigned int flags) { struct request *rq; + gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : +__GFP_DIRECT_RECLAIM; int ret = 0; WARN_ON_ONCE(q->mq_ops); @@ -1410,7 +1415,7 @@ static struct request *blk_old_get_request(struct request_queue *q, if (ret) return ERR_PTR(ret); spin_lock_irq(q->queue_lock); - rq = get_request(q, op, NULL, gfp_mask); + rq = get_request(q, op, NULL, flags); if (IS_ERR(rq)) { spin_unlock_irq(q->queue_lock); blk_queue_exit(q); @@ -1424,25 +1429,40 @@ static struct request *blk_old_get_request(struct request_queue *q, return rq; } -struct request *blk_get_request(struct request_queue *q, unsigned int op, - gfp_t gfp_mask) +/** + * blk_get_request_flags - allocate a request + * @q: request queue to allocate a request for + * @op: operation (REQ_OP_*) and REQ_* flags,
[PATCH v10 06/10] block: Introduce BLK_MQ_REQ_PREEMPT
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to blk_get_request_flags(). Signed-off-by: Bart Van AsscheReviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: Christoph Hellwig Cc: Ming Lei Cc: Johannes Thumshirn --- block/blk-core.c | 4 +++- block/blk-mq.c | 2 ++ include/linux/blk-mq.h | 1 + 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 559818be8e02..ab835d3c2f39 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1260,6 +1260,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, blk_rq_set_rl(rq, rl); rq->cmd_flags = op; rq->rq_flags = rq_flags; + if (flags & BLK_MQ_REQ_PREEMPT) + rq->rq_flags |= RQF_PREEMPT; /* init elvpriv */ if (rq_flags & RQF_ELVPRIV) { @@ -1441,7 +1443,7 @@ struct request *blk_get_request_flags(struct request_queue *q, unsigned int op, struct request *req; WARN_ON_ONCE(op & REQ_NOWAIT); - WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT); + WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT)); if (q->mq_ops) { req = blk_mq_alloc_request(q, op, flags); diff --git a/block/blk-mq.c b/block/blk-mq.c index 59b7de6b616b..6a025b17caac 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -290,6 +290,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->q = data->q; rq->mq_ctx = data->ctx; rq->cmd_flags = op; + if (data->flags & BLK_MQ_REQ_PREEMPT) + rq->rq_flags |= RQF_PREEMPT; if (blk_queue_io_stat(data->q)) rq->rq_flags |= RQF_IO_STAT; /* do not touch atomic flags, it needs atomic ops against the timer */ diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index e5e6becd57d3..22c7f36745fc 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -213,6 +213,7 @@ enum { BLK_MQ_REQ_NOWAIT = (1 << 0), /* return when out of requests */ BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */ BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */ + BLK_MQ_REQ_PREEMPT = (1 << 3), /* set RQF_PREEMPT */ }; struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, -- 2.14.2
[PATCH v10 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests
Convert blk_get_request(q, op, __GFP_RECLAIM) into blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not change any functionality. Signed-off-by: Bart Van AsscheTested-by: Martin Steigerwald Acked-by: David S. Miller [ for IDE ] Acked-by: Martin K. Petersen Reviewed-by: Hannes Reinecke Cc: Ming Lei Cc: Christoph Hellwig Cc: Johannes Thumshirn --- drivers/ide/ide-pm.c| 4 ++-- drivers/scsi/scsi_lib.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c index 544f02d673ca..f56d742908df 100644 --- a/drivers/ide/ide-pm.c +++ b/drivers/ide/ide-pm.c @@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev) } memset(, 0, sizeof(rqpm)); - rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM); + rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN, + BLK_MQ_REQ_PREEMPT); ide_req(rq)->type = ATA_PRIV_PM_RESUME; - rq->rq_flags |= RQF_PREEMPT; rq->special = rqpm.pm_step = IDE_PM_START_RESUME; rqpm.pm_state = PM_EVENT_ON; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6f10afaca25b..7c119696402c 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, struct scsi_request *rq; int ret = DRIVER_ERROR << 24; - req = blk_get_request(sdev->request_queue, + req = blk_get_request_flags(sdev->request_queue, data_direction == DMA_TO_DEVICE ? - REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM); + REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT); if (IS_ERR(req)) return ret; rq = scsi_req(req); @@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, rq->retries = retries; req->timeout = timeout; req->cmd_flags |= flags; - req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT; + req->rq_flags |= rq_flags | RQF_QUIET; /* * head injection *required* here otherwise quiesce won't work -- 2.14.2
[PATCH v10 01/10] md: Rename md_notifier into md_reboot_notifier
This avoids confusion with the pm notifier that will be added through a later patch. Signed-off-by: Bart Van AsscheReviewed-by: Johannes Thumshirn Reviewed-by: Shaohua Li Reviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: linux-r...@vger.kernel.org Cc: Ming Lei Cc: Christoph Hellwig --- drivers/md/md.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 5d61049e7417..8933cafc212d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8966,7 +8966,7 @@ static int md_notify_reboot(struct notifier_block *this, return NOTIFY_DONE; } -static struct notifier_block md_notifier = { +static struct notifier_block md_reboot_notifier = { .notifier_call = md_notify_reboot, .next = NULL, .priority = INT_MAX, /* before any real devices */ @@ -9003,7 +9003,7 @@ static int __init md_init(void) blk_register_region(MKDEV(mdp_major, 0), 1UL<
[PATCH v10 08/10] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
This flag will be used in the next patch to let the block layer core know whether or not a SCSI request queue has been quiesced. A quiesced SCSI queue namely only processes RQF_PREEMPT requests. Signed-off-by: Bart Van AsscheReviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: Ming Lei Cc: Christoph Hellwig Cc: Johannes Thumshirn --- block/blk-core.c | 30 ++ block/blk-mq-debugfs.c | 1 + include/linux/blkdev.h | 6 ++ 3 files changed, 37 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index ab835d3c2f39..b73203286bf5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -346,6 +346,36 @@ void blk_sync_queue(struct request_queue *q) } EXPORT_SYMBOL(blk_sync_queue); +/** + * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * @q: request queue pointer + * + * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not + * set and 1 if the flag was already set. + */ +int blk_set_preempt_only(struct request_queue *q) +{ + unsigned long flags; + int res; + + spin_lock_irqsave(q->queue_lock, flags); + res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + spin_unlock_irqrestore(q->queue_lock, flags); + + return res; +} +EXPORT_SYMBOL_GPL(blk_set_preempt_only); + +void blk_clear_preempt_only(struct request_queue *q) +{ + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + spin_unlock_irqrestore(q->queue_lock, flags); +} +EXPORT_SYMBOL_GPL(blk_clear_preempt_only); + /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped * @q: The queue to run diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 7f4a1ba532af..75f31535f280 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -74,6 +74,7 @@ static const char *const blk_queue_flag_name[] = { QUEUE_FLAG_NAME(REGISTERED), QUEUE_FLAG_NAME(SCSI_PASSTHROUGH), QUEUE_FLAG_NAME(QUIESCED), + QUEUE_FLAG_NAME(PREEMPT_ONLY), }; #undef QUEUE_FLAG_NAME diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 05203175eb9c..864ad2e4a58c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -630,6 +630,7 @@ struct request_queue { #define QUEUE_FLAG_REGISTERED 26 /* queue has been registered to a disk */ #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED28 /* queue has been quiesced */ +#define QUEUE_FLAG_PREEMPT_ONLY29 /* only process REQ_PREEMPT requests */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\ (1 << QUEUE_FLAG_SAME_COMP)| \ @@ -730,6 +731,11 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ REQ_FAILFAST_DRIVER)) #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) +#define blk_queue_preempt_only(q) \ + test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags) + +extern int blk_set_preempt_only(struct request_queue *q); +extern void blk_clear_preempt_only(struct request_queue *q); static inline bool blk_account_rq(struct request *rq) { -- 2.14.2
[PATCH v10 04/10] block: Make q_usage_counter also track legacy requests
From: Ming LeiThis patch makes it possible to pause request allocation for the legacy block layer by calling blk_mq_freeze_queue() and blk_mq_unfreeze_queue(). Signed-off-by: Ming Lei [ bvanassche: Combined two patches into one, edited a comment and made sure REQ_NOWAIT is handled properly in blk_old_get_request() ] Signed-off-by: Bart Van Assche Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: Ming Lei --- block/blk-core.c | 12 block/blk-mq.c | 10 ++ 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index e8e149ccc86b..23a0110fc62b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -610,6 +610,9 @@ void blk_set_queue_dying(struct request_queue *q) } spin_unlock_irq(q->queue_lock); } + + /* Make blk_queue_enter() reexamine the DYING flag. */ + wake_up_all(>mq_freeze_wq); } EXPORT_SYMBOL_GPL(blk_set_queue_dying); @@ -1395,16 +1398,22 @@ static struct request *blk_old_get_request(struct request_queue *q, unsigned int op, gfp_t gfp_mask) { struct request *rq; + int ret = 0; WARN_ON_ONCE(q->mq_ops); /* create ioc upfront */ create_io_context(gfp_mask, q->node); + ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) || + (op & REQ_NOWAIT)); + if (ret) + return ERR_PTR(ret); spin_lock_irq(q->queue_lock); rq = get_request(q, op, NULL, gfp_mask); if (IS_ERR(rq)) { spin_unlock_irq(q->queue_lock); + blk_queue_exit(q); return rq; } @@ -1576,6 +1585,7 @@ void __blk_put_request(struct request_queue *q, struct request *req) blk_free_request(rl, req); freed_request(rl, sync, rq_flags); blk_put_rl(rl); + blk_queue_exit(q); } } EXPORT_SYMBOL_GPL(__blk_put_request); @@ -1857,8 +1867,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) * Grab a free request. This is might sleep but can not fail. * Returns with the queue unlocked. */ + blk_queue_enter_live(q); req = get_request(q, bio->bi_opf, bio, GFP_NOIO); if (IS_ERR(req)) { + blk_queue_exit(q); __wbt_done(q->rq_wb, wb_acct); if (PTR_ERR(req) == -ENOMEM) bio->bi_status = BLK_STS_RESOURCE; diff --git a/block/blk-mq.c b/block/blk-mq.c index 097ca3ece716..59b7de6b616b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q) freeze_depth = atomic_inc_return(>mq_freeze_depth); if (freeze_depth == 1) { percpu_ref_kill(>q_usage_counter); - blk_mq_run_hw_queues(q, false); + if (q->mq_ops) + blk_mq_run_hw_queues(q, false); } } EXPORT_SYMBOL_GPL(blk_freeze_queue_start); @@ -255,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q) queue_for_each_hw_ctx(q, hctx, i) if (blk_mq_hw_queue_mapped(hctx)) blk_mq_tag_wakeup_all(hctx->tags, true); - - /* -* If we are called because the queue has now been marked as -* dying, we need to ensure that processes currently waiting on -* the queue are notified as well. -*/ - wake_up_all(>mq_freeze_wq); } bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx) -- 2.14.2
[PATCH v10 09/10] block, scsi: Make SCSI quiesce and resume work reliably
The contexts from which a SCSI device can be quiesced or resumed are: * Writing into /sys/class/scsi_device/*/device/state. * SCSI parallel (SPI) domain validation. * The SCSI device power management methods. See also scsi_bus_pm_ops. It is essential during suspend and resume that neither the filesystem state nor the filesystem metadata in RAM changes. This is why while the hibernation image is being written or restored that SCSI devices are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() and scsi_device_resume(). In the SDEV_QUIESCE state execution of non-preempt requests is deferred. This is realized by returning BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI devices. Avoid that a full queue prevents power management requests to be submitted by deferring allocation of non-preempt requests for devices in the quiesced state. This patch has been tested by running the following commands and by verifying that after resume the fio job is still running: for d in /sys/class/block/sd*[a-z]; do hcil=$(readlink "$d/device") hcil=${hcil#../../../} echo 4 > "$d/queue/nr_requests" echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" done bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c) bdev=${bdev#../../} hcil=$(readlink "/sys/block/$bdev/device") hcil=${hcil#../../../} fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \ --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \ --loops=$((2**31)) & pid=$! sleep 1 systemctl hibernate sleep 10 kill $pid Reported-by: Oleksandr NatalenkoReferences: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block=150340235201348). Signed-off-by: Bart Van Assche Tested-by: Martin Steigerwald Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- block/blk-core.c | 43 --- block/blk-mq.c | 4 ++-- block/blk-timeout.c| 2 +- drivers/scsi/scsi_lib.c| 42 ++ fs/block_dev.c | 4 ++-- include/linux/blkdev.h | 2 +- include/scsi/scsi_device.h | 1 + 7 files changed, 73 insertions(+), 25 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index b73203286bf5..63e88e78c1fe 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q) spin_lock_irqsave(q->queue_lock, flags); queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + wake_up_all(>mq_freeze_wq); spin_unlock_irqrestore(q->queue_lock, flags); } EXPORT_SYMBOL_GPL(blk_clear_preempt_only); @@ -793,15 +794,41 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); -int blk_queue_enter(struct request_queue *q, bool nowait) +/** + * blk_queue_enter() - try to increase q->q_usage_counter + * @q: request queue pointer + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + */ +int blk_queue_enter(struct request_queue *q, unsigned int flags) { + const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + while (true) { + bool success = false; int ret; - if (percpu_ref_tryget_live(>q_usage_counter)) + rcu_read_lock_sched(); + if (percpu_ref_tryget_live(>q_usage_counter)) { + /* +* The code that sets the PREEMPT_ONLY flag is +* responsible for ensuring that that flag is globally +* visible before the queue is unfrozen. +*/ + if (preempt || !blk_queue_preempt_only(q)) { + success = true; + } else { + percpu_ref_put(>q_usage_counter); + WARN_ONCE(true, + "%s: Attempt to allocate non-preempt request in preempt-only mode.\n", + kobject_name(q->kobj.parent)); + } + } + rcu_read_unlock_sched(); + + if (success) return 0; - if (nowait) + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; /* @@ -814,7 +841,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait) smp_rmb(); ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(>mq_freeze_depth) || + (atomic_read(>mq_freeze_depth) == 0 && +(preempt ||
[PATCH v10 03/10] md: Neither resync nor reshape while the system is frozen
Some people use the md driver on laptops and use the suspend and resume functionality. Since it is essential that submitting of new I/O requests stops before a hibernation image is created, interrupt the md resync and reshape actions if the system is being frozen. Note: the resync and reshape will restart after the system is resumed and a message similar to the following will appear in the system log: md: md0: data-check interrupted. Signed-off-by: Bart Van AsscheReviewed-by: Shaohua Li Reviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Cc: linux-r...@vger.kernel.org Cc: Ming Lei Cc: Christoph Hellwig Cc: Johannes Thumshirn --- drivers/md/md.c | 50 +- drivers/md/md.h | 8 2 files changed, 57 insertions(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index b99584e5d6b1..8b2fc750f97f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -66,6 +66,8 @@ #include #include #include +#include +#include #include #include "md.h" @@ -7439,6 +7441,7 @@ static int md_thread(void *arg) */ allow_signal(SIGKILL); + set_freezable(); while (!kthread_should_stop()) { /* We need to wait INTERRUPTIBLE so that @@ -7449,7 +7452,7 @@ static int md_thread(void *arg) if (signal_pending(current)) flush_signals(current); - wait_event_interruptible_timeout + wait_event_freezable_timeout (thread->wqueue, test_bit(THREAD_WAKEUP, >flags) || kthread_should_stop() || kthread_should_park(), @@ -8944,6 +8947,8 @@ static void md_stop_all_writes(void) int need_delay = 0; for_each_mddev(mddev, tmp) { + mddev->frozen_before_suspend = + test_bit(MD_RECOVERY_FROZEN, >recovery); if (mddev_trylock(mddev)) { if (mddev->pers) __md_stop_writes(mddev); @@ -8963,6 +8968,47 @@ static void md_stop_all_writes(void) mdelay(1000*1); } +static void md_restore_frozen_flag(void) +{ + struct list_head *tmp; + struct mddev *mddev; + + for_each_mddev(mddev, tmp) { + if (mddev->frozen_before_suspend) + set_bit(MD_RECOVERY_FROZEN, >recovery); + else + clear_bit(MD_RECOVERY_FROZEN, >recovery); + } +} + +/* + * Ensure that neither resyncing nor reshaping occurs while the system is + * frozen. + */ +static int md_notify_pm(struct notifier_block *bl, unsigned long state, + void *unused) +{ + pr_debug("%s: state = %ld\n", __func__, state); + + switch (state) { + case PM_HIBERNATION_PREPARE: + case PM_SUSPEND_PREPARE: + case PM_RESTORE_PREPARE: + md_stop_all_writes(); + break; + case PM_POST_HIBERNATION: + case PM_POST_SUSPEND: + case PM_POST_RESTORE: + md_restore_frozen_flag(); + break; + } + return NOTIFY_DONE; +} + +static struct notifier_block md_pm_notifier = { + .notifier_call = md_notify_pm, +}; + static int md_notify_reboot(struct notifier_block *this, unsigned long code, void *x) { @@ -9009,6 +9055,7 @@ static int __init md_init(void) md_probe, NULL, NULL); register_reboot_notifier(_reboot_notifier); + register_pm_notifier(_pm_notifier); raid_table_header = register_sysctl_table(raid_root_table); md_geninit(); @@ -9248,6 +9295,7 @@ static __exit void md_exit(void) unregister_blkdev(MD_MAJOR,"md"); unregister_blkdev(mdp_major, "mdp"); + unregister_pm_notifier(_pm_notifier); unregister_reboot_notifier(_reboot_notifier); unregister_sysctl_table(raid_table_header); diff --git a/drivers/md/md.h b/drivers/md/md.h index d8287d3cd1bf..727b3aef4d26 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -356,6 +356,14 @@ struct mddev { */ int recovery_disabled; + /* Writes are stopped before hibernation, suspend and restore by +* calling md_stop_all_writes(). That function sets the +* MD_RECOVERY_FROZEN flag. The value of that flag is saved before +* writes are stopped such that it can be restored after hibernation, +* suspend or restore have finished. +*/ + boolfrozen_before_suspend; + int in_sync;/* know to not need resync */ /* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so * that we are never stopping an array while it is open.
[PATCH v10 00/10] block, scsi, md: Improve suspend and resume
Hello Jens, It is known that during the resume following a hibernate, especially when using an md RAID1 array created on top of SCSI devices, sometimes the system hangs instead of coming up properly. This patch series fixes that problem. These patches have been tested on top of the block layer for-next branch. Please consider these changes for kernel v4.15. Thanks, Bart. Changes between v9 and v10: - Made sure that scsi_device_resume() handles SCSI devices that are in an error state correctly. Unlike other SCSI transport protocols, during an ATA device resume the ATA controller is reset. This should fix the blk_queue_enter WARNING reported by the kernel test robot. Changes between v8 and v9: - Modified the third patch such that the MD_RECOVERY_FROZEN flag is restored properly after a resume. - Modified the ninth patch such that a kernel warning is reported if it is attempted to call scsi_device_quiesce() from multiple contexts concurrently. - Added Reviewed-by / Tested-by tags as appropriate. Changes between v7 and v8: - Fixed a (theoretical?) race identified by Ming Lei. - Added a tenth patch that checks whether the proper type of flags has been passed to a range of block layer functions. Changes between v6 and v7: - Added support for the PREEMPT_ONLY flag in blk-mq-debugfs.c. - Fixed kerneldoc header of blk_queue_enter(). - Added a rcu_read_lock_sched() / rcu_read_unlock_sched() pair in blk_set_preempt_only(). - Removed a synchronize_rcu() call from scsi_device_quiesce(). - Modified the description of patch 9/9 in this series. - Removed scsi_run_queue() call from scsi_device_resume(). Changes between v5 and v6: - Split an md patch into two patches to make it easier to review the changes. - For the md patch that suspends I/O while the system is frozen, switched back to the freezer mechanism because this makes the code shorter and easier to review. - Changed blk_set/clear_preempt_only() from EXPORT_SYMBOL() into EXPORT_SYMBOL_GPL(). - Made blk_set_preempt_only() behave as a test-and-set operation. - Introduced blk_get_request_flags() and BLK_MQ_REQ_PREEMPT as requested by Christoph and reduced the number of arguments of blk_queue_enter() back from three to two. - In scsi_device_quiesce(), moved the blk_mq_freeze_queue() call out of a critical section. Made the explanation of why the synchronize_rcu() call is necessary more detailed. Changes between v4 and v5: - Split blk_set_preempt_only() into two functions as requested by Christoph. - Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate a request while the system is frozen. This is a big help to identify drivers that submit I/O while the system is frozen. - Since kernel thread freezing is on its way out, reworked the approach for avoiding that the md driver submits I/O while the system is frozen such that the approach no longer depends on the kernel thread freeze mechanism. - Fixed the (theoretical) deadlock in scsi_device_quiesce() that was identified by Ming. Changes between v3 and v4: - Made sure that this patch series not only works for scsi-mq but also for the legacy SCSI stack. - Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment that explains why that is safe. - Reordered the patches in this patch series. Changes between v2 and v3: - Made md kernel threads freezable. - Changed the approach for quiescing SCSI devices again. - Addressed Ming's review comments. Changes compared to v1 of this patch series: - Changed the approach and rewrote the patch series. Bart Van Assche (9): md: Rename md_notifier into md_reboot_notifier md: Introduce md_stop_all_writes() md: Neither resync nor reshape while the system is frozen block: Introduce blk_get_request_flags() block: Introduce BLK_MQ_REQ_PREEMPT ide, scsi: Tell the block layer at request allocation time about preempt requests block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag block, scsi: Make SCSI quiesce and resume work reliably block, nvme: Introduce blk_mq_req_flags_t Ming Lei (1): block: Make q_usage_counter also track legacy requests block/blk-core.c | 133 ++--- block/blk-mq-debugfs.c | 1 + block/blk-mq.c | 20 +++ block/blk-mq.h | 2 +- block/blk-timeout.c| 2 +- drivers/ide/ide-pm.c | 4 +- drivers/md/md.c| 65 -- drivers/md/md.h| 8 +++ drivers/nvme/host/core.c | 5 +- drivers/nvme/host/nvme.h | 5 +- drivers/scsi/scsi_lib.c| 48 +++- fs/block_dev.c | 4 +- include/linux/blk-mq.h | 16 -- include/linux/blk_types.h | 2 + include/linux/blkdev.h | 11 +++- include/scsi/scsi_device.h | 1 + 16 files changed, 258 insertions(+), 69 deletions(-) -- 2.14.2
Re: [PATCH] sd: Micro-optimize READ / WRITE CDB encoding
On Tue, 2017-10-17 at 18:17 -0400, Douglas Gilbert wrote: > On 2017-10-17 05:03 PM, Bart Van Assche wrote: > > @@ -1025,7 +1025,6 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd > > *SCpnt) > > struct gendisk *disk = rq->rq_disk; > > struct scsi_disk *sdkp = scsi_disk(disk); > > sector_t block = blk_rq_pos(rq); > > s/block/lba/ # use the well understood SCSI abbreviation Since blk_rq_pos() returns the offset in units of 512 bytes, how about renaming 'block' into 'sector' and using the name 'lba' for the number obtained after the shift operation? > > > - sector_t threshold; > > unsigned int this_count = blk_rq_sectors(rq); > > unsigned int dif, dix; > > bool zoned_write = sd_is_zoned(sdkp) && rq_data_dir(rq) == WRITE; > > @@ -1071,20 +1070,22 @@ static int sd_setup_read_write_cmnd(struct > > scsi_cmnd *SCpnt) > > goto out; > > } > > > > - /* > > -* Some SD card readers can't handle multi-sector accesses which touch > > -* the last one or two hardware sectors. Split accesses as needed. > > -*/ > > - threshold = get_capacity(disk) - SD_LAST_BUGGY_SECTORS * > > - (sdp->sector_size / 512); > > + if (unlikely(sdp->last_sector_bug)) { > > + sector_t threshold; > > s/threshold/threshold_lba/# a bit long but more precise A similar comment applies here - shouldn't this be called 'threshold_sector'? > > } > > } > > if (rq_data_dir(rq) == WRITE) { > > @@ -1173,56 +1157,26 @@ static int sd_setup_read_write_cmnd(struct > > scsi_cmnd *SCpnt) > > SCpnt->cmnd[7] = 0x18; > > SCpnt->cmnd[9] = (rq_data_dir(rq) == READ) ? READ_32 : WRITE_32; > > Perhaps rq_data_dir(rq) could be placed in a local variable I will keep that for a separate patch. Bart.
Re: [PATCH] sd: Micro-optimize READ / WRITE CDB encoding
On 2017-10-17 05:03 PM, Bart Van Assche wrote: Only compute 'threshold' if .last_sector_bug has been set. Reduce the number of branches that has to be taken to check the starting LBA and transfer length alignment. Optimize the encoding of the READ(10), WRITE(10), READ(16), WRITE(16), READ(32) and WRITE(32) CDB encoding. Great! Some suggestions, see inline below. And showing us the improved version of sd_setup_read_write_cmnd() as well as the diff could help us (or at least me) see the bigger picture. Doug Gilbert Signed-off-by: Bart Van AsscheCc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/scsi/sd.c | 102 +++--- 1 file changed, 28 insertions(+), 74 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 67b50baa943c..83284a8fa2cd 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1025,7 +1025,6 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) struct gendisk *disk = rq->rq_disk; struct scsi_disk *sdkp = scsi_disk(disk); sector_t block = blk_rq_pos(rq); s/block/lba/# use the well understood SCSI abbreviation - sector_t threshold; unsigned int this_count = blk_rq_sectors(rq); unsigned int dif, dix; bool zoned_write = sd_is_zoned(sdkp) && rq_data_dir(rq) == WRITE; @@ -1071,20 +1070,22 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) goto out; } - /* -* Some SD card readers can't handle multi-sector accesses which touch -* the last one or two hardware sectors. Split accesses as needed. -*/ - threshold = get_capacity(disk) - SD_LAST_BUGGY_SECTORS * - (sdp->sector_size / 512); + if (unlikely(sdp->last_sector_bug)) { + sector_t threshold; s/threshold/threshold_lba/ # a bit long but more precise - if (unlikely(sdp->last_sector_bug && block + this_count > threshold)) { - if (block < threshold) { - /* Access up to the threshold but not beyond */ - this_count = threshold - block; - } else { + /* +* Some SD card readers can't handle multi-sector accesses +* which touch the last one or two hardware sectors. Split +* accesses as needed. +*/ + threshold = get_capacity(disk) - SD_LAST_BUGGY_SECTORS * + (sdp->sector_size / 512); + if (block >= threshold) { /* Access only a single hardware sector */ this_count = sdp->sector_size / 512; + } else if (block + this_count > threshold) { + /* Access up to the threshold but not beyond */ + this_count = threshold - block; } } @@ -1102,34 +1103,17 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) * and not force the scsi disk driver to use bounce buffers * for this. */ - if (sdp->sector_size == 1024) { - if ((block & 1) || (blk_rq_sectors(rq) & 1)) { + if (sdp->sector_size != 512) { + if ((block | this_count) & (sdp->sector_size / 512 - 1)) { scmd_printk(KERN_ERR, SCpnt, - "Bad block number requested\n"); + "Bad block number requested: block number = %llu, sector count = %u, sector size = %u bytes\n", + block + 0ULL, this_count, sdp->sector_size); goto out; } else { - block = block >> 1; - this_count = this_count >> 1; - } - } - if (sdp->sector_size == 2048) { - if ((block & 3) || (blk_rq_sectors(rq) & 3)) { - scmd_printk(KERN_ERR, SCpnt, - "Bad block number requested\n"); - goto out; - } else { - block = block >> 2; - this_count = this_count >> 2; - } - } - if (sdp->sector_size == 4096) { - if ((block & 7) || (blk_rq_sectors(rq) & 7)) { - scmd_printk(KERN_ERR, SCpnt, - "Bad block number requested\n"); - goto out; - } else { - block = block >> 3; - this_count = this_count >> 3; + int shift = ilog2(sdp->sector_size / 512); + + block = block >> shift; lba >>= shift;/* scale down LBA */ + this_count = this_count >> shift; likewise
Re: integer overflow in aic7xxx (was [bug report] Linux-2.6.12-rc2)
Argh... My script chose a stupid subject. Sorry for not catching that. regards, dan carpenter On Wed, Oct 18, 2017 at 12:52:49AM +0300, Dan Carpenter wrote: > Hey, > > This code is older than git is so it probably doesn't matter. But just > for laughs does anyone know what this should be? > > drivers/scsi/aic7xxx/aic7xxx_core.c:4807 ahc_init_scbdata() > warn: integer overflow (literal): u32max + 1 > > drivers/scsi/aic7xxx/aic7xxx_core.c > 4794 > 4795 /* > 4796 * Create our DMA tags. These tags define the kinds of device > 4797 * accessible memory allocations and memory mappings we will > 4798 * need to perform during normal operation. > 4799 * > 4800 * Unless we need to further restrict the allocation, we rely > 4801 * on the restrictions of the parent dmat, hence the common > 4802 * use of MAXADDR and MAXSIZE. > 4803 */ > 4804 > 4805 /* DMA tag for our hardware scb structures */ > 4806 if (ahc_dma_tag_create(ahc, ahc->parent_dmat, /*alignment*/1, > 4807 /*boundary*/BUS_SPACE_MAXADDR_32BIT + > 1, >^^^ > This is "0x + 1" which has an integer overflow so it's a > complicated way to say zero. > > 4808 /*lowaddr*/BUS_SPACE_MAXADDR_32BIT, > 4809 /*highaddr*/BUS_SPACE_MAXADDR, > 4810 /*filter*/NULL, /*filterarg*/NULL, > 4811 AHC_SCB_MAX_ALLOC * sizeof(struct > hardware_scb), > 4812 /*nsegments*/1, > 4813 /*maxsegsz*/BUS_SPACE_MAXSIZE_32BIT, > 4814 /*flags*/0, _data->hscb_dmat) != > 0) { > 4815 goto error_exit; > 4816 } > 4817 > 4818 scb_data->init_level++; > 4819 > > regards, > dan carpenter
[bug report] Linux-2.6.12-rc2
Hey, This code is older than git is so it probably doesn't matter. But just for laughs does anyone know what this should be? drivers/scsi/aic7xxx/aic7xxx_core.c:4807 ahc_init_scbdata() warn: integer overflow (literal): u32max + 1 drivers/scsi/aic7xxx/aic7xxx_core.c 4794 4795 /* 4796 * Create our DMA tags. These tags define the kinds of device 4797 * accessible memory allocations and memory mappings we will 4798 * need to perform during normal operation. 4799 * 4800 * Unless we need to further restrict the allocation, we rely 4801 * on the restrictions of the parent dmat, hence the common 4802 * use of MAXADDR and MAXSIZE. 4803 */ 4804 4805 /* DMA tag for our hardware scb structures */ 4806 if (ahc_dma_tag_create(ahc, ahc->parent_dmat, /*alignment*/1, 4807 /*boundary*/BUS_SPACE_MAXADDR_32BIT + 1, ^^^ This is "0x + 1" which has an integer overflow so it's a complicated way to say zero. 4808 /*lowaddr*/BUS_SPACE_MAXADDR_32BIT, 4809 /*highaddr*/BUS_SPACE_MAXADDR, 4810 /*filter*/NULL, /*filterarg*/NULL, 4811 AHC_SCB_MAX_ALLOC * sizeof(struct hardware_scb), 4812 /*nsegments*/1, 4813 /*maxsegsz*/BUS_SPACE_MAXSIZE_32BIT, 4814 /*flags*/0, _data->hscb_dmat) != 0) { 4815 goto error_exit; 4816 } 4817 4818 scb_data->init_level++; 4819 regards, dan carpenter
Re: [PATCH v4 2/5] dt-bindings: scsi: ufs: add document for hisi-ufs
On Fri, Oct 13, 2017 at 03:49:33PM +0800, Li Wei wrote: > add ufs node document for Hisilicon. > > Signed-off-by: Li Wei> --- > Documentation/devicetree/bindings/ufs/ufs-hisi.txt | 47 > ++ > 1 file changed, 47 insertions(+) > create mode 100644 Documentation/devicetree/bindings/ufs/ufs-hisi.txt > > diff --git a/Documentation/devicetree/bindings/ufs/ufs-hisi.txt > b/Documentation/devicetree/bindings/ufs/ufs-hisi.txt > new file mode 100644 > index ..ee114a65143d > --- /dev/null > +++ b/Documentation/devicetree/bindings/ufs/ufs-hisi.txt > @@ -0,0 +1,47 @@ > +* Hisilicon Universal Flash Storage (UFS) Host Controller > + > +UFS nodes are defined to describe on-chip UFS hardware macro. > +Each UFS Host Controller should have its own node. > + > +Required properties: > +- compatible: compatible list, contains one of the following - > + "hisilicon,hi3660-ufs" for hisi ufs host controller > + present on Hi3660 chipset. > +- reg : should contain UFS register address space & UFS SYS > CTRL register address, > +- interrupt-parent : interrupt device > +- interrupts: interrupt number > +- clocks : List of phandle and clock specifier pairs > +- clock-names : List of clock input name strings sorted in the same > + order as the clocks property. "clk_ref", "clk_phy" is > optional > +- resets: reset node register, one reset the clk and the other > reset the controller > +- reset-names : describe reset node register > + > +Optional properties for board device: > +- reset-gpio : specifies to reset devices Should be "reset-gpios". > + > +Example: > + > + ufs: ufs@ff3b { > + compatible = "hisilicon,hi3660-ufs", "jedec,ufs-1.1"; > + /* 0: HCI standard */ > + /* 1: UFS SYS CTRL */ > + reg = <0x0 0xff3b 0x0 0x1000>, > + <0x0 0xff3b1000 0x0 0x1000>; > + interrupt-parent = <>; > + interrupts = ; > + clocks = <_ctrl HI3660_CLK_GATE_UFSIO_REF>, > + <_ctrl HI3660_CLK_GATE_UFSPHY_CFG>; > + clock-names = "clk_ref", "clk_phy"; > + freq-table-hz = <0 0>, <0 0>; > + /* offset: 0x84; bit: 12 */ > + /* offset: 0x84; bit: 7 */ > + resets = <_rst 0x84 12>, > + <_rst 0x84 7>; > + reset-names = "rst", "assert"; > + }; > + > + { > + reset-gpio = < 1 0>; > + status = "okay"; Don't show status or the chip/board split in examples. > + }; > + > -- > 2.11.0 >
[PATCH] sd: Micro-optimize READ / WRITE CDB encoding
Only compute 'threshold' if .last_sector_bug has been set. Reduce the number of branches that has to be taken to check the starting LBA and transfer length alignment. Optimize the encoding of the READ(10), WRITE(10), READ(16), WRITE(16), READ(32) and WRITE(32) CDB encoding. Signed-off-by: Bart Van AsscheCc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/scsi/sd.c | 102 +++--- 1 file changed, 28 insertions(+), 74 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 67b50baa943c..83284a8fa2cd 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1025,7 +1025,6 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) struct gendisk *disk = rq->rq_disk; struct scsi_disk *sdkp = scsi_disk(disk); sector_t block = blk_rq_pos(rq); - sector_t threshold; unsigned int this_count = blk_rq_sectors(rq); unsigned int dif, dix; bool zoned_write = sd_is_zoned(sdkp) && rq_data_dir(rq) == WRITE; @@ -1071,20 +1070,22 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) goto out; } - /* -* Some SD card readers can't handle multi-sector accesses which touch -* the last one or two hardware sectors. Split accesses as needed. -*/ - threshold = get_capacity(disk) - SD_LAST_BUGGY_SECTORS * - (sdp->sector_size / 512); + if (unlikely(sdp->last_sector_bug)) { + sector_t threshold; - if (unlikely(sdp->last_sector_bug && block + this_count > threshold)) { - if (block < threshold) { - /* Access up to the threshold but not beyond */ - this_count = threshold - block; - } else { + /* +* Some SD card readers can't handle multi-sector accesses +* which touch the last one or two hardware sectors. Split +* accesses as needed. +*/ + threshold = get_capacity(disk) - SD_LAST_BUGGY_SECTORS * + (sdp->sector_size / 512); + if (block >= threshold) { /* Access only a single hardware sector */ this_count = sdp->sector_size / 512; + } else if (block + this_count > threshold) { + /* Access up to the threshold but not beyond */ + this_count = threshold - block; } } @@ -1102,34 +1103,17 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) * and not force the scsi disk driver to use bounce buffers * for this. */ - if (sdp->sector_size == 1024) { - if ((block & 1) || (blk_rq_sectors(rq) & 1)) { + if (sdp->sector_size != 512) { + if ((block | this_count) & (sdp->sector_size / 512 - 1)) { scmd_printk(KERN_ERR, SCpnt, - "Bad block number requested\n"); + "Bad block number requested: block number = %llu, sector count = %u, sector size = %u bytes\n", + block + 0ULL, this_count, sdp->sector_size); goto out; } else { - block = block >> 1; - this_count = this_count >> 1; - } - } - if (sdp->sector_size == 2048) { - if ((block & 3) || (blk_rq_sectors(rq) & 3)) { - scmd_printk(KERN_ERR, SCpnt, - "Bad block number requested\n"); - goto out; - } else { - block = block >> 2; - this_count = this_count >> 2; - } - } - if (sdp->sector_size == 4096) { - if ((block & 7) || (blk_rq_sectors(rq) & 7)) { - scmd_printk(KERN_ERR, SCpnt, - "Bad block number requested\n"); - goto out; - } else { - block = block >> 3; - this_count = this_count >> 3; + int shift = ilog2(sdp->sector_size / 512); + + block = block >> shift; + this_count = this_count >> shift; } } if (rq_data_dir(rq) == WRITE) { @@ -1173,56 +1157,26 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) SCpnt->cmnd[7] = 0x18; SCpnt->cmnd[9] = (rq_data_dir(rq) == READ) ? READ_32 : WRITE_32; SCpnt->cmnd[10] = protect | ((rq->cmd_flags & REQ_FUA) ? 0x8 : 0); - /* LBA */ - SCpnt->cmnd[12] = sizeof(block) > 4 ? (unsigned char) (block >>
Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
On Tue, 2017-10-17 at 08:33 +0200, Hannes Reinecke wrote: > How do you ensure that PREEMPT requests are not stuck in the queue > _behind_ non-PREEMPT requests? > Once they are in the queue the request are already allocated, so your > deferred allocation won't work. > _And_ deferred requests will be re-inserted at the head of the queue. > Consequently the PREEMPT request will never ever scheduled during quiesce. > How do you avoid such a scenario? Hello Hannes, The blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls that have been added through patch 9/10 to scsi_device_quiesce() will only succeed after all outstanding requests have finished. That includes non-preempt requests and requests that are about to be requeued. The changes in scsi_device_quiesce() are such that blk_queue_enter() will fail for non-preempt requests after the queue has been unfrozen. In other words, if a scsi_device_quiesce() call succeeds, it is guaranteed that there are no non-preempt requests in the queue and also that no new non-preempt requests will enter the queue until scsi_device_resume() is called. Bart.
Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
On Tue, 2017-10-17 at 08:43 +0800, Ming Lei wrote: > On Mon, Oct 16, 2017 at 04:29:04PM -0700, Bart Van Assche wrote: > > [ ... ] > > int > > scsi_device_quiesce(struct scsi_device *sdev) > > { > > + struct request_queue *q = sdev->request_queue; > > int err; > > > > + /* > > +* It is allowed to call scsi_device_quiesce() multiple times from > > +* the same context but concurrent scsi_device_quiesce() calls are > > +* not allowed. > > +*/ > > + WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current); > > If the above line and comment is removed, everything is still fine, > either nested calling or concurrent calling, isn't it? If scsi_device_quiesce() is called concurrently from two different threads then the first scsi_device_resume() call will resume I/O for *both* contexts. That's not what the callers expect. If scsi_device_quiesce() and scsi_device_resume() are called concurrently that would be even worse. I think we *really* should know whether callers serialize scsi_device_quiesce() and scsi_device_resume() calls properly. Hence the WARN_ON_ONCE() statement. Bart.
Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
On 10/16/17 23:38, Hannes Reinecke wrote: Again, these are just some pre-defined values to avoid I/O starvation when having several LUNs. _if_ we can guarantee I/O fairness between several (hundreds!) devices all sharing the same tagspace we wouldn't need these variables. I think we already have two mechanisms for guaranteeing I/O fairness: * Scsi_Host.starved_list for both the legacy SCSI core and scsi-mq. * For scsi-mq, the active_queues counter in the blk-mq core. See also hctx_may_queue(). Bart.
[PATCH V2] scsi: storvsc: Allow only one remove lun work item to be issued per lun
When running multipath on a VM if all available paths go down the driver can schedule large amounts of storvsc_remove_lun work items to the same lun. In response to the failing paths typically storvsc responds by taking host->scan_mutex and issuing a TUR per lun. If there has been heavy IO to the failed device all the failed IOs are returned from the host. A remove lun work item is issued per failed IO. If the outstanding TURs have not been completed in a timely manner the scan_mutex is never released or released too late. Consequently the many remove lun work items are not completed as scsi_remove_device also tries to take host->scan_mutex. This results in dragging the VM down and sometimes completely. This patch only allows one remove lun to be issued to a particular lun while it is an instantiated member of the scsi stack. Changes since v1: Use single threaded workqueue to serialize work in storvsc_handle_error [Christoph Hellwig] Signed-off-by: Cathy Avery--- drivers/scsi/storvsc_drv.c | 27 ++- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 5e7200f..6febcdb 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -486,6 +486,8 @@ struct hv_host_device { unsigned int port; unsigned char path; unsigned char target; + struct workqueue_struct *handle_error_wq; + char work_q_name[20]; }; struct storvsc_scan_work { @@ -922,6 +924,7 @@ static void storvsc_handle_error(struct vmscsi_request *vm_srb, { struct storvsc_scan_work *wrk; void (*process_err_fn)(struct work_struct *work); + struct hv_host_device *host_dev = shost_priv(host); bool do_work = false; switch (SRB_STATUS(vm_srb->srb_status)) { @@ -988,7 +991,7 @@ static void storvsc_handle_error(struct vmscsi_request *vm_srb, wrk->lun = vm_srb->lun; wrk->tgt_id = vm_srb->target_id; INIT_WORK(>work, process_err_fn); - schedule_work(>work); + queue_work(host_dev->handle_error_wq, >work); } @@ -1803,10 +1806,19 @@ static int storvsc_probe(struct hv_device *device, if (stor_device->num_sc != 0) host->nr_hw_queues = stor_device->num_sc + 1; + /* +* Set the error handler work queue. +*/ + snprintf(host_dev->work_q_name, sizeof(host_dev->work_q_name), +"storvsc_error_wq_%d", host->host_no); + host_dev->handle_error_wq = + create_singlethread_workqueue(host_dev->work_q_name); + if (!host_dev->handle_error_wq) + goto err_out2; /* Register the HBA and start the scsi bus scan */ ret = scsi_add_host(host, >device); if (ret != 0) - goto err_out2; + goto err_out3; if (!dev_is_ide) { scsi_scan_host(host); @@ -1815,7 +1827,7 @@ static int storvsc_probe(struct hv_device *device, device->dev_instance.b[4]); ret = scsi_add_device(host, 0, target, 0); if (ret) - goto err_out3; + goto err_out4; } #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) if (host->transportt == fc_transport_template) { @@ -1827,14 +1839,17 @@ static int storvsc_probe(struct hv_device *device, fc_host_port_name(host) = stor_device->port_name; stor_device->rport = fc_remote_port_add(host, 0, ); if (!stor_device->rport) - goto err_out3; + goto err_out4; } #endif return 0; -err_out3: +err_out4: scsi_remove_host(host); +err_out3: + destroy_workqueue(host_dev->handle_error_wq); + err_out2: /* * Once we have connected with the host, we would need to @@ -1858,6 +1873,7 @@ static int storvsc_remove(struct hv_device *dev) { struct storvsc_device *stor_device = hv_get_drvdata(dev); struct Scsi_Host *host = stor_device->host; + struct hv_host_device *host_dev = shost_priv(host); #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) if (host->transportt == fc_transport_template) { @@ -1865,6 +1881,7 @@ static int storvsc_remove(struct hv_device *dev) fc_remove_host(host); } #endif + destroy_workqueue(host_dev->handle_error_wq); scsi_remove_host(host); storvsc_dev_remove(dev); scsi_host_put(host); -- 2.5.0
[PATCH] scsi: storvsc: Allow only one remove lun work item to be issued per lun
When running multipath on a VM if all available paths go down the driver can schedule large amounts of storvsc_remove_lun work items to the same lun. In response to the failing paths typically storvsc responds by taking host->scan_mutex and issuing a TUR per lun. If there has been heavy IO to the failed device all the failed IOs are returned from the host. A remove lun work item is issued per failed IO. If the outstanding TURs have not been completed in a timely manner the scan_mutex is never released or released too late. Consequently the many remove lun work items are not completed as scsi_remove_device also tries to take host->scan_mutex. This results in dragging the VM down and sometimes completely. This patch only allows one remove lun to be issued to a particular lun while it is an instantiated member of the scsi stack. Changes since v1: Use single threaded workqueue to serialize work in storvsc_handle_error [Christoph Hellwig] Signed-off-by: Cathy Avery--- drivers/scsi/storvsc_drv.c | 27 ++- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 5e7200f..6febcdb 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -486,6 +486,8 @@ struct hv_host_device { unsigned int port; unsigned char path; unsigned char target; + struct workqueue_struct *handle_error_wq; + char work_q_name[20]; }; struct storvsc_scan_work { @@ -922,6 +924,7 @@ static void storvsc_handle_error(struct vmscsi_request *vm_srb, { struct storvsc_scan_work *wrk; void (*process_err_fn)(struct work_struct *work); + struct hv_host_device *host_dev = shost_priv(host); bool do_work = false; switch (SRB_STATUS(vm_srb->srb_status)) { @@ -988,7 +991,7 @@ static void storvsc_handle_error(struct vmscsi_request *vm_srb, wrk->lun = vm_srb->lun; wrk->tgt_id = vm_srb->target_id; INIT_WORK(>work, process_err_fn); - schedule_work(>work); + queue_work(host_dev->handle_error_wq, >work); } @@ -1803,10 +1806,19 @@ static int storvsc_probe(struct hv_device *device, if (stor_device->num_sc != 0) host->nr_hw_queues = stor_device->num_sc + 1; + /* +* Set the error handler work queue. +*/ + snprintf(host_dev->work_q_name, sizeof(host_dev->work_q_name), +"storvsc_error_wq_%d", host->host_no); + host_dev->handle_error_wq = + create_singlethread_workqueue(host_dev->work_q_name); + if (!host_dev->handle_error_wq) + goto err_out2; /* Register the HBA and start the scsi bus scan */ ret = scsi_add_host(host, >device); if (ret != 0) - goto err_out2; + goto err_out3; if (!dev_is_ide) { scsi_scan_host(host); @@ -1815,7 +1827,7 @@ static int storvsc_probe(struct hv_device *device, device->dev_instance.b[4]); ret = scsi_add_device(host, 0, target, 0); if (ret) - goto err_out3; + goto err_out4; } #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) if (host->transportt == fc_transport_template) { @@ -1827,14 +1839,17 @@ static int storvsc_probe(struct hv_device *device, fc_host_port_name(host) = stor_device->port_name; stor_device->rport = fc_remote_port_add(host, 0, ); if (!stor_device->rport) - goto err_out3; + goto err_out4; } #endif return 0; -err_out3: +err_out4: scsi_remove_host(host); +err_out3: + destroy_workqueue(host_dev->handle_error_wq); + err_out2: /* * Once we have connected with the host, we would need to @@ -1858,6 +1873,7 @@ static int storvsc_remove(struct hv_device *dev) { struct storvsc_device *stor_device = hv_get_drvdata(dev); struct Scsi_Host *host = stor_device->host; + struct hv_host_device *host_dev = shost_priv(host); #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) if (host->transportt == fc_transport_template) { @@ -1865,6 +1881,7 @@ static int storvsc_remove(struct hv_device *dev) fc_remove_host(host); } #endif + destroy_workqueue(host_dev->handle_error_wq); scsi_remove_host(host); storvsc_dev_remove(dev); scsi_host_put(host); -- 2.5.0
Re: [PATCH] scsi: libiscsi: Convert timers to use timer_setup()
Reviewed-by: Chris Leech- Original Message - > In preparation for unconditionally passing the struct timer_list pointer to > all timer callbacks, switch to using the new timer_setup() and from_timer() > to pass the timer pointer explicitly. > > Cc: Lee Duncan > Cc: Chris Leech > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: open-is...@googlegroups.com > Cc: linux-scsi@vger.kernel.org > Signed-off-by: Kees Cook > --- > drivers/scsi/libiscsi.c | 16 ++-- > 1 file changed, 6 insertions(+), 10 deletions(-) > > diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c > index f8dc1601efd5..9c50d2d9f27c 100644 > --- a/drivers/scsi/libiscsi.c > +++ b/drivers/scsi/libiscsi.c > @@ -1805,9 +1805,9 @@ int iscsi_target_alloc(struct scsi_target *starget) > } > EXPORT_SYMBOL_GPL(iscsi_target_alloc); > > -static void iscsi_tmf_timedout(unsigned long data) > +static void iscsi_tmf_timedout(struct timer_list *t) > { > - struct iscsi_conn *conn = (struct iscsi_conn *)data; > + struct iscsi_conn *conn = from_timer(conn, t, tmf_timer); > struct iscsi_session *session = conn->session; > > spin_lock(>frwd_lock); > @@ -1838,8 +1838,6 @@ static int iscsi_exec_task_mgmt_fn(struct iscsi_conn > *conn, > } > conn->tmfcmd_pdus_cnt++; > conn->tmf_timer.expires = timeout * HZ + jiffies; > - conn->tmf_timer.function = iscsi_tmf_timedout; > - conn->tmf_timer.data = (unsigned long)conn; > add_timer(>tmf_timer); > ISCSI_DBG_EH(session, "tmf set timeout\n"); > > @@ -2089,9 +2087,9 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct > scsi_cmnd *sc) > } > EXPORT_SYMBOL_GPL(iscsi_eh_cmd_timed_out); > > -static void iscsi_check_transport_timeouts(unsigned long data) > +static void iscsi_check_transport_timeouts(struct timer_list *t) > { > - struct iscsi_conn *conn = (struct iscsi_conn *)data; > + struct iscsi_conn *conn = from_timer(conn, t, transport_timer); > struct iscsi_session *session = conn->session; > unsigned long recv_timeout, next_timeout = 0, last_recv; > > @@ -2913,9 +2911,7 @@ iscsi_conn_setup(struct iscsi_cls_session *cls_session, > int dd_size, > conn->exp_statsn = 0; > conn->tmf_state = TMF_INITIAL; > > - init_timer(>transport_timer); > - conn->transport_timer.data = (unsigned long)conn; > - conn->transport_timer.function = iscsi_check_transport_timeouts; > + timer_setup(>transport_timer, iscsi_check_transport_timeouts, 0); > > INIT_LIST_HEAD(>mgmtqueue); > INIT_LIST_HEAD(>cmdqueue); > @@ -2939,7 +2935,7 @@ iscsi_conn_setup(struct iscsi_cls_session *cls_session, > int dd_size, > goto login_task_data_alloc_fail; > conn->login_task->data = conn->data = data; > > - init_timer(>tmf_timer); > + timer_setup(>tmf_timer, iscsi_tmf_timedout, 0); > init_waitqueue_head(>ehwait); > > return cls_conn; > -- > 2.7.4 > > > -- > Kees Cook > Pixel Security >
Re: [PATCH 0/2] block/SCSI MQ: two RESTART related patches
On 17/10/2017 06:12, Ming Lei wrote: On Tue, Oct 17, 2017 at 01:04:16PM +0800, Ming Lei wrote: Hi Jens, The 1st patch runs idle hctx after dealy in scsi_mq_get_budget(), so that we can keep same behaviour with before, and it can be thought as a fix. The 2nd patch cleans up RESTART, and removes handling for TAG_SHARED from current blk-mq's RESTART mechanism because SCSI_MQ can covers its restart by itself, so that no need to handle TAG_SHARED in blk-mq RESTART. And >20% IOPS boost is observed in my rand read test over scsi_debug. John, please test this two patches and see if it may improve your SAS IO performance, and you can find the two patches in the following branch: https://github.com/ming1/linux/commits/blk_mq_improve_restart_V1 Hi Ming, As requested, here's my figures for blk_mq_improve_restart_V1: without default SCSI_MQ, deadline scheduler, CONFIG PREEMPT off read, rw, write IOPS 989K, 113K/113K, 835K with default SCSI_MQ, mq-deadline scheduler, CONFIG PREEMPT off read, rw, write IOPS 738K, 130K/130K, 686K For axboe for-next tip (21ed538): without default SCSI_MQ, deadline scheduler, CONFIG PREEMPT off read, rw, write IOPS 977K, 117K/117K, 821K with default SCSI_MQ, mq-deadline scheduler, CONFIG PREEMPT off read, rw, write IOPS 733K, 128K/128K, 676K All cases do not have LLDD mq exposed/enabled. So unfortunately not much difference with your branch. cheers, John Forget to mention, you need to either pull the above branch directly or apply the two patches against for-next branch of Jens' block tree: git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git #for-next
[GIT PULL] SCSI fixes for 4.14-rc5
Four mostly error leg fixes and one more important regression in a prior commit (the qla2xxx one). The patch is available here: git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes The short changelog is: Hannes Reinecke (1): scsi: fixup kernel warning during rmmod() Johannes Thumshirn (2): scsi: fc: check for rport presence in fc_block_scsi_eh scsi: libiscsi: fix shifting of DID_REQUEUE host byte Quinn Tran (1): scsi: qla2xxx: Fix uninitialized work element Satish Kharat (1): scsi: libfc: fix a deadlock in fc_rport_work And the diffstat: drivers/scsi/libfc/fc_rport.c| 2 +- drivers/scsi/libiscsi.c | 2 +- drivers/scsi/qla2xxx/qla_os.c| 3 ++- drivers/scsi/scsi_sysfs.c| 10 -- drivers/scsi/scsi_transport_fc.c | 3 +++ 5 files changed, 15 insertions(+), 5 deletions(-) With full diff below. James --- diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c index 520325867e2b..31d31aad3de1 100644 --- a/drivers/scsi/libfc/fc_rport.c +++ b/drivers/scsi/libfc/fc_rport.c @@ -383,11 +383,11 @@ static void fc_rport_work(struct work_struct *work) fc_rport_enter_flogi(rdata); mutex_unlock(>rp_mutex); } else { + mutex_unlock(>rp_mutex); FC_RPORT_DBG(rdata, "work delete\n"); mutex_lock(>disc.disc_mutex); list_del_rcu(>peers); mutex_unlock(>disc.disc_mutex); - mutex_unlock(>rp_mutex); kref_put(>kref, fc_rport_destroy); } } else { diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index c62e8d111fd9..f8dc1601efd5 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -1728,7 +1728,7 @@ int iscsi_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *sc) if (test_bit(ISCSI_SUSPEND_BIT, >suspend_tx)) { reason = FAILURE_SESSION_IN_RECOVERY; - sc->result = DID_REQUEUE; + sc->result = DID_REQUEUE << 16; goto fault; } diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 5b2437a5ea44..937209805baf 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -3175,6 +3175,8 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id) host->can_queue, base_vha->req, base_vha->mgmt_svr_loop_id, host->sg_tablesize); + INIT_WORK(_vha->iocb_work, qla2x00_iocb_work_fn); + if (ha->mqenable) { bool mq = false; bool startit = false; @@ -3223,7 +3225,6 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id) */ qla2xxx_wake_dpc(base_vha); - INIT_WORK(_vha->iocb_work, qla2x00_iocb_work_fn); INIT_WORK(>board_disable, qla2x00_disable_board_on_pci_error); if (IS_QLA8031(ha) || IS_MCTP_CAPABLE(ha)) { diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index bf53356f41f0..f796bd61f3f0 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -1376,13 +1376,19 @@ static void __scsi_remove_target(struct scsi_target *starget) spin_lock_irqsave(shost->host_lock, flags); restart: list_for_each_entry(sdev, >__devices, siblings) { + /* +* We cannot call scsi_device_get() here, as +* we might've been called from rmmod() causing +* scsi_device_get() to fail the module_is_live() +* check. +*/ if (sdev->channel != starget->channel || sdev->id != starget->id || - scsi_device_get(sdev)) + !get_device(>sdev_gendev)) continue; spin_unlock_irqrestore(shost->host_lock, flags); scsi_remove_device(sdev); - scsi_device_put(sdev); + put_device(>sdev_gendev); spin_lock_irqsave(shost->host_lock, flags); goto restart; } diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index cbd4495d0ff9..8c46a6d536af 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -3320,6 +3320,9 @@ int fc_block_scsi_eh(struct scsi_cmnd *cmnd) { struct fc_rport *rport = starget_to_rport(scsi_target(cmnd->device)); + if (WARN_ON_ONCE(!rport)) + return FAST_IO_FAIL; + return fc_block_rport(rport); } EXPORT_SYMBOL(fc_block_scsi_eh);
Re: scsi: Add TRY_VPD_PAGES to HITACHI OPEN-V blacklist entry
On 10/17/2017 09:10 AM, Hannes Reinecke wrote: > HITACHI is always supporting VPD pages, even though it's claiming > to support SCSI Revision 3 only. "HP/OPEN-" is also affected. HP XP is a rebrand of HITACHI VSP. {"HP", "OPEN-", "*", BLIST_REPORTLUN2}, /* HP XP Arrays */ > Signed-off-by: Hannes Reinecke> --- > drivers/scsi/scsi_devinfo.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c > index ae2ad84..a998585 100644 > --- a/drivers/scsi/scsi_devinfo.c > +++ b/drivers/scsi/scsi_devinfo.c > @@ -174,7 +174,7 @@ struct scsi_dev_info_list_table { > {"HITACHI", "DF500", "*", BLIST_REPORTLUN2}, > {"HITACHI", "DISK-SUBSYSTEM", "*", BLIST_REPORTLUN2}, > {"HITACHI", "HUS1530", "*", BLIST_NO_DIF}, > - {"HITACHI", "OPEN-", "*", BLIST_REPORTLUN2}, > + {"HITACHI", "OPEN-", "*", BLIST_REPORTLUN2 | BLIST_TRY_VPD_PAGES}, > {"HITACHI", "OP-C-", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, > {"HITACHI", "3380-", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, > {"HITACHI", "3390-", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, >
Re: [PATCH] scsi: gdth: Convert timers to use timer_setup()
On Mon, Oct 16, 2017 at 9:09 PM, Martin K. Petersenwrote: > > Kees, > >> In preparation for unconditionally passing the struct timer_list >> pointer to all timer callbacks, switch to using the new timer_setup() >> and from_timer() to pass the timer pointer explicitly. > > Reviewed-by: Martin K. Petersen Thanks for the reviews! Do you want the timer tree to carry these patches, or can you pick them up in the scsi tree? -Kees -- Kees Cook Pixel Security
Re: [PATCH] scsi: sd: Implement blacklist option for WRITE SAME w/ UNMAP
On Tue, 2017-10-17 at 10:26 -0400, Laurence Oberman wrote: > On Fri, 2017-09-29 at 10:01 -0400, Laurence Oberman wrote: > > On Fri, 2017-09-29 at 09:21 -0400, Martin K. Petersen wrote: > > > Laurence, > > > > > > > I am testing this but its not being picked up so I want to know > > > > if > > > > I > > > > have the kernel command line wrong here. > > > > > > > > scsi_dev_flags=LIO-ORG:thin2:0x8000 > > > > > > > > What am I doing wrong to pass the BLIST flags. > > > > > > This worked for me: > > > > > > [root@kvm ~]# echo "Linux:scsi_debug:0x8000" > > > > /proc/scsi/device_info > > > [root@kvm ~]# grep Linux /proc/scsi/device_info > > > 'Linux ' 'scsi_debug ' 0x8000 > > > [root@kvm ~]# modprobe scsi_debug unmap_max_blocks=10 > > > unmap_max_desc=1 write_same_length=20 lbpws=1 > > > [root@kvm ~]# lsblk -D > > > NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO > > > sda 0 512B 5K 0 > > > > > > (With the caveat that I tweaked scsi_debug to report the UNMAP > > > parameters despite lbpu being 0). > > > > > > > OK, Thanks, that is working now and I pick up the correct size now. > > Its going to be very useful for these corner case array > > inconsistencies. > > > > Tested-by: Laurence Oberman> > > > Sep 29 09:56:11 localhost kernel: scsi 1:0:0:50: Direct- > > Access LIO- > > ORG thin24.0 PQ: 0 ANSI: 5 > > Sep 29 09:56:11 localhost kernel: scsi 1:0:0:50: alua: supports > > implicit and explicit TPGS > > Sep 29 09:56:11 localhost kernel: scsi 1:0:0:50: alua: device > > naa.6001405f7aa27ca453f4381a00f22ea6 port group 0 rel port 2 > > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: Attached scsi > > generic > > sg64 type 0 > > Sep 29 09:56:11 localhost kernel: RHDEBUG: unmap_limit_for_ws set > > by > > kernel flag for case SD_LBP_WS16 > > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: [sdbl] 8192 512- > > byte > > logical blocks: (41.9 GB/39.1 GiB) > > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: [sdbl] Write Protect > > is > > off > > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: [sdbl] Write cache: > > enabled, read cache: enabled, supports DPO and FUA > > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: alua: transition > > timeout > > set to 60 seconds > > > Hi Martin > > We have the code accepted for the patch above and its working fine > with > echo after boot as you already know. > > However I am still fighting with trying to pass this on the kernel > command line. We need to be able to do this as a dynamic method for > adding devices to the list so that the list is populated prior to the > device scan. > > Using scsi_dev_flags=LIO-ORG:thin2:0x8000 on the kernel line is > ignored and not filled in. > Its apparent to me that we have no longer have code to actually copy > the string from the kernel line after boot. > > I ran some tests and added a couple of printk;s to see if we have any > capture and its NULL. > > So when did this stop working, is what I am now chasing > > [1.524595] RHDEBUG:In scsi_init_devinfo scsi_dev_flags= > [1.524705] RHDEBUG: In scsi_init_devinfo error=0 > > We have this in drivers/scsi/scsi_devinfo.c > > module_param_string(dev_flags, scsi_dev_flags, > sizeof(scsi_dev_flags), > 0); > MODULE_PARM_DESC(dev_flags, > > "Given scsi_dev_flags=vendor:model:flags[,v:m:f] add > black/white" > " list entries for vendor and model with an integer value of > flags" > " to the scsi device info list"); > > and we have: > > /** > * scsi_init_devinfo - set up the dynamic device list. > * > * Description: > * Add command line entries from scsi_dev_flags, then add > * scsi_static_device_list entries to the scsi device info list. > */ > int __init scsi_init_devinfo(void) > { > #ifdef CONFIG_SCSI_PROC_FS > struct proc_dir_entry *p; > #endif > int error, i; > > printk("RHDEBUG:In scsi_init_devinfo > scsi_dev_flags=%s\n",scsi_dev_flags); > > error = scsi_dev_info_add_list(SCSI_DEVINFO_GLOBAL, NULL); > printk("RHDEBUG: In scsi_init_devinfo error=%d\n",error); > if (error) { > printk("RHDEBUG: In scsi_init_devinfo, calling > scsi_dev_info_add_list returning with error=%d\n",error); > return error; > } > > error = scsi_dev_info_list_add_str(scsi_dev_flags); > if (error) { > printk("RHDEBUG: In scsi_init_devinfo, calling > scsi_info_list_add returning with error=%d\n",error); > goto out; > } > > for (i = 0; scsi_static_device_list[i].vendor; i++) { > error = scsi_dev_info_list_add(1 /* compatibile */, > scsi_static_device_list[i].vendor, > scsi_static_device_list[i].model, > NULL, > scsi_static_device_list[i].flags); > if (error) >
Re: [PATCH v2 5/8] target: Use sgl_alloc_order() and sgl_free()
On Tue, 2017-10-17 at 08:14 +0200, Hannes Reinecke wrote: > On 10/17/2017 12:49 AM, Bart Van Assche wrote: > > [ ... ] > > void target_free_sgl(struct scatterlist *sgl, int nents) > > { > > - struct scatterlist *sg; > > - int count; > > - > > - for_each_sg(sgl, sg, nents, count) > > - __free_page(sg_page(sg)); > > - > > - kfree(sgl); > > + sgl_free(sgl); > > } > > EXPORT_SYMBOL(target_free_sgl); > > > > @@ -2423,42 +2417,10 @@ int > > target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length, > > bool zero_page, bool chainable) > > { > > - [ ... ] > > + *sgl = sgl_alloc_order(length, 0, chainable, gfp, nents); > > + return *sgl ? 0 : -ENOMEM; > > } > > EXPORT_SYMBOL(target_alloc_sgl); > > > > The calling convention from target_alloc_sgl() is decidedly dodgy. > > Can't we convert it into returning the sgl, and remove the first parameter? Hello Hannes, Another option is to remove the target_alloc_sgl() and target_free_sgl() functions and to make LIO target drivers call sgl_alloc_order() and sgl_free_order() directly. Do you perhaps have a preference for one of these two approaches? Thanks, Bart.
Re: [PATCH v2 7/8] scsi/pmcraid: Remove an unused structure member
On Tue, 2017-10-17 at 08:21 +0200, Hannes Reinecke wrote: > On 10/17/2017 12:49 AM, Bart Van Assche wrote: > > Signed-off-by: Bart Van Assche> > Reviewed-by: Johannes Thumshirn > > Cc: linux-scsi@vger.kernel.org > > Cc: Martin K. Petersen > > Cc: Anil Ravindranath > > --- > > drivers/scsi/pmcraid.h | 1 - > > 1 file changed, 1 deletion(-) > > > > diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h > > index 8bfac72a242b..44da91712115 100644 > > --- a/drivers/scsi/pmcraid.h > > +++ b/drivers/scsi/pmcraid.h > > @@ -542,7 +542,6 @@ struct pmcraid_sglist { > > u32 order; > > u32 num_sg; > > u32 num_dma_sg; > > - u32 buffer_len; > > struct scatterlist scatterlist[1]; > > }; > > > > > > This actually is the same story that we've had with ipr (and, looking at > the code, those two drivers look awfully similar ...). > pmcraid_sglist looks as if it's a hardware-dependent structure, so just > removing one entry from the middle of a structure might not be a good idea. > But this is something for the pmcraid folks to clarify. Hello Hannes, Sorry but I don't see how a structure that contains a struct scatterlist could be hardware-dependent? Thanks, Bart.
Re: [PATCH] scsi: sd: Implement blacklist option for WRITE SAME w/ UNMAP
On Fri, 2017-09-29 at 10:01 -0400, Laurence Oberman wrote: > On Fri, 2017-09-29 at 09:21 -0400, Martin K. Petersen wrote: > > Laurence, > > > > > I am testing this but its not being picked up so I want to know > > > if > > > I > > > have the kernel command line wrong here. > > > > > > scsi_dev_flags=LIO-ORG:thin2:0x8000 > > > > > > What am I doing wrong to pass the BLIST flags. > > > > This worked for me: > > > > [root@kvm ~]# echo "Linux:scsi_debug:0x8000" > > > /proc/scsi/device_info > > [root@kvm ~]# grep Linux /proc/scsi/device_info > > 'Linux ' 'scsi_debug ' 0x8000 > > [root@kvm ~]# modprobe scsi_debug unmap_max_blocks=10 > > unmap_max_desc=1 write_same_length=20 lbpws=1 > > [root@kvm ~]# lsblk -D > > NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO > > sda 0 512B 5K 0 > > > > (With the caveat that I tweaked scsi_debug to report the UNMAP > > parameters despite lbpu being 0). > > > > OK, Thanks, that is working now and I pick up the correct size now. > Its going to be very useful for these corner case array > inconsistencies. > > Tested-by: Laurence Oberman> > Sep 29 09:56:11 localhost kernel: scsi 1:0:0:50: Direct- > Access LIO- > ORG thin24.0 PQ: 0 ANSI: 5 > Sep 29 09:56:11 localhost kernel: scsi 1:0:0:50: alua: supports > implicit and explicit TPGS > Sep 29 09:56:11 localhost kernel: scsi 1:0:0:50: alua: device > naa.6001405f7aa27ca453f4381a00f22ea6 port group 0 rel port 2 > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: Attached scsi generic > sg64 type 0 > Sep 29 09:56:11 localhost kernel: RHDEBUG: unmap_limit_for_ws set by > kernel flag for case SD_LBP_WS16 > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: [sdbl] 8192 512- > byte > logical blocks: (41.9 GB/39.1 GiB) > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: [sdbl] Write Protect > is > off > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: [sdbl] Write cache: > enabled, read cache: enabled, supports DPO and FUA > Sep 29 09:56:11 localhost kernel: sd 1:0:0:50: alua: transition > timeout > set to 60 seconds Hi Martin We have the code accepted for the patch above and its working fine with echo after boot as you already know. However I am still fighting with trying to pass this on the kernel command line. We need to be able to do this as a dynamic method for adding devices to the list so that the list is populated prior to the device scan. Using scsi_dev_flags=LIO-ORG:thin2:0x8000 on the kernel line is ignored and not filled in. Its apparent to me that we have no longer have code to actually copy the string from the kernel line after boot. I ran some tests and added a couple of printk;s to see if we have any capture and its NULL. So when did this stop working, is what I am now chasing [1.524595] RHDEBUG:In scsi_init_devinfo scsi_dev_flags= [1.524705] RHDEBUG: In scsi_init_devinfo error=0 We have this in drivers/scsi/scsi_devinfo.c module_param_string(dev_flags, scsi_dev_flags, sizeof(scsi_dev_flags), 0); MODULE_PARM_DESC(dev_flags, "Given scsi_dev_flags=vendor:model:flags[,v:m:f] add black/white" " list entries for vendor and model with an integer value of flags" " to the scsi device info list"); and we have: /** * scsi_init_devinfo - set up the dynamic device list. * * Description: * Add command line entries from scsi_dev_flags, then add * scsi_static_device_list entries to the scsi device info list. */ int __init scsi_init_devinfo(void) { #ifdef CONFIG_SCSI_PROC_FS struct proc_dir_entry *p; #endif int error, i; printk("RHDEBUG:In scsi_init_devinfo scsi_dev_flags=%s\n",scsi_dev_flags); error = scsi_dev_info_add_list(SCSI_DEVINFO_GLOBAL, NULL); printk("RHDEBUG: In scsi_init_devinfo error=%d\n",error); if (error) { printk("RHDEBUG: In scsi_init_devinfo, calling scsi_dev_info_add_list returning with error=%d\n",error); return error; } error = scsi_dev_info_list_add_str(scsi_dev_flags); if (error) { printk("RHDEBUG: In scsi_init_devinfo, calling scsi_info_list_add returning with error=%d\n",error); goto out; } for (i = 0; scsi_static_device_list[i].vendor; i++) { error = scsi_dev_info_list_add(1 /* compatibile */, scsi_static_device_list[i].vendor, scsi_static_device_list[i].model, NULL, scsi_static_device_list[i].flags); if (error) goto out; } #ifdef CONFIG_SCSI_PROC_FS p = proc_create("scsi/device_info", 0, NULL, _devinfo_proc_fops); if (!p) { error = -ENOMEM; goto out; } #endif /* CONFIG_SCSI_PROC_FS */ out: if (error) scsi_exit_devinfo(); return
Re: [PATCH] sd: always retry READ CAPACITY for ALUA state transition
On Tue, 2017-10-17 at 09:11 +0200, Hannes Reinecke wrote: > During ALUA state transitions the device might return > a sense code 02/04/0a (Logical unit not accessible, asymmetric > access state transition). As this is a transient error > we should just retry the READ CAPACITY call until > the state transition finishes and the correct > capacity can be returned. This will lock up the system if some ALUA initiator gets into a state where it always returns transitioning and never completes, which doesn't look like the best way to handle problem devices. I thought after the ALUA transition the LUN gives a unit attention ... can't you use that some way to trigger the capacity re-read, so do asynchronous event notification instead of polling forever. James
Re: [PATCH] scsi: sg: Re-fix off by one in sg_fill_request_table()
On 2017-10-15 01:16 PM, Ben Hutchings wrote: Commit 109bade9c625 ("scsi: sg: use standard lists for sg_requests") introduced an off-by-one error in sg_ioctl(), which was fixed by commit bd46fc406b30 ("scsi: sg: off by one in sg_ioctl()"). Unfortunately commit 4759df905a47 ("scsi: sg: factor out sg_fill_request_table()") moved that code, and reintroduced the bug (perhaps due to a botched rebase). Fix it again. Fixes: 4759df905a47 ("scsi: sg: factor out sg_fill_request_table()") Cc: sta...@vger.kernel.org Signed-off-by: Ben HutchingsAcked-by: Douglas Gilbert Thanks. --- drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 0419c2298eab..aa28874e8fb9 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -837,7 +837,7 @@ sg_fill_request_table(Sg_fd *sfp, sg_req_info_t *rinfo) val = 0; list_for_each_entry(srp, >rq_list, entry) { - if (val > SG_MAX_QUEUE) + if (val >= SG_MAX_QUEUE) break; rinfo[val].req_state = srp->done + 1; rinfo[val].problem =
Re: [PATCH V8 4/5] libsas: Align SMP req/resp to dma_get_cache_alignment()
Hi Huacai, On 2017-10-17 10:05, Huacai Chen wrote: In non-coherent DMA mode, kernel uses cache flushing operations to maintain I/O coherency, so libsas's SMP request/response should be aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer and a kernel structure share a same cache line, and if the kernel structure has dirty data, cache_invalidate (no writeback) will cause data corruption. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen--- drivers/scsi/libsas/sas_expander.c | 93 +++--- 1 file changed, 57 insertions(+), 36 deletions(-) diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c index 6b4fd23..124a44b 100644 --- a/drivers/scsi/libsas/sas_expander.c +++ b/drivers/scsi/libsas/sas_expander.c @@ -164,17 +164,17 @@ static int smp_execute_task(struct domain_device *dev, void *req, int req_size, /* -- Allocations -- */ -static inline void *alloc_smp_req(int size) +static inline void *alloc_smp_req(int size, int align) { - u8 *p = kzalloc(size, GFP_KERNEL); + u8 *p = kzalloc(ALIGN(size, align), GFP_KERNEL); If I remember correctly, kernel guarantees that each kmalloced buffer is always at least aligned to the CPU cache line, so CPU cache can be invalidated on the allocated buffer without corrupting anything else. Taking this into account, I wonder if the above change make sense. Have you experienced any problems without this change? if (p) p[0] = SMP_REQUEST; return p; } -static inline void *alloc_smp_resp(int size) +static inline void *alloc_smp_resp(int size, int align) { - return kzalloc(size, GFP_KERNEL); + return kzalloc(ALIGN(size, align), GFP_KERNEL); Save a above. } static char sas_route_char(struct domain_device *dev, struct ex_phy *phy) @@ -403,15 +403,17 @@ static int sas_ex_phy_discover_helper(struct domain_device *dev, u8 *disc_req, int sas_ex_phy_discover(struct domain_device *dev, int single) { struct expander_device *ex = >ex_dev; - int res = 0; + int res = 0, align; u8 *disc_req; u8 *disc_resp; - disc_req = alloc_smp_req(DISCOVER_REQ_SIZE); + align = dma_get_cache_alignment(>phy->dev); + + disc_req = alloc_smp_req(DISCOVER_REQ_SIZE, align); if (!disc_req) return -ENOMEM; - disc_resp = alloc_smp_resp(DISCOVER_RESP_SIZE); + disc_resp = alloc_smp_resp(DISCOVER_RESP_SIZE, align); if (!disc_resp) { kfree(disc_req); return -ENOMEM; @@ -480,14 +482,15 @@ static int sas_ex_general(struct domain_device *dev) { u8 *rg_req; struct smp_resp *rg_resp; - int res; - int i; + int i, res, align; - rg_req = alloc_smp_req(RG_REQ_SIZE); + align = dma_get_cache_alignment(>phy->dev); + + rg_req = alloc_smp_req(RG_REQ_SIZE, align); if (!rg_req) return -ENOMEM; - rg_resp = alloc_smp_resp(RG_RESP_SIZE); + rg_resp = alloc_smp_resp(RG_RESP_SIZE, align); if (!rg_resp) { kfree(rg_req); return -ENOMEM; @@ -552,13 +555,15 @@ static int sas_ex_manuf_info(struct domain_device *dev) { u8 *mi_req; u8 *mi_resp; - int res; + int res, align; - mi_req = alloc_smp_req(MI_REQ_SIZE); + align = dma_get_cache_alignment(>phy->dev); + + mi_req = alloc_smp_req(MI_REQ_SIZE, align); if (!mi_req) return -ENOMEM; - mi_resp = alloc_smp_resp(MI_RESP_SIZE); + mi_resp = alloc_smp_resp(MI_RESP_SIZE, align); if (!mi_resp) { kfree(mi_req); return -ENOMEM; @@ -593,13 +598,15 @@ int sas_smp_phy_control(struct domain_device *dev, int phy_id, { u8 *pc_req; u8 *pc_resp; - int res; + int res, align; + + align = dma_get_cache_alignment(>phy->dev); - pc_req = alloc_smp_req(PC_REQ_SIZE); + pc_req = alloc_smp_req(PC_REQ_SIZE, align); if (!pc_req) return -ENOMEM; - pc_resp = alloc_smp_resp(PC_RESP_SIZE); + pc_resp = alloc_smp_resp(PC_RESP_SIZE, align); if (!pc_resp) { kfree(pc_req); return -ENOMEM; @@ -664,17 +671,19 @@ static int sas_dev_present_in_domain(struct asd_sas_port *port, #define RPEL_RESP_SIZE32 int sas_smp_get_phy_events(struct sas_phy *phy) { - int res; + int res, align; u8 *req; u8 *resp; struct sas_rphy *rphy = dev_to_rphy(phy->dev.parent); struct domain_device *dev = sas_find_dev_by_rphy(rphy); - req = alloc_smp_req(RPEL_REQ_SIZE); + align = dma_get_cache_alignment(>dev); + + req = alloc_smp_req(RPEL_REQ_SIZE, align); if (!req) return -ENOMEM; - resp = alloc_smp_resp(RPEL_RESP_SIZE); + resp = alloc_smp_resp(RPEL_RESP_SIZE, align);
Re: [PATCH v2 12/15] RDMA/cma: make config_item_type const
On Mon, Oct 16, 2017 at 05:18:51PM +0200, Bhumika Goyal wrote: > Make these structures const as they are either passed to the functions > having the argument as const or stored as a reference in the "ci_type" > const field of a config_item structure. > > Signed-off-by: Bhumika Goyal> --- > * Changes in v2- Combine all the followup patches and the constification > patches into a series. > > drivers/infiniband/core/cma_configfs.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > Thanks, Reviewed-by: Leon Romanovsky signature.asc Description: PGP signature
Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
On Tue, Oct 17, 2017 at 08:33:36AM +0200, Hannes Reinecke wrote: > On 10/17/2017 01:29 AM, Bart Van Assche wrote: > > The contexts from which a SCSI device can be quiesced or resumed are: > > * Writing into /sys/class/scsi_device/*/device/state. > > * SCSI parallel (SPI) domain validation. > > * The SCSI device power management methods. See also scsi_bus_pm_ops. > > > > It is essential during suspend and resume that neither the filesystem > > state nor the filesystem metadata in RAM changes. This is why while > > the hibernation image is being written or restored that SCSI devices > > are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() > > and scsi_device_resume(). In the SDEV_QUIESCE state execution of > > non-preempt requests is deferred. This is realized by returning > > BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI > > devices. Avoid that a full queue prevents power management requests > > to be submitted by deferring allocation of non-preempt requests for > > devices in the quiesced state. This patch has been tested by running > > the following commands and by verifying that after resume the fio job > > is still running: > > > (We've discussed this at ALPSS already:) > > How do you ensure that PREEMPT requests are not stuck in the queue > _behind_ non-PREEMPT requests? Once the PREEMP_ONLY flag is set, blk_mq_freeze_queue() is called for draining up all requests in queue first, and new requests are prevented from being allocated meantime. So looks no such issue? > Once they are in the queue the request are already allocated, so your > deferred allocation won't work. > _And_ deferred requests will be re-inserted at the head of the queue. > Consequently the PREEMPT request will never ever scheduled during quiesce. > How do you avoid such a scenario? >From the implementation, no any PREEMPT request can be allocated once scsi_device_quiesce() returns. Also not see deferred requests in this patch, could you explain it a bit? -- Ming
Re: [PATCH v2 07/15] usb: gadget: configfs: make config_item_type const
Bhumika Goyalwrites: > Make config_item_type structures const as they are either passed to a > function having the argument as const or stored in the const "ci_type" > field of a config_item structure. > > Done using Coccinelle. > > Signed-off-by: Bhumika Goyal Acked-by: Felipe Balbi -- balbi signature.asc Description: PGP signature
Re: [PATCH v2 11/15] stm class: make config_item_type const
On Mon, Oct 16, 2017 at 05:18:50PM +0200, Bhumika Goyal wrote: > Make config_item_type structures const as they are either passed to a > function having the argument as const or used inside a if statement or > stored in the const "ci_type" field of a config_item structure. > > Done using Coccinelle. > > Signed-off-by: Bhumika Goyal> --- > * Changes in v2- Combine all the followup patches and the constification > patches into a series. Acked-by: Greg Kroah-Hartman
Re: [PATCH v2 01/15] configfs: make ci_type field, some pointers and function arguments const
On Mon, Oct 16, 2017 at 05:18:40PM +0200, Bhumika Goyal wrote: > The ci_type field of the config_item structure do not modify the fields > of the config_item_type structure it points to. And the other pointers > initialized with ci_type do not modify the fields as well. > So, make the ci_type field and the pointers initialized with ci_type > as const. > > Make the struct config_item_type *type function argument of functions > config_{item/group}_init_type_name const as the argument in both the > functions is only stored in the ci_type field of a config_item structure > which is now made const. > Make the argument of configfs_register_default_group const as it is > only passed to the argument of the function config_group_init_type_name > which is now const. > > Signed-off-by: Bhumika Goyal> --- > * Changes in v2- Combine all the followup patches and the constification > patches into a series. > > fs/configfs/dir.c| 10 +- > fs/configfs/item.c | 6 +++--- > fs/configfs/symlink.c| 4 ++-- > include/linux/configfs.h | 8 > 4 files changed, 14 insertions(+), 14 deletions(-) Acked-by: Greg Kroah-Hartman
Re: [PATCH v2 00/15] make structure field, function arguments and structures const
On Tue, Oct 17, 2017 at 12:16:18PM +0200, Julia Lawall wrote: > > > On Tue, 17 Oct 2017, Greg KH wrote: > > > On Mon, Oct 16, 2017 at 05:18:39PM +0200, Bhumika Goyal wrote: > > > Make the ci_type field and some function arguments as const. After this > > > change, make config_item_type structures as const. > > > > > > * Changes in v2- Combine all the followup patches and the constification > > > patches into a series. > > > > Who do you want to take these patches? If you want, I can take them > > through my driver-core tree, which has done other configfs stuff like > > this in the past. > > Christoph Hellwig proposed to take care of it. Great! I'll go ack the individual ones that I might need to... thanks, greg k-h
Re: [PATCH v2 08/15] nvmet: make config_item_type const
Acked-by: Sagi Grimberg
Re: [PATCH v2 00/15] make structure field, function arguments and structures const
On Tue, 17 Oct 2017, Greg KH wrote: > On Mon, Oct 16, 2017 at 05:18:39PM +0200, Bhumika Goyal wrote: > > Make the ci_type field and some function arguments as const. After this > > change, make config_item_type structures as const. > > > > * Changes in v2- Combine all the followup patches and the constification > > patches into a series. > > Who do you want to take these patches? If you want, I can take them > through my driver-core tree, which has done other configfs stuff like > this in the past. Christoph Hellwig proposed to take care of it. julia > > thanks, > > greg k-h >
Re: [PATCH v2 00/15] make structure field, function arguments and structures const
On Mon, Oct 16, 2017 at 05:18:39PM +0200, Bhumika Goyal wrote: > Make the ci_type field and some function arguments as const. After this > change, make config_item_type structures as const. > > * Changes in v2- Combine all the followup patches and the constification > patches into a series. Who do you want to take these patches? If you want, I can take them through my driver-core tree, which has done other configfs stuff like this in the past. thanks, greg k-h
Re: [PATCH V8 5/5] libata: Align DMA buffer to dma_get_cache_alignment()
On 10/17/2017 11:05 AM, Huacai Chen wrote: In non-coherent DMA mode, kernel uses cache flushing operations to maintain I/O coherency, so in ata_do_dev_read_id() the DMA buffer should be aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer and a kernel structure share a same cache line, and if the kernel structure has dirty data, cache_invalidate (no writeback) will cause data corruption. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen--- drivers/ata/libata-core.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index ee4c1ec..e134955 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -1833,8 +1833,19 @@ static u32 ata_pio_mask_no_iordy(const struct ata_device *adev) unsigned int ata_do_dev_read_id(struct ata_device *dev, struct ata_taskfile *tf, u16 *id) { - return ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, -id, sizeof(id[0]) * ATA_ID_WORDS, 0); + u16 *devid; + int res, size = sizeof(u16) * ATA_ID_WORDS; + + if (IS_ALIGNED((unsigned long)id, dma_get_cache_alignment(>tdev))) + res = ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, id, size, 0); + else { + devid = kmalloc(size, GFP_KERNEL); + res = ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, devid, size, 0); + memcpy(id, devid, size); + kfree(devid); + } + + return res; This function is called only for the PIO mode commands, so I doubt this is necessary... MBR, Sergei
Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
On Tue, Oct 17, 2017 at 08:38:01AM +0200, Hannes Reinecke wrote: > On 10/17/2017 03:29 AM, Ming Lei wrote: > > On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote: > >> On 10/13/2017 07:29 PM, Ming Lei wrote: > >>> On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote: > On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote: > > On Fri, Oct 13, 2017 at 04:31:04PM +, Bart Van Assche wrote: > >> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote: > >>> Actually it is in hot path, for example, lpfc and qla2xx's queue > >>> depth is 3, > >> > >> Sorry but I doubt whether that is correct. More in general, I don't > >> know any modern > >> storage HBA for which the default queue depth is so low. > > > > You can grep: > > > > [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E > > "qla2xxx|lpfc" > > Such a low queue depth will result in suboptimal performance for adapters > that communicate over a storage network. I think that's a bug and that > both > adapters support much higher cmd_per_lun values. > > (+James Smart) > > James, can you explain us why commit 445cf4f4d2aa decreased > LPFC_CMD_PER_LUN > from 30 to 3? Was that perhaps a workaround for a bug in a specific > target > implementation? > > (+Himanshu Madhani) > > Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun > for > the qla2xxx initiator driver to the scsi_host->can_queue value? > >>> > >>> ->can_queue is size of the whole tag space shared by all LUNs, looks it > >>> isn't > >>> reasonable to increase cmd_per_lun to .can_queue. > >>> > >> '3' is just a starting point; later on it'll be adjusted via > >> scsi_change_depth(). > >> Looks like it's not working correctly with blk-mq, though. > > > > At default, in scsi_alloc_sdev(), q->queue_depth is set as > > host->cmd_per_lun. You are right, q->queue_depth can be adjusted > > later too. > > > > q->queue_depth is respected in scsi_dev_queue_ready(). > > .cmd_per_lun defines the max outstanding cmds for each lun, I > > guess it is respected by some hardware inside. > > > No, this is purely a linux abstraction. Nothing to do with the hardware. That is also my initial understanding. But my test showed that actually the max outstanding cmds per LUN is really 3 even though q->queue_depth is 30, that is why I guess the hardware may put a hard limit inside: https://marc.info/?l=linux-block=150549401611868=2 Also if they were same thing, why does lpfc define different default value for q->queue_depth and .cmd_per_lun? drivers/scsi/lpfc/lpfc_attr.c: 3411 /* # lun_queue_depth: This parameter is used to limit the number of outstanding # commands per FCP LUN. Value range is [1,512]. Default value is 30. # If this parameter value is greater than 1/8th the maximum number of exchanges # supported by the HBA port, then the lun queue depth will be reduced to # 1/8th the maximum number of exchanges. */ LPFC_VPORT_ATTR_R(lun_queue_depth, 30, 1, 512, "Max number of FCP commands we can queue to a specific LUN"); drivers/scsi/lpfc/lpfc.h: 47 #define LPFC_CMD_PER_LUN3 /* max outstanding cmds per lun */ -- Ming
Re: [PATCH v2 1/8] lib/scatterlist: Introduce sgl_alloc() and sgl_free()
Thanks Bart, Reviewed-by: Johannes Thumshirn-- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v2 02/15] usb: gadget: make config_item_type structures const
Hi, Bhumika Goyalwrites: > Make these structures const as they are only passed to the const > argument of the functions config_{group/item}_init_type_name. > > Signed-off-by: Bhumika Goyal > --- > * Changes in v2- Combine all the followup patches and the constification > patches into a series. I'm assuming this depends on patch 1 of the series. In that case: Acked-by: Felipe Balbi -- balbi signature.asc Description: PGP signature
[PATCH V8 2/5] MIPS: Implement dma_map_ops::get_cache_alignment()
Currently, MIPS is an architecture which support coherent & noncoherent devices co-exist. So implement get_cache_alignment() function pointer in 'struct dma_map_ops' to return different dma alignments. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen--- arch/mips/cavium-octeon/dma-octeon.c | 3 ++- arch/mips/include/asm/dma-coherence.h | 2 ++ arch/mips/loongson64/common/dma-swiotlb.c | 1 + arch/mips/mm/dma-default.c| 11 ++- arch/mips/netlogic/common/nlm-dma.c | 3 ++- 5 files changed, 17 insertions(+), 3 deletions(-) diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c index c64bd87..41c29a85 100644 --- a/arch/mips/cavium-octeon/dma-octeon.c +++ b/arch/mips/cavium-octeon/dma-octeon.c @@ -324,7 +324,8 @@ static struct octeon_dma_map_ops _octeon_pci_dma_map_ops = { .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, .sync_sg_for_device = octeon_dma_sync_sg_for_device, .mapping_error = swiotlb_dma_mapping_error, - .dma_supported = swiotlb_dma_supported + .dma_supported = swiotlb_dma_supported, + .get_cache_alignment = mips_dma_get_cache_alignment }, }; diff --git a/arch/mips/include/asm/dma-coherence.h b/arch/mips/include/asm/dma-coherence.h index 72d0eab..5f7a9fc 100644 --- a/arch/mips/include/asm/dma-coherence.h +++ b/arch/mips/include/asm/dma-coherence.h @@ -29,4 +29,6 @@ extern int hw_coherentio; #define hw_coherentio 0 #endif /* CONFIG_DMA_MAYBE_COHERENT */ +int mips_dma_get_cache_alignment(struct device *dev); + #endif diff --git a/arch/mips/loongson64/common/dma-swiotlb.c b/arch/mips/loongson64/common/dma-swiotlb.c index 34486c1..17b9897 100644 --- a/arch/mips/loongson64/common/dma-swiotlb.c +++ b/arch/mips/loongson64/common/dma-swiotlb.c @@ -119,6 +119,7 @@ static const struct dma_map_ops loongson_dma_map_ops = { .sync_sg_for_device = loongson_dma_sync_sg_for_device, .mapping_error = swiotlb_dma_mapping_error, .dma_supported = loongson_dma_supported, + .get_cache_alignment = mips_dma_get_cache_alignment }; void __init plat_swiotlb_setup(void) diff --git a/arch/mips/mm/dma-default.c b/arch/mips/mm/dma-default.c index c01bd20..e8f0659 100644 --- a/arch/mips/mm/dma-default.c +++ b/arch/mips/mm/dma-default.c @@ -394,6 +394,14 @@ void dma_cache_sync(struct device *dev, void *vaddr, size_t size, EXPORT_SYMBOL(dma_cache_sync); +int mips_dma_get_cache_alignment(struct device *dev) +{ + if (plat_device_is_coherent(dev)) + return 1; + else + return ARCH_DMA_MINALIGN; +} + static const struct dma_map_ops mips_default_dma_map_ops = { .alloc = mips_dma_alloc_coherent, .free = mips_dma_free_coherent, @@ -407,7 +415,8 @@ static const struct dma_map_ops mips_default_dma_map_ops = { .sync_sg_for_cpu = mips_dma_sync_sg_for_cpu, .sync_sg_for_device = mips_dma_sync_sg_for_device, .mapping_error = mips_dma_mapping_error, - .dma_supported = mips_dma_supported + .dma_supported = mips_dma_supported, + .get_cache_alignment = mips_dma_get_cache_alignment }; const struct dma_map_ops *mips_dma_map_ops = _default_dma_map_ops; diff --git a/arch/mips/netlogic/common/nlm-dma.c b/arch/mips/netlogic/common/nlm-dma.c index 0ec9d9d..e9a9ddc 100644 --- a/arch/mips/netlogic/common/nlm-dma.c +++ b/arch/mips/netlogic/common/nlm-dma.c @@ -79,7 +79,8 @@ const struct dma_map_ops nlm_swiotlb_dma_ops = { .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu, .sync_sg_for_device = swiotlb_sync_sg_for_device, .mapping_error = swiotlb_dma_mapping_error, - .dma_supported = swiotlb_dma_supported + .dma_supported = swiotlb_dma_supported, + .get_cache_alignment = mips_dma_get_cache_alignment }; void __init plat_swiotlb_setup(void) -- 2.7.0
[PATCH V8 4/5] libsas: Align SMP req/resp to dma_get_cache_alignment()
In non-coherent DMA mode, kernel uses cache flushing operations to maintain I/O coherency, so libsas's SMP request/response should be aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer and a kernel structure share a same cache line, and if the kernel structure has dirty data, cache_invalidate (no writeback) will cause data corruption. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen--- drivers/scsi/libsas/sas_expander.c | 93 +++--- 1 file changed, 57 insertions(+), 36 deletions(-) diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c index 6b4fd23..124a44b 100644 --- a/drivers/scsi/libsas/sas_expander.c +++ b/drivers/scsi/libsas/sas_expander.c @@ -164,17 +164,17 @@ static int smp_execute_task(struct domain_device *dev, void *req, int req_size, /* -- Allocations -- */ -static inline void *alloc_smp_req(int size) +static inline void *alloc_smp_req(int size, int align) { - u8 *p = kzalloc(size, GFP_KERNEL); + u8 *p = kzalloc(ALIGN(size, align), GFP_KERNEL); if (p) p[0] = SMP_REQUEST; return p; } -static inline void *alloc_smp_resp(int size) +static inline void *alloc_smp_resp(int size, int align) { - return kzalloc(size, GFP_KERNEL); + return kzalloc(ALIGN(size, align), GFP_KERNEL); } static char sas_route_char(struct domain_device *dev, struct ex_phy *phy) @@ -403,15 +403,17 @@ static int sas_ex_phy_discover_helper(struct domain_device *dev, u8 *disc_req, int sas_ex_phy_discover(struct domain_device *dev, int single) { struct expander_device *ex = >ex_dev; - int res = 0; + int res = 0, align; u8 *disc_req; u8 *disc_resp; - disc_req = alloc_smp_req(DISCOVER_REQ_SIZE); + align = dma_get_cache_alignment(>phy->dev); + + disc_req = alloc_smp_req(DISCOVER_REQ_SIZE, align); if (!disc_req) return -ENOMEM; - disc_resp = alloc_smp_resp(DISCOVER_RESP_SIZE); + disc_resp = alloc_smp_resp(DISCOVER_RESP_SIZE, align); if (!disc_resp) { kfree(disc_req); return -ENOMEM; @@ -480,14 +482,15 @@ static int sas_ex_general(struct domain_device *dev) { u8 *rg_req; struct smp_resp *rg_resp; - int res; - int i; + int i, res, align; - rg_req = alloc_smp_req(RG_REQ_SIZE); + align = dma_get_cache_alignment(>phy->dev); + + rg_req = alloc_smp_req(RG_REQ_SIZE, align); if (!rg_req) return -ENOMEM; - rg_resp = alloc_smp_resp(RG_RESP_SIZE); + rg_resp = alloc_smp_resp(RG_RESP_SIZE, align); if (!rg_resp) { kfree(rg_req); return -ENOMEM; @@ -552,13 +555,15 @@ static int sas_ex_manuf_info(struct domain_device *dev) { u8 *mi_req; u8 *mi_resp; - int res; + int res, align; - mi_req = alloc_smp_req(MI_REQ_SIZE); + align = dma_get_cache_alignment(>phy->dev); + + mi_req = alloc_smp_req(MI_REQ_SIZE, align); if (!mi_req) return -ENOMEM; - mi_resp = alloc_smp_resp(MI_RESP_SIZE); + mi_resp = alloc_smp_resp(MI_RESP_SIZE, align); if (!mi_resp) { kfree(mi_req); return -ENOMEM; @@ -593,13 +598,15 @@ int sas_smp_phy_control(struct domain_device *dev, int phy_id, { u8 *pc_req; u8 *pc_resp; - int res; + int res, align; + + align = dma_get_cache_alignment(>phy->dev); - pc_req = alloc_smp_req(PC_REQ_SIZE); + pc_req = alloc_smp_req(PC_REQ_SIZE, align); if (!pc_req) return -ENOMEM; - pc_resp = alloc_smp_resp(PC_RESP_SIZE); + pc_resp = alloc_smp_resp(PC_RESP_SIZE, align); if (!pc_resp) { kfree(pc_req); return -ENOMEM; @@ -664,17 +671,19 @@ static int sas_dev_present_in_domain(struct asd_sas_port *port, #define RPEL_RESP_SIZE 32 int sas_smp_get_phy_events(struct sas_phy *phy) { - int res; + int res, align; u8 *req; u8 *resp; struct sas_rphy *rphy = dev_to_rphy(phy->dev.parent); struct domain_device *dev = sas_find_dev_by_rphy(rphy); - req = alloc_smp_req(RPEL_REQ_SIZE); + align = dma_get_cache_alignment(>dev); + + req = alloc_smp_req(RPEL_REQ_SIZE, align); if (!req) return -ENOMEM; - resp = alloc_smp_resp(RPEL_RESP_SIZE); + resp = alloc_smp_resp(RPEL_RESP_SIZE, align); if (!resp) { kfree(req); return -ENOMEM; @@ -709,7 +718,8 @@ int sas_get_report_phy_sata(struct domain_device *dev, int phy_id, struct smp_resp *rps_resp) { int res; - u8 *rps_req = alloc_smp_req(RPS_REQ_SIZE); + int align = dma_get_cache_alignment(>phy->dev); + u8 *rps_req =
[PATCH V8 3/5] scsi: Align block queue to dma_get_cache_alignment()
In non-coherent DMA mode, kernel uses cache flushing operations to maintain I/O coherency, so scsi's block queue should be aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer and a kernel structure share a same cache line, and if the kernel structure has dirty data, cache_invalidate (no writeback) will cause data corruption. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen--- drivers/scsi/scsi_lib.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9cf6a80..19abc2e 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2132,11 +2132,11 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) q->limits.cluster = 0; /* -* set a reasonable default alignment on word boundaries: the -* host and device may alter it using +* set a reasonable default alignment on word/cacheline boundaries: +* the host and device may alter it using * blk_queue_update_dma_alignment() later. */ - blk_queue_dma_alignment(q, 0x03); + blk_queue_dma_alignment(q, max(4, dma_get_cache_alignment(dev)) - 1); } EXPORT_SYMBOL_GPL(__scsi_init_queue); -- 2.7.0
[PATCH V8 5/5] libata: Align DMA buffer to dma_get_cache_alignment()
In non-coherent DMA mode, kernel uses cache flushing operations to maintain I/O coherency, so in ata_do_dev_read_id() the DMA buffer should be aligned to ARCH_DMA_MINALIGN. Otherwise, If a DMA buffer and a kernel structure share a same cache line, and if the kernel structure has dirty data, cache_invalidate (no writeback) will cause data corruption. Cc: sta...@vger.kernel.org Signed-off-by: Huacai Chen--- drivers/ata/libata-core.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index ee4c1ec..e134955 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -1833,8 +1833,19 @@ static u32 ata_pio_mask_no_iordy(const struct ata_device *adev) unsigned int ata_do_dev_read_id(struct ata_device *dev, struct ata_taskfile *tf, u16 *id) { - return ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, -id, sizeof(id[0]) * ATA_ID_WORDS, 0); + u16 *devid; + int res, size = sizeof(u16) * ATA_ID_WORDS; + + if (IS_ALIGNED((unsigned long)id, dma_get_cache_alignment(>tdev))) + res = ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, id, size, 0); + else { + devid = kmalloc(size, GFP_KERNEL); + res = ata_exec_internal(dev, tf, NULL, DMA_FROM_DEVICE, devid, size, 0); + memcpy(id, devid, size); + kfree(devid); + } + + return res; } /** -- 2.7.0
[PATCH V8 1/5] dma-mapping: Rework dma_get_cache_alignment()
Make dma_get_cache_alignment() to accept a 'dev' argument. As a result, it can return different alignments due to different devices' I/O cache coherency. Currently, ARM/ARM64 and MIPS support coherent & noncoherent devices co-exist. This may be extended in the future, so add a new function pointer (i.e, get_cache_alignment) in 'struct dma_map_ops' as a generic solution. Cc: sta...@vger.kernel.org Cc: Michael S. TsirkinCc: Pawel Osciak Cc: Marek Szyprowski Cc: Kyungmin Park Cc: Michael Chan Cc: Benjamin Herrenschmidt Cc: Ivan Mikhaylov Cc: Tariq Toukan Cc: Andy Gross Cc: Mark A. Greer Cc: Robert Baldyga Cc: Marek Szyprowski Signed-off-by: Huacai Chen --- drivers/infiniband/hw/mthca/mthca_main.c | 2 +- drivers/media/v4l2-core/videobuf2-dma-contig.c | 2 +- drivers/net/ethernet/broadcom/b44.c| 8 +++ drivers/net/ethernet/ibm/emac/core.c | 22 ++--- drivers/net/ethernet/ibm/emac/core.h | 6 ++--- drivers/net/ethernet/mellanox/mlx4/main.c | 2 +- drivers/spi/spi-qup.c | 4 ++-- drivers/tty/serial/mpsc.c | 33 ++ drivers/tty/serial/samsung.c | 14 +-- include/linux/dma-mapping.h| 17 + 10 files changed, 68 insertions(+), 42 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index e36a9bc..078fe8d 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -416,7 +416,7 @@ static int mthca_init_icm(struct mthca_dev *mdev, /* CPU writes to non-reserved MTTs, while HCA might DMA to reserved mtts */ mdev->limits.reserved_mtts = ALIGN(mdev->limits.reserved_mtts * mdev->limits.mtt_seg_size, - dma_get_cache_alignment()) / mdev->limits.mtt_seg_size; + dma_get_cache_alignment(>pdev->dev)) / mdev->limits.mtt_seg_size; mdev->mr_table.mtt_table = mthca_alloc_icm_table(mdev, init_hca->mtt_base, mdev->limits.mtt_seg_size, diff --git a/drivers/media/v4l2-core/videobuf2-dma-contig.c b/drivers/media/v4l2-core/videobuf2-dma-contig.c index 9f389f3..1f6a9b7 100644 --- a/drivers/media/v4l2-core/videobuf2-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf2-dma-contig.c @@ -484,7 +484,7 @@ static void *vb2_dc_get_userptr(struct device *dev, unsigned long vaddr, int ret = 0; struct sg_table *sgt; unsigned long contig_size; - unsigned long dma_align = dma_get_cache_alignment(); + unsigned long dma_align = dma_get_cache_alignment(dev); /* Only cache aligned DMA transfers are reliable */ if (!IS_ALIGNED(vaddr | size, dma_align)) { diff --git a/drivers/net/ethernet/broadcom/b44.c b/drivers/net/ethernet/broadcom/b44.c index a1125d1..2f6ffe5 100644 --- a/drivers/net/ethernet/broadcom/b44.c +++ b/drivers/net/ethernet/broadcom/b44.c @@ -2344,6 +2344,10 @@ static int b44_init_one(struct ssb_device *sdev, struct net_device *dev; struct b44 *bp; int err; + unsigned int dma_desc_align_size = dma_get_cache_alignment(sdev->dma_dev); + + /* Setup paramaters for syncing RX/TX DMA descriptors */ + dma_desc_sync_size = max_t(unsigned int, dma_desc_align_size, sizeof(struct dma_desc)); instance++; @@ -2587,12 +2591,8 @@ static inline void b44_pci_exit(void) static int __init b44_init(void) { - unsigned int dma_desc_align_size = dma_get_cache_alignment(); int err; - /* Setup paramaters for syncing RX/TX DMA descriptors */ - dma_desc_sync_size = max_t(unsigned int, dma_desc_align_size, sizeof(struct dma_desc)); - err = b44_pci_init(); if (err) return err; diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c index 7feff24..b0c5319 100644 --- a/drivers/net/ethernet/ibm/emac/core.c +++ b/drivers/net/ethernet/ibm/emac/core.c @@ -1030,8 +1030,9 @@ static int emac_set_mac_address(struct net_device *ndev, void *sa) static int emac_resize_rx_ring(struct emac_instance *dev, int new_mtu) { - int rx_sync_size = emac_rx_sync_size(new_mtu); - int rx_skb_size = emac_rx_skb_size(new_mtu); + struct device *dma_dev = >ofdev->dev; + int rx_skb_size = emac_rx_skb_size(dma_dev, new_mtu); + int rx_sync_size = emac_rx_sync_size(dma_dev, new_mtu); int i, ret = 0; int mr1_jumbo_bit_change = 0; @@ -1115,20 +1116,21 @@ static int
Re: [PATCH] scsi: libfc: Convert timers to use timer_setup()
Looks good, Acked-by: Johannes Thumshirn-- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH] scsi: Add 'AIX VDASD' to blacklist
The AIX VDASD devices do support VPD pages, but implement only SPC. So set BLIST_TRY_VPD_PAGS to correctly display the VPD information in sysfs. Signed-off-by: Hannes Reinecke--- drivers/scsi/scsi_devinfo.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index d39b27c..ae2ad84 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -134,6 +134,7 @@ struct scsi_dev_info_list_table { {"3PARdata", "VV", NULL, BLIST_REPORTLUN2}, {"ADAPTEC", "AACRAID", NULL, BLIST_FORCELUN}, {"ADAPTEC", "Adaptec 5400S", NULL, BLIST_FORCELUN}, + {"AIX", "VDASD", NULL, BLIST_TRY_VPD_PAGES}, {"AFT PRO", "-IX CF", "0.0>", BLIST_FORCELUN}, {"BELKIN", "USB 2 HS-CF", "1.95", BLIST_FORCELUN | BLIST_INQUIRY_36}, {"BROWNIE", "1200U3P", NULL, BLIST_NOREPORTLUN}, -- 1.8.5.6
[PATCH] scsi: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry
From: Kurt GarloffAll EMC SYMMETRIX support REPORT_LUNS, even if configured to report SCSI-2 for whatever reason. Signed-off-by: Kurt Garloff Signed-off-by: Hannes Reinecke --- drivers/scsi/scsi_devinfo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index a998585..555269b 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -161,7 +161,7 @@ struct scsi_dev_info_list_table { {"DGC", "RAID", NULL, BLIST_SPARSELUN}, /* Dell PV 650F, storage on LUN 0 */ {"DGC", "DISK", NULL, BLIST_SPARSELUN}, /* Dell PV 650F, no storage on LUN 0 */ {"EMC", "Invista", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, - {"EMC", "SYMMETRIX", NULL, BLIST_SPARSELUN | BLIST_LARGELUN | BLIST_FORCELUN}, + {"EMC", "SYMMETRIX", NULL, BLIST_SPARSELUN | BLIST_LARGELUN | BLIST_REPORTLUN2}, {"EMULEX", "MD21/S2 ESDI", NULL, BLIST_SINGLELUN}, {"easyRAID", "16P", NULL, BLIST_NOREPORTLUN}, {"easyRAID", "X6P", NULL, BLIST_NOREPORTLUN}, -- 1.8.5.6
[PATCH] scsi: Add TRY_VPD_PAGES to HITACHI OPEN-V blacklist entry
HITACHI is always supporting VPD pages, even though it's claiming to support SCSI Revision 3 only. Signed-off-by: Hannes Reinecke--- drivers/scsi/scsi_devinfo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index ae2ad84..a998585 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -174,7 +174,7 @@ struct scsi_dev_info_list_table { {"HITACHI", "DF500", "*", BLIST_REPORTLUN2}, {"HITACHI", "DISK-SUBSYSTEM", "*", BLIST_REPORTLUN2}, {"HITACHI", "HUS1530", "*", BLIST_NO_DIF}, - {"HITACHI", "OPEN-", "*", BLIST_REPORTLUN2}, + {"HITACHI", "OPEN-", "*", BLIST_REPORTLUN2 | BLIST_TRY_VPD_PAGES}, {"HITACHI", "OP-C-", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, {"HITACHI", "3380-", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, {"HITACHI", "3390-", "*", BLIST_SPARSELUN | BLIST_LARGELUN}, -- 1.8.5.6
[PATCH] scsi: Always retry internal target error
EMC Symmetrix is returning 'internal target error' for a variety of conditions, most of which will be transient. So we should always retry it, even with failfast set. Otherwise we'd get spurious path flaps with multipath. Signed-off-by: Hannes Reinecke--- drivers/scsi/scsi_error.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 5086489..5722c2e 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -497,6 +497,16 @@ int scsi_check_sense(struct scsi_cmnd *scmd) if (sshdr.asc == 0x10) /* DIF */ return SUCCESS; + if (!strncmp(scmd->device->vendor, "EMC", 3) && + !strncmp(scmd->device->model, "SYMMETRIX", 9) && + (sshdr.asc == 0x44) && (sshdr.ascq == 0x0)) { + /* +* EMC Symmetrix returns 'Internal target failure' +* for a variety of internal issues, all of which +* can be recovered by retry. +*/ + return ADD_TO_MLQUEUE; + } return NEEDS_RETRY; case NOT_READY: case UNIT_ATTENTION: @@ -1846,6 +1856,9 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd) rtn = scsi_check_sense(scmd); if (rtn == NEEDS_RETRY) goto maybe_retry; + else if (rtn == ADD_TO_MLQUEUE) + /* Always enforce a retry for ADD_TO_MLQUEUE */ + rtn = NEEDS_RETRY; /* if rtn == FAILED, we have no sense information; * returning FAILED will wake the error handler thread * to collect the sense and redo the decide -- 1.8.5.6
[PATCH] sd: always retry READ CAPACITY for ALUA state transition
During ALUA state transitions the device might return a sense code 02/04/0a (Logical unit not accessible, asymmetric access state transition). As this is a transient error we should just retry the READ CAPACITY call until the state transition finishes and the correct capacity can be returned. Signed-off-by: Hannes Reinecke--- drivers/scsi/sd.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 37daf9a..b4647f5 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2333,6 +2333,11 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, * give it one more chance */ if (--reset_retries > 0) continue; + if (sense_valid && + sshdr.sense_key == NOT_READY && + sshdr.asc == 0x04 && sshdr.ascq == 0x0A) + /* ALUA state transition; always retry */ + continue; } retries--; @@ -2418,6 +2423,11 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, * give it one more chance */ if (--reset_retries > 0) continue; + if (sense_valid && + sshdr.sense_key == NOT_READY && + sshdr.asc == 0x04 && sshdr.ascq == 0x0A) + /* ALUA state transition; always retry */ + continue; } retries--; -- 1.8.5.6
[PATCH] scsi: Handle power-on reset unit attention
As per SAM there is a status precedence, with any sense code 29/XX taking second place just after an ACA ACTIVE status. Additionally, each target might prefer to not queue any unit attention conditions, but just report one. Due to the above this will be that one with the highest precedence. This results in the sense code 29/XX effectively overwriting any other unit attention. Hence we should report the power-on reset to userland so that it can take appropriate action. Signed-off-by: Hannes Reinecke--- drivers/scsi/scsi_error.c | 6 ++ drivers/scsi/scsi_lib.c| 4 include/scsi/scsi_device.h | 3 ++- 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 5722c2e..f49710a 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -403,6 +403,12 @@ static void scsi_report_sense(struct scsi_device *sdev, "threshold.\n"); } + if (sshdr->asc == 0x29) { + evt_type = SDEV_EVT_POWER_ON_RESET_OCCURRED; + sdev_printk(KERN_WARNING, sdev, + "Power-on or device reset occurred\n"); + } + if (sshdr->asc == 0x2a && sshdr->ascq == 0x01) { evt_type = SDEV_EVT_MODE_PARAMETER_CHANGE_REPORTED; sdev_printk(KERN_WARNING, sdev, diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index c72b97a..e5fcfa8 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2743,6 +2743,9 @@ static void scsi_evt_emit(struct scsi_device *sdev, struct scsi_event *evt) case SDEV_EVT_ALUA_STATE_CHANGE_REPORTED: envp[idx++] = "SDEV_UA=ASYMMETRIC_ACCESS_STATE_CHANGED"; break; + case SDEV_EVT_POWER_ON_RESET_OCCURRED: + envp[idx++] = "SDEV_UA=POWER_ON_RESET_OCCURRED"; + break; default: /* do nothing */ break; @@ -2847,6 +2850,7 @@ struct scsi_event *sdev_evt_alloc(enum scsi_device_event evt_type, case SDEV_EVT_MODE_PARAMETER_CHANGE_REPORTED: case SDEV_EVT_LUN_CHANGE_REPORTED: case SDEV_EVT_ALUA_STATE_CHANGE_REPORTED: + case SDEV_EVT_POWER_ON_RESET_OCCURRED: default: /* do nothing */ break; diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 82e93ee..d68985c 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -64,9 +64,10 @@ enum scsi_device_event { SDEV_EVT_MODE_PARAMETER_CHANGE_REPORTED,/* 2A 01 UA reported */ SDEV_EVT_LUN_CHANGE_REPORTED, /* 3F 0E UA reported */ SDEV_EVT_ALUA_STATE_CHANGE_REPORTED,/* 2A 06 UA reported */ + SDEV_EVT_POWER_ON_RESET_OCCURRED, /* 29 00 UA reported */ SDEV_EVT_FIRST = SDEV_EVT_MEDIA_CHANGE, - SDEV_EVT_LAST = SDEV_EVT_ALUA_STATE_CHANGE_REPORTED, + SDEV_EVT_LAST = SDEV_EVT_POWER_ON_RESET_OCCURRED, SDEV_EVT_MAXBITS= SDEV_EVT_LAST + 1 }; -- 1.8.5.6
[PATCH] scsi: Do not retry invalid function error
Hitachi USP-V returns 'Invalid function' when the internal staging mechanism encountered an error. These errors should not be retried on another path. Signed-off-by: Hannes Reinecke--- drivers/scsi/scsi_error.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 3894205..5086489 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -579,6 +579,7 @@ int scsi_check_sense(struct scsi_cmnd *scmd) case ILLEGAL_REQUEST: if (sshdr.asc == 0x20 || /* Invalid command operation code */ sshdr.asc == 0x21 || /* Logical block address out of range */ + sshdr.asc == 0x22 || /* Invalid function */ sshdr.asc == 0x24 || /* Invalid field in cdb */ sshdr.asc == 0x26) { /* Parameter value invalid */ set_host_byte(scmd, DID_TARGET_FAILURE); -- 1.8.5.6
Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
On 10/17/2017 03:29 AM, Ming Lei wrote: > On Mon, Oct 16, 2017 at 01:30:09PM +0200, Hannes Reinecke wrote: >> On 10/13/2017 07:29 PM, Ming Lei wrote: >>> On Fri, Oct 13, 2017 at 05:08:52PM +, Bart Van Assche wrote: On Sat, 2017-10-14 at 00:45 +0800, Ming Lei wrote: > On Fri, Oct 13, 2017 at 04:31:04PM +, Bart Van Assche wrote: >> On Sat, 2017-10-14 at 00:07 +0800, Ming Lei wrote: >>> Actually it is in hot path, for example, lpfc and qla2xx's queue depth >>> is 3, >> >> Sorry but I doubt whether that is correct. More in general, I don't know >> any modern >> storage HBA for which the default queue depth is so low. > > You can grep: > > [ming@ming linux]$ git grep -n cmd_per_lun ./drivers/scsi/ | grep -E > "qla2xxx|lpfc" Such a low queue depth will result in suboptimal performance for adapters that communicate over a storage network. I think that's a bug and that both adapters support much higher cmd_per_lun values. (+James Smart) James, can you explain us why commit 445cf4f4d2aa decreased LPFC_CMD_PER_LUN from 30 to 3? Was that perhaps a workaround for a bug in a specific target implementation? (+Himanshu Madhani) Himanshu, do you perhaps know whether it is safe to increase cmd_per_lun for the qla2xxx initiator driver to the scsi_host->can_queue value? >>> >>> ->can_queue is size of the whole tag space shared by all LUNs, looks it >>> isn't >>> reasonable to increase cmd_per_lun to .can_queue. >>> >> '3' is just a starting point; later on it'll be adjusted via >> scsi_change_depth(). >> Looks like it's not working correctly with blk-mq, though. > > At default, in scsi_alloc_sdev(), q->queue_depth is set as > host->cmd_per_lun. You are right, q->queue_depth can be adjusted > later too. > > q->queue_depth is respected in scsi_dev_queue_ready(). > .cmd_per_lun defines the max outstanding cmds for each lun, I > guess it is respected by some hardware inside. > No, this is purely a linux abstraction. Nothing to do with the hardware. > For example, I remembered that on lpfc q->queue_depth is 30 because > the default 'lpfc_lun_queue_depth' is 30. And its .cmd_per_lun is 3. > Per my observation, this .cmd_per_lun limit is still workable. > Again, these are just some pre-defined values to avoid I/O starvation when having several LUNs. _if_ we can guarantee I/O fairness between several (hundreds!) devices all sharing the same tagspace we wouldn't need these variables. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 10/10] block, nvme: Introduce blk_mq_req_flags_t
On 10/17/2017 01:29 AM, Bart Van Assche wrote: > Several block layer and NVMe core functions accept a combination > of BLK_MQ_REQ_* flags through the 'flags' argument but there is > no verification at compile time whether the right type of block > layer flags is passed. Make it possible for sparse to verify this. > This patch does not change any functionality. > > Signed-off-by: Bart Van Assche> Cc: linux-n...@lists.infradead.org > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > Cc: Ming Lei > --- > block/blk-core.c | 12 ++-- > block/blk-mq.c| 4 ++-- > block/blk-mq.h| 2 +- > drivers/nvme/host/core.c | 5 +++-- > drivers/nvme/host/nvme.h | 5 +++-- > include/linux/blk-mq.h| 17 +++-- > include/linux/blk_types.h | 2 ++ > include/linux/blkdev.h| 4 ++-- > 8 files changed, 30 insertions(+), 21 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably
On 10/17/2017 01:29 AM, Bart Van Assche wrote: > The contexts from which a SCSI device can be quiesced or resumed are: > * Writing into /sys/class/scsi_device/*/device/state. > * SCSI parallel (SPI) domain validation. > * The SCSI device power management methods. See also scsi_bus_pm_ops. > > It is essential during suspend and resume that neither the filesystem > state nor the filesystem metadata in RAM changes. This is why while > the hibernation image is being written or restored that SCSI devices > are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() > and scsi_device_resume(). In the SDEV_QUIESCE state execution of > non-preempt requests is deferred. This is realized by returning > BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI > devices. Avoid that a full queue prevents power management requests > to be submitted by deferring allocation of non-preempt requests for > devices in the quiesced state. This patch has been tested by running > the following commands and by verifying that after resume the fio job > is still running: > (We've discussed this at ALPSS already:) How do you ensure that PREEMPT requests are not stuck in the queue _behind_ non-PREEMPT requests? Once they are in the queue the request are already allocated, so your deferred allocation won't work. _And_ deferred requests will be re-inserted at the head of the queue. Consequently the PREEMPT request will never ever scheduled during quiesce. How do you avoid such a scenario? Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 08/10] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
On 10/17/2017 01:29 AM, Bart Van Assche wrote: > This flag will be used in the next patch to let the block layer > core know whether or not a SCSI request queue has been quiesced. > A quiesced SCSI queue namely only processes RQF_PREEMPT requests. > > Signed-off-by: Bart Van Assche> Tested-by: Martin Steigerwald > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > block/blk-core.c | 30 ++ > block/blk-mq-debugfs.c | 1 + > include/linux/blkdev.h | 6 ++ > 3 files changed, 37 insertions(+) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests
On 10/17/2017 01:29 AM, Bart Van Assche wrote: > Convert blk_get_request(q, op, __GFP_RECLAIM) into > blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not > change any functionality. > > Signed-off-by: Bart Van Assche> Tested-by: Martin Steigerwald > Acked-by: David S. Miller [ for IDE ] > Cc: Martin K. Petersen > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > drivers/ide/ide-pm.c| 4 ++-- > drivers/scsi/scsi_lib.c | 6 +++--- > 2 files changed, 5 insertions(+), 5 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 05/10] block: Introduce blk_get_request_flags()
On 10/17/2017 01:29 AM, Bart Van Assche wrote: > A side effect of this patch is that the GFP mask that is passed to > several allocation functions in the legacy block layer is changed > from GFP_KERNEL into __GFP_DIRECT_RECLAIM. > > Signed-off-by: Bart Van Assche> Tested-by: Martin Steigerwald > Cc: Christoph Hellwig > Cc: Ming Lei > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > block/blk-core.c | 50 > +++--- > include/linux/blkdev.h | 3 +++ > 2 files changed, 38 insertions(+), 15 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 06/10] block: Introduce BLK_MQ_REQ_PREEMPT
On 10/17/2017 01:29 AM, Bart Van Assche wrote: > Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to > blk_get_request_flags(). > > Signed-off-by: Bart Van Assche> Tested-by: Martin Steigerwald > Cc: Christoph Hellwig > Cc: Ming Lei > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > block/blk-core.c | 4 +++- > block/blk-mq.c | 2 ++ > include/linux/blk-mq.h | 1 + > 3 files changed, 6 insertions(+), 1 deletion(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 04/10] block: Make q_usage_counter also track legacy requests
On 10/17/2017 01:28 AM, Bart Van Assche wrote: > From: Ming Lei> > This patch makes it possible to pause request allocation for > the legacy block layer by calling blk_mq_freeze_queue() and > blk_mq_unfreeze_queue(). > > Signed-off-by: Ming Lei > [ bvanassche: Combined two patches into one, edited a comment and made sure > REQ_NOWAIT is handled properly in blk_old_get_request() ] > Signed-off-by: Bart Van Assche > Reviewed-by: Christoph Hellwig > Reviewed-by: Johannes Thumshirn > Tested-by: Martin Steigerwald > Cc: Ming Lei > Cc: Hannes Reinecke > --- > block/blk-core.c | 12 > block/blk-mq.c | 10 ++ > 2 files changed, 14 insertions(+), 8 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 03/10] md: Neither resync nor reshape while the system is frozen
On 10/17/2017 01:28 AM, Bart Van Assche wrote: > Some people use the md driver on laptops and use the suspend and > resume functionality. Since it is essential that submitting of > new I/O requests stops before a hibernation image is created, > interrupt the md resync and reshape actions if the system is > being frozen. Note: the resync and reshape will restart after > the system is resumed and a message similar to the following > will appear in the system log: > > md: md0: data-check interrupted. > > Signed-off-by: Bart Van Assche> Reviewed-by: Shaohua Li > Tested-by: Martin Steigerwald > Cc: linux-r...@vger.kernel.org > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > drivers/md/md.c | 50 +- > drivers/md/md.h | 8 > 2 files changed, 57 insertions(+), 1 deletion(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 02/10] md: Introduce md_stop_all_writes()
On 10/17/2017 01:28 AM, Bart Van Assche wrote: > Introduce md_stop_all_writes() because the next patch will add > a second caller for this function. This patch does not change > any functionality. > > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Reviewed-by: Shaohua Li > Tested-by: Martin Steigerwald > Cc: linux-r...@vger.kernel.org > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > --- > drivers/md/md.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 8933cafc212d..b99584e5d6b1 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8937,8 +8937,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t > s, int sectors, > } > EXPORT_SYMBOL_GPL(rdev_clear_badblocks); > > -static int md_notify_reboot(struct notifier_block *this, > - unsigned long code, void *x) > +static void md_stop_all_writes(void) > { > struct list_head *tmp; > struct mddev *mddev; > @@ -8962,6 +8961,12 @@ static int md_notify_reboot(struct notifier_block > *this, >*/ > if (need_delay) > mdelay(1000*1); > +} > + > +static int md_notify_reboot(struct notifier_block *this, > + unsigned long code, void *x) > +{ > + md_stop_all_writes(); > > return NOTIFY_DONE; > } > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v9 01/10] md: Rename md_notifier into md_reboot_notifier
On 10/17/2017 01:28 AM, Bart Van Assche wrote: > This avoids confusion with the pm notifier that will be added > through a later patch. > > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Reviewed-by: Shaohua Li > Tested-by: Martin Steigerwald > Cc: linux-r...@vger.kernel.org > Cc: Ming Lei > Cc: Christoph Hellwig > Cc: Hannes Reinecke > --- > drivers/md/md.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 5d61049e7417..8933cafc212d 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8966,7 +8966,7 @@ static int md_notify_reboot(struct notifier_block *this, > return NOTIFY_DONE; > } > > -static struct notifier_block md_notifier = { > +static struct notifier_block md_reboot_notifier = { > .notifier_call = md_notify_reboot, > .next = NULL, > .priority = INT_MAX, /* before any real devices */ > @@ -9003,7 +9003,7 @@ static int __init md_init(void) > blk_register_region(MKDEV(mdp_major, 0), 1UL< md_probe, NULL, NULL); > > - register_reboot_notifier(_notifier); > + register_reboot_notifier(_reboot_notifier); > raid_table_header = register_sysctl_table(raid_root_table); > > md_geninit(); > @@ -9243,7 +9243,7 @@ static __exit void md_exit(void) > > unregister_blkdev(MD_MAJOR,"md"); > unregister_blkdev(mdp_major, "mdp"); > - unregister_reboot_notifier(_notifier); > + unregister_reboot_notifier(_reboot_notifier); > unregister_sysctl_table(raid_table_header); > > /* We cannot unload the modules while some process is > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 8/8] scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Use the sgl_alloc_order() and sgl_free_order() functions instead > of open coding these functions. > > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Cc: linux-scsi@vger.kernel.org > Cc: Martin K. Petersen > Cc: Anil Ravindranath > --- > drivers/scsi/Kconfig | 1 + > drivers/scsi/pmcraid.c | 43 --- > drivers/scsi/pmcraid.h | 2 +- > 3 files changed, 6 insertions(+), 40 deletions(-) > > diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig > index d11e75e76195..632200ec36a6 100644 > --- a/drivers/scsi/Kconfig > +++ b/drivers/scsi/Kconfig > @@ -1576,6 +1576,7 @@ config ZFCP > config SCSI_PMCRAID > tristate "PMC SIERRA Linux MaxRAID adapter support" > depends on PCI && SCSI && NET > + select SGL_ALLOC > ---help--- > This driver supports the PMC SIERRA MaxRAID adapters. > > diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c > index b4d6cd8cd1ad..95fc0352f9bb 100644 > --- a/drivers/scsi/pmcraid.c > +++ b/drivers/scsi/pmcraid.c > @@ -3232,12 +3232,7 @@ static int pmcraid_build_ioadl( > */ > static void pmcraid_free_sglist(struct pmcraid_sglist *sglist) > { > - int i; > - > - for (i = 0; i < sglist->num_sg; i++) > - __free_pages(sg_page(&(sglist->scatterlist[i])), > - sglist->order); > - > + sgl_free_order(sglist->scatterlist, sglist->order); > kfree(sglist); > } > > @@ -3254,50 +3249,20 @@ static void pmcraid_free_sglist(struct pmcraid_sglist > *sglist) > static struct pmcraid_sglist *pmcraid_alloc_sglist(int buflen) > { > struct pmcraid_sglist *sglist; > - struct scatterlist *scatterlist; > - struct page *page; > - int num_elem, i, j; > int sg_size; > int order; > - int bsize_elem; > > sg_size = buflen / (PMCRAID_MAX_IOADLS - 1); > order = (sg_size > 0) ? get_order(sg_size) : 0; > - bsize_elem = PAGE_SIZE * (1 << order); > - > - /* Determine the actual number of sg entries needed */ > - if (buflen % bsize_elem) > - num_elem = (buflen / bsize_elem) + 1; > - else > - num_elem = buflen / bsize_elem; > > /* Allocate a scatter/gather list for the DMA */ > - sglist = kzalloc(sizeof(struct pmcraid_sglist) + > - (sizeof(struct scatterlist) * (num_elem - 1)), > - GFP_KERNEL); > - > + sglist = kzalloc(sizeof(struct pmcraid_sglist), GFP_KERNEL); > if (sglist == NULL) > return NULL; > > - scatterlist = sglist->scatterlist; > - sg_init_table(scatterlist, num_elem); > sglist->order = order; > - sglist->num_sg = num_elem; > - sg_size = buflen; > - > - for (i = 0; i < num_elem; i++) { > - page = alloc_pages(GFP_KERNEL|GFP_DMA|__GFP_ZERO, order); > - if (!page) { > - for (j = i - 1; j >= 0; j--) > - __free_pages(sg_page([j]), order); > - kfree(sglist); > - return NULL; > - } > - > - sg_set_page([i], page, > - sg_size < bsize_elem ? sg_size : bsize_elem, 0); > - sg_size -= bsize_elem; > - } > + sgl_alloc_order(buflen, order, false, > + GFP_KERNEL | GFP_DMA | __GFP_ZERO, >num_sg); > > return sglist; > } > diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h > index 44da91712115..754ef30927e2 100644 > --- a/drivers/scsi/pmcraid.h > +++ b/drivers/scsi/pmcraid.h > @@ -542,7 +542,7 @@ struct pmcraid_sglist { > u32 order; > u32 num_sg; > u32 num_dma_sg; > - struct scatterlist scatterlist[1]; > + struct scatterlist *scatterlist; > }; > > /* page D0 inquiry data of focal point resource */ > The comment to ipr applied here, too. We need input from the pmcraid folks if this is a valid conversion. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 7/8] scsi/pmcraid: Remove an unused structure member
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Cc: linux-scsi@vger.kernel.org > Cc: Martin K. Petersen > Cc: Anil Ravindranath > --- > drivers/scsi/pmcraid.h | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h > index 8bfac72a242b..44da91712115 100644 > --- a/drivers/scsi/pmcraid.h > +++ b/drivers/scsi/pmcraid.h > @@ -542,7 +542,6 @@ struct pmcraid_sglist { > u32 order; > u32 num_sg; > u32 num_dma_sg; > - u32 buffer_len; > struct scatterlist scatterlist[1]; > }; > > This actually is the same story that we've had with ipr (and, looking at the code, those two drivers look awfully similar ...). pmcraid_sglist looks as if it's a hardware-dependent structure, so just removing one entry from the middle of a structure might not be a good idea. But this is something for the pmcraid folks to clarify. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 6/8] scsi/ipr: Use sgl_alloc_order() and sgl_free_order()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Use the sgl_alloc_order() and sgl_free_order() functions instead > of open coding these functions. > > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Cc: linux-scsi@vger.kernel.org > Cc: Martin K. Petersen > Cc: Brian King > --- > drivers/scsi/Kconfig | 1 + > drivers/scsi/ipr.c | 49 - > drivers/scsi/ipr.h | 2 +- > 3 files changed, 10 insertions(+), 42 deletions(-) > > diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig > index 41366339b950..d11e75e76195 100644 > --- a/drivers/scsi/Kconfig > +++ b/drivers/scsi/Kconfig > @@ -1058,6 +1058,7 @@ config SCSI_IPR > depends on PCI && SCSI && ATA > select FW_LOADER > select IRQ_POLL > + select SGL_ALLOC > ---help--- > This driver supports the IBM Power Linux family RAID adapters. > This includes IBM pSeries 5712, 5703, 5709, and 570A, as well > diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c > index f838bd73befa..6fed5db6255e 100644 > --- a/drivers/scsi/ipr.c > +++ b/drivers/scsi/ipr.c > @@ -3815,10 +3815,8 @@ static struct device_attribute ipr_iopoll_weight_attr > = { > **/ > static struct ipr_sglist *ipr_alloc_ucode_buffer(int buf_len) > { > - int sg_size, order, bsize_elem, num_elem, i, j; > + int sg_size, order; > struct ipr_sglist *sglist; > - struct scatterlist *scatterlist; > - struct page *page; > > /* Get the minimum size per scatter/gather element */ > sg_size = buf_len / (IPR_MAX_SGLIST - 1); > @@ -3826,45 +3824,18 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int > buf_len) > /* Get the actual size per element */ > order = get_order(sg_size); > > - /* Determine the actual number of bytes per element */ > - bsize_elem = PAGE_SIZE * (1 << order); > - > - /* Determine the actual number of sg entries needed */ > - if (buf_len % bsize_elem) > - num_elem = (buf_len / bsize_elem) + 1; > - else > - num_elem = buf_len / bsize_elem; > - > /* Allocate a scatter/gather list for the DMA */ > - sglist = kzalloc(sizeof(struct ipr_sglist) + > - (sizeof(struct scatterlist) * (num_elem - 1)), > - GFP_KERNEL); > - > + sglist = kzalloc(sizeof(struct ipr_sglist), GFP_KERNEL); > if (sglist == NULL) { > ipr_trace; > return NULL; > } > - > - scatterlist = sglist->scatterlist; > - sg_init_table(scatterlist, num_elem); > - > sglist->order = order; > - sglist->num_sg = num_elem; > - > - /* Allocate a bunch of sg elements */ > - for (i = 0; i < num_elem; i++) { > - page = alloc_pages(GFP_KERNEL, order); > - if (!page) { > - ipr_trace; > - > - /* Free up what we already allocated */ > - for (j = i - 1; j >= 0; j--) > - __free_pages(sg_page([j]), order); > - kfree(sglist); > - return NULL; > - } > - > - sg_set_page([i], page, 0, 0); > + sglist->scatterlist = sgl_alloc_order(buf_len, order, false, GFP_KERNEL, > + >num_sg); > + if (!sglist->scatterlist) { > + kfree(sglist); > + return NULL; > } > > return sglist; > @@ -3882,11 +3853,7 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int > buf_len) > **/ > static void ipr_free_ucode_buffer(struct ipr_sglist *sglist) > { > - int i; > - > - for (i = 0; i < sglist->num_sg; i++) > - __free_pages(sg_page(>scatterlist[i]), sglist->order); > - > + sgl_free_order(sglist->scatterlist, sglist->order); > kfree(sglist); > } > > diff --git a/drivers/scsi/ipr.h b/drivers/scsi/ipr.h > index c7f0e9e3cd7d..93570734cbfb 100644 > --- a/drivers/scsi/ipr.h > +++ b/drivers/scsi/ipr.h > @@ -1454,7 +1454,7 @@ struct ipr_sglist { > u32 num_sg; > u32 num_dma_sg; > u32 buffer_len; > - struct scatterlist scatterlist[1]; > + struct scatterlist *scatterlist; > }; > > enum ipr_sdt_state { > Not sure if this is a valid conversion. Originally the driver would allocate a single buffer; with this buffer we have two distinct buffers. Given that this is used to download the microcode I'm not sure if this isn't a hardware-dependent structure which requires a single buffer including the sglist. Brian, can you shed some light here? Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 5/8] target: Use sgl_alloc_order() and sgl_free()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Use the sgl_alloc_order() and sgl_free() functions instead of open > coding these functions. > > Signed-off-by: Bart Van Assche> Cc: Nicholas A. Bellinger > Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Sagi Grimberg > --- > drivers/target/Kconfig | 1 + > drivers/target/target_core_transport.c | 46 > +++--- > 2 files changed, 5 insertions(+), 42 deletions(-) > > diff --git a/drivers/target/Kconfig b/drivers/target/Kconfig > index e2bc99980f75..4c44d7bed01a 100644 > --- a/drivers/target/Kconfig > +++ b/drivers/target/Kconfig > @@ -5,6 +5,7 @@ menuconfig TARGET_CORE > select CONFIGFS_FS > select CRC_T10DIF > select BLK_SCSI_REQUEST # only for scsi_command_size_tbl.. > + select SGL_ALLOC > default n > help > Say Y or M here to enable the TCM Storage Engine and ConfigFS enabled > diff --git a/drivers/target/target_core_transport.c > b/drivers/target/target_core_transport.c > index 836d552b0385..9bbd08be9d60 100644 > --- a/drivers/target/target_core_transport.c > +++ b/drivers/target/target_core_transport.c > @@ -2293,13 +2293,7 @@ static void target_complete_ok_work(struct work_struct > *work) > > void target_free_sgl(struct scatterlist *sgl, int nents) > { > - struct scatterlist *sg; > - int count; > - > - for_each_sg(sgl, sg, nents, count) > - __free_page(sg_page(sg)); > - > - kfree(sgl); > + sgl_free(sgl); > } > EXPORT_SYMBOL(target_free_sgl); > > @@ -2423,42 +2417,10 @@ int > target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length, >bool zero_page, bool chainable) > { > - struct scatterlist *sg; > - struct page *page; > - gfp_t zero_flag = (zero_page) ? __GFP_ZERO : 0; > - unsigned int nalloc, nent; > - int i = 0; > - > - nalloc = nent = DIV_ROUND_UP(length, PAGE_SIZE); > - if (chainable) > - nalloc++; > - sg = kmalloc_array(nalloc, sizeof(struct scatterlist), GFP_KERNEL); > - if (!sg) > - return -ENOMEM; > + gfp_t gfp = GFP_KERNEL | (zero_page ? __GFP_ZERO : 0); > > - sg_init_table(sg, nalloc); > - > - while (length) { > - u32 page_len = min_t(u32, length, PAGE_SIZE); > - page = alloc_page(GFP_KERNEL | zero_flag); > - if (!page) > - goto out; > - > - sg_set_page([i], page, page_len, 0); > - length -= page_len; > - i++; > - } > - *sgl = sg; > - *nents = nent; > - return 0; > - > -out: > - while (i > 0) { > - i--; > - __free_page(sg_page([i])); > - } > - kfree(sg); > - return -ENOMEM; > + *sgl = sgl_alloc_order(length, 0, chainable, gfp, nents); > + return *sgl ? 0 : -ENOMEM; > } > EXPORT_SYMBOL(target_alloc_sgl); > > The calling convention from target_alloc_sgl() is decidedly dodgy. Can't we convert it into returning the sgl, and remove the first parameter? Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 4/8] nvmet/rdma: Use sgl_alloc() and sgl_free()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Use the sgl_alloc() and sgl_free() functions instead of open coding > these functions. > > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Cc: Keith Busch > Cc: Christoph Hellwig > Cc: Sagi Grimberg > --- > drivers/nvme/target/Kconfig | 1 + > drivers/nvme/target/rdma.c | 63 > +++-- > 2 files changed, 5 insertions(+), 59 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 3/8] nvmet/fc: Use sgl_alloc() and sgl_free()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Use the sgl_alloc() and sgl_free() functions instead of open coding > these functions. > > Signed-off-by: Bart Van Assche> Reviewed-by: Johannes Thumshirn > Cc: Keith Busch > Cc: Christoph Hellwig > Cc: James Smart > Cc: Sagi Grimberg > --- > drivers/nvme/target/Kconfig | 1 + > drivers/nvme/target/fc.c| 36 ++-- > 2 files changed, 3 insertions(+), 34 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 2/8] crypto: scompress - use sgl_alloc() and sgl_free()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Use the sgl_alloc() and sgl_free() functions instead of open coding > these functions. > > Signed-off-by: Bart Van Assche> Cc: Ard Biesheuvel > Cc: Herbert Xu > --- > crypto/Kconfig | 1 + > crypto/scompress.c | 51 ++- > 2 files changed, 3 insertions(+), 49 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH v2 1/8] lib/scatterlist: Introduce sgl_alloc() and sgl_free()
On 10/17/2017 12:49 AM, Bart Van Assche wrote: > Many kernel drivers contain code that allocates and frees both a > scatterlist and the pages that populate that scatterlist. > Introduce functions in lib/scatterlist.c that perform these tasks > instead of duplicating this functionality in multiple drivers. > Only include these functions in the build if CONFIG_SGL_ALLOC=y > to avoid that the kernel size increases if this functionality is > not used. > > Signed-off-by: Bart Van Assche> --- > include/linux/scatterlist.h | 10 + > lib/Kconfig | 4 ++ > lib/scatterlist.c | 105 > > 3 files changed, 119 insertions(+) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)