Re: [PATCH 0/3] Improve block device testing coverage
On Thu, Mar 30, 2017 at 05:19:01PM +0400, Dmitry Monakhov wrote: > During LSFMM we have discussed how to test lower-backend of linux IO-stack. > Common opinion was that xfstests is the most obvious solution which cover > most of use cases filesystem care about. > > I'm working on integration T10-DIF/DIF data integrity features to ext4, > for that reason we need to be shure that linux integrity framework is > in working state, which is currently broken in several places. > > In fact, it is relatively simple to add basic coverage tests for basic > IO operations over virtual device with integrity support. All we need > is to add lio target support. First: Thanks for adding block layer testing! Second: even more so than Darrick's blockdev fallocate test this is the wrong place. If I run xfstests I want to test my file system, not random block device features. Please start a proper block device testsuite instead, possibly by copy and pasting code from xfstests. That's how I started the test suite for qemu's block layer for example.
Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors
On Fri, Jan 20, 2017 at 07:27:02PM -0500, Martin K. Petersen wrote: > > "Christoph" == Christoph Hellwig writes: > > Christoph> And get automatic MSI-X affinity for free. > > Chelsio folks: Please review and test! ping!
Re: [PATCH v2] sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable
On Thu, 03/30 11:30, Martin K. Petersen wrote: > Fam Zheng writes: > > >>rw_max = min_not_zero(logical_to_sectors(sdp, dev_max), > >> BLK_DEF_MAX_SECTORS); > > > > Yes, it is better. Is it okay to make the change when you apply? > > Sure. Applied to 4.11/scsi-fixes. Cool. Thanks! Fam
Re: [PATCH 1/6] virtio: wrap find_vqs
On 2017年03月30日 22:32, Michael S. Tsirkin wrote: On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote: On 2017年03月30日 04:48, Michael S. Tsirkin wrote: We are going to add more parameters to find_vqs, let's wrap the call so we don't need to tweak all drivers every time. Signed-off-by: Michael S. Tsirkin --- A quick glance and it looks ok, but what the benefit of this series, is it required by other changes? Thanks Yes - to avoid touching all devices when doing the rest of the patchset. Maybe I'm not clear. I mean the benefit of this series not this single patch. I guess it may be used by you proposal that avoid reset when set XDP? If yes, do we really want to drop some packets after XDP is set? Thanks
[PATCH] tcmu: Skip Data-Out blocks before gathering Data-In buffer for BIDI case
From: Xiubo Li For the bidirectional case, the Data-Out buffer blocks will always at the head of the tcmu_cmd's bitmap, and before gathering the Data-In buffer, first of all it should skip the Data-Out ones, or the device supporting BIDI commands won't work. Fixed: 26418649eead ("target/user: Introduce data_bitmap, replace data_length/data_head/data_tail") Reported-by: Ilias Tsitsimpis Signed-off-by: Xiubo Li --- drivers/target/target_core_user.c | 48 +++ 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index ede815c..63e3a1b 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -311,24 +311,50 @@ static void free_data_area(struct tcmu_dev *udev, struct tcmu_cmd *cmd) DATA_BLOCK_BITS); } -static void gather_data_area(struct tcmu_dev *udev, unsigned long *cmd_bitmap, - struct scatterlist *data_sg, unsigned int data_nents) +static void gather_data_area(struct tcmu_dev *udev, struct tcmu_cmd *cmd, +bool bidi) { + struct se_cmd *se_cmd = cmd->se_cmd; int i, block; int block_remaining = 0; void *from, *to; size_t copy_bytes, from_offset; - struct scatterlist *sg; + struct scatterlist *sg, *data_sg; + unsigned int data_nents; + DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS); + + bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS); + + if (!bidi) { + data_sg = se_cmd->t_data_sg; + data_nents = se_cmd->t_data_nents; + } else { + uint32_t count; + + /* +* For bidi case, the first count blocks are for Data-Out +* buffer blocks, and before gathering the Data-In buffer +* the Data-Out buffer blocks should be discarded. +*/ + count = DIV_ROUND_UP(se_cmd->data_length, DATA_BLOCK_SIZE); + while (count--) { + block = find_first_bit(bitmap, DATA_BLOCK_BITS); + clear_bit(block, bitmap); + } + + data_sg = se_cmd->t_bidi_data_sg; + data_nents = se_cmd->t_bidi_data_nents; + } for_each_sg(data_sg, sg, data_nents, i) { int sg_remaining = sg->length; to = kmap_atomic(sg_page(sg)) + sg->offset; while (sg_remaining > 0) { if (block_remaining == 0) { - block = find_first_bit(cmd_bitmap, + block = find_first_bit(bitmap, DATA_BLOCK_BITS); block_remaining = DATA_BLOCK_SIZE; - clear_bit(block, cmd_bitmap); + clear_bit(block, bitmap); } copy_bytes = min_t(size_t, sg_remaining, block_remaining); @@ -610,19 +636,11 @@ static void tcmu_handle_completion(struct tcmu_cmd *cmd, struct tcmu_cmd_entry * se_cmd->scsi_sense_length); free_data_area(udev, cmd); } else if (se_cmd->se_cmd_flags & SCF_BIDI) { - DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS); - /* Get Data-In buffer before clean up */ - bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS); - gather_data_area(udev, bitmap, - se_cmd->t_bidi_data_sg, se_cmd->t_bidi_data_nents); + gather_data_area(udev, cmd, true); free_data_area(udev, cmd); } else if (se_cmd->data_direction == DMA_FROM_DEVICE) { - DECLARE_BITMAP(bitmap, DATA_BLOCK_BITS); - - bitmap_copy(bitmap, cmd->data_bitmap, DATA_BLOCK_BITS); - gather_data_area(udev, bitmap, - se_cmd->t_data_sg, se_cmd->t_data_nents); + gather_data_area(udev, cmd, false); free_data_area(udev, cmd); } else if (se_cmd->data_direction == DMA_TO_DEVICE) { free_data_area(udev, cmd); -- 1.8.3.1
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
Mike Snitzer writes: Mike, > But while discussing this effort with Jeff Moyer he asked: shouldn't the > zeroed blocks be provisioned? This is a fairly embarassing question not > to be able to answer in the moment. So I clearly need to learn what the > overall intent of WRITE_ZEROES actually is. The intent is to guarantee all future reads to these blocks will return zeroes. Whether to allocate or deallocate or do a combination to achieve that goal is up to the device. > If it is meant as a replacement for WRITE_SAME (as hch switched dm-io > over from WRITE_SAME with a payload of 0 to WRITE_ZEROES) and for the > backing mechanism for blkdev_issue_zeroout() then I guess I have my > answer. Yes. I was hoping MD would use WRITE SAME to initialize data and parity drives. That's why I went with SAME nomenclature rather than ZEROES (which had just appeared in NVMe at that time). Christoph's renaming is mostly to emphasize the purpose and the semantics. People keep getting confused because both REQ_DISCARD and REQ_WRITE_SAME use the WRITE SAME SCSI command. But the two uses are completely orthogonal and governed by different heuristics in sd.c. > Unless DM thinp can guarantee that the discarded blocks will > always return zeroes (by zeroing before all partial block writes) my > discard based dm-thinp implementation of WRITE_ZEROES is a complete > throw-away (unless block zeroing is enabled.. which it never is because > performance sucks with it). So if an upper-level of the IO stack > (e.g. ext4) were to assume that a block will _definitely_ have zeroes > then DM thinp would fall short. You don't have a way to mark those blocks as being full of zeroes without actually writing them? Note that the fallback to a zeroout command is to do a regular write. So if DM doesn't zero the blocks, the block layer is going to it. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 12/23] sd: handle REQ_UNMAP
"h...@lst.de" writes: > If you manually change the provisioning mode to WS10 on a device that > must use WRITE SAME (16) to be able to address all blocks you're already > screwed right now, and with this patch you can screw yourself through > the WRITE_ZEROES path in addition to the DISCARD path. Oh, I see. We only had the LBA sanity check in place for write same, not for discard. -- Martin K. Petersen Oracle Linux Engineering
commit 1b6ac5e3f "fnic: Using rport->dd_data to check rport online instead of rport_lookup."
Hi Satish, My customer hit below error when issue LIP to fnic controller: [94702.898408] sd 2:0:4:1: [sdx] tag#1 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [94702.898416] sd 2:0:4:1: [sdx] tag#1 CDB: Write(10) 2a 00 04 56 c0 08 00 00 08 00 [94702.898420] blk_update_request: I/O error, dev sdx, sector 72794120 [94702.898455] sd 2:0:4:10: [sdy] tag#2 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [94702.898459] sd 2:0:4:10: [sdy] tag#2 CDB: Read(10) 28 00 00 00 08 a8 00 00 02 00 [94702.898462] blk_update_request: I/O error, dev sdy, sector 2216 Looked at all changes of fnic I found it caused by commit 1b6ac5e3f "fnic: Using rport->dd_data to check rport online instead of rport_lookup.", a question is why changed state check from RPORT_ST_DELETE to RPORT_ST_READY here? - if (!rdata || (rdata->rp_state == RPORT_ST_DELETE)) { - FNIC_SCSI_DBG(KERN_DEBUG, fnic->lport->host, - "returning IO as rport is removed\n"); - atomic64_inc(&fnic_stats->misc_stats.rport_not_ready); - sc->result = DID_NO_CONNECT; - done(sc); - return 0; + if (rport) { + struct fc_rport_libfc_priv *rp = rport->dd_data; + + if (!rp || rp->rp_state != RPORT_ST_READY) { + FNIC_SCSI_DBG(KERN_DEBUG, fnic->lport->host, + "returning DID_NO_CONNECT for IO as rport is removed\n"); Thanks, Joe
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30 2017 at 11:20am -0400, Martin K. Petersen wrote: > Mike Snitzer writes: > > > I can work on this now. Only question I have is: should DM thinp take > > care to zero any misaligned head and tail? (I assume so but with all > > the back and forth between Bart, Paolo and Martin I figured I'd ask > > explicitly). > > Yep, let's make sure our semantics match the hardware ditto. > > - So write zeroes should behave deterministically and explicitly handle >any blocks that can't be cleared via deprovisioning. > > - And discard can work at the discard granularity in a >non-deterministic fashion. I got pretty far along with implementing the DM thinp support for WRITE_ZEROES in terms of thinp's DISCARD support (more of an implementation detail.. or so I thought). But while discussing this effort with Jeff Moyer he asked: shouldn't the zeroed blocks be provisioned? This is a fairly embarassing question not to be able to answer in the moment. So I clearly need to learn what the overall intent of WRITE_ZEROES actually is. If it is meant as a replacement for WRITE_SAME (as hch switched dm-io over from WRITE_SAME with a payload of 0 to WRITE_ZEROES) and for the backing mechanism for blkdev_issue_zeroout() then I guess I have my answer. Unless DM thinp can guarantee that the discarded blocks will always return zeroes (by zeroing before all partial block writes) my discard based dm-thinp implementation of WRITE_ZEROES is a complete throw-away (unless block zeroing is enabled.. which it never is because performance sucks with it). So if an upper-level of the IO stack (e.g. ext4) were to assume that a block will _definitely_ have zeroes then DM thinp would fall short. This is all to say: I don't see a quick way forward on implementing performant WRITE_ZEROES support for DM thinp.
[RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
Now that we are using p2pmem SG buffers we occasionally have to copy to and from this memory. For this, we add an iomem flag to sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_ variants to use this more easily. Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates Signed-off-by: Steve Wise --- drivers/scsi/scsi_debug.c | 7 ++--- include/linux/scatterlist.h | 7 - lib/scatterlist.c | 64 ++--- 3 files changed, 65 insertions(+), 13 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 17249c3..70c0d9f 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip) int lu_id_num, port_group_id, target_dev_id, len; char lu_id_str[6]; int host_no = devip->sdbg_host->shost->host_no; - + port_group_id = (((host_no + 1) & 0x7f) << 8) + (devip->channel & 0x7f); if (sdebug_vpd_use_hostno == 0) @@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 lba, u32 num, ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents, fake_storep + (block * sdebug_sector_size), - (num - rest) * sdebug_sector_size, 0, do_write); + (num - rest) * sdebug_sector_size, 0, do_write, false); if (ret != (num - rest) * sdebug_sector_size) return ret; if (rest) { ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents, fake_storep, rest * sdebug_sector_size, - (num - rest) * sdebug_sector_size, do_write); + (num - rest) * sdebug_sector_size, do_write, + false); } return ret; diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index cb3c8fe..030b92b 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, gfp_t gfp_mask); size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, - size_t buflen, off_t skip, bool to_buffer); + size_t buflen, off_t skip, bool to_buffer, bool iomem); size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents, const void *buf, size_t buflen); @@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen, off_t skip); +size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents, +const void *buf, size_t buflen); +size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents, + void *buf, size_t buflen); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c6cf822..22abd94 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop); * **/ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, - size_t buflen, off_t skip, bool to_buffer) + size_t buflen, off_t skip, bool to_buffer, bool iomem) { unsigned int offset = 0; struct sg_mapping_iter miter; @@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, len = min(miter.length, buflen - offset); - if (to_buffer) - memcpy(buf + offset, miter.addr, len); - else - memcpy(miter.addr, buf + offset, len); + if (iomem) { + if (to_buffer) + memcpy_fromio(buf + offset, miter.addr, len); + else + memcpy_toio(miter.addr, buf + offset, len); + } else { + if (to_buffer) + memcpy(buf + offset, miter.addr, len); + else + memcpy(miter.addr, buf + offset, len); + } offset += len; } @@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer); size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents, const void *buf, size_t buflen) { - return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false); + return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false, + false); } EXP
[RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
p2pmem will always be iomem so if we ever access it, we should be using the correct methods to read and write to it. Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates Signed-off-by: Steve Wise --- drivers/nvme/target/core.c| 18 -- drivers/nvme/target/fabrics-cmd.c | 28 +++- drivers/nvme/target/nvmet.h | 1 + drivers/nvme/target/rdma.c| 13 ++--- 4 files changed, 38 insertions(+), 22 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 798653b..a1524d5 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port, u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf, size_t len) { - if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len) + bool iomem = req->p2pmem; + size_t ret; + + ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off, +false, iomem); + + if (ret != len) return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; + return 0; } u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len) { - if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len) + bool iomem = req->p2pmem; + size_t ret; + + ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true, +iomem); + + if (ret != len) return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; + return 0; } diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c index 8bd022af..9d966f0 100644 --- a/drivers/nvme/target/fabrics-cmd.c +++ b/drivers/nvme/target/fabrics-cmd.c @@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req) static void nvmet_execute_admin_connect(struct nvmet_req *req) { struct nvmf_connect_command *c = &req->cmd->connect; - struct nvmf_connect_data *d; + struct nvmf_connect_data d; struct nvmet_ctrl *ctrl = NULL; u16 status = 0; - d = kmap(sg_page(req->sg)) + req->sg->offset; + status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d)); + if (status) + goto out; /* zero out initial completion result, assign values as needed */ req->rsp->result.u32 = 0; @@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) goto out; } - if (unlikely(d->cntlid != cpu_to_le16(0x))) { + if (unlikely(d.cntlid != cpu_to_le16(0x))) { pr_warn("connect attempt for invalid controller ID %#x\n", - d->cntlid); + d.cntlid); status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); goto out; } - status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req, - le32_to_cpu(c->kato), &ctrl); + status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req, + le32_to_cpu(c->kato), &ctrl); if (status) goto out; @@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid); out: - kunmap(sg_page(req->sg)); nvmet_req_complete(req, status); } static void nvmet_execute_io_connect(struct nvmet_req *req) { struct nvmf_connect_command *c = &req->cmd->connect; - struct nvmf_connect_data *d; + struct nvmf_connect_data d; struct nvmet_ctrl *ctrl = NULL; u16 qid = le16_to_cpu(c->qid); u16 status = 0; - d = kmap(sg_page(req->sg)) + req->sg->offset; + status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d)); + if (status) + goto out; /* zero out initial completion result, assign values as needed */ req->rsp->result.u32 = 0; @@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req) goto out; } - status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn, - le16_to_cpu(d->cntlid), - req, &ctrl); + status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn, +le16_to_cpu(d.cntlid), +req, &ctrl); if (status) goto out; @@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req) pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid); out: - kunmap(sg_page(req->sg)); nvmet_req_complete(req, status); return; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index ab67175..ccd79ed 100644 ---
[RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
A p2pmem device is simply a PCI card with a BAR space that points to regular memory. This may be an independent PCI card or part of another completely unrelated device (like an IB card or a NVMe card). The p2pmem device is designed such that other drivers may register p2pmem memory for use by the system. p2pmem devices then provide a kernel interface so that other subsystems can allocate chunks of this memory as necessary to facilitate transfers between two PCI peers. Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce presure on system memory. This may be desirable in many cases: for example a system could be designed with a small CPU connected to a PCI switch by a small number of lanes which would maximize the number of lanes available to connect to NVME devices. Seeing using p2p memory can often have negative effects, especially with older PCI root complexes. The code is designed to only utilize the p2pmem device if all the devices involved in a transfer are behind the same PCI switch. Other cases may still work or be desirable for some end users but it was decided this would be the best course of action to prevent users enabling it and wondering why their performance dropped. Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates Signed-off-by: Steve Wise --- drivers/memory/Kconfig | 5 + drivers/memory/Makefile | 2 + drivers/memory/p2pmem.c | 403 include/linux/p2pmem.h | 103 + 4 files changed, 513 insertions(+) create mode 100644 drivers/memory/p2pmem.c create mode 100644 include/linux/p2pmem.h diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig index ec80e35..4a02cd3 100644 --- a/drivers/memory/Kconfig +++ b/drivers/memory/Kconfig @@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig" source "drivers/memory/tegra/Kconfig" endif + +config P2PMEM + bool "Peer 2 Peer Memory Device Support" + help + This driver is for peer 2 peer memory device managers. diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile index e88097fb..260bfe9 100644 --- a/drivers/memory/Makefile +++ b/drivers/memory/Makefile @@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)+= da8xx-ddrctl.o obj-$(CONFIG_SAMSUNG_MC) += samsung/ obj-$(CONFIG_TEGRA_MC) += tegra/ + +obj-$(CONFIG_P2PMEM)+= p2pmem.o diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c new file mode 100644 index 000..c4ea311 --- /dev/null +++ b/drivers/memory/p2pmem.c @@ -0,0 +1,403 @@ +/* + * Peer 2 Peer Memory Device + * Copyright (c) 2016, Microsemi Corporation + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ + +#include +#include +#include +#include +#include + +MODULE_DESCRIPTION("Peer 2 Peer Memory Device"); +MODULE_VERSION("0.1"); +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Microsemi Corporation"); + +static struct class *p2pmem_class; +static DEFINE_IDA(p2pmem_ida); + +static struct p2pmem_dev *to_p2pmem(struct device *dev) +{ + return container_of(dev, struct p2pmem_dev, dev); +} + +static void p2pmem_percpu_release(struct percpu_ref *ref) +{ + struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref); + + complete_all(&p->cmp); +} + +static void p2pmem_percpu_exit(void *data) +{ + struct percpu_ref *ref = data; + + percpu_ref_exit(ref); +} + +static void p2pmem_percpu_kill(void *data) +{ + struct percpu_ref *ref = data; + struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref); + + if (percpu_ref_is_dying(ref)) + return; + + percpu_ref_kill(ref); + wait_for_completion(&p->cmp); +} + +static void p2pmem_release(struct device *dev) +{ + struct p2pmem_dev *p = to_p2pmem(dev); + + if (p->pool) + gen_pool_destroy(p->pool); + + kfree(p); +} + +/** + * p2pmem_create() - create a new p2pmem device + * @parent: the parent device to create it under + * + * Return value is a pointer to the new device or an ERR_PTR + * on failure. + */ +struct p2pmem_dev *p2pmem_create(struct device *parent) +{ + struct p2pmem_dev *p; + int nid = dev_to_node(parent); + int rc; + + p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid); + if (!p) + return ERR_PTR(-ENOMEM); + + init_completion(&p->cmp); + device_initialize(&p->dev); + p->dev.class = p2pmem_class; + p->dev.parent = parent; + p->dev.release = p2pmem_release; + + p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNE
[RFC 8/8] p2pmem: Added char device user interface
This creates a userspace interface to use p2pmemory. A user can use mmap on the p2pmem char device to get buffers from the corresponding device. This allows a user to use p2p memory with existing interfaces like RDMA and O_DIRECT. This patch is a bit more controversial because people don't want to expose these interfaces to userspace without more consideration. However, this patch is _very_ useful for expirementing with p2p memory. For example, with this patch, you can test with commands like: ib_write_bw -R --mmap=/dev/p2pmem0 -D 30 or use an fio script like: [rdma-server] rw=read mem=mmapshared:/dev/p2pmem0 ioengine=rdma port=14242 bs=64k size=10G iodepth=2 which would test the bandwidth of RDMA to/from the specified p2p memory. Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates Signed-off-by: Steve Wise --- drivers/memory/p2pmem.c | 184 +++- include/linux/p2pmem.h | 4 ++ 2 files changed, 186 insertions(+), 2 deletions(-) diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c index 499d42c..129c49c 100644 --- a/drivers/memory/p2pmem.c +++ b/drivers/memory/p2pmem.c @@ -19,14 +19,20 @@ #include #include #include +#include MODULE_DESCRIPTION("Peer 2 Peer Memory Device"); MODULE_VERSION("0.1"); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Microsemi Corporation"); +static int max_devices = 16; +module_param(max_devices, int, 0444); +MODULE_PARM_DESC(max_devices, "Maximum number of char devices"); + static struct class *p2pmem_class; static DEFINE_IDA(p2pmem_ida); +static dev_t p2pmem_devt; static struct dentry *p2pmem_debugfs_root; @@ -67,6 +73,144 @@ static struct p2pmem_dev *to_p2pmem(struct device *dev) return container_of(dev, struct p2pmem_dev, dev); } +struct p2pmem_vma { + struct p2pmem_dev *p2pmem_dev; + atomic_t mmap_count; + size_t nr_pages; + + /* Protects the used_pages array */ + struct mutex mutex; + struct page *used_pages[]; +}; + +static void p2pmem_vma_open(struct vm_area_struct *vma) +{ + struct p2pmem_vma *pv = vma->vm_private_data; + + atomic_inc(&pv->mmap_count); +} + +static void p2pmem_vma_free_pages(struct vm_area_struct *vma) +{ + int i; + struct p2pmem_vma *pv = vma->vm_private_data; + + mutex_lock(&pv->mutex); + + for (i = 0; i < pv->nr_pages; i++) { + if (pv->used_pages[i]) { + p2pmem_free_page(pv->p2pmem_dev, pv->used_pages[i]); + pv->used_pages[i] = NULL; + } + } + + mutex_unlock(&pv->mutex); +} + +static void p2pmem_vma_close(struct vm_area_struct *vma) +{ + struct p2pmem_vma *pv = vma->vm_private_data; + + if (!atomic_dec_and_test(&pv->mmap_count)) + return; + + p2pmem_vma_free_pages(vma); + + dev_dbg(&pv->p2pmem_dev->dev, "vma close"); + kfree(pv); +} + +static int p2pmem_vma_fault(struct vm_fault *vmf) +{ + struct p2pmem_vma *pv = vmf->vma->vm_private_data; + unsigned int pg_idx; + struct page *pg; + pfn_t pfn; + int rc; + + if (!pv->p2pmem_dev->alive) + return VM_FAULT_SIGBUS; + + pg_idx = (vmf->address - vmf->vma->vm_start) / PAGE_SIZE; + + mutex_lock(&pv->mutex); + + if (pv->used_pages[pg_idx]) + pg = pv->used_pages[pg_idx]; + else + pg = p2pmem_alloc_page(pv->p2pmem_dev); + + if (!pg) + return VM_FAULT_OOM; + + pv->used_pages[pg_idx] = pg; + + pfn = phys_to_pfn_t(page_to_phys(pg), PFN_DEV | PFN_MAP); + rc = vm_insert_mixed(vmf->vma, vmf->address, pfn); + + mutex_unlock(&pv->mutex); + + if (rc == -ENOMEM) + return VM_FAULT_OOM; + if (rc < 0 && rc != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} + +const struct vm_operations_struct p2pmem_vmops = { + .open = p2pmem_vma_open, + .close = p2pmem_vma_close, + .fault = p2pmem_vma_fault, +}; + +static int p2pmem_open(struct inode *inode, struct file *filp) +{ + struct p2pmem_dev *p; + + p = container_of(inode->i_cdev, struct p2pmem_dev, cdev); + filp->private_data = p; + p->inode = inode; + + return 0; +} + +static int p2pmem_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct p2pmem_dev *p = filp->private_data; + struct p2pmem_vma *pv; + size_t nr_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE; + + if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) { + dev_warn(&p->dev, "mmap failed: can't create private mapping\n"); + return -EINVAL; + } + + dev_dbg(&p->dev, "Allocating mmap with %zd pages.\n", nr_pages); + + pv = kzalloc(sizeof(*pv) + sizeof(pv->used_pages[0]) * nr_pages, +GFP_KERNEL); + if (!pv) + return -ENOMEM; + + mutex_init(&pv->mutex); +
[RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
Hello, As discussed at LSF/MM we'd like to present our work to enable copy offload support in NVMe fabrics RDMA targets. We'd appreciate some review and feedback from the community on our direction. This series is not intended to go upstream at this point. The concept here is to use memory that's exposed on a PCI BAR as data buffers in the NVME target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely. The upside of this is better QoS for applications running on the CPU utilizing memory and lower PCI bandwidth required to the CPU (such that systems could be designed with fewer lanes connected to the CPU). However, presently, the trade-off is currently a reduction in overall throughput. (Largely due to hardware issues that would certainly improve in the future). Due to these trade-offs we've designed the system to only enable using the PCI memory in cases where the NIC, NVMe devices and memory are all behind the same PCI switch. This will mean many setups that could likely work well will not be supported so that we can be more confident it will work and not place any responsibility on the user to understand their topology. (We've chosen to go this route based on feedback we received at LSF). In order to enable this functionality we introduce a new p2pmem device which can be instantiated by PCI drivers. The device will register some PCI memory as ZONE_DEVICE and provide an genalloc based allocator for users of these devices to get buffers. We give an example of enabling p2p memory with the cxgb4 driver, however currently these devices have some hardware issues that prevent their use so we will likely be dropping this patch in the future. Ideally, we'd want to enable this functionality with NVME CMB buffers, however we don't have any hardware with this feature at this time. In nvmet-rdma, we attempt to get an appropriate p2pmem device at queue creation time and if a suitable one is found we will use it for all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs attribute is also created which is required to be set before any p2pmem is attempted. This patchset also includes a more controversial patch which provides an interface for userspace to obtain p2pmem buffers through an mmap call on a cdev. This enables userspace to fairly easily use p2pmem with RDMA and O_DIRECT interfaces. However, the user would be entirely responsible for knowing what their doing and inspecting sysfs to understand the pci topology and only using it in sane situations. Thanks, Logan Logan Gunthorpe (6): Introduce Peer-to-Peer memory (p2pmem) device nvmet: Use p2pmem in nvme target scatterlist: Modify SG copy functions to support io memory. nvmet: Be careful about using iomem accesses when dealing with p2pmem p2pmem: Support device removal p2pmem: Added char device user interface Steve Wise (2): cxgb4: setup pcie memory window 4 and create p2pmem region p2pmem: Add debugfs "stats" file drivers/memory/Kconfig | 5 + drivers/memory/Makefile | 2 + drivers/memory/p2pmem.c | 697 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 3 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 +++- drivers/net/ethernet/chelsio/cxgb4/t4_regs.h| 5 + drivers/nvme/target/configfs.c | 31 ++ drivers/nvme/target/core.c | 18 +- drivers/nvme/target/fabrics-cmd.c | 28 +- drivers/nvme/target/nvmet.h | 2 + drivers/nvme/target/rdma.c | 183 +-- drivers/scsi/scsi_debug.c | 7 +- include/linux/p2pmem.h | 120 include/linux/scatterlist.h | 7 +- lib/scatterlist.c | 64 ++- 15 files changed, 1189 insertions(+), 80 deletions(-) create mode 100644 drivers/memory/p2pmem.c create mode 100644 include/linux/p2pmem.h -- 2.1.4
[RFC 4/8] p2pmem: Add debugfs "stats" file
From: Steve Wise For each p2pmem instance, add a "stats" file to show the gen_pool statistics. Signed-off-by: Steve Wise Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates --- drivers/memory/p2pmem.c | 49 + include/linux/p2pmem.h | 2 ++ 2 files changed, 51 insertions(+) diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c index c4ea311..71741c2 100644 --- a/drivers/memory/p2pmem.c +++ b/drivers/memory/p2pmem.c @@ -18,6 +18,7 @@ #include #include #include +#include MODULE_DESCRIPTION("Peer 2 Peer Memory Device"); MODULE_VERSION("0.1"); @@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation"); static struct class *p2pmem_class; static DEFINE_IDA(p2pmem_ida); +static struct dentry *p2pmem_debugfs_root; + +static int stats_show(struct seq_file *seq, void *v) +{ + struct p2pmem_dev *p = seq->private; + + if (p->pool) { + seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool)); + seq_printf(seq, "available: %lu\n", gen_pool_avail(p->pool)); + } + return 0; +} + +static int stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, stats_show, inode->i_private); +} + +static const struct file_operations stats_debugfs_fops = { + .owner = THIS_MODULE, + .open= stats_open, + .release = single_release, + .read= seq_read, + .llseek = seq_lseek, +}; + +static void setup_debugfs(struct p2pmem_dev *p) +{ + struct dentry *de; + + de = debugfs_create_file("stats", 0400, p->debugfs_root, +(void *)p, &stats_debugfs_fops); +} + static struct p2pmem_dev *to_p2pmem(struct device *dev) { return container_of(dev, struct p2pmem_dev, dev); @@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev) { struct p2pmem_dev *p = to_p2pmem(dev); + debugfs_remove_recursive(p->debugfs_root); + if (p->pool) gen_pool_destroy(p->pool); @@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent) if (rc) goto err_id; + if (p2pmem_debugfs_root) { + p->debugfs_root = debugfs_create_dir(dev_name(&p->dev), +p2pmem_debugfs_root); + if (p->debugfs_root) + setup_debugfs(p); + } + rc = device_add(&p->dev); if (rc) goto err_id; @@ -390,12 +434,17 @@ static int __init p2pmem_init(void) if (IS_ERR(p2pmem_class)) return PTR_ERR(p2pmem_class); + p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL); + if (!p2pmem_debugfs_root) + pr_info("could not create debugfs entry, continuing\n"); + return 0; } module_init(p2pmem_init); static void __exit p2pmem_exit(void) { + debugfs_remove_recursive(p2pmem_debugfs_root); class_destroy(p2pmem_class); pr_info(KBUILD_MODNAME ": unloaded.\n"); diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h index 71dc1e1..4cd6f35 100644 --- a/include/linux/p2pmem.h +++ b/include/linux/p2pmem.h @@ -26,6 +26,8 @@ struct p2pmem_dev { struct percpu_ref ref; struct completion cmp; struct gen_pool *pool; + + struct dentry *debugfs_root; }; #ifdef CONFIG_P2PMEM -- 2.1.4
[RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
From: Steve Wise Some cxgb4 cards expose memory as part of BAR4. This patch registers this memory as a p2pmem device. Signed-off-by: Steve Wise Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates --- drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 3 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 - drivers/net/ethernet/chelsio/cxgb4/t4_regs.h| 5 ++ 3 files changed, 102 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h index 163543b..e92443b 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h @@ -48,6 +48,7 @@ #include #include #include +#include #include #include "t4_chip_type.h" #include "cxgb4_uld.h" @@ -859,6 +860,8 @@ struct adapter { /* TC u32 offload */ struct cxgb4_tc_u32_table *tc_u32; + + struct p2pmem_dev *p2pmem; }; /* Support for "sched-class" command to allow a TX Scheduling Class to be diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index afb0967..a33bcd1 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -172,6 +172,11 @@ module_param(select_queue, int, 0644); MODULE_PARM_DESC(select_queue, "Select between kernel provided method of selecting or driver method of selecting TX queue. Default is kernel method."); +static bool use_p2pmem; +module_param(use_p2pmem, bool, 0644); +MODULE_PARM_DESC(use_p2pmem, +"Enable registering a p2pmem device with bar space (if available)"); + static struct dentry *cxgb4_debugfs_root; LIST_HEAD(adapter_list); @@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap) } } +static void setup_memwin_p2pmem(struct adapter *adap) +{ + unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A); + unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A); + + if (!use_p2pmem) + return; + + if (mem_base != 0 && mem_size != 0) { + unsigned int sz_kb, pcieofst; + + sz_kb = roundup_pow_of_two(mem_size) >> 10; + + /* +* The start offset must be aligned to the window size. +* Also, BAR4 has MSIX vectors using the first 8KB. +* Further, the min allowed p2pmem region size is 1MB, +* so set the start offset to the memory size and we're aligned +* as well as past the 8KB vector table. +*/ + pcieofst = sz_kb << 10; + + dev_info(adap->pdev_dev, +"p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, " +"pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb), +pcieofst); + + /* Write the window offset and size */ + t4_write_reg(adap, + PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A, + MEMWIN_RSVD4), + pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb))); + + /* Write the adapter memory base/start */ + t4_write_reg(adap, + PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A, + MEMWIN_RSVD4), + MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf)); + + /* Read it back to flush it */ + t4_read_reg(adap, + PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A, + MEMWIN_RSVD4)); + } else + dev_info(adap->pdev_dev, "p2pmem memory not reserved, " +"base 0x%x size %uB\n", mem_base, mem_size); +} + static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c) { u32 v; @@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs) } #endif +static int init_p2pmem(struct adapter *adapter) +{ + unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A); + struct p2pmem_dev *p; + int rc; + struct resource res; + + if (!mem_size || !use_p2pmem) + return 0; + + mem_size = roundup_pow_of_two(mem_size); + + /* +* Create a subset of BAR4 for the p2pmem region based on the +* exported memory size. +*/ + memcpy(&res, &adapter->pdev->resource[4], sizeof(res)); + res.start += mem_size; + res.end = res.start + mem_size - 1; + dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx size %lluB\n", +res.start, res.end, resource_size(&res)); + + p = p2pmem_create(&adapter->pdev->dev); + if (IS_ERR(p)) + return PTR_ERR(p); + + rc = p2pmem_add_resou
[RFC 3/8] nvmet: Use p2pmem in nvme target
We create a configfs attribute in each nvme-fabrics target port to enable p2p memory use. When enabled, the port will only then use the p2p memory if a p2p memory device can be found which is behind the same switch as the RDMA port and all the block devices in use. If the user enabled it an no devices are found, then the system will silently fall back on using regular memory. If appropriate, that port will allocate memory for the RDMA buffers for queues from the p2pmem device falling back to system memory should anything fail. Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available. Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates Signed-off-by: Steve Wise --- drivers/nvme/target/configfs.c | 31 +++ drivers/nvme/target/nvmet.h| 1 + drivers/nvme/target/rdma.c | 90 ++ 3 files changed, 114 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index be8c800..e61a7f4 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item) kfree(port); } +#ifdef CONFIG_P2PMEM +static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page) +{ + return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem); +} + +static ssize_t nvmet_allow_p2pmem_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_port *port = to_nvmet_port(item); + bool allow; + int ret; + + ret = strtobool(page, &allow); + if (ret) + return ret; + + down_write(&nvmet_config_sem); + port->allow_p2pmem = allow; + up_write(&nvmet_config_sem); + + return count; +} +CONFIGFS_ATTR(nvmet_, allow_p2pmem); +#endif + static struct configfs_attribute *nvmet_port_attrs[] = { &nvmet_attr_addr_adrfam, &nvmet_attr_addr_treq, &nvmet_attr_addr_traddr, &nvmet_attr_addr_trsvcid, &nvmet_attr_addr_trtype, + + #ifdef CONFIG_P2PMEM + &nvmet_attr_allow_p2pmem, + #endif + NULL, }; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index f7ff15f..ab67175 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -95,6 +95,7 @@ struct nvmet_port { struct list_headreferrals; void*priv; boolenabled; + boolallow_p2pmem; }; static inline struct nvmet_port *to_nvmet_port(struct config_item *item) diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index ecc4fe8..7fd4840 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -64,6 +65,7 @@ struct nvmet_rdma_rsp { struct rdma_rw_ctx rw; struct nvmet_reqreq; + struct p2pmem_dev *p2pmem; u8 n_rdma; u32 flags; @@ -107,6 +109,8 @@ struct nvmet_rdma_queue { int send_queue_size; struct list_headqueue_list; + + struct p2pmem_dev *p2pmem; }; struct nvmet_rdma_device { @@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp) spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags); } -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents) +static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents, + struct p2pmem_dev *p2pmem) { struct scatterlist *sg; int count; @@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents) if (!sgl || !nents) return; - for_each_sg(sgl, sg, nents, count) - __free_page(sg_page(sg)); + for_each_sg(sgl, sg, nents, count) { + if (p2pmem) + p2pmem_free_page(p2pmem, sg_page(sg)); + else + __free_page(sg_page(sg)); + } kfree(sgl); } static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, - u32 length) + u32 length, struct p2pmem_dev *p2pmem) { struct scatterlist *sg; struct page *page; @@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, while (length) { u32 page_len = min_t(u32, length, PAGE_SIZE); - page = alloc_page(GFP_KERNEL); + if (p2pmem) + page = p2pme
[RFC 7/8] p2pmem: Support device removal
This patch creates a list of callbacks to notify users of this memory that the p2pmem device is going away or gone. In nvmet-rdma, we disconnect any queue using p2p memory. The remote side will then automatically reconnect in a couple seconds and regular system memory (or a different p2pmem device) will be used. Signed-off-by: Logan Gunthorpe Signed-off-by: Stephen Bates Signed-off-by: Steve Wise --- drivers/memory/p2pmem.c| 75 --- drivers/nvme/target/rdma.c | 98 ++ include/linux/p2pmem.h | 19 +++-- 3 files changed, 140 insertions(+), 52 deletions(-) diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c index 71741c2..499d42c 100644 --- a/drivers/memory/p2pmem.c +++ b/drivers/memory/p2pmem.c @@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev) kfree(p); } +struct remove_callback { + struct list_head list; + void (*callback)(void *context); + void *context; +}; + +static void p2pmem_remove(struct p2pmem_dev *p) +{ + struct remove_callback *remove_call, *tmp; + + p->alive = false; + list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list) + remove_call->callback(remove_call->context); +} + /** * p2pmem_create() - create a new p2pmem device * @parent: the parent device to create it under @@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent) return ERR_PTR(-ENOMEM); init_completion(&p->cmp); + mutex_init(&p->remove_mutex); + INIT_LIST_HEAD(&p->remove_list); + p->alive = true; + device_initialize(&p->dev); p->dev.class = p2pmem_class; p->dev.parent = parent; @@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p) dev_info(&p->dev, "unregistered"); device_del(&p->dev); + p2pmem_remove(p); ida_simple_remove(&p2pmem_ida, p->id); put_device(&p->dev); } @@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region); */ void *p2pmem_alloc(struct p2pmem_dev *p, size_t size) { + if (!p->alive) + return NULL; + return (void *)gen_pool_alloc(p->pool, size); } EXPORT_SYMBOL(p2pmem_alloc); @@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem, struct pci_dev *p2p_up; struct pci_dev *dma_up; + if (!to_p2pmem(p2pmem)->alive) + return false; + p2p_up = get_upstream_switch_port(p2pmem); if (!p2p_up) { dev_warn(p2pmem, "p2pmem is not behind a pci switch"); @@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem, * specified devices * @dma_devices: a null terminated array of device pointers which * all must be compatible with the returned p2pmem device + * @remove_callback: this callback will be called if the p2pmem + * device is removed. * * For now, we only support cases where all the devices that * will transfer to the p2pmem device are on the same switch. @@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem, * (use p2pmem_put to return the reference) or NULL if no compatible * p2pmem device is found. */ -struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices) +struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices, + void (*remove_callback)(void *context), + void *context) { struct device *dev; + struct p2pmem_dev *p; + struct remove_callback *remove_call; dev = class_find_device(p2pmem_class, NULL, dma_devices, upstream_bridges_match); @@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices) if (!dev) return NULL; - return to_p2pmem(dev); + p = to_p2pmem(dev); + mutex_lock(&p->remove_mutex); + + if (!p->alive) { + p = NULL; + goto out; + } + + remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL); + remove_call->callback = remove_callback; + remove_call->context = context; + INIT_LIST_HEAD(&remove_call->list); + list_add(&remove_call->list, &p->remove_list); + +out: + mutex_unlock(&p->remove_mutex); + return p; } EXPORT_SYMBOL(p2pmem_find_compat); /** * p2pmem_put() - decrement a p2pmem device reference * @p: p2pmem device to return + * @data: data pointer that was passed to p2pmem_find_compat * * Dereference and free (if last) the device's reference counter. * It's safe to pass a NULL pointer to this function. */ -void p2pmem_put(struct p2pmem_dev *p) +void p2pmem_put(struct p2pmem_dev *p, void *context) { - if (p) - put_device(&p->dev); + struct remove_callback *remove_call; + + if (!p) + return; + + mutex_lock(&p->remo
Re: [PATCH 12/23] sd: handle REQ_UNMAP
On Thu, Mar 30, 2017 at 11:28:32AM -0400, Martin K. Petersen wrote: > "h...@lst.de" writes: > > Christoph, > > > On Tue, Mar 28, 2017 at 04:48:55PM +, Bart Van Assche wrote: > >> > if (sdp->no_write_same) > >> > return BLKPREP_INVALID; > >> > if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) > >> > >> Users can change the provisioning mode from user space from SD_LBP_WS16 > >> into > >> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector > > >> 0x || nr_sectors > 0x) check if REQ_UNMAP is set. > > > > They can, and if the device has too many sectors that will already cause > > discard to fail, > > I'm not sure I understand what you mean by that? If you manually change the provisioning mode to WS10 on a device that must use WRITE SAME (16) to be able to address all blocks you're already screwed right now, and with this patch you can screw yourself through the WRITE_ZEROES path in addition to the DISCARD path.
Re: [PATCH 23/23] block: remove the discard_zeroes_data flag
On Thu, Mar 30, 2017 at 11:29:29AM -0400, Martin K. Petersen wrote: > "h...@lst.de" writes: > > > Jens, any opinion? I'd like to remove it too, but I fear it might > > break things. We could deprecate it first with a warning when read > > and then remove it a few releases down the road. > > I know of several apps that check this variable (as opposed to the > ioctl). The above was in reference to both methods..
[PATCH] osd_uld: Check scsi_device_get() return value
scsi_device_get() can fail. Hence check its return value. Signed-off-by: Bart Van Assche Cc: Boaz Harrosh --- drivers/scsi/osd/osd_uld.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c index e0ce5d2fd14d..1dea4244dd0c 100644 --- a/drivers/scsi/osd/osd_uld.c +++ b/drivers/scsi/osd/osd_uld.c @@ -464,14 +464,15 @@ static int osd_probe(struct device *dev) /* hold one more reference to the scsi_device that will get released * in __release, in case a logout is happening while fs is mounted */ - scsi_device_get(scsi_device); + if (scsi_device_get(scsi_device)) + goto err_put_disk; osd_dev_init(&oud->od, scsi_device); /* Detect the OSD Version */ error = __detect_osd(oud); if (error) { OSD_ERR("osd detection failed, non-compatible OSD device\n"); - goto err_put_disk; + goto err_put_sdev; } /* init the char-device for communication with user-mode */ @@ -508,8 +509,9 @@ static int osd_probe(struct device *dev) err_put_cdev: cdev_del(&oud->cdev); -err_put_disk: +err_put_sdev: scsi_device_put(scsi_device); +err_put_disk: put_disk(disk); err_free_osd: dev_set_drvdata(dev, NULL); -- 2.12.0
Re: [PATCH 1/2] iscsi-target: Fix TMR reference leak during session shutdown
On Thu, 2017-03-30 at 08:29 +, Nicholas A. Bellinger wrote: > diff --git a/drivers/target/iscsi/iscsi_target_util.c > b/drivers/target/iscsi/iscsi_target_util.c > index 5041a9c..b464033 100644 > --- a/drivers/target/iscsi/iscsi_target_util.c > +++ b/drivers/target/iscsi/iscsi_target_util.c > @@ -737,21 +737,23 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool > shutdown) > { > struct se_cmd *se_cmd = NULL; > int rc; > + bool op_scsi = false; > /* >* Determine if a struct se_cmd is associated with >* this struct iscsi_cmd. >*/ > switch (cmd->iscsi_opcode) { > case ISCSI_OP_SCSI_CMD: > - se_cmd = &cmd->se_cmd; > - __iscsit_free_cmd(cmd, true, shutdown); > + op_scsi = true; > /* >* Fallthrough >*/ > case ISCSI_OP_SCSI_TMFUNC: > - rc = transport_generic_free_cmd(&cmd->se_cmd, shutdown); > - if (!rc && shutdown && se_cmd && se_cmd->se_sess) { > - __iscsit_free_cmd(cmd, true, shutdown); > + se_cmd = &cmd->se_cmd; > + __iscsit_free_cmd(cmd, op_scsi, shutdown); > + rc = transport_generic_free_cmd(se_cmd, shutdown); > + if (!rc && shutdown && se_cmd->se_sess) { > + __iscsit_free_cmd(cmd, op_scsi, shutdown); > target_put_sess_cmd(se_cmd); > } > break; Hello Nic, I agree that this patch fixes a leak. However, an existing bug in iscsit_free_cmd() is not addressed by this patch. Before the TMF code started using kref_get() / kref_put() it was possible for transport_generic_free_cmd() to determine whether or not iscsit_free_cmd() should call __iscsit_free_cmd() by checking the command reference count. I think that since the TMF code manipulates the command reference count it is no longer possible for transport_generic_free_cmd() to determine this. If iscsit_free_cmd() is called while a LUN RESET is in progress then the return value of transport_generic_free_cmd() will be wrong. I will post a few patches that not only address what I just described but also the leak fixed by this patch. Bart.
Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
On Thu, Mar 30 2017 at 11:22am -0400, Martin K. Petersen wrote: > Mike Snitzer writes: > > > Would be very useful, particularly for testing, if > > drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES. > > There is no WRITE ZEROES in SCSI. You should be able to get the right > behavior with lbpws=1 lbprz=1. OK, thanks.
Re: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors
Christoph Hellwig writes: > Any progress on that? I'd like to get be2iscsi off the old MSI-X API > for 4.12. Applied to 4.12/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v2] sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable
Fam Zheng writes: >> rw_max = min_not_zero(logical_to_sectors(sdp, dev_max), >> BLK_DEF_MAX_SECTORS); > > Yes, it is better. Is it okay to make the change when you apply? Sure. Applied to 4.11/scsi-fixes. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 23/23] block: remove the discard_zeroes_data flag
"h...@lst.de" writes: > Jens, any opinion? I'd like to remove it too, but I fear it might > break things. We could deprecate it first with a warning when read > and then remove it a few releases down the road. I know of several apps that check this variable (as opposed to the ioctl). -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 12/23] sd: handle REQ_UNMAP
"h...@lst.de" writes: Christoph, > On Tue, Mar 28, 2017 at 04:48:55PM +, Bart Van Assche wrote: >> >if (sdp->no_write_same) >> >return BLKPREP_INVALID; >> >if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) >> >> Users can change the provisioning mode from user space from SD_LBP_WS16 into >> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector > >> 0x || nr_sectors > 0x) check if REQ_UNMAP is set. > > They can, and if the device has too many sectors that will already cause > discard to fail, I'm not sure I understand what you mean by that? -- Martin K. Petersen Oracle Linux Engineering
Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
Mike Snitzer writes: > Would be very useful, particularly for testing, if > drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES. There is no WRITE ZEROES in SCSI. You should be able to get the right behavior with lbpws=1 lbprz=1. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
Mike Snitzer writes: > I can work on this now. Only question I have is: should DM thinp take > care to zero any misaligned head and tail? (I assume so but with all > the back and forth between Bart, Paolo and Martin I figured I'd ask > explicitly). Yep, let's make sure our semantics match the hardware ditto. - So write zeroes should behave deterministically and explicitly handle any blocks that can't be cleared via deprovisioning. - And discard can work at the discard granularity in a non-deterministic fashion. -- Martin K. Petersen Oracle Linux Engineering
Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
Would be very useful, particularly for testing, if drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES.
Re: [PATCHv5 0/2] tcmu: For bugs fix only
On Thu, 2017-03-30 at 01:48 -0700, Nicholas A. Bellinger wrote: > Just for future reference, the flow of these tags should reflect the > history of the patch. Eg: > > Reviewed-by: First reviewer > Tested-by: First tester > Reviewed-by: Second reviewer > Signed-off-by: Patch Author > > and then once the subsystem maintainer merges it into his tree, they add > their own: > > Signed-off-by: Subsystem Maintainer Hi Nic, I agree that these tags should reflect the history of the patch. I think that means that the patch author should be mentioned first, Reviewed-by / Tested-by tags next and the subsystem maintainer sign-off last. At least, that's how most other maintainers do it. Bart.
[PATCH] target/user: PGR Support
This adds PGR support for just TCMU, since tcmu doesn't have the necessary IT_NEXUS info to process PGR in userspace, so have those commands be processed in kernel. Signed-off-by: Bryant G. Ly --- drivers/target/target_core_configfs.c | 10 +- drivers/target/target_core_device.c | 27 +++ drivers/target/target_core_pr.c | 2 +- drivers/target/target_core_pscsi.c| 3 ++- include/target/target_core_backend.h | 1 + 5 files changed, 36 insertions(+), 7 deletions(-) diff --git a/drivers/target/target_core_configfs.c b/drivers/target/target_core_configfs.c index 38b5025..edfb098 100644 --- a/drivers/target/target_core_configfs.c +++ b/drivers/target/target_core_configfs.c @@ -1366,7 +1366,7 @@ static ssize_t target_pr_res_holder_show(struct config_item *item, char *page) struct se_device *dev = pr_to_dev(item); int ret; - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) return sprintf(page, "Passthrough\n"); spin_lock(&dev->dev_reservation_lock); @@ -1506,7 +1506,7 @@ static ssize_t target_pr_res_type_show(struct config_item *item, char *page) { struct se_device *dev = pr_to_dev(item); - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) return sprintf(page, "SPC_PASSTHROUGH\n"); else if (dev->dev_reservation_flags & DRF_SPC2_RESERVATIONS) return sprintf(page, "SPC2_RESERVATIONS\n"); @@ -1519,7 +1519,7 @@ static ssize_t target_pr_res_aptpl_active_show(struct config_item *item, { struct se_device *dev = pr_to_dev(item); - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) return 0; return sprintf(page, "APTPL Bit Status: %s\n", @@ -1531,7 +1531,7 @@ static ssize_t target_pr_res_aptpl_metadata_show(struct config_item *item, { struct se_device *dev = pr_to_dev(item); - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) return 0; return sprintf(page, "Ready to process PR APTPL metadata..\n"); @@ -1577,7 +1577,7 @@ static ssize_t target_pr_res_aptpl_metadata_store(struct config_item *item, u16 tpgt = 0; u8 type = 0; - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) return count; if (dev->dev_reservation_flags & DRF_SPC2_RESERVATIONS) return count; diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c index c754ae3..83c0d77 100644 --- a/drivers/target/target_core_device.c +++ b/drivers/target/target_core_device.c @@ -1045,6 +1045,7 @@ passthrough_parse_cdb(struct se_cmd *cmd, sense_reason_t (*exec_cmd)(struct se_cmd *cmd)) { unsigned char *cdb = cmd->t_task_cdb; + struct se_device *dev = cmd->se_dev; /* * Clear a lun set in the cdb if the initiator talking to use spoke @@ -1076,6 +1077,32 @@ passthrough_parse_cdb(struct se_cmd *cmd, return TCM_NO_SENSE; } + /* +* For PERSISTENT RESERVE IN/OUT, RELEASE, and RESERVE we need to +* emulate the response, since tcmu does not have the information +* required to process these commands. +*/ + if (!(dev->transport->transport_flags & + TRANSPORT_FLAG_PASSTHROUGH_PGR)) { + if (cdb[0] == PERSISTENT_RESERVE_IN) { + cmd->execute_cmd = target_scsi3_emulate_pr_in; + return TCM_NO_SENSE; + } + if (cdb[0] == PERSISTENT_RESERVE_OUT) { + cmd->execute_cmd = target_scsi3_emulate_pr_out; + return TCM_NO_SENSE; + } + + if (cdb[0] == RELEASE || cdb[0] == RELEASE_10) { + cmd->execute_cmd = target_scsi2_reservation_release; + return TCM_NO_SENSE; + } + if (cdb[0] == RESERVE || cdb[0] == RESERVE_10) { + cmd->execute_cmd = target_scsi2_reservation_reserve; + return TCM_NO_SENSE; + } + } + /* Set DATA_CDB flag for ops that should have it */ switch (cdb[0]) { case READ_6: diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c index e180511..129ca57 100644 --- a/drivers/target/target_core_pr.c +++ b/drivers/target/target_core_pr.c @@ -4147,7 +4147,7 @@ target_check_reservation(struct se_cmd *cmd) return 0; if (dev->
Re: [PATCH 1/6] virtio: wrap find_vqs
On Thu, Mar 30, 2017 at 02:00:08PM +0800, Jason Wang wrote: > > > On 2017年03月30日 04:48, Michael S. Tsirkin wrote: > > We are going to add more parameters to find_vqs, let's wrap the call so > > we don't need to tweak all drivers every time. > > > > Signed-off-by: Michael S. Tsirkin > > --- > > A quick glance and it looks ok, but what the benefit of this series, is it > required by other changes? > > Thanks Yes - to avoid touching all devices when doing the rest of the patchset.
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30 2017 at 7:44am -0400, Christoph Hellwig wrote: > On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote: > > > Will it make an fstrim cause thinly provisioned > > devices to suddenly be fully allocated? > > Not for SCSI devices. Yes for dm-thinp until it implements > REQ_OP_WRITE_ZEROES, which will hopefully be soon. I can work on this now. Only question I have is: should DM thinp take care to zero any misaligned head and tail? (I assume so but with all the back and forth between Bart, Paolo and Martin I figured I'd ask explicitly).
[PATCH 2/3] add test: generic/420 check information lead for lio-fileio
Signed-off-by: Dmitry Monakhov --- tests/generic/420 | 77 +++ tests/generic/420.out | 2 ++ tests/generic/group | 1 + 3 files changed, 80 insertions(+) create mode 100755 tests/generic/420 create mode 100644 tests/generic/420.out diff --git a/tests/generic/420 b/tests/generic/420 new file mode 100755 index 000..592703d --- /dev/null +++ b/tests/generic/420 @@ -0,0 +1,77 @@ +#! /bin/bash +# FS QA Test 420 +# +# Regression test for information leak for lio-fileio target +# BUG: If image file is less than virtual blockdev then read() may return +# unitilized data. +# +#--- +# Copyright (c) 2017 Dmitry Monakhov +# All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + + +_cleanup() +{ + cd / + rm -f $tmp.* + _liotgt_cleanup + rm -rf $TEST_DIR/$$ +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/liotarget + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_test +_require_liotarget + +mkdir -p $TEST_DIR/$$ || _fail "Can not make test dir" + +cfg_path=$(_liotgt_create_fileio 420-test.img $TEST_DIR/$$/420-test.img 128M) +dev=$(_liotgt_attach_target /backstores/fileio/420-test.img) + +$XFS_IO_PROG -f -c "truncate 1M" $TEST_DIR/$$/420-test.img >> $seqres.full + +$XFS_IO_PROG -c "pwrite -S 0xa0 -b 512 512 1k" -d $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" + +$XFS_IO_PROG -c "pwrite -S 0xab -b 512 2M 1k" -d $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" + +md5sum $dev | sed -e "s,$dev,loopdev,g" +# success, all done +status=0 +exit diff --git a/tests/generic/420.out b/tests/generic/420.out new file mode 100644 index 000..deed5da --- /dev/null +++ b/tests/generic/420.out @@ -0,0 +1,2 @@ +QA output created by 420 +f5cfb0d8d7bfadbc2127f4f6ca4a0eba loopdev diff --git a/tests/generic/group b/tests/generic/group index 0781f35..1261547 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -422,3 +422,4 @@ 417 auto quick shutdown log 418 auto rw 419 auto quick encrypt +420 auto blockdev liotarget -- 2.9.3
[PATCH 1/3] add lio-target helpers
Linux-IO Target is very good framework for testing block backend. It is more flexible than scsi_debug. http://linux-iscsi.org/wiki/LIO Signed-off-by: Dmitry Monakhov --- common/config| 2 + common/liotarget | 111 +++ 2 files changed, 113 insertions(+) create mode 100644 common/liotarget diff --git a/common/config b/common/config index 59041a3..cfe7913 100644 --- a/common/config +++ b/common/config @@ -212,6 +212,8 @@ export XZ_PROG="`set_prog_path xz`" export FLOCK_PROG="`set_prog_path flock`" export LDD_PROG="`set_prog_path ldd`" export TIMEOUT_PROG="`set_prog_path timeout`" +export TARGETCLI_PROG="`set_prog_path targetcli`" +export TARGETCTL_PROG="`set_prog_path targetctl`" # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. # newer systems have udevadm command but older systems like RHEL5 don't. diff --git a/common/liotarget b/common/liotarget new file mode 100644 index 000..f821692 --- /dev/null +++ b/common/liotarget @@ -0,0 +1,111 @@ +#!/bin/bash +# +# Copyright (c) 2017 Virtuozzo Inc +# All Rights Reserved. +# +# Written by Dmitry Monakhov +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +# +# Functions for Linux-IO Target manipulation +# + +_require_liotarget() +{ + which $TARGETCLI_PROG >>$seqres.full 2>&1 || \ + _notrun "this test requires 'targetcli' tool" + which $TARGETCLI_PROG >>$seqres.full 2>&1 || \ + _notrun "this test requires 'targetcli' tool" + + $TARGETCLI_PROG ls /backstores/ramdisk >>$seqres.full 2>&1 ||\ + _notrun "kernel compiled w/o CONFIG_TARGET_CORE" + $TARGETCLI_PROG ls /backstores/fileio >>$seqres.full 2>&1 ||\ + _notrun "kernel compiled w/o CONFIG_TCM_FILEIO" + $TARGETCLI_PROG ls /loopback >>$seqres.full 2>&1 ||\ + _notrun "kernel compiled w/o CONFIG_LOOPBACK_TARGET" +} + +_liotgt_create_fileio() +{ + local name=$1 + local img=$2 + local sz=$3 + + $TARGETCLI_PROG /backstores/fileio create \ + name=$name file_or_dev=$img size=$sz >>$seqres.full ||\ + _fail "Can not create /backstores/fileio/$name" + + local cfg_path=`ls -d /sys/kernel/config/target/core/fileio_*/$name` + [ -d "$cfg_path" ] || _fail "Bad config path" + echo $cfg_path +} + +_liotgt_set_attribute() +{ + local path=$1 + local attr_name=$2 + local attr_val=$3 + + [ -f $path/attrib/$attr_name ] || _fail "Bad attribute $attr_name" + echo "echo $attr_val > $path/attrib/$attr_name " >>$seqres.full + echo $attr_val > $path/attrib/$attr_name +} + +_liotgt_create_loopback() +{ + local out=`$TARGETCLI_PROG /loopback/ create 2>&1` + [ $? -eq 0 ] || _fail "Can not create loopback target" + echo $out >>$seqres.full + + local naa=`echo $out | gawk '{ print $3 }'` + echo ${naa:0:20} >>$seqres.full + + echo ${naa:0:20} +} + +_liotgt_find_dev() +{ + local found="" + local name=$1 + local drives=`find /sys/devices -type f -name model` + for d in $drives;do + local dir=`dirname $d` + local vendor=`cat $dir/vendor` + local model=`cat $dir/model` + if [ "${vendor//[[:space:]]/}" == "LIO-ORG" ] && \ + [ "${model//[[:space:]]/}" == "$name" ]; then + found=/dev/$(ls $dir/block) + break + fi + done + [ -z "$found" ] && _fail "Can not find device with backend $name" + echo "$found" +} + +_liotgt_attach_target() +{ + local path=$1 + local name=`basename $path` + local naa=$(_liotgt_create_loopback) + + $TARGETCLI_PROG /loopback/$naa/luns create $path >>$seqres.full 2>&1 ||\ + _fail "Can not attach $path to tcm_loop" + _liotgt_find_dev $name +} + +_liotgt_cleanup() +{ + $TARGETCTL_PROG clear +} -- 2.9.3
[PATCH 3/3] add test: generic/421 basic blockdev T10-DIF-TYPE1 integrity checks
Test create virtual block device via lio-targed infastructure and perform basic IO operations with data corruption detection. Temprorally mark is as dangerous, because currently it trigger BUG_ON inside blkdev_issue_flush BTW: I use 'dd' to test read from corrupted image instead of xfs_io because even if pread failed, xfs_io still exit with success, BUG? Signed-off-by: Dmitry Monakhov --- tests/generic/421 | 136 ++ tests/generic/421.out | 16 ++ tests/generic/group | 1 + 3 files changed, 153 insertions(+) create mode 100644 tests/generic/421 create mode 100644 tests/generic/421.out diff --git a/tests/generic/421 b/tests/generic/421 new file mode 100644 index 000..de1ba73 --- /dev/null +++ b/tests/generic/421 @@ -0,0 +1,136 @@ +#! /bin/bash +# FS QA Test 421 +# +# Check basic T10-DIF integrity features for a block device +# +# DIF/DIX TYPE: T10-DIF-TYPE1-CRC +# Kernel docs: Documentation/block/data-integrity.txt +#--- +# Copyright (c) 2017 Dmitry Monakhov +# All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + + +_cleanup() +{ + cd / + rm -f $tmp.* + _liotgt_cleanup + rm -rf $TEST_DIR/$$ +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/liotarget + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_test +_require_liotarget + +mkdir -p $TEST_DIR/$$ || _fail "Can not make test dir" + +# Create virtual block device +cfg_path=$(_liotgt_create_fileio t10-dif-t1 $TEST_DIR/$$/t10-dif-t1 32M) + +_liotgt_set_attribute $cfg_path pi_prot_type 1 +_liotgt_set_attribute $cfg_path pi_prot_format 1 + +dev=$(_liotgt_attach_target /backstores/fileio/t10-dif-t1) + +echo "T0: Test basic IO" +$XFS_IO_PROG -c "pwrite -S 0xa0 -b 4M 0 16M" -d $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" + +$XFS_IO_PROG -c "pwrite -S 0xa1 -b 1k -V8 20M 32k" -d $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" + +$XFS_IO_PROG -c "pwrite -S 0xa2 -b 1k -V8 2M 32k" -f $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" + +$XFS_IO_PROG -c "pwrite -S 0xa3 -b 4k 1536000 8k" -f $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" + +$XFS_IO_PROG -c "fsync" -d $dev >>$seqres.full 2>&1 || _fail "fsync failed" + +echo "Check that buffered and direct read works" +dd if=$dev bs=4k 2>$seqres.full | md5sum +dd if=$dev bs=4M iflag=direct 2>$seqres.full | md5sum + +echo "Check csum corruption detection" +# LIO-fileio store t10 DIF data in separate file ${IMG}.protection +# struct t10_pi_tuple { +# __be16 guard_tag; /* Checksum */ +# __be16 app_tag; /* Opaque storage */ +# __be32 ref_tag; /* Target LBA or indirect LBA */ +#} +# Play with 3000'th sector -> t10_pi_tuple offset == 3000 * 8 == 24000 +# +echo "T1: Corrupt guard_tag, next read should fail" +$XFS_IO_PROG -c "pwrite -S 0xde -b2 24000 2 -w" \ +-f $TEST_DIR/$$/t10-dif-t1.protection >>$seqres.full 2>&1 +dd if=$dev bs=1M count=2 iflag=direct >>$seqres.full 2>&1 && +_fail "read should fail on 3000'th sector" + +echo "T2: Check that unaffected blocks are still readable" +dd if=$dev bs=1M count=1 iflag=direct >>$seqres.full 2>&1 || _fail "read failed" + +echo "T3: Rewrite corrupted sector and check that read works" +$XFS_IO_PROG -c "pwrite -S 0xa3 -b 4k 1536000 4k" -d $dev >>$seqres.full 2>&1 || \ +_fail "pwrite failed" +dd if=$dev bs=2M count=1 iflag=direct >>$seqres.full 2>&1 || _fail "read failed" + +echo "T4: Corrupt app_tag, should not affect read" +$XFS_IO_PROG -c "pwrite -S 0xde -b2 24002 2 -w" \ +-f $TEST_DIR/$$/t10-dif-t1.protection >>$seqres.full 2>&1 +dd if=$dev bs=2M count=1 iflag=direct >>$seqres.full 2>&1 || _fail "read failed" + +echo "T5: Corrupt ref_tag, next read should fail" +$XFS_IO_PROG -c "pwrite -S 0xde -b4 24004 4 -w" \ +$TEST_DIR/$$/t10-dif-t1.protection >>$seqres.fu
[PATCH 0/3] Improve block device testing coverage
During LSFMM we have discussed how to test lower-backend of linux IO-stack. Common opinion was that xfstests is the most obvious solution which cover most of use cases filesystem care about. I'm working on integration T10-DIF/DIF data integrity features to ext4, for that reason we need to be shure that linux integrity framework is in working state, which is currently broken in several places. In fact, it is relatively simple to add basic coverage tests for basic IO operations over virtual device with integrity support. All we need is to add lio target support. TOC: add lio target helpers add test: generic/420 check information lead for lio-fileio add test: generic/421 basic blockdev T10-DIF-TYPE1 IO
Re: [Drbd-dev] [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30, 2017 at 01:44:09PM +0200, Christoph Hellwig wrote: > On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote: > > On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote: > > > It seems like DRBD assumes its on the wire TRIM request always zeroes > > > data. > > > Use that fact to implement REQ_OP_WRITE_ZEROES. > > > > > > XXX: will need a careful audit from the drbd team! > > > > Thanks, this one looks ok to me. > > So the DRBD protocol requires the TRIM operation to always zero? "users" (both as in submitting entities, and people using DRBD) expect that DRBD guarantees replicas to be identical after whatever operations have been completed by all replicas. Which means that for trim/discard/unmap, we can only expose that to upper layers (or use it for internal purposes) if the operation has a well defined, and on all backends identical, result. Short answer: Yes. > > The real question for me is, will the previous one (21/23) > > return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than > > what we have now? > > No, blkdev_issue_zeroout should never return -EOPNOTSUPP. > > > Will it make an fstrim cause thinly provisioned > > devices to suddenly be fully allocated? > > Not for SCSI devices. Yes for dm-thinp until it implements > REQ_OP_WRITE_ZEROES, which will hopefully be soon. "good enough for me" ;-) Thanks, Lars
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote: > On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote: > > It seems like DRBD assumes its on the wire TRIM request always zeroes data. > > Use that fact to implement REQ_OP_WRITE_ZEROES. > > > > XXX: will need a careful audit from the drbd team! > > Thanks, this one looks ok to me. So the DRBD protocol requires the TRIM operation to always zero? > The real question for me is, will the previous one (21/23) > return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than > what we have now? No, blkdev_issue_zeroout should never return -EOPNOTSUPP. > Will it make an fstrim cause thinly provisioned > devices to suddenly be fully allocated? Not for SCSI devices. Yes for dm-thinp until it implements REQ_OP_WRITE_ZEROES, which will hopefully be soon.
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote: > It seems like DRBD assumes its on the wire TRIM request always zeroes data. > Use that fact to implement REQ_OP_WRITE_ZEROES. > > XXX: will need a careful audit from the drbd team! Thanks, this one looks ok to me. The real question for me is, will the previous one (21/23) return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than what we have now? Will it make an fstrim cause thinly provisioned devices to suddenly be fully allocated? Or does it unmap "the same" as what we have now? Especially on top of dm-thin, but also on top of any other device. That's something that is not really "obvious" to me yet. Cheers, Lars
Re: [PATCHv5 0/2] tcmu: For bugs fix only
On 2017年03月30日 16:48, Nicholas A. Bellinger wrote: Hi Xiubo & Co, On Mon, 2017-03-27 at 17:07 +0800, lixi...@cmss.chinamobile.com wrote: From: Xiubo Li Changed for V5: - This only includes #1 and #2. And for old #3, #4 are still reviewing. - #1, since the issue reported by Ilias is a separate new one, and will create a new patch later. - #2, address the issue pointed out by Mike, thanks. - #1 and #2 have been tested by Ilias in BIDI case, thanks. Changed for V4: - re-order the #3, #4 at the head. - merge most of the #5 to others. Xiubo Li (2): tcmu: Fix possible overwrite of t_data_sg's last iov[] tcmu: Fix wrongly calculating of the base_command_size drivers/target/target_core_user.c | 44 +++ 1 file changed, 31 insertions(+), 13 deletions(-) Applied to target-pending/master, along with CC's for 3.19.y stable. Thanks Xiubo ! Btw, I fixed-up the ordering of the Reviewed-by + Tested-by + Signed-off-by tags on the patches here: https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=ab22d2604c86ceb01bb2725c9860b88a7dd383bb https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=abe342a5b4b5aa579f6bf40ba73447c699e6b579 Just for future reference, the flow of these tags should reflect the history of the patch. Eg: Reviewed-by: First reviewer Tested-by: First tester Reviewed-by: Second reviewer Signed-off-by: Patch Author and then once the subsystem maintainer merges it into his tree, they add their own: Signed-off-by: Subsystem Maintainer Sure, thanks very much. Beyond that, what is the status of the other two feature improvement patches..? Do you intend to post those (again) as v4.12-rc1 material..? Since the #2 has changed in V5, and the old #3 will have a conflict with it. The #3 and #4 are still in reviewing process, so I will post it again if they are all okay later. Thanks, BRs Xiubo
Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
Lars, can you please take a look a patch 22 and check if it's safe? That's the big thing I want to know before posting the next version of the series. If it's not safe I'd like to drop that patch.
RE: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors
> -Original Message- > From: Christoph Hellwig [mailto:h...@lst.de] > Sent: Friday, January 13, 2017 10:00 PM > To: subbu.seethara...@broadcom.com; ketan.muka...@broadcom.com; > jitendra.bhiv...@broadcom.com; linux-scsi@vger.kernel.org > Subject: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors > > And get automatic MSI-X affinity for free. > > Signed-off-by: Christoph Hellwig > --- > drivers/scsi/be2iscsi/be_main.c | 127 +--- > drivers/scsi/be2iscsi/be_main.h | 2 - > 2 files changed, 42 insertions(+), 87 deletions(-) > > diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c > index 6372613..03faca8 100644 > --- a/drivers/scsi/be2iscsi/be_main.c > +++ b/drivers/scsi/be2iscsi/be_main.c > @@ -801,12 +801,12 @@ static int beiscsi_init_irqs(struct beiscsi_hba > *phba) > struct pci_dev *pcidev = phba->pcidev; > struct hwi_controller *phwi_ctrlr; > struct hwi_context_memory *phwi_context; > - int ret, msix_vec, i, j; > + int ret, i, j; > > phwi_ctrlr = phba->phwi_ctrlr; > phwi_context = phwi_ctrlr->phwi_ctxt; > > - if (phba->msix_enabled) { > + if (pcidev->msix_enabled) { > for (i = 0; i < phba->num_cpus; i++) { > phba->msi_name[i] = kzalloc(BEISCSI_MSI_NAME, > GFP_KERNEL); > @@ -817,9 +817,8 @@ static int beiscsi_init_irqs(struct beiscsi_hba *phba) > > sprintf(phba->msi_name[i], "beiscsi_%02x_%02x", > phba->shost->host_no, i); > - msix_vec = phba->msix_entries[i].vector; > - ret = request_irq(msix_vec, be_isr_msix, 0, > - phba->msi_name[i], > + ret = request_irq(pci_irq_vector(pcidev, i), > + be_isr_msix, 0, phba->msi_name[i], > &phwi_context->be_eq[i]); > if (ret) { > beiscsi_log(phba, KERN_ERR, > BEISCSI_LOG_INIT, @@ -837,9 +836,8 @@ static int beiscsi_init_irqs(struct > beiscsi_hba *phba) > } > sprintf(phba->msi_name[i], "beiscsi_mcc_%02x", > phba->shost->host_no); > - msix_vec = phba->msix_entries[i].vector; > - ret = request_irq(msix_vec, be_isr_mcc, 0, phba- > >msi_name[i], > - &phwi_context->be_eq[i]); > + ret = request_irq(pci_irq_vector(pcidev, i), be_isr_mcc, 0, > + phba->msi_name[i], &phwi_context- > >be_eq[i]); > if (ret) { > beiscsi_log(phba, KERN_ERR, BEISCSI_LOG_INIT , > "BM_%d : beiscsi_init_irqs-" > @@ -861,9 +859,8 @@ static int beiscsi_init_irqs(struct beiscsi_hba *phba) > return 0; > free_msix_irqs: > for (j = i - 1; j >= 0; j--) { > + free_irq(pci_irq_vector(pcidev, i), &phwi_context->be_eq[j]); > kfree(phba->msi_name[j]); > - msix_vec = phba->msix_entries[j].vector; > - free_irq(msix_vec, &phwi_context->be_eq[j]); > } > return ret; > } > @@ -3039,7 +3036,7 @@ static int beiscsi_create_eqs(struct beiscsi_hba > *phba, > num_eq_pages = PAGES_REQUIRED(phba->params.num_eq_entries * > \ > sizeof(struct be_eq_entry)); > > - if (phba->msix_enabled) > + if (phba->pcidev->msix_enabled) > eq_for_mcc = 1; > else > eq_for_mcc = 0; > @@ -3549,7 +3546,7 @@ static int be_mcc_queues_create(struct > beiscsi_hba *phba, > sizeof(struct be_mcc_compl))) > goto err; > /* Ask BE to create MCC compl queue; */ > - if (phba->msix_enabled) { > + if (phba->pcidev->msix_enabled) { > if (beiscsi_cmd_cq_create(ctrl, cq, &phwi_context->be_eq >[phba->num_cpus].q, false, true, 0)) > goto mcc_cq_free; > @@ -3580,42 +3577,35 @@ static int be_mcc_queues_create(struct > beiscsi_hba *phba, > return -ENOMEM; > } > > -/** > - * find_num_cpus()- Get the CPU online count > - * @phba: ptr to priv structure > - * > - * CPU count is used for creating EQ. > - **/ > -static void find_num_cpus(struct beiscsi_hba *phba) > +static void be2iscsi_enable_msix(struct beiscsi_hba *phba) > { > - int num_cpus = 0; > - > - num_cpus = num_online_cpus(); > + int nvec = 1; > > switch (phba->generation) { > case BE_GEN2: > case BE_GEN3: > - phba->num_cpus = (num_cpus > BEISCSI_MAX_NUM_CPUS) ? > - BEISCSI_MAX_NUM_CPUS : num_cpus; > + nvec = BEISCSI_MAX_NUM_CPUS + 1; > break; > case BE_GEN4: > - /* > - * If eqid_count == 1 fall back to > - * I
Re: [PATCH 23/23] block: remove the discard_zeroes_data flag
On Tue, Mar 28, 2017 at 05:00:48PM +, Bart Van Assche wrote: > > It seems to me like the documentation in Documentation/ABI/testing/sysfs-block > and the above code are not in sync. I think the above code will cause reading > from the discard_zeroes_data attribute to return an empty string ("") instead > of "0\n". Thanks, fine with me. > > BTW, my personal preference is to remove this attribute entirely because > keeping > it will cause confusion, no matter how well we document the behavior of this > attribute. Jens, any opinion? I'd like to remove it too, but I fear it might break things. We could deprecate it first with a warning when read and then remove it a few releases down the road.
Re: [PATCH 11/23] block_dev: use blkdev_issue_zerout for hole punches
On Tue, Mar 28, 2017 at 04:50:47PM +, Bart Van Assche wrote: > On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote: > > This gets us support for non-discard efficient write of zeroes (e.g. NVMe) > > and preparse for removing the discard_zeroes_data flag. > > Hello Christoph, > > "preparse" probably should have been "prepare"? Yes, fixed.
Re: [PATCH 12/23] sd: handle REQ_UNMAP
On Tue, Mar 28, 2017 at 04:48:55PM +, Bart Van Assche wrote: > > if (sdp->no_write_same) > > return BLKPREP_INVALID; > > if (sdkp->ws16 || sector > 0x || nr_sectors > 0x) > > Users can change the provisioning mode from user space from SD_LBP_WS16 into > SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector > > 0x || nr_sectors > 0x) check if REQ_UNMAP is set. They can, and if the device has too many sectors that will already cause discard to fail, and in this case it will cause write zeroes to fail as well. The intent behind this patch is to keep the behavior the same as the old path that uses discards for zeroing. The logic looks a bit clumsy, but I'd rather keep it as-is.
Re: [PATCH 01/23] block: renumber REQ_OP_WRITE_ZEROES
On Tue, Mar 28, 2017 at 04:12:46PM +, Bart Van Assche wrote: > Since REQ_OP_WRITE_ZEROES was introduced in kernel v4.10, do we need > "Cc: stable" and "Fixes: a6f0788ec2881" tags for this patch? No. This just works around the way scsi_setup_cmnd sets up the data direction. Before this series it's not an issue because no one used the req_op data direction for setting up the dma direction.
Re: [PATCHv5 0/2] tcmu: For bugs fix only
Hi Xiubo & Co, On Mon, 2017-03-27 at 17:07 +0800, lixi...@cmss.chinamobile.com wrote: > From: Xiubo Li > > Changed for V5: > - This only includes #1 and #2. And for old #3, #4 are still reviewing. > - #1, since the issue reported by Ilias is a separate new one, and will > create a new patch later. > - #2, address the issue pointed out by Mike, thanks. > - #1 and #2 have been tested by Ilias in BIDI case, thanks. > > Changed for V4: > - re-order the #3, #4 at the head. > - merge most of the #5 to others. > > Xiubo Li (2): > tcmu: Fix possible overwrite of t_data_sg's last iov[] > tcmu: Fix wrongly calculating of the base_command_size > > drivers/target/target_core_user.c | 44 > +++ > 1 file changed, 31 insertions(+), 13 deletions(-) > Applied to target-pending/master, along with CC's for 3.19.y stable. Thanks Xiubo ! Btw, I fixed-up the ordering of the Reviewed-by + Tested-by + Signed-off-by tags on the patches here: https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=ab22d2604c86ceb01bb2725c9860b88a7dd383bb https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=abe342a5b4b5aa579f6bf40ba73447c699e6b579 Just for future reference, the flow of these tags should reflect the history of the patch. Eg: Reviewed-by: First reviewer Tested-by: First tester Reviewed-by: Second reviewer Signed-off-by: Patch Author and then once the subsystem maintainer merges it into his tree, they add their own: Signed-off-by: Subsystem Maintainer Beyond that, what is the status of the other two feature improvement patches..? Do you intend to post those (again) as v4.12-rc1 material..?
Re: [PATCH 1/7] ѕd: split sd_setup_discard_cmnd
On Tue, Mar 28, 2017 at 08:05:09AM -0600, ax...@kernel.dk wrote: > > Although I know this is an issue in the existing code and not something > > introduced by you: please consider using logical_to_sectors() instead of > > open-coding this function. Otherwise this patch looks fine to me. > > The downside of doing that is that we still call ilog2() twice, which > sucks. Would be faster to cache ilog2(sector_size) and use that in the > shift calculation. I suspect that gcc is smart enough to optimize it away. That beeing said while this looks like a nice cleanup this patch is just supposed to move code, so I'd rather not add the change here and leave it for a separate submission.
Re: [PATCH] be2iscsi: switch to pci_alloc_irq_vectors
On Mon, Jan 23, 2017 at 09:41:45AM +0530, Jitendra Bhivare wrote: > We will be taking this up along with some other changes in the same area. Any progress on that? I'd like to get be2iscsi off the old MSI-X API for 4.12.
[PATCH 2/2] target: Avoid mappedlun symlink creation during lun shutdown
From: Nicholas Bellinger This patch closes a race between se_lun deletion during configfs unlink in target_fabric_port_unlink() -> core_dev_del_lun() -> core_tpg_remove_lun(), when transport_clear_lun_ref() blocks waiting for percpu_ref RCU grace period to finish, but a new NodeACL mappedlun is added before the RCU grace period has completed. This can happen in target_fabric_mappedlun_link() because it only checks for se_lun->lun_se_dev, which is not cleared until after transport_clear_lun_ref() percpu_ref RCU grace period finishes. This bug originally manifested as NULL pointer dereference OOPsen in target_stat_scsi_att_intr_port_show_attr_dev() on v4.1.y code, because it dereferences lun->lun_se_dev without a explicit NULL pointer check. In post v4.1 code with target-core RCU conversion, the code in target_stat_scsi_att_intr_port_show_attr_dev() no longer uses se_lun->lun_se_dev, but the same race still exists. To address the bug, go ahead and set se_lun>lun_shutdown as early as possible in core_tpg_remove_lun(), and ensure new NodeACL mappedlun creation in target_fabric_mappedlun_link() fails during se_lun shutdown. Reported-by: James Shen Cc: James Shen Tested-by: James Shen Signed-off-by: Nicholas Bellinger --- drivers/target/target_core_fabric_configfs.c | 5 + drivers/target/target_core_tpg.c | 4 include/target/target_core_base.h| 1 + 3 files changed, 10 insertions(+) diff --git a/drivers/target/target_core_fabric_configfs.c b/drivers/target/target_core_fabric_configfs.c index d8a16ca..d1e6cab 100644 --- a/drivers/target/target_core_fabric_configfs.c +++ b/drivers/target/target_core_fabric_configfs.c @@ -92,6 +92,11 @@ static int target_fabric_mappedlun_link( pr_err("Source se_lun->lun_se_dev does not exist\n"); return -EINVAL; } + if (lun->lun_shutdown) { + pr_err("Unable to create mappedlun symlink because" + " lun->lun_shutdown=true\n"); + return -EINVAL; + } se_tpg = lun->lun_tpg; nacl_ci = &lun_acl_ci->ci_parent->ci_group->cg_item; diff --git a/drivers/target/target_core_tpg.c b/drivers/target/target_core_tpg.c index 6fb1919..dfaef4d 100644 --- a/drivers/target/target_core_tpg.c +++ b/drivers/target/target_core_tpg.c @@ -642,6 +642,8 @@ void core_tpg_remove_lun( */ struct se_device *dev = rcu_dereference_raw(lun->lun_se_dev); + lun->lun_shutdown = true; + core_clear_lun_from_tpg(lun, tpg); /* * Wait for any active I/O references to percpu se_lun->lun_ref to @@ -663,6 +665,8 @@ void core_tpg_remove_lun( } if (!(dev->se_hba->hba_flags & HBA_FLAGS_INTERNAL_USE)) hlist_del_rcu(&lun->link); + + lun->lun_shutdown = false; mutex_unlock(&tpg->tpg_lun_mutex); percpu_ref_exit(&lun->lun_ref); diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h index 4b784b6..2e28246 100644 --- a/include/target/target_core_base.h +++ b/include/target/target_core_base.h @@ -705,6 +705,7 @@ struct se_lun { u64 unpacked_lun; #define SE_LUN_LINK_MAGIC 0x7771 u32 lun_link_magic; + boollun_shutdown; boollun_access_ro; u32 lun_index; -- 1.9.1
[PATCH 1/2] iscsi-target: Fix TMR reference leak during session shutdown
From: Nicholas Bellinger This patch fixes a iscsi-target specific TMR reference leak during session shutdown, that could occur when a TMR was quiesced before the hand-off back to iscsi-target code via transport_cmd_check_stop_to_fabric(). The reference leak happens because iscsit_free_cmd() was incorrectly skipping the final target_put_sess_cmd() for TMRs when transport_generic_free_cmd() returned zero because the se_cmd->cmd_kref did not reach zero, due to the missing se_cmd assignment in original code. The result was iscsi_cmd and it's associated se_cmd memory would be freed once se_sess->sess_cmd_map where released, but the associated se_tmr_req was leaked and remained part of se_device->dev_tmr_list. This bug would manfiest itself as kernel paging request OOPsen in core_tmr_lun_reset(), when a left-over se_tmr_req attempted to dereference it's se_cmd pointer that had already been released during normal session shutdown. To address this bug, go ahead and treat ISCSI_OP_SCSI_CMD and ISCSI_OP_SCSI_TMFUNC the same when there is an extra se_cmd->cmd_kref to drop in iscsit_free_cmd(), and use op_scsi to signal __iscsit_free_cmd() when the former needs to clear any further iscsi related I/O state. Reported-by: Rob Millner Cc: Rob Millner Reported-by: Chu Yuan Lin Cc: Chu Yuan Lin Tested-by: Chu Yuan Lin Signed-off-by: Nicholas Bellinger --- drivers/target/iscsi/iscsi_target_util.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index 5041a9c..b464033 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -737,21 +737,23 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown) { struct se_cmd *se_cmd = NULL; int rc; + bool op_scsi = false; /* * Determine if a struct se_cmd is associated with * this struct iscsi_cmd. */ switch (cmd->iscsi_opcode) { case ISCSI_OP_SCSI_CMD: - se_cmd = &cmd->se_cmd; - __iscsit_free_cmd(cmd, true, shutdown); + op_scsi = true; /* * Fallthrough */ case ISCSI_OP_SCSI_TMFUNC: - rc = transport_generic_free_cmd(&cmd->se_cmd, shutdown); - if (!rc && shutdown && se_cmd && se_cmd->se_sess) { - __iscsit_free_cmd(cmd, true, shutdown); + se_cmd = &cmd->se_cmd; + __iscsit_free_cmd(cmd, op_scsi, shutdown); + rc = transport_generic_free_cmd(se_cmd, shutdown); + if (!rc && shutdown && se_cmd->se_sess) { + __iscsit_free_cmd(cmd, op_scsi, shutdown); target_put_sess_cmd(se_cmd); } break; -- 1.9.1
[PATCH 0/2] target: Bug-fixes for v4.11-rc
From: Nicholas Bellinger Hi all, Here are two additional target bug-fixes that have been found by the DATERA Q/A + automation team during extended longevity and stress testing atop v4.1.y stable code. The first is a iscsi-target specific TMR reference leak during session shutdown when se_cmd quiesce occured before hand-off back to iscsi-target, that results in se_tmr_req descriptors being left after se_cmd was freed. This would manifest as kernel paging OOPsen during LUN_RESET after the leak occured, because the associated se_tmr_req had not been released even though the pre-allocated parent se_cmd descriptor had already been freed during session shutdown. The second is a race between se_lun configfs unlink shutdown when se_lun->lun_ref percpu_ref RCU grace period is happening, and user-space attempts to add a new NodeACL mappedlun configfs symlink while se_lun shutdown is occuring. This would manifest as NULL pointer dereference OOPsen within target_stat_scsi_att_intr_port_show_attr_dev() after a new NodeACL mappedlun was added to a se_lun that had already been shutdown. Both have been verified on v4.1.y stable code, and forward ported to v4.11-rc. Please review. --nab Nicholas Bellinger (2): iscsi-target: Fix TMR reference leak during session shutdown target: Avoid mappedlun symlink creation during lun shutdown drivers/target/iscsi/iscsi_target_util.c | 12 +++- drivers/target/target_core_fabric_configfs.c | 5 + drivers/target/target_core_tpg.c | 4 include/target/target_core_base.h| 1 + 4 files changed, 17 insertions(+), 5 deletions(-) -- 1.9.1
Re: [PATCH v2] scsi/bfa: use designated initializers
On Wed, Mar 29, 2017 at 01:55:09PM -0700, Kees Cook wrote: > Prepare to mark sensitive kernel structures for randomization by making > sure they're using designated initializers. These were identified during > allyesconfig builds of x86, arm, and arm64, with most initializer fixes > extracted from grsecurity. > > Signed-off-by: Kees Cook > --- > This has been updated now that struct bfa_fcs_mod_s was dropped. Can you also add designated array initializers for __port_action while you're at it?
Re: [REGRESSION][Stable][v3.12.y][v4.4.y][v4.9.y][v4.10.y][v4.11-rc1] scsi: storvsc: properly set residual data length on errors
On Tue, Mar 28, 2017 at 04:14:09PM +, Stephen Hemminger wrote: > I decided not to send it to stable since problem was only observed on > 4.11 but it is probably endemic to all GEN2 VM's So, what does this mean? What should stable@ do? Nothing? Ok, now dropped this from my patch queue :) thanks, greg k-h
Re: [PATCH 1/6] virtio: wrap find_vqs
On Wed, 29 Mar 2017 23:48:44 +0300 "Michael S. Tsirkin" wrote: > We are going to add more parameters to find_vqs, let's wrap the call so > we don't need to tweak all drivers every time. > > Signed-off-by: Michael S. Tsirkin > --- > drivers/block/virtio_blk.c | 3 +-- > drivers/char/virtio_console.c | 6 +++--- > drivers/crypto/virtio/virtio_crypto_core.c | 3 +-- > drivers/gpu/drm/virtio/virtgpu_kms.c | 3 +-- > drivers/net/caif/caif_virtio.c | 3 +-- > drivers/net/virtio_net.c | 3 +-- > drivers/rpmsg/virtio_rpmsg_bus.c | 2 +- > drivers/scsi/virtio_scsi.c | 3 +-- > drivers/virtio/virtio_balloon.c| 3 +-- > drivers/virtio/virtio_input.c | 3 +-- > include/linux/virtio_config.h | 9 + > net/vmw_vsock/virtio_transport.c | 6 +++--- > 12 files changed, 24 insertions(+), 23 deletions(-) Regardless whether that context thing is the right thing to do, this looks like a sensible cleanup.