Re: [PATCH] sd: fix uninitialized variable access in error handling
On Fri, Oct 21, 2016 at 10:32 AM, Arnd Bergmann wrote: > If sd_zbc_report_zones fails, the check for 'zone_blocks == 0' > later in the function accesses uninitialized data: > > drivers/scsi/sd_zbc.c: In function ‘sd_zbc_read_zones’: > drivers/scsi/sd_zbc.c:520:7: error: ‘zone_blocks’ may be used uninitialized > in this function [-Werror=maybe-uninitialized] > > This sets it to zero, which has the desired effect of leaving > the sd_zbc_read_zones successfully with sdkp->zone_blocks = 0. > > Fixes: 89d947561077 ("sd: Implement support for ZBC devices") > Signed-off-by: Arnd Bergmann > --- > drivers/scsi/sd_zbc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c > index 16d3fa62d8ac..d5b3bd915d9e 100644 > --- a/drivers/scsi/sd_zbc.c > +++ b/drivers/scsi/sd_zbc.c > @@ -455,8 +455,10 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) > > /* Do a report zone to get the same field */ > ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, 0); > - if (ret) > + if (ret) { > + zone_blocks = 0; > goto out; > + } > > same = buf[4] & 0x0f; > if (same > 0) { > -- > 2.9.0 > Reviewed-by: Shaun Tancheff -- Shaun Tancheff
Re: [PATCH] block: zoned: fix harmless maybe-uninitialized warning
On Fri, Oct 21, 2016 at 10:42 AM, Arnd Bergmann wrote: > The blkdev_report_zones produces a harmless warning when > -Wmaybe-uninitialized is set, after gcc gets a little confused > about the multiple 'goto' here: > > block/blk-zoned.c: In function 'blkdev_report_zones': > block/blk-zoned.c:188:13: error: 'nz' may be used uninitialized in this > function [-Werror=maybe-uninitialized] > > Moving the assignment to nr_zones makes this a little simpler > while also avoiding the warning reliably. I'm removing the > extraneous initialization of 'int ret' in the same patch, as > that is semi-related and could cause an uninitialized use of > that variable to not produce a warning. > > Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices") > Signed-off-by: Arnd Bergmann > --- > block/blk-zoned.c | 6 ++ > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/block/blk-zoned.c b/block/blk-zoned.c > index 667f95d86695..472211fa183a 100644 > --- a/block/blk-zoned.c > +++ b/block/blk-zoned.c > @@ -80,7 +80,7 @@ int blkdev_report_zones(struct block_device *bdev, > unsigned int i, n, nz; > unsigned int ofst; > void *addr; > - int ret = 0; > + int ret; > > if (!q) > return -ENXIO; > @@ -179,14 +179,12 @@ int blkdev_report_zones(struct block_device *bdev, > > } > > + *nr_zones = nz; > out: > bio_for_each_segment_all(bv, bio, i) > __free_page(bv->bv_page); > bio_put(bio); > > - if (ret == 0) > - *nr_zones = nz; > - > return ret; > } > EXPORT_SYMBOL_GPL(blkdev_report_zones); > -- > 2.9.0 Reviewed-by: Shaun Tancheff > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at > https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=RyRS1pzTGdFENUb0PbQRSMAhvMgZx_dBftw2khYVIXU&s=p9SYS2a__p_YHv8FoZVz9kuTQQ7LIZBVKCkZuQgR0cs&e= -- Shaun Tancheff
Re: [PATCH v5 2/4] fusion: remove iopriority handling
On Thu, Oct 13, 2016 at 6:00 PM, Adam Manzanares wrote: > The request priority is now by default coming from the ioc. It was not > clear what this code was trying to do based upon the iopriority class or > data. The driver should check that a device supports priorities and use > them according to the specificiations of ioprio. > > Signed-off-by: Adam Manzanares > --- > drivers/message/fusion/mptscsih.c | 5 - > 1 file changed, 5 deletions(-) > > diff --git a/drivers/message/fusion/mptscsih.c > b/drivers/message/fusion/mptscsih.c > index 6c9fc11..4740bb6 100644 > --- a/drivers/message/fusion/mptscsih.c > +++ b/drivers/message/fusion/mptscsih.c > @@ -1369,11 +1369,6 @@ mptscsih_qcmd(struct scsi_cmnd *SCpnt) > if ((vdevice->vtarget->tflags & MPT_TARGET_FLAGS_Q_YES) > && (SCpnt->device->tagged_supported)) { > scsictl = scsidir | MPI_SCSIIO_CONTROL_SIMPLEQ; > - if (SCpnt->request && SCpnt->request->ioprio) { > - if (((SCpnt->request->ioprio & 0x7) == 1) || > - !(SCpnt->request->ioprio & 0x7)) > - scsictl |= MPI_SCSIIO_CONTROL_HEADOFQ; > - } > } else > scsictl = scsidir | MPI_SCSIIO_CONTROL_UNTAGGED; Style wise you can further remove the extra parens around SCpnt->device->tagged_supported As well as the now redundant braces. Regards, Shaun > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at > https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=ZE7JzxXeXPEWqk9WYm42hZHj8gESRg1QoS5XklfbprM&s=C0iMyTgYbYl06F1SQ2DqfdESKBtl3Whp5rSnHSBXOc4&e=
[PATCH v6 7/7] blk-zoned: implement ioctls
Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively obtaining the zone configuration of a zoned block device and resetting the write pointer of sequential zones of a zoned block device. The BLKREPORTZONE ioctl maps directly to a single call of the function blkdev_report_zones. The zone information result is passed as an array of struct blk_zone identical to the structure used internally for processing the REQ_OP_ZONE_REPORT operation. The BLKRESETZONE ioctl maps to a single call of the blkdev_reset_zones function. Signed-off-by: Shaun Tancheff Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen --- block/blk-zoned.c | 93 +++ block/ioctl.c | 4 ++ include/linux/blkdev.h| 21 ++ include/uapi/linux/blkzoned.h | 40 +++ include/uapi/linux/fs.h | 4 ++ 5 files changed, 162 insertions(+) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 1603573..667f95d 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -255,3 +255,96 @@ int blkdev_reset_zones(struct block_device *bdev, return 0; } EXPORT_SYMBOL_GPL(blkdev_reset_zones); + +/** + * BLKREPORTZONE ioctl processing. + * Called from blkdev_ioctl. + */ +int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + void __user *argp = (void __user *)arg; + struct request_queue *q; + struct blk_zone_report rep; + struct blk_zone *zones; + int ret; + + if (!argp) + return -EINVAL; + + q = bdev_get_queue(bdev); + if (!q) + return -ENXIO; + + if (!blk_queue_is_zoned(q)) + return -ENOTTY; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (copy_from_user(&rep, argp, sizeof(struct blk_zone_report))) + return -EFAULT; + + if (!rep.nr_zones) + return -EINVAL; + + zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) + return -ENOMEM; + + ret = blkdev_report_zones(bdev, rep.sector, + zones, &rep.nr_zones, + GFP_KERNEL); + if (ret) + goto out; + + if (copy_to_user(argp, &rep, sizeof(struct blk_zone_report))) { + ret = -EFAULT; + goto out; + } + + if (rep.nr_zones) { + if (copy_to_user(argp + sizeof(struct blk_zone_report), zones, +sizeof(struct blk_zone) * rep.nr_zones)) + ret = -EFAULT; + } + + out: + kfree(zones); + + return ret; +} + +/** + * BLKRESETZONE ioctl processing. + * Called from blkdev_ioctl. + */ +int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode, +unsigned int cmd, unsigned long arg) +{ + void __user *argp = (void __user *)arg; + struct request_queue *q; + struct blk_zone_range zrange; + + if (!argp) + return -EINVAL; + + q = bdev_get_queue(bdev); + if (!q) + return -ENXIO; + + if (!blk_queue_is_zoned(q)) + return -ENOTTY; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + if (copy_from_user(&zrange, argp, sizeof(struct blk_zone_range))) + return -EFAULT; + + return blkdev_reset_zones(bdev, zrange.sector, zrange.nr_sectors, + GFP_KERNEL); +} diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..448f78a 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -513,6 +513,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, BLKDEV_DISCARD_SECURE); case BLKZEROOUT: return blk_ioctl_zeroout(bdev, mode, arg); + case BLKREPORTZONE: + return blkdev_report_zones_ioctl(bdev, mode, cmd, arg); + case BLKRESETZONE: + return blkdev_reset_zones_ioctl(bdev, mode, cmd, arg); case HDIO_GETGEO: return blkdev_getgeo(bdev, argp); case BLKRAGET: diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 252043f..90097dd 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -316,6 +316,27 @@ extern int blkdev_report_zones(struct block_device *bdev, extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors, sector_t nr_sectors, gfp_t gfp_mask); +extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, +unsigned int cmd, unsigned long arg); +extern int blkdev_reset_zones_ioctl(struct block_device
[PATCH v6 5/7] block: Implement support for zoned block devices
From: Hannes Reinecke Implement zoned block device zone information reporting and reset. Zone information are reported as struct blk_zone. This implementation does not differentiate between host-aware and host-managed device models and is valid for both. Two functions are provided: blkdev_report_zones for discovering the zone configuration of a zoned block device, and blkdev_reset_zones for resetting the write pointer of sequential zones. The helper function blk_queue_zone_size and bdev_zone_size are also provided for, as the name suggest, obtaining the zone size (in 512B sectors) of the zones of the device. Signed-off-by: Hannes Reinecke [Damien: * Removed the zone cache * Implement report zones operation based on earlier proposal by Shaun Tancheff ] Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- Changes from v5: * Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928). block/Kconfig | 8 ++ block/Makefile| 2 +- block/blk-zoned.c | 257 ++ include/linux/blkdev.h| 31 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned.h | 103 + 6 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 block/blk-zoned.c create mode 100644 include/uapi/linux/blkzoned.h diff --git a/block/Kconfig b/block/Kconfig index 5136ad4..7bb9bf8 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -89,6 +89,14 @@ config BLK_DEV_INTEGRITY T10/SCSI Data Integrity Field or the T13/ATA External Path Protection. If in doubt, say N. +config BLK_DEV_ZONED + bool "Zoned block device support" + ---help--- + Block layer zoned block device support. This option enables + support for ZAC/ZBC host-managed and host-aware zoned block devices. + + Say yes here if you have a ZAC or ZBC storage device. + config BLK_DEV_THROTTLING bool "Block layer bio throttling support" depends on BLK_CGROUP=y diff --git a/block/Makefile b/block/Makefile index 9eda232..4676969 100644 --- a/block/Makefile +++ b/block/Makefile @@ -22,4 +22,4 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o - +obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o diff --git a/block/blk-zoned.c b/block/blk-zoned.c new file mode 100644 index 000..1603573 --- /dev/null +++ b/block/blk-zoned.c @@ -0,0 +1,257 @@ +/* + * Zoned block device handling + * + * Copyright (c) 2015, Hannes Reinecke + * Copyright (c) 2015, SUSE Linux GmbH + * + * Copyright (c) 2016, Damien Le Moal + * Copyright (c) 2016, Western Digital + */ + +#include +#include +#include +#include + +static inline sector_t blk_zone_start(struct request_queue *q, + sector_t sector) +{ + sector_t zone_mask = blk_queue_zone_size(q) - 1; + + return sector & ~zone_mask; +} + +/* + * Check that a zone report belongs to the partition. + * If yes, fix its start sector and write pointer, copy it in the + * zone information array and return true. Return false otherwise. + */ +static bool blkdev_report_zone(struct block_device *bdev, + struct blk_zone *rep, + struct blk_zone *zone) +{ + sector_t offset = get_start_sect(bdev); + + if (rep->start < offset) + return false; + + rep->start -= offset; + if (rep->start + rep->len > bdev->bd_part->nr_sects) + return false; + + if (rep->type == BLK_ZONE_TYPE_CONVENTIONAL) + rep->wp = rep->start + rep->len; + else + rep->wp -= offset; + memcpy(zone, rep, sizeof(struct blk_zone)); + + return true; +} + +/** + * blkdev_report_zones - Get zones information + * @bdev: Target block device + * @sector:Sector from which to report zones + * @zones: Array of zone structures where to return the zones information + * @nr_zones: Number of zone structures in the zone array + * @gfp_mask: Memory allocation flags (for bio_alloc) + * + * Description: + *Get zone information starting from the zone containing @sector. + *The number of zone information reported may be less than the number + *requested by @nr_zones. The number of zones actually reported is + *returned in @nr_zones. + */ +int blkdev_report_zones(struct block_device *bdev, + sector_t sector, + struct blk_zone *zones, + unsigned int *nr_zones, + gfp_t gfp_mask) +{ + struct request_queue *q = bdev_get_queue(bdev); + s
[PATCH v6 6/7] sd: Implement support for ZBC devices
From: Hannes Reinecke Implement ZBC support functions to setup zoned disks, both host-managed and host-aware models. Only zoned disks that satisfy the following conditions are supported: 1) All zones are the same size, with the exception of an eventual last smaller runt zone. 2) For host-managed disks, reads are unrestricted (reads are not failed due to zone or write pointer alignement constraints). Zoned disks that do not satisfy these 2 conditions are setup with a capacity of 0 to prevent their use. The function sd_zbc_read_zones, called from sd_revalidate_disk, checks that the device satisfies the above two constraints. This function may also change the disk capacity previously set by sd_read_capacity for devices reporting only the capacity of conventional zones at the beginning of the LBA range (i.e. devices reporting rc_basis set to 0). The capacity message output was moved out of sd_read_capacity into a new function sd_print_capacity to include this eventual capacity change by sd_zbc_read_zones. This new function also includes a call to sd_zbc_print_zones to display the number of zones and zone size of the device. Signed-off-by: Hannes Reinecke [Damien: * Removed zone cache support * Removed mapping of discard to reset write pointer command * Modified sd_zbc_read_zones to include checks that the device satisfies the kernel constraints * Implemeted REPORT ZONES setup and post-processing based on code from Shaun Tancheff * Removed confusing use of 512B sector units in functions interface] Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- Changes from v5: * Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928). drivers/scsi/Makefile | 1 + drivers/scsi/sd.c | 141 --- drivers/scsi/sd.h | 67 + drivers/scsi/sd_zbc.c | 627 ++ include/scsi/scsi_proto.h | 17 ++ 5 files changed, 820 insertions(+), 33 deletions(-) create mode 100644 drivers/scsi/sd_zbc.c diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index d539798..fabcb6d 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -179,6 +179,7 @@ hv_storvsc-y:= storvsc_drv.o sd_mod-objs:= sd.o sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o +sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o sr_mod-objs:= sr.o sr_ioctl.o sr_vendor.o ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \ diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d3e852a..fb324ac 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -92,6 +92,7 @@ MODULE_ALIAS_BLOCKDEV_MAJOR(SCSI_DISK15_MAJOR); MODULE_ALIAS_SCSI_DEVICE(TYPE_DISK); MODULE_ALIAS_SCSI_DEVICE(TYPE_MOD); MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC); +MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC); #if !defined(CONFIG_DEBUG_BLOCK_EXT_DEVT) #define SD_MINORS 16 @@ -162,7 +163,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr, static const char temp[] = "temporary "; int len; - if (sdp->type != TYPE_DISK) + if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC) /* no cache control on RBC devices; theoretically they * can do it, but there's probably so many exceptions * it's not worth the risk */ @@ -261,7 +262,7 @@ allow_restart_store(struct device *dev, struct device_attribute *attr, if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (sdp->type != TYPE_DISK) + if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC) return -EINVAL; sdp->allow_restart = simple_strtoul(buf, NULL, 10); @@ -391,6 +392,11 @@ provisioning_mode_store(struct device *dev, struct device_attribute *attr, if (!capable(CAP_SYS_ADMIN)) return -EACCES; + if (sd_is_zoned(sdkp)) { + sd_config_discard(sdkp, SD_LBP_DISABLE); + return count; + } + if (sdp->type != TYPE_DISK) return -EINVAL; @@ -458,7 +464,7 @@ max_write_same_blocks_store(struct device *dev, struct device_attribute *attr, if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (sdp->type != TYPE_DISK) + if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC) return -EINVAL; err = kstrtoul(buf, 10, &max); @@ -843,6 +849,12 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd) BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size); + if (sd_is_zoned(sdkp)) { + ret = sd_zbc_setup_read_write(cmd); + if (ret != BLKPREP_OK) + return ret; + } + sector >>= ilog2(sdp->sector_size) - 9; nr_sectors
[PATCH v6 4/7] block: Define zoned block device operations
From: Shaun Tancheff Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of host-managed and host-aware zoned block devices. With with these two new operations, the total number of operations defined reaches 8 and still fits with the 3 bits definition of REQ_OP_BITS. Signed-off-by: Shaun Tancheff Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen --- block/blk-core.c | 4 include/linux/blk_types.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 14d7c07..e4eda5d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1941,6 +1941,10 @@ generic_make_request_checks(struct bio *bio) case REQ_OP_WRITE_SAME: if (!bdev_write_same(bio->bi_bdev)) goto not_supported; + case REQ_OP_ZONE_REPORT: + case REQ_OP_ZONE_RESET: + if (!bdev_is_zoned(bio->bi_bdev)) + goto not_supported; break; default: break; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index cd395ec..dd50dce 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -243,6 +243,8 @@ enum req_op { REQ_OP_SECURE_ERASE,/* request to securely erase sectors */ REQ_OP_WRITE_SAME, /* write same block many times */ REQ_OP_FLUSH, /* request for cache flush */ + REQ_OP_ZONE_REPORT, /* Get zone information */ + REQ_OP_ZONE_RESET, /* Reset a zone write pointer */ }; #define REQ_OP_BITS 3 -- 2.9.3
[PATCH v6 3/7] block: update chunk_sectors in blk_stack_limits()
From: Hannes Reinecke Signed-off-by: Hannes Reinecke Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- block/blk-settings.c | 4 1 file changed, 4 insertions(+) diff --git a/block/blk-settings.c b/block/blk-settings.c index b1d5b7f..55369a6 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -631,6 +631,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->discard_granularity; } + if (b->chunk_sectors) + t->chunk_sectors = min_not_zero(t->chunk_sectors, + b->chunk_sectors); + return ret; } EXPORT_SYMBOL(blk_stack_limits); -- 2.9.3
[PATCH v6 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes
From: Hannes Reinecke The queue limits already have a 'chunk_sectors' setting, so we should be presenting it via sysfs. Signed-off-by: Hannes Reinecke [Damien: Updated Documentation/ABI/testing/sysfs-block] Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- Documentation/ABI/testing/sysfs-block | 13 + block/blk-sysfs.c | 11 +++ 2 files changed, 24 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 75a5055..ee2d5cd 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -251,3 +251,16 @@ Description: since drive-managed zoned block devices do not support zone commands, they will be treated as regular block devices and zoned will report "none". + +What: /sys/block//queue/chunk_sectors +Date: September 2016 +Contact: Hannes Reinecke +Description: + chunk_sectors has different meaning depending on the type + of the disk. For a RAID device (dm-raid), chunk_sectors + indicates the size in 512B sectors of the RAID volume + stripe segment. For a zoned block device, either + host-aware or host-managed, chunk_sectors indicates the + size of 512B sectors of the zones of the device, with + the eventual exception of the last zone of the device + which may be smaller. diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index ff9cd9c..488c2e2 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -130,6 +130,11 @@ static ssize_t queue_physical_block_size_show(struct request_queue *q, char *pag return queue_var_show(queue_physical_block_size(q), page); } +static ssize_t queue_chunk_sectors_show(struct request_queue *q, char *page) +{ + return queue_var_show(q->limits.chunk_sectors, page); +} + static ssize_t queue_io_min_show(struct request_queue *q, char *page) { return queue_var_show(queue_io_min(q), page); @@ -455,6 +460,11 @@ static struct queue_sysfs_entry queue_physical_block_size_entry = { .show = queue_physical_block_size_show, }; +static struct queue_sysfs_entry queue_chunk_sectors_entry = { + .attr = {.name = "chunk_sectors", .mode = S_IRUGO }, + .show = queue_chunk_sectors_show, +}; + static struct queue_sysfs_entry queue_io_min_entry = { .attr = {.name = "minimum_io_size", .mode = S_IRUGO }, .show = queue_io_min_show, @@ -555,6 +565,7 @@ static struct attribute *default_attrs[] = { &queue_hw_sector_size_entry.attr, &queue_logical_block_size_entry.attr, &queue_physical_block_size_entry.attr, + &queue_chunk_sectors_entry.attr, &queue_io_min_entry.attr, &queue_io_opt_entry.attr, &queue_discard_granularity_entry.attr, -- 2.9.3
[PATCH v6 1/7] block: Add 'zoned' queue limit
From: Damien Le Moal Add the zoned queue limit to indicate the zoning model of a block device. Defined values are 0 (BLK_ZONED_NONE) for regular block devices, 1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM) for host-managed zone block devices. The standards defined drive managed model is not defined here since these block devices do not provide any command for accessing zone information. Drive managed model devices will be reported as BLK_ZONED_NONE. The helper functions blk_queue_zoned_model and bdev_zoned_model return the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned return a boolean for callers to test if a block device is zoned. The zoned attribute is also exported as a string to applications via sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and BLK_ZONED_HM as "host-managed". Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- Documentation/ABI/testing/sysfs-block | 16 block/blk-settings.c | 1 + block/blk-sysfs.c | 18 ++ include/linux/blkdev.h| 47 +++ 4 files changed, 82 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 71d184d..75a5055 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -235,3 +235,19 @@ Description: write_same_max_bytes is 0, write same is not supported by the device. +What: /sys/block//queue/zoned +Date: September 2016 +Contact: Damien Le Moal +Description: + zoned indicates if the device is a zoned block device + and the zone model of the device if it is indeed zoned. + The possible values indicated by zoned are "none" for + regular block devices and "host-aware" or "host-managed" + for zoned block devices. The characteristics of + host-aware and host-managed zoned block devices are + described in the ZBC (Zoned Block Commands) and ZAC + (Zoned Device ATA Command Set) standards. These standards + also define the "drive-managed" zone model. However, + since drive-managed zoned block devices do not support + zone commands, they will be treated as regular block + devices and zoned will report "none". diff --git a/block/blk-settings.c b/block/blk-settings.c index f679ae1..b1d5b7f 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -107,6 +107,7 @@ void blk_set_default_limits(struct queue_limits *lim) lim->io_opt = 0; lim->misaligned = 0; lim->cluster = 1; + lim->zoned = BLK_ZONED_NONE; } EXPORT_SYMBOL(blk_set_default_limits); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 9cc8d7c..ff9cd9c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -257,6 +257,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); #undef QUEUE_SYSFS_BIT_FNS +static ssize_t queue_zoned_show(struct request_queue *q, char *page) +{ + switch (blk_queue_zoned_model(q)) { + case BLK_ZONED_HA: + return sprintf(page, "host-aware\n"); + case BLK_ZONED_HM: + return sprintf(page, "host-managed\n"); + default: + return sprintf(page, "none\n"); + } +} + static ssize_t queue_nomerges_show(struct request_queue *q, char *page) { return queue_var_show((blk_queue_nomerges(q) << 1) | @@ -485,6 +497,11 @@ static struct queue_sysfs_entry queue_nonrot_entry = { .store = queue_store_nonrot, }; +static struct queue_sysfs_entry queue_zoned_entry = { + .attr = {.name = "zoned", .mode = S_IRUGO }, + .show = queue_zoned_show, +}; + static struct queue_sysfs_entry queue_nomerges_entry = { .attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR }, .show = queue_nomerges_show, @@ -546,6 +563,7 @@ static struct attribute *default_attrs[] = { &queue_discard_zeroes_data_entry.attr, &queue_write_same_max_entry.attr, &queue_nonrot_entry.attr, + &queue_zoned_entry.attr, &queue_nomerges_entry.attr, &queue_rq_affinity_entry.attr, &queue_iostats_entry.attr, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index c47c358..f19e16b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -261,6 +261,15 @@ struct blk_queue_tag { #define BLK_SCSI_MAX_CMDS (256) #define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8)) +/* + * Z
[PATCH v6 0/7] ZBC / Zoned block device support
This series introduces support for zoned block devices. It integrates earlier submissions by Hannes Reinecke, Damien Le Moal and Shaun Tancheff. Compared to the previous series version, the code was significantly simplified by limiting support to zoned devices satisfying the following conditions: 1) All zones of the device are the same size, with the exception of an eventual last smaller runt zone. 2) For host-managed disks, reads must be unrestricted (read commands do not fail due to zone or write pointer alignement constraints). Zoned disks that do not satisfy these 2 conditions are ignored. These 2 conditions allowed dropping the zone information cache implemented in the previous version. This simplifies the code and also reduces the memory consumption at run time. Support for zoned devices now only require one bit per zone (less than 8KB in total). This bit field is used to write-lock zones and prevent the concurrent execution of multiple write commands in the same zone. This avoids write ordering problems at dispatch time, for both the simple queue and scsi-mq settings. The new operations introduced to suport zone manipulation was reduced to only the two main ZBC/ZAC defined commands: REPORT ZONES (REQ_OP_ZONE_REPORT) and RESET WRITE POINTER (REQ_OP_ZONE_RESET). This brings the total number of operations defined to 8, which fits in the 3 bits (REQ_OP_BITS) reserved for operation code in bio->bi_opf and req->cmd_flags. Most of the ZBC specific code is kept out of sd.c and implemented in the new file sd_zbc.c. Similarly, at the block layer, most of the zoned block device code is implemented in the new blk-zoned.c. For host-managed zoned block devices, the sequential write constraint of write pointer zones is exposed to the user. Users of the disk (applications, file systems or device mappers) must sequentially write to zones. This means that for raw block device accesses from applications, buffered writes are unreliable and direct I/Os must be used (or buffered writes with O_SYNC). Access to zone manipulation operations is also provided to applications through a set of new ioctls. This allows applications operating on raw block devices (e.g. mkfs.xxx) to discover a device zone layout and manipulate zone state. Changes from v5: * Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928). Changes from v4: * Changed interface of sd_zbc_setup_read_write Changes from v3: * Fixed several typos and tabs/spaces * Added description of zoned and chunk_sectors queue attributes in Documentation/ABI/testing/sysfs-block * Fixed sd_read_capacity call in sd.c and to avoid missing information on the first pass of a disk scan * Fixed scsi_disk zone related field to use logical block size unit instead of 512B sector unit. Changes from v2: * Use kcalloc to allocate zone information array for ioctl * Use kcalloc to allocate zone information array for ioctl * Export GPL the functions blkdev_report_zones and blkdev_reset_zones * Shuffled uapi definitions from patch 7 into patch 5 Damien Le Moal (1): block: Add 'zoned' queue limit Hannes Reinecke (4): blk-sysfs: Add 'chunk_sectors' to sysfs attributes block: update chunk_sectors in blk_stack_limits() block: Implement support for zoned block devices sd: Implement support for ZBC devices Shaun Tancheff (2): block: Define zoned block device operations blk-zoned: implement ioctls Documentation/ABI/testing/sysfs-block | 29 ++ block/Kconfig | 8 + block/Makefile| 2 +- block/blk-core.c | 4 + block/blk-settings.c | 5 + block/blk-sysfs.c | 29 ++ block/blk-zoned.c | 350 +++ block/ioctl.c | 4 + drivers/scsi/Makefile | 1 + drivers/scsi/sd.c | 141 ++-- drivers/scsi/sd.h | 67 drivers/scsi/sd_zbc.c | 627 ++ include/linux/blk_types.h | 2 + include/linux/blkdev.h| 99 ++ include/scsi/scsi_proto.h | 17 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned.h | 143 include/uapi/linux/fs.h | 4 + 18 files changed, 1499 insertions(+), 34 deletions(-) create mode 100644 block/blk-zoned.c create mode 100644 drivers/scsi/sd_zbc.c create mode 100644 include/uapi/linux/blkzoned.h -- 2.9.3
Re: BUG Re: mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
Confirmed: - Removing DEBUG_VM_RB fixes the hang. Also confirmed: - Above patch fixes the hang when DEBUG_VM_RB is re-enabled. Thanks! On Tue, Sep 27, 2016 at 11:05 AM, Andrea Arcangeli wrote: > Hello, > > On Tue, Sep 27, 2016 at 05:16:15AM -0500, Shaun Tancheff wrote: >> git bisect points at commit c9634dcf00c9c93b ("mm: vma_merge: fix >> vm_page_prot SMP race condition against rmap_walk") > > I assume linux-next? But I can't find the commit, but I should know > what this is. > >> >> Last lines to console are [transcribed]: >> >> vma 8c3d989a7c78 start 7fe02ed4c000 end 7fe02ed52000 >> next 8c3d96de0c38 prev 8c3d989a6e40 mm 8c3d071cbac0 >> prot 8025 anon_vma 8c3d96fc9b28 vm_ops (null) >> pgoff 7fe02ed4c file (null) private_data (null) >> flags: 0x8100073(read|write|mayread|maywrite|mayexec|account|softdirty) > > It's a false positive, you have DEBUG_VM_RB=y, you can disable it or > cherry-pick the fix: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_cgit_linux_kernel_git_andrea_aa.git_commit_-3Fid-3D74d8b44224f31153e23ca8a7f7f0700091f5a9b2&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=mhyVFRknYnKxpypFw43nt0xMGGZX0r4k-qe6PIyp5ew&s=QjS2W4fUFnnJl4YxCk4WB30v5281AC4B7bAQeP8KWlQ&e= > > The assumption validate_mm_rb did isn't valid anymore on the new code > during __vma_unlink, the validation code must be updated to skip the > next vma instead of the current one after this change. It's a bug in > DEBUG_VM_RB=y, if you keep DEBUG_VM_RB=n there's no bug. > >> Reproducer is an Ubuntu 16.04.1 LTS x86_64 running on a VM (VirtualBox). >> Symptom is a solid hang after boot and switch to starting gnome session. >> >> Hang at about 35s. >> >> kdbg traceback is all null entries. >> >> Let me know what additional information I can provide. > > I already submitted the fix to Andrew last week: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dmm-26m-3D147449253801920-26w-3D2&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=mhyVFRknYnKxpypFw43nt0xMGGZX0r4k-qe6PIyp5ew&s=EIo2P9JsNNIZSPoTgxO2vC5DJE4p6-HeOznwL1qhowo&e= > > I assume it's pending for merging in -mm. > > If you can test this patch and confirm the problem goes away with > DEBUG_VM_RB=y it'd be great. > > Thanks, > Andrea -- Shaun Tancheff
BUG Re: mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
git bisect points at commit c9634dcf00c9c93b ("mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk") Last lines to console are [transcribed]: vma 8c3d989a7c78 start 7fe02ed4c000 end 7fe02ed52000 next 8c3d96de0c38 prev 8c3d989a6e40 mm 8c3d071cbac0 prot 8025 anon_vma 8c3d96fc9b28 vm_ops (null) pgoff 7fe02ed4c file (null) private_data (null) flags: 0x8100073(read|write|mayread|maywrite|mayexec|account|softdirty) Reproducer is an Ubuntu 16.04.1 LTS x86_64 running on a VM (VirtualBox). Symptom is a solid hang after boot and switch to starting gnome session. Hang at about 35s. kdbg traceback is all null entries. Let me know what additional information I can provide. Regards, Shaun Tancheff
Re: UFS API in the kernel
On Thu, Sep 22, 2016 at 10:21 AM, Joao Pinto wrote: > Hi! > > I am designing an application that has the goal to be an utility for Unipro > and > UFS testing purposes. This application is going to run on top of a recent > Linux > Kernel containing the new UFS stack (including the new DWC drivers). > > I am considering doing the following: > a) Create a new config item called CONFIG_UFS_CHARDEV which is going to > create a > char device responsible to make some IOCTL available for user-space > applications > b) Create a linux/ufs.h header file that contains data structures declarations > that will be needed in user-space applications I am not very familiar with UFS devices, that said you should have an sgX chardev being created already so you can handle SG_IO requests. There also appear to be some sysfs entries being created. So between sg and sysfs you should be able to handle any user-space out of band requests without resorting to making a new chardev. Adding more sysfs entries, if you need them, should be fine. You may find it easier to expand on the existing interfaces than to get consensus on a new driver and ioctls. Hope this helps, Shaun > Could you please advise me about what the correct approach should be to make > it > as standard as possible and usable in the future? > > Thank you very much for your help! > > regards, > Joao > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majord...@vger.kernel.org > More majordomo info at > https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQICaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=vJFB6pCywWtdvkgHz9Vc0jQz0xzeyZlr-7eCWYu88nM&s=yiQLPFpqmMrbqLZz1Jb3aNqOje2dRMLJHEzUDobwcXc&e= -- Shaun Tancheff
Re: [PATCH v8 1/2 RESEND] Add bio/request flags to issue ZBC/ZAC commands
On Thu, Aug 25, 2016 at 9:31 PM, Damien Le Moal wrote: > > Shaun, > > On 8/25/16 05:24, Shaun Tancheff wrote: >> >> (RESENDING to include f2fs, fs-devel and dm-devel) >> >> Add op flags to access to zone information as well as open, close >> and reset zones: >> - REQ_OP_ZONE_REPORT - Query zone information (Report zones) >> - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing >> - REQ_OP_ZONE_CLOSE - Explicitly close a zone >> - REQ_OP_ZONE_FINISH - Explicitly finish a zone >> - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone >> >> These op flags can be used to create bio's to control zoned devices >> through the block layer. > > > I still have a hard time seeing the need for the REQ_OP_ZONE_REPORT > operation assuming that the device queue will hold a zone information cache, > Hannes RB-tree or your array type, whichever. > > Let's try to think simply here: if the disk user (and FS, a device mapper or > an application doing raw disk accesses) wants to access the disk zone > information, why would it need to issue a BIO when calling > blkdev_lookup_zone would exactly give that information straight out of > memory (so much faster) ? I thought hard about this, but cannot think of any > value for the BIO-to-disk option. It seems to me to be equivalent to > systematically doing a page cache read even if the page cache tells us that > the page is up-to-date... Firstly the BIO abstraction here gives a common interface to getting the zone information and works even for embedded systems that are not willing / convinced to enable SCSI_ZBC + BLK_ZONED. Secondly when SCSI_ZBC + BLK_ZONED are enabled it just returns from the zone cache [as you can hopefully find in the second half of this series]. I did add a 'force' option but it's not intended to be used lightly. Thirdly it is my belief that BIO abstraction is more easily adapted to working with [and through] the device mapper layer (s). Today we both have the issue where if a file system supports working with a ZBC device there can be no device mapper stacked between the file system and the actual zoned device. This is also true of our respective device mapper targets. It is my current belief that teaching the device mapper layer to include REQ_OP_ZONE* operations is relatively straight forward and can be done w/o affecting existing targets that don't specifically need to operate on zones. Something similar to the way flush is handled currently. If the target doesn't ask to see zone operations the default mapping rules apply. Examples of why I would like to add REQ_OP_ZONE* support to the device mapper: I think it would be really neat if I could just to a quick dm-linear and put big chunk of SSD in front of dm-zoned or dm-zdm as it would be a nice way to boost performance. Similarly it enable using dm-linear to stitch together enough conventional space with a ZBC drive to see if Dave Chinner's XFS proposal from a couple of years ago could work. > Moreover, issuing a report zone to the disk may return information that is > in fact incorrect, as that would not take into account the eventual set of > write requests that was dispatched but not yet processed by the disk (some > zone write pointer may be reported with a value lower than what the zone > cache maintains). Yes but issuing a zone report to media is not the expected path when the zone cache is available. It is there to 'force' a re-sync and it is intended that the user of the call knows that the force is being applied and wants it to happen. Perhaps I should make it two flags? One to force a reply form the device and second flag to re-sync the zone cache with the result? There is one piece of information that can only be retrieved by going to the device and that is the 'non-seq resources' flag and it is only used by Host Aware devices ... as far as I understand. > Dealing (and fixing) these inconsistencies would force an update of the > report zone result using the information of the zone cache, which in itself > sounds like a good justification of not doing a report zones in the first > place. When report zones is just pulling from the zone cache it should not be a problem. So the normal activity [when SCSI_ZBC + BLK_ZONED are enabled] should not be introducing any inconsistencies. > I am fine with the other operations, and in fact having a BIO interface for > them to send down to the SCSI layer is better than any other method. It will > causes them to be seen in sd_init_command, which is the path taken for read > and write commands too. So all zone cache information checking and updating > can be done in that single place and serialized with a spinlock. Maintenance > of the zone cache information becomes very easy. > >
Re: [PATCH v6 3/4 RESEND] SCT Write Same / DSM Trim
On Thu, Aug 25, 2016 at 2:01 AM, Tom Yan wrote: > Really please just drop this patch. There is no rational reason for > you to associate the maximum payload size to the logical sector size. Been over this many, many times now. It has to do with the size of the buffer setup through WRITE SAME in drivers/scsi/sd.c > And please stop using the ATA SCSI Response Buffer (ata_scsi_rbuf) > that is used for response to the SCSI layer for SCSI commands that > won't really interact with the ATA device (i.e. triggers an ATA > command), while ata_format_sct_write_same() and > ata_scsi_write_same_xlat() are used for constructing payload that is > going to be send to the ATA device. Can't you even see that these are > of different direction to different layer? Adding a new global buffer where there is one there already is kind of silly. The buffer already has a perfectly acceptable spinlock and the time spent copying data around is trivially small in comparison to the I/O operation so there is not likely to be any contention over the buffer. It is memory. Why do you think ata_scsi_rbuf is so special?
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Thu, Aug 25, 2016 at 1:23 AM, Tom Yan wrote: > You only fill the bytes that you want to to set explicitly: > > + put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ > + put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ > + put_unaligned_le64(lba, &sctpg[2]); > + put_unaligned_le64(num, &sctpg[6]); > + put_unaligned_le32(0u, &sctpg[10]); > > What I doubted is, if you don't memset (zero-fill) the buffer first, > will other bytes have indeterministic value that causes random > unexpected behavior? No. If there is random or unexpected behaviour the device is broken and some other remedy, such as blacklisting, is required. --- Shaun
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Thu, Aug 25, 2016 at 1:31 AM, Tom Yan wrote: > On 25 August 2016 at 05:28, Shaun Tancheff wrote: >> On Wed, Aug 24, 2016 at 12:31 AM, Tom Yan wrote: >>> On 24 August 2016 at 11:33, Martin K. Petersen >>> wrote: >>>>>>>>> "Tom" == Tom Yan writes: >>>> >>>> Tom> Nope, SCSI Write Same commands does not have payload (or in SCSI >>>> Tom> terms, parameter list / data-out buffer). >>>> >>>> WRITE SAME has a a payload of 1 logical block (unless NDOB is set but we >>>> have had no good reason to support that yet). >>> >>> Interesting, I wasn't aware of the bit. I just didn't see any >>> parameter list defined for any of the Write Same commands. Ah wait, it >>> carries the pattern (the "same") and so. >>> >>> Hmm, it doesn't seem like the translation implemented in this patch >>> series cares about the payload though? >> >> As repeated here and elsewhere the payload is: >> scsi_sglist(cmd) >> and it was put there by scsi_init_io() when it called scsi_init_sgtable() > > What I mean is: > > put_unaligned_le32(0u, &sctpg[10]); > > So even if the payload of the SCSI Write Same commands instruct a > non-zero pattern writing, SCT Write Same will conveniently ignore that > do zero-filling anyway. That's what I mean by "doesn't care about the > payload". If you would like to add support for that it would be nice. I am not planning to do so here. > Though that would only be case with SG_IO though. SCSI Write Same > issued from block layer (BLKZEROOUT) will be instructing zero-filling > anyway. >>>> UNMAP has a payload that varies based on the number of range >>>> descriptors. The SCSI disk driver only ever issues a single descriptor >>>> but since libata doesn't support UNMAP this doesn't really come into >>>> play. >>>> >>>> Ideally there would be a way to distinguish between device limits for >>>> WRITE SAME with the UNMAP bit and for "regular" WRITE SAME. One way to >>>> do that would be to transition the libata discard implementation over to >>>> single-range UNMAP, fill out the relevant VPD page B0 fields and leave >>>> the WRITE SAME bits for writing zeroes. >>>> >>>> One reason that has not been particularly compelling is that the WRITE >>>> SAME payload buffer does double duty to hold the ATA DSM TRIM range >>> >>> Huh? Why would the SATL care about its payload buffer for TRIM (i.e. >>> when the UNMAP bit is set)? Doesn't it just read the LBA and NUMBER OF >>> BLOCKS field and pack TRIM ranges/payload according to that? >>> >>>> descriptors and matches the required ATA payload size. Whereas the UNMAP >>> >>> Why would it need to "matches the required ATA payload size"? >>> >>>> command would only provide 24 bytes of TRIM range space. >>> >>> I don't really follow. The UNMAP descriptor has LBA (8 bytes / 64-bit) >>> and NUMBER OF BLOCKS (4 bytes / 32-bit) field of the same length as >>> Write Same (16). Even if the SCSI disk driver will only send one >>> descriptor, it should work as good as Write Same (16). >> >> The "payload" is the data block transferred with the command. >> The "descriptor" is, in this context, the contents of the payload as >> it "describes" what the action the command is supposed to perform. >> > > I know right. > >> The "payload" contains the "descriptor" that "describes" how >> DSM TRIM should determine which logical blocks it should UNMAP. > > This should only be the case of UNMAP command, but not SCSI Write Same > with UNMAP bit set. And either way it should not affect how large the > ATA TRIM payload can be. The current SATL does not report support for UNMAP. If you think it should be added please submit a patch. If you would like to extend the current translate to support sending multiple blocks of trim data please submit a patch. >>>> Also, please be careful with transfer lengths, __data_len, etc. As >>>> mentioned, the transfer length WRITE SAME is typically 512 bytes and >>>> that's the number of bytes that need to be DMA'ed and transferred over >>>> the wire by the controller. But from a command completion perspective we >>>> need to complete however many bytes the command acted upon. Unlike reads >>>> and writes there is not a 1:1 mapping between the transfer length and >>>> the affected area. So we do a bit of magic after the buffer has been >>>> mapped to ensure that the completion byte count matches the number of >>>> blocks that were affected by the command rather than the size of the >>>> data buffer in bytes. >>>> >>>> -- >>>> Martin K. Petersen Oracle Linux Engineering >> -- >> Shaun Tancheff -- Shaun Tancheff
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Wed, Aug 24, 2016 at 1:10 AM, Tom Yan wrote: > Btw, I wonder if you need to memset your buffer with 0 first, like > what is done in ata_scsi_rbuf_get. It is not necessary as the defined buffer is completely filled out here. Are you thinking as a sort of future proofing? Ex: In the unlikely event that the SCT Write Same command descriptor is expanded in a future ACS? It is more likely to see the command deprecated and replaced with a new SCT feature. Regardless of how unlikely I would consider a memset here to clear the remainder of the payload.
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Wed, Aug 24, 2016 at 12:31 AM, Tom Yan wrote: > On 24 August 2016 at 11:33, Martin K. Petersen > wrote: >>>>>>> "Tom" == Tom Yan writes: >> >> Tom> Nope, SCSI Write Same commands does not have payload (or in SCSI >> Tom> terms, parameter list / data-out buffer). >> >> WRITE SAME has a a payload of 1 logical block (unless NDOB is set but we >> have had no good reason to support that yet). > > Interesting, I wasn't aware of the bit. I just didn't see any > parameter list defined for any of the Write Same commands. Ah wait, it > carries the pattern (the "same") and so. > > Hmm, it doesn't seem like the translation implemented in this patch > series cares about the payload though? As repeated here and elsewhere the payload is: scsi_sglist(cmd) and it was put there by scsi_init_io() when it called scsi_init_sgtable() >> UNMAP has a payload that varies based on the number of range >> descriptors. The SCSI disk driver only ever issues a single descriptor >> but since libata doesn't support UNMAP this doesn't really come into >> play. >> >> Ideally there would be a way to distinguish between device limits for >> WRITE SAME with the UNMAP bit and for "regular" WRITE SAME. One way to >> do that would be to transition the libata discard implementation over to >> single-range UNMAP, fill out the relevant VPD page B0 fields and leave >> the WRITE SAME bits for writing zeroes. >> >> One reason that has not been particularly compelling is that the WRITE >> SAME payload buffer does double duty to hold the ATA DSM TRIM range > > Huh? Why would the SATL care about its payload buffer for TRIM (i.e. > when the UNMAP bit is set)? Doesn't it just read the LBA and NUMBER OF > BLOCKS field and pack TRIM ranges/payload according to that? > >> descriptors and matches the required ATA payload size. Whereas the UNMAP > > Why would it need to "matches the required ATA payload size"? > >> command would only provide 24 bytes of TRIM range space. > > I don't really follow. The UNMAP descriptor has LBA (8 bytes / 64-bit) > and NUMBER OF BLOCKS (4 bytes / 32-bit) field of the same length as > Write Same (16). Even if the SCSI disk driver will only send one > descriptor, it should work as good as Write Same (16). The "payload" is the data block transferred with the command. The "descriptor" is, in this context, the contents of the payload as it "describes" what the action the command is supposed to perform. The "payload" contains the "descriptor" that "describes" how DSM TRIM should determine which logical blocks it should UNMAP. >> Also, please be careful with transfer lengths, __data_len, etc. As >> mentioned, the transfer length WRITE SAME is typically 512 bytes and >> that's the number of bytes that need to be DMA'ed and transferred over >> the wire by the controller. But from a command completion perspective we >> need to complete however many bytes the command acted upon. Unlike reads >> and writes there is not a 1:1 mapping between the transfer length and >> the affected area. So we do a bit of magic after the buffer has been >> mapped to ensure that the completion byte count matches the number of >> blocks that were affected by the command rather than the size of the >> data buffer in bytes. >> >> -- >> Martin K. Petersen Oracle Linux Engineering -- Shaun Tancheff
Re: [PATCH 1/2] f2fs: detect host-managed SMR by feature flag
orrectly for HASMR. > IIRC, we discussed about HASMR with Marc Lehmann last year, and he gave the > below link which sums up. > > https://urldefense.proofpoint.com/v2/url?u=http-3A__comments.gmane.org_gmane.linux.file-2Dsystems.f2fs_2854&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=3Uw29rZTMJsKnUE9bBduFOdp3OiEwJhpgbJ9GmyO_SU&s=YNxrXKMJpd-BPHnYG8ctqg6ct58p2yqO3BRKlbjBO_s&e= > https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.schmorp.de_2015-2D10-2D08-2Dsmr-2Darchive-2Ddrives-2Dfast-2Dnow.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=3Uw29rZTMJsKnUE9bBduFOdp3OiEwJhpgbJ9GmyO_SU&s=SC_RkdkfDQZTeIccq0o1zE4siI_RgDL4UhiVc1VzENY&e= I think above applies to 'drive managed' SMR where the upper level has no insight into the underlying zone layout. >> So how about also introducing a F2FS_FEATURE_HASMR feature flag to >> handle these different cases ? > > Yes, so once I can retrieve the zone information from the device, surely I'll > add that. I have posted some patches that will allow you to retrieve the zone information for both HA and HM drives. Can you see if this would meet your needs? The data is returned to the BIO is in the same (drive) format as SG_IO. >> Also, I think that the DISCARD option must be enabled by default for >> HMSMR disks. Otherwise, zones write pointer will never get reset. >> The same applies to HASMR devices mounted with the LFS mode. >> (In any case, the discard handling does not look like it will always >> align to the device zone size, which will fail on a zoned disk (discard >> granularity is the zone size). I may be missing something though. Still >> checking.) > > Yup, will do. BTW, I remember I fixed zone-aligned discard issue before. > Could you please check the latest dev-test branch in f2fs.git? > I'm used to rebase that branch occasionally when I changed the previous > patches. I think you will need to have the have access to the zone information. Regards, Shaun Tancheff
[PATCH v8 2/2 RESEND] Add ioctl to issue ZBC/ZAC commands via block layer
(RESENDING to include f2fs, fs-devel and dm-devel) Add support for ZBC ioctl's BLKREPORT - Issue Report Zones to device. BLKZONEACTION - Issue a Zone Action (Close, Finish, Open, or Reset) Signed-off-by: Shaun Tancheff --- v8: - Changed ioctl for zone actions to a single ioctl that takes a structure including the zone, zone action, all flag, and force option - Mapped REQ_META flag to 'force unit access' for zone operations v6: - Added GFP_DMA to gfp mask. v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's block/ioctl.c | 149 ++ include/uapi/linux/blkzoned_api.h | 30 +++- include/uapi/linux/fs.h | 1 + 3 files changed, 179 insertions(+), 1 deletion(-) diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..d760523 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -194,6 +194,151 @@ int blkdev_reread_part(struct block_device *bdev) } EXPORT_SYMBOL(blkdev_reread_part); +static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode, + void __user *parg) +{ + int error = -EFAULT; + gfp_t gfp = GFP_KERNEL | GFP_DMA; + void *iopg = NULL; + struct bdev_zone_report_io *bzrpt = NULL; + int order = 0; + struct page *pgs = NULL; + u32 alloc_size = PAGE_SIZE; + unsigned int op_flags = 0; + u8 opt = 0; + + if (!(mode & FMODE_READ)) + return -EBADF; + + iopg = (void *)get_zeroed_page(gfp); + if (!iopg) { + error = -ENOMEM; + goto report_zones_out; + } + bzrpt = iopg; + if (copy_from_user(bzrpt, parg, sizeof(*bzrpt))) { + error = -EFAULT; + goto report_zones_out; + } + if (bzrpt->data.in.return_page_count > alloc_size) { + int npages; + + alloc_size = bzrpt->data.in.return_page_count; + npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT; + pgs = alloc_pages(gfp, ilog2(npages)); + if (pgs) { + void *mem = page_address(pgs); + + if (!mem) { + error = -ENOMEM; + goto report_zones_out; + } + order = ilog2(npages); + memset(mem, 0, alloc_size); + memcpy(mem, bzrpt, sizeof(*bzrpt)); + bzrpt = mem; + } else { + /* Result requires DMA capable memory */ + pr_err("Not enough memory available for request.\n"); + error = -ENOMEM; + goto report_zones_out; + } + } else { + alloc_size = bzrpt->data.in.return_page_count; + } + if (bzrpt->data.in.force_unit_access) + op_flags |= REQ_META; + opt = bzrpt->data.in.report_option; + error = blkdev_issue_zone_report(bdev, op_flags, + bzrpt->data.in.zone_locator_lba, opt, + pgs ? pgs : virt_to_page(iopg), + alloc_size, GFP_KERNEL); + if (error) + goto report_zones_out; + + if (pgs) { + void *src = bzrpt; + u32 off = 0; + + /* +* When moving a multi-order page with GFP_DMA +* the copy to user can trap "" +* so instead we copy out 1 page at a time. +*/ + while (off < alloc_size && !error) { + u32 len = min_t(u32, PAGE_SIZE, alloc_size - off); + + memcpy(iopg, src + off, len); + if (copy_to_user(parg + off, iopg, len)) + error = -EFAULT; + off += len; + } + } else { + if (copy_to_user(parg, iopg, alloc_size)) + error = -EFAULT; + } + +report_zones_out: + if (pgs) + __free_pages(pgs, order); + if (iopg) + free_page((unsigned long)iopg); + return error; +} + +static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode, + void __user *parg) +{ + unsigned int op = 0; + unsigned int op_flags = 0; + sector_t lba; + struct bdev_zone_action za; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + /* When acting on zones we explicitly disallow using a partition. */ + if (bdev != bdev->bd_contains) { + pr_err("%s: All zone operations disallowed on this device\n", + __func__); + return -EFAULT; + } + + if (copy_from_user(
[PATCH v8 1/2 RESEND] Add bio/request flags to issue ZBC/ZAC commands
(RESENDING to include f2fs, fs-devel and dm-devel) Add op flags to access to zone information as well as open, close and reset zones: - REQ_OP_ZONE_REPORT - Query zone information (Report zones) - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing - REQ_OP_ZONE_CLOSE - Explicitly close a zone - REQ_OP_ZONE_FINISH - Explicitly finish a zone - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone These op flags can be used to create bio's to control zoned devices through the block layer. This is useful for file systems and device mappers that need explicit control of zoned devices such as Host Managed and Host Aware SMR drives, Report zones is a device read that requires a buffer. Open, Close, Finish and Reset are device commands that have no associated data transfer. Open - Open is a zone for writing. Close - Disallow writing to a zone. Finish - Disallow writing a zone and set the WP to the end of the zone. Reset - Discard data in a zone and reset the WP to the start of the zone. Sending an LBA of ~0 will attempt to operate on all zones. This is typically used with Reset to wipe a drive as a Reset behaves similar to TRIM in that all data in the zone(s) is deleted. Report zones currently defaults to reporting on all zones. It expected that support for the zone option flag will piggy back on streamid support. The report option flag is useful as it can reduce the number of zones in each report, but not critical. Signed-off-by: Shaun Tancheff --- v8: - Added Finish Zone op - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled v6: - Added GFP_DMA to gfp mask. v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Removed include/uapi/linux/fs.h from this patch. MAINTAINERS | 9 ++ block/blk-lib.c | 94 drivers/scsi/sd.c | 121 + drivers/scsi/sd.h | 1 + include/linux/bio.h | 8 +- include/linux/blk_types.h | 7 +- include/linux/blkdev.h| 1 + include/linux/blkzoned_api.h | 25 ++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 182 ++ 10 files changed, 447 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h diff --git a/MAINTAINERS b/MAINTAINERS index a306795..aedf311 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12984,6 +12984,15 @@ F: Documentation/networking/z8530drv.txt F: drivers/net/hamradio/*scc.c F: drivers/net/hamradio/z8530.h +ZBC AND ZBC BLOCK DEVICES +M: Shaun Tancheff +W: http://seagate.com +W: https://github.com/Seagate/ZDM-Device-Mapper +L: linux-bl...@vger.kernel.org +S: Maintained +F: include/linux/blkzoned_api.h +F: include/uapi/linux/blkzoned_api.h + ZBUD COMPRESSED PAGE ALLOCATOR M: Seth Jennings L: linux...@kvack.org diff --git a/block/blk-lib.c b/block/blk-lib.c index 083e56f..e92bd56 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -266,3 +266,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask); } EXPORT_SYMBOL(blkdev_issue_zeroout); + +/** + * blkdev_issue_zone_report - queue a report zones operation + * @bdev: target blockdev + * @op_flags: extra bio rw flags. If unsure, use 0. + * @sector:starting sector (report will include this sector). + * @opt: See: zone_report_option, default is 0 (all zones). + * @page: one or more contiguous pages. + * @pgsz: up to size of page in bytes, size of report. + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + *Issue a zone report request for the sectors in question. + */ +int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags, +sector_t sector, u8 opt, struct page *page, +size_t pgsz, gfp_t gfp_mask) +{ + struct bdev_zone_report *conv = page_address(page); + struct bio *bio; + unsigned int nr_iovecs = 1; + int ret = 0; + + if (pgsz < (sizeof(struct bdev_zone_report) + + sizeof(struct bdev_zone_descriptor))) + return -EINVAL; + + bio = bio_alloc(gfp_mask, nr_iovecs); + if (!bio) + return -ENOMEM; + + conv->
[PATCH v8 0/2 RESEND] Block layer support ZAC/ZBC commands
(RESENDING to include f2fs, fs-devel and dm-devel) Hi Jens, This series is based on linus' v4.8-rc2 branch. As Host Aware drives are becoming available we would like to be able to make use of such drives. This series is also intended to be suitable for use by Host Managed drives. ZBC [and ZAC] drives add new commands for discovering and working with Zones. Part one of this series expands the bio/request reserved op size from 3 to 4 bits and then adds op codes for each of the ZBC commands: Report zones, close zone, finish zone, open zone and reset zone. Part two of this series deals with integrating these new bio/request op's with Hannes' zone cache. This extends the ZBC support up to the block layer allowing direct control by file systems or device mapper targets. Also by deferring the zone handling to the authoritative subsystem there is an overall lower memory usage for holding the active zone information as well as clarifying responsible party for maintaining the write pointer for each active zone. By way of example a DM target may have several writes in progress. To sector (or lba) for those writes will each depend on the previous write. While the drive's write pointer will be updated as writes are completed the DM target will be maintaining both where the next write should be scheduled from and where the write pointer is based on writes completed w/o errors. Knowing the drive zone topology enables DM targets and file systems to extend their block allocation schemes and issue write pointer resets (or discards) that are zone aligned. A perhaps non-obvious approach is that a conventional drive will returns a zone report descriptor with a single large conventional zone. This is intended to allow a collection of zoned and non-zoned media to be stitched together to provide a file system with a zoned device with conventional space mapped to where it is useful. Patches for util-linux can be found here: g...@github.com:stancheff/util-linux.git v2.28.1+biof https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof This patch is available here: https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v8 g...@github.com:stancheff/linux.git v4.8-rc2+biof.v8 v8: - Changed zone report to default to reading from zone cache. - Changed ioctl for zone commands to support forcing a query or command to be sent to media. - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled v7: - Initial support for Hannes' zone cache. v6: - Fix page alloc to include DMA flag for ioctl. v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's - Dropped ata16 hackery V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. - Use zoned report reserved bit for ata-passthrough flag. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Moved include/uapi/linux/fs.h changes to patch 3 - Fixed commit message for first patch in series. Shaun Tancheff (2): Add bio/request flags to issue ZBC/ZAC commands Add ioctl to issue ZBC/ZAC commands via block layer MAINTAINERS | 9 ++ block/blk-lib.c | 94 + block/ioctl.c | 149 +++ drivers/scsi/sd.c | 121 ++ drivers/scsi/sd.h | 1 + include/linux/bio.h | 8 +- include/linux/blk_types.h | 7 +- include/linux/blkdev.h| 1 + include/linux/blkzoned_api.h | 25 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 210 ++ include/uapi/linux/fs.h | 1 + 12 files changed, 625 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h -- 2.9.3
[PATCH v6 3/4 RESEND] SCT Write Same / DSM Trim
Correct handling of devices with sector_size other that 512 bytes. In the case of a 4Kn device sector_size it is possible to describe a much larger DSM Trim than the current fixed default of 512 bytes. This patch assumes the minimum descriptor is sector_size and fills out the descriptor accordingly. The ACS-2 specification is quite clear that the DSM command payload is sized as number of 512 byte transfers so a 4Kn device will operate correctly without this patch. Signed-off-by: Shaun Tancheff --- v5: - Reshuffled descripton. - Added support for a sector_size descriptor other than 512 bytes. drivers/ata/libata-scsi.c | 85 +++ 1 file changed, 57 insertions(+), 28 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index ebf1a04..37f456e 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3283,7 +3283,7 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) /** * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim * @cmd: SCSI command being translated - * @num: Maximum number of entries (nominally 64). + * @trmax: Maximum number of entries that will fit in sector_size bytes. * @sector: Starting sector * @count: Total Range of request in logical sectors * @@ -3298,63 +3298,80 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) * LBA's should be sorted order and not overlap. * * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET + * + * Return: Number of bytes copied into sglist. */ -static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num, - u64 sector, u32 count) +static size_t ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 trmax, + u64 sector, u32 count) { - __le64 *buffer; - u32 i = 0, used_bytes; + struct scsi_device *sdp = cmd->device; + size_t len = sdp->sector_size; + size_t r; + __le64 *buf; + u32 i = 0; unsigned long flags; - BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE); + WARN_ON(len > ATA_SCSI_RBUF_SIZE); + + if (len > ATA_SCSI_RBUF_SIZE) + len = ATA_SCSI_RBUF_SIZE; spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); - buffer = ((void *)ata_scsi_rbuf); - while (i < num) { + buf = ((void *)ata_scsi_rbuf); + memset(buf, 0, len); + while (i < trmax) { u64 entry = sector | ((u64)(count > 0x ? 0x : count) << 48); - buffer[i++] = __cpu_to_le64(entry); + buf[i++] = __cpu_to_le64(entry); if (count <= 0x) break; count -= 0x; sector += 0x; } - - used_bytes = ALIGN(i * 8, 512); - memset(buffer + i, 0, used_bytes - i * 8); - sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 512); + r = sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buf, len); spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); - return used_bytes; + return r; } /** * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same * @cmd: SCSI command being translated * @lba: Starting sector - * @num: Number of logical sectors to be zero'd. + * @num: Number of sectors to be zero'd. * - * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted + * Rewrite the WRITE SAME payload to be an SCT Write Same formatted * descriptor. * NOTE: Writes a pattern (0's) in the foreground. - * Large write-same requents can timeout. + * + * Return: Number of bytes copied into sglist. */ -static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num) +static size_t ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num) { - u16 *sctpg; + struct scsi_device *sdp = cmd->device; + size_t len = sdp->sector_size; + size_t r; + u16 *buf; unsigned long flags; spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); - sctpg = ((void *)ata_scsi_rbuf); + buf = ((void *)ata_scsi_rbuf); + + put_unaligned_le16(0x0002, &buf[0]); /* SCT_ACT_WRITE_SAME */ + put_unaligned_le16(0x0101, &buf[1]); /* WRITE PTRN FG */ + put_unaligned_le64(lba, &buf[2]); + put_unaligned_le64(num, &buf[6]); + put_unaligned_le32(0u, &buf[10]); /* pattern */ + + WARN_ON(len > ATA_SCSI_RBUF_SIZE); - put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ - put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ - put_unaligned_le64(lba, &sctpg[2]); - put_unaligned_le64(num, &sctpg[6]); - put_unaligned_le32(0u, &sctpg[10]); + if
Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
On Mon, Aug 22, 2016 at 8:25 PM, Damien Le Moal wrote: > > Shaun, > > On 8/23/16 09:22, Shaun Tancheff wrote: >> On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal >> wrote: >> Also you may note that in my patch to get Host Aware working >> with the zone cache I do not include the runt zone in the cache. > > Why not ? The RB-tree will handle it just fine (the insert and lookup > code as Hannes had them was not relying on a constant zone size). A good point. I didn't pay too much attention while brining this forward. I think a few of my hacks may be pointless now. I'll try to rework it and get rid of the runt check. >> So as it sits I need this fallback otherwise doing blkdiscard over >> the whole device ends in a error, as well as mkfs.f2fs et. al. > > Got it, but I do not see a problem with including it. I have not checked > the code, but the split of a big discard call into "chunks" should be > already handling the last chunk and make sure that the operation does > not exceed the device capacity (in any case, that's easy to fix in the > sd_zbc_setup_discard code). Yes I agree the split of big discards does handle the last chunk correctly. >>> Some 10TB host managed disks out there have 1% conventional zone space, >>> that is 100GB of capacity. When issuing a "reset all", doing a write >>> same in these zones will take forever... If the user really wants zeroes >>> in those zones, let it issue a zeroout. >>> >>> I think that it would a better choice to simply not report >>> discard_zeroes_data as true and do nothing for conventional zones reset. >> >> I think that would be unfortunate for Host Managed but I think it's >> the right choice for Host Aware at this time. So either we base >> it on disk type or we have some other config flag added to sysfs. > > I do not see any difference between host managed and host aware. Both > define the same behavior for reset, and both end up in a NOP for > conventional zone reset (no data "erasure" required by the standard). > For write pointer zones, reading unwritten LBAs returns the > initialization pattern, with the exception of host-managed disks with > the URSWRZ bit set to 0. But that case is covered in sd.c, so the > behavior is consistent across all models. So why forcing data zeroing > when the standards do not mandate it ? Well you do have point. It appears to be only mkfs and similar tools that are really utilizing discard zeros data at the moment. I did a quick test: mkfs -t ext4 -b 4096 -g 32768 -G 32 \ -E lazy_itable_init=0,lazy_journal_init=0,offset=0,num_backup_sb=0,packed_meta_blocks=1,discard \ -O flex_bg,extent,sparse_super2 - discard zeroes data true - 3 minutess - discard zeroes data false - 6 minutes So for the smaller conventional space on the current HA drive there is some advantage to enabling discard zeroes data. However for a larger conventional space you are correct the overall impact is worse performance. For some reason I had been assuming that some file systems used or relied on discard zeroes data during normal operation. Now that I am looking for that I don't seem to be finding any evidence of it, so aside from mkfs I don't have as good an argument discard zeroes data as I though I did. Regards, Shaun
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Tue, Aug 23, 2016 at 5:37 AM, Tom Yan wrote: > On 22 August 2016 at 04:23, Shaun Tancheff wrote: >> +/** >> + * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same >> + * @cmd: SCSI command being translated >> + * @lba: Starting sector >> + * @num: Number of logical sectors to be zero'd. >> + * >> + * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted >> + * descriptor. >> + * NOTE: Writes a pattern (0's) in the foreground. >> + * Large write-same requents can timeout. >> + */ >> +static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 >> num) >> +{ >> + u16 *sctpg; >> + unsigned long flags; >> + >> + spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); >> + sctpg = ((void *)ata_scsi_rbuf); > > Because ata_scsi_rbuf is of a fixed size of ATA_SCSI_RBUF_SIZE. > > #define ATA_SCSI_RBUF_SIZE 4096 > ... > static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE]; > >> + >> + put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ >> + put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ >> + put_unaligned_le64(lba, &sctpg[2]); >> + put_unaligned_le64(num, &sctpg[6]); >> + put_unaligned_le32(0u, &sctpg[10]); >> + >> + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), sctpg, >> 512); > > You have no reason to use 512 here instead of ATA_SCSI_RBUF_SIZE this time. Ah .. because SCT Write Same is a fixed 512 byte transfer? Ah .. because I only have 512 bytes to copy? >> + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); >> +}
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Tue, Aug 23, 2016 at 2:53 AM, Tom Yan wrote: > On 23 August 2016 at 06:20, Shaun Tancheff wrote: >> On Tue, Aug 23, 2016 at 12:26 AM, Tom Yan wrote: >> >>> It is always 1 merely because we prefer sticking to that. Say we want >>> to enable multi-block payload now, it won't be 1 anymore. >> >> Sorry, I though that DSM TRIM is using 512 bytes here because >> WRITE_SAME_16 has a payload of a single logical sector. > > Nope, SCSI Write Same commands does not have payload (or in SCSI > terms, parameter list / data-out buffer). Hmm. Not sure about T10, but that's not how I read sd_setup_write_same_cmnd(), it always setups up a transfer of sector_size for scsi_init_io(). As I understand things, this is how the cmd's sglist is populated and why this should be using sg_copy_from_buffer() rather than open coding to the page. -- Shaun Tancheff
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Tue, Aug 23, 2016 at 12:26 AM, Tom Yan wrote: > It is always 1 merely because we prefer sticking to that. Say we want > to enable multi-block payload now, it won't be 1 anymore. Sorry, I though that DSM TRIM is using 512 bytes here because WRITE_SAME_16 has a payload of a single logical sector. > Also note that the payload is NOT always fully-filled. Oh, I was considering filling the remainder of the buffer with 0's using memset() as you described to be "filly-filled" in this case. Sorry for the confusion. -- Shaun Tancheff
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Tue, Aug 23, 2016 at 12:55 AM, Tom Yan wrote: > I am really not going to tell the whole story here again. We have a > really long discussion on whether we should advertise Maximum Write > Same Length for SCT Write Same, and the value we should advertise. > (Didn't we come to an conclusion on that as well?) I had thought the conclusion was leave b0 as is and let the WS10 limit take effect per Martin. -- Shaun Tancheff
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Mon, Aug 22, 2016 at 6:49 PM, Tom Yan wrote: > The only 512 I can see in the old code is the one in: > >> - used_bytes = ALIGN(i * 8, 512); > > where the alignment is necessary because 'used_bytes' will be returned > as 'size', which will be used for setting the number of 512-byte block > payload when we construct the ATA taskfile / command. It may NOT be a > good idea to replace it with ATA_SECT_SIZE. Some comment could be > nice. Not sure I agree with that analysis. Could just as well assign it directly, as ALIGN() is just superfluous here. Later the count is always 512/512 -> 1. Always. "i" is used here to limit the number of bytes that need to be memset() after filling to payload. Personally I think memset is fast enough that it is better to do before rather than later but I figure if the code works let it be. > So I don't think it makes any sense to check ATA_SCSI_RBUF_SIZE > against 512 here. Again, not sure I agree, but I don't really care on that point. Just many years of defensive coding. > On 22 August 2016 at 22:00, Shaun Tancheff wrote: >> Because this is re-using the response buffer in a new way. >> Such reuse could be a surprise to someone refactoring that >> code, although it's pretty unlikely. The build bug on >> gives some level of confidence that it won't go unnoticed. >> It also codifies the assumption that we can write 512 bytes >> to the global ata_scsi_rbuf buffer. >> >> As to using a literal 512, I was just following what was here >> before. >> >> It looks like there is a ATA_SECT_SIZE that can replace most >> of the literal 512's in here though. That might be a nice cleanup >> overall. Not sure it belongs here though. >> > >>>> + used_bytes = ALIGN(i * 8, 512); >>>> + memset(buffer + i, 0, used_bytes - i * 8); >>>> + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, >>>> 512); > > And I don't think you have a reason to use 512 here either. It appears > to me that you should use ATA_SCSI_RBUF_SIZE instead (see > ata_scsi_rbuf_put in libata-scsi). If not, it should probably be a > _derived_ value (e.g. from `i`) that tells the _actual_ size of > `buffer`. Nope. We *must* copy the whole 512 byte payload into the sglist. Otherwise what was the point of the memset to clear out an cruft? Failing to move the whole payload into place could leave whatever garbage is in the buffer to be interpreted as an actual trim and do real damage. I certainly can't use ATA_SCSI_RBUF_SIZE because the payload attached to the cmd need only be 512 bytes. Trying to copy in 4k is going to cause bad things when you check the return from sg_copy_from_buffer() and notice you failed to copy in you payload. > Again, note that even when we prefer to stick with _one_ 512-byte > block TRIM payload, we should probably NEVER _hard code_ such limit > (for it's really ugly and unnecessary) in functions like this. All we The interface requires well formed 512 byte chunks so we have to at least have to do enough to ensure that we work in multiples of 512. Since 512 is all over the spec for this type of thing I think it would be reasonable to have a define or enum if you don't think reusing ATA_SECT_SIZE is good maybe something like ATA_CMD_XFER_SIZE ? > need to do is to advertise the limit (or lower) as Maximum Write > Length, and make sure our SATL works as per SBC that it will reject > SCSI Write Same commands that is "larger" than that. >>>> + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); >>>> + >>>> + return used_bytes; >>>> +} -- Shaun Tancheff
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Mon, Aug 22, 2016 at 6:09 PM, Tom Yan wrote: > I am not sure about what you mean here. Rejecting SCSI Write Same > commands that has its "number of blocks" field set to a value higher > than the device's reported Maximum Write Same Length is only natural > and mandated by SBC. We have no reason (even if it is practically not > a must) not to do it while we are implementing a SCSI-ATA Translation > Layer here as long as we advertise Maximum Write Same Length. It does > not matter here whether the command ends up being translated to SCT > Write Same or TRIM. > > How high or how lower the limit should be advertised has nothing to do > with the checking. > > FWIW, letting the SCSI/block layer fall back with SD_MAX_WS10_BLOCKS > does NOT count as advertising Maximum Write Same Length, that's why we > may or may not (in terms of SBC) check n_block against it if we are > really gonna leave ata_scsiop_inq_b0 in libata-scsi untouched. Sorry I'm still a bit confused. SCT Write Same does not have a limit ... it's a u64 of logical sectors. Any limit specified is smaller based on other parts of the stack. The SATL code being used to emulating SCSI Write Same which does have a limited number of sectors .. so falling back to the SCSI limit seems reasonable. So the limit that is being applied is either the current TRIM limit, or the SCSI Write Same limit. -- Shaun Tancheff
Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal wrote: > > Shaun, > > On 8/22/16 13:31, Shaun Tancheff wrote: > [...] >> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq, >> - sector_t sector, unsigned int num_sectors) >> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd) >> { >> - struct blk_zone *zone; >> + struct request *rq = cmd->request; >> + struct scsi_device *sdp = cmd->device; >> + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); >> + sector_t sector = blk_rq_pos(rq); >> + unsigned int nr_sectors = blk_rq_sectors(rq); >> int ret = BLKPREP_OK; >> + struct blk_zone *zone; >> unsigned long flags; >> + u32 wp_offset; >> + bool use_write_same = false; >> >> zone = blk_lookup_zone(rq->q, sector); >> - if (!zone) >> + if (!zone) { >> + /* Test for a runt zone before giving up */ >> + if (sdp->type != TYPE_ZBC) { >> + struct request_queue *q = rq->q; >> + struct rb_node *node; >> + >> + node = rb_last(&q->zones); >> + if (node) >> + zone = rb_entry(node, struct blk_zone, node); >> + if (zone) { >> + spin_lock_irqsave(&zone->lock, flags); >> + if ((zone->start + zone->len) <= sector) >> + goto out; >> + spin_unlock_irqrestore(&zone->lock, flags); >> + zone = NULL; >> + } >> + } >> return BLKPREP_KILL; >> + } > > I do not understand the point of this code here to test for the runt > zone. As long as sector is within the device maximum capacity (in 512B > unit), blk_lookup_zone will return the pointer to the zone structure > containing that sector (the RB-tree does not have any constraint > regarding zone size). The only case where NULL would be returned is if > discard is issued super early right after the disk is probed and before > the zone refresh work has completed. We can certainly protect against > that by delaying the discard. As you can see I am not including Host Managed in the runt check. Also you may note that in my patch to get Host Aware working with the zone cache I do not include the runt zone in the cache. So as it sits I need this fallback otherwise doing blkdiscard over the whole device ends in a error, as well as mkfs.f2fs et. al. >> spin_lock_irqsave(&zone->lock, flags); >> - >> if (zone->state == BLK_ZONE_UNKNOWN || >> zone->state == BLK_ZONE_BUSY) { >> sd_zbc_debug_ratelimit(sdkp, >> -"Discarding zone %zu state %x, >> deferring\n", >> +"Discarding zone %zx state %x, >> deferring\n", > > Sector values are usually displayed in decimal. Why use Hex here ? At > least "0x" would be needed to avoid confusion I think. Yeah, my brain is lazy about converting very large numbers to powers of 2. So it's much easier to spot zone alignment here. >> zone->start, zone->state); >> ret = BLKPREP_DEFER; >> goto out; >> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, >> struct request *rq, >> if (zone->state == BLK_ZONE_OFFLINE) { >> /* let the drive fail the command */ >> sd_zbc_debug_ratelimit(sdkp, >> -"Discarding offline zone %zu\n", >> +"Discarding offline zone %zx\n", >> zone->start); >> goto out; >> } >> - >> - if (!blk_zone_is_smr(zone)) { >> + if (blk_zone_is_cmr(zone)) { >> + use_write_same = true; >> sd_zbc_debug_ratelimit(sdkp, >> -"Discarding %s zone %zu\n", >> -blk_zone_is_cmr(zone) ? "CMR" : >> "unknown", >> +"Discarding CMR zone %zx\n", >> zone->start); >> - ret = BLKPREP_DONE; >> goto out; >> } > > Some 10TB host managed
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Mon, Aug 22, 2016 at 3:14 PM, Tom Yan wrote: > On 23 August 2016 at 03:43, Shaun Tancheff wrote: >> >> Why would we enforce upper level limits on something that doesn't >> have any? > > If we advertise a limit in our SATL, it makes sense that we should > make sure the behaviour is consistent when we issue a write same > through the block layer / ioctl and when we issue a SCSI Write Same > command directly (e.g. with sg_write_same). IMHO that's pretty much > why SBC would mandate such behaviour as well. Breaking would be advertising a limit that is too high and failing. Advertising a lower limit and succeeding may not be ideal for all possible use cases, but it's not breaking behaviour. >> >> If the upper level, or SG_IO, chooses to set a timeout of 10 hours and >> wipe a whole disk it should be free to do so. >> > > That's why I said, "if you are going to advertise an Maximum Write Same > Length". -- Shaun Tancheff
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Mon, Aug 22, 2016 at 4:20 PM, Tom Yan wrote: > On 22 August 2016 at 04:23, Shaun Tancheff wrote: >> Safely overwriting the attached page to ATA format from the SCSI formatted >> variant. >> >> Signed-off-by: Shaun Tancheff >> --- >> v6: >> - Fix bisect bug reported by Tom Yan >> - Change to use sg_copy_from_buffer as per Christoph Hellwig >> v5: >> - Added prep patch to work with non-page aligned scatterlist pages >>and use kmap_atomic() to lock page during modification. >> >> drivers/ata/libata-scsi.c | 56 >> ++- >> include/linux/ata.h | 26 -- >> 2 files changed, 51 insertions(+), 31 deletions(-) >> >> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c >> index e207b33..7990cb2 100644 >> --- a/drivers/ata/libata-scsi.c >> +++ b/drivers/ata/libata-scsi.c >> @@ -3282,6 +3282,54 @@ static unsigned int ata_scsi_pass_thru(struct >> ata_queued_cmd *qc) >> return 1; >> } >> >> +/** >> + * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim >> + * @cmd: SCSI command being translated >> + * @num: Maximum number of entries (nominally 64). >> + * @sector: Starting sector >> + * @count: Total Range of request >> + * >> + * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian >> formatted >> + * descriptor. >> + * >> + * Upto 64 entries of the format: >> + * 63:48 Range Length >> + * 47:0 LBA >> + * >> + * Range Length of 0 is ignored. >> + * LBA's should be sorted order and not overlap. >> + * >> + * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET >> + */ >> +static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 >> num, >> + u64 sector, u32 count) >> +{ >> + __le64 *buffer; >> + u32 i = 0, used_bytes; >> + unsigned long flags; >> + >> + BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE); > > Why do we want to check "512" (a literal number) against > ATA_SCSI_RBUF_SIZE here? Because this is re-using the response buffer in a new way. Such reuse could be a surprise to someone refactoring that code, although it's pretty unlikely. The build bug on gives some level of confidence that it won't go unnoticed. It also codifies the assumption that we can write 512 bytes to the global ata_scsi_rbuf buffer. As to using a literal 512, I was just following what was here before. It looks like there is a ATA_SECT_SIZE that can replace most of the literal 512's in here though. That might be a nice cleanup overall. Not sure it belongs here though. >> + >> + spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); >> + buffer = ((void *)ata_scsi_rbuf); >> + while (i < num) { >> + u64 entry = sector | >> + ((u64)(count > 0x ? 0x : count) << 48); >> + buffer[i++] = __cpu_to_le64(entry); >> + if (count <= 0x) >> + break; >> + count -= 0x; >> + sector += 0x; >> + } >> + >> + used_bytes = ALIGN(i * 8, 512); >> + memset(buffer + i, 0, used_bytes - i * 8); >> + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, >> 512); >> + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); >> + >> + return used_bytes; >> +} >> + >> static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) >> { >> struct ata_taskfile *tf = &qc->tf; >> @@ -3290,8 +3338,8 @@ static unsigned int ata_scsi_write_same_xlat(struct >> ata_queued_cmd *qc) >> const u8 *cdb = scmd->cmnd; >> u64 block; >> u32 n_block; >> + const u32 trmax = ATA_MAX_TRIM_RNUM; >> u32 size; >> - void *buf; >> u16 fp; >> u8 bp = 0xff; >> >> @@ -3319,10 +3367,8 @@ static unsigned int ata_scsi_write_same_xlat(struct >> ata_queued_cmd *qc) >> if (!scsi_sg_count(scmd)) >> goto invalid_param_len; >> >> - buf = page_address(sg_page(scsi_sglist(scmd))); >> - >> - if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) { >> - size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, >> block, n_block); >> + if (n_block <= 0x * trmax) { >> + size = ata_f
Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
On Mon, Aug 22, 2016 at 2:27 PM, Tom Yan wrote: > On 22 August 2016 at 12:23, Shaun Tancheff wrote: >> Safely overwriting the attached page to ATA format from the SCSI formatted >> variant. >> >> Signed-off-by: Shaun Tancheff >> --- >> v6: >> - Fix bisect bug reported by Tom Yan >> - Change to use sg_copy_from_buffer as per Christoph Hellwig >> v5: >> - Added prep patch to work with non-page aligned scatterlist pages >>and use kmap_atomic() to lock page during modification. >> >> drivers/ata/libata-scsi.c | 56 >> ++- >> include/linux/ata.h | 26 -- >> 2 files changed, 51 insertions(+), 31 deletions(-) >> >> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c >> index e207b33..7990cb2 100644 >> --- a/drivers/ata/libata-scsi.c >> +++ b/drivers/ata/libata-scsi.c >> @@ -3282,6 +3282,54 @@ static unsigned int ata_scsi_pass_thru(struct >> ata_queued_cmd *qc) >> return 1; >> } >> >> +/** >> + * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim >> + * @cmd: SCSI command being translated >> + * @num: Maximum number of entries (nominally 64). >> + * @sector: Starting sector >> + * @count: Total Range of request >> + * >> + * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian >> formatted >> + * descriptor. >> + * >> + * Upto 64 entries of the format: >> + * 63:48 Range Length >> + * 47:0 LBA >> + * >> + * Range Length of 0 is ignored. >> + * LBA's should be sorted order and not overlap. >> + * >> + * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET >> + */ >> +static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 >> num, >> + u64 sector, u32 count) >> +{ >> + __le64 *buffer; >> + u32 i = 0, used_bytes; >> + unsigned long flags; >> + >> + BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE); >> + >> + spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); >> + buffer = ((void *)ata_scsi_rbuf); >> + while (i < num) { >> + u64 entry = sector | >> + ((u64)(count > 0x ? 0x : count) << 48); >> + buffer[i++] = __cpu_to_le64(entry); >> + if (count <= 0x) >> + break; >> + count -= 0x; >> + sector += 0x; >> + } >> + >> + used_bytes = ALIGN(i * 8, 512); >> + memset(buffer + i, 0, used_bytes - i * 8); >> + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, >> 512); >> + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); >> + >> + return used_bytes; >> +} >> + >> static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) >> { >> struct ata_taskfile *tf = &qc->tf; >> @@ -3290,8 +3338,8 @@ static unsigned int ata_scsi_write_same_xlat(struct >> ata_queued_cmd *qc) >> const u8 *cdb = scmd->cmnd; >> u64 block; >> u32 n_block; >> + const u32 trmax = ATA_MAX_TRIM_RNUM; > > This does not seem like a good idea. > >> u32 size; >> - void *buf; >> u16 fp; >> u8 bp = 0xff; >> >> @@ -3319,10 +3367,8 @@ static unsigned int ata_scsi_write_same_xlat(struct >> ata_queued_cmd *qc) >> if (!scsi_sg_count(scmd)) >> goto invalid_param_len; >> >> - buf = page_address(sg_page(scsi_sglist(scmd))); >> - >> - if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) { >> - size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, >> block, n_block); >> + if (n_block <= 0x * trmax) { > > Note that the limit here would always be the advertised Maximum Write > Same Length (ata_scsiop_inq_b0). It would be best for them to look the > same. Besides, it doesn't seem necessary to create this trmax anyway. It is entirely a style thing. I tend to prefer hex when describing an interface that specifies number of bits. The trmax is just to keep the following line under 80 chars. I am not tied to either. I can change it if people really find it confusing. >> + size = ata_format_dsm_trim_descr(scmd, trmax, block, >> n_block); >> } else { >> fp = 2; >>
Re: [PATCH v6 2/4] Add support for SCT Write Same
On Mon, Aug 22, 2016 at 2:20 PM, Tom Yan wrote: > On 22 August 2016 at 12:23, Shaun Tancheff wrote: >> SATA drives may support write same via SCT. This is useful >> for setting the drive contents to a specific pattern (0's). >> >> Translate a SCSI WRITE SAME 16 command to be either a DSM TRIM >> command or an SCT Write Same command. >> >> Based on the UNMAP flag: >> - When set translate to DSM TRIM >> - When not set translate to SCT Write Same >> >> Signed-off-by: Shaun Tancheff >> --- >> v6: >> - Change to use sg_copy_from_buffer as per Christoph Hellwig >> v5: >> - Addressed review comments >> - Report support for ZBC only for zoned devices. >> - kmap page during rewrite >> - Fix unmap set to require trim or error, if not unmap then sct write >>same or error. >> v4: >> - Added partial MAINTENANCE_IN opcode simulation >> - Dropped all changes in drivers/scsi/* >> - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT. >> v3: >> - Demux UNMAP/TRIM from WRITE SAME >> v2: >> - Remove fugly ata hacking from sd.c >> >> drivers/ata/libata-scsi.c | 199 >> +++--- >> include/linux/ata.h | 43 ++ >> 2 files changed, 213 insertions(+), 29 deletions(-) >> >> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c >> index 7990cb2..ebf1a04 100644 >> --- a/drivers/ata/libata-scsi.c >> +++ b/drivers/ata/libata-scsi.c >> @@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device >> *sdev) >> { >> sdev->use_10_for_rw = 1; >> sdev->use_10_for_ms = 1; >> - sdev->no_report_opcodes = 1; >> - sdev->no_write_same = 1; >> >> /* Schedule policy is determined by ->qc_defer() callback and >> * it needs to see every deferred qc. Set dev_blocked to 1 to >> @@ -3287,7 +3285,7 @@ static unsigned int ata_scsi_pass_thru(struct >> ata_queued_cmd *qc) >> * @cmd: SCSI command being translated >> * @num: Maximum number of entries (nominally 64). >> * @sector: Starting sector >> - * @count: Total Range of request >> + * @count: Total Range of request in logical sectors >> * >> * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian >> formatted >> * descriptor. >> @@ -3330,6 +3328,45 @@ static unsigned int ata_format_dsm_trim_descr(struct >> scsi_cmnd *cmd, u32 num, >> return used_bytes; >> } >> >> +/** >> + * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same >> + * @cmd: SCSI command being translated >> + * @lba: Starting sector >> + * @num: Number of logical sectors to be zero'd. >> + * >> + * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted >> + * descriptor. >> + * NOTE: Writes a pattern (0's) in the foreground. >> + * Large write-same requents can timeout. >> + */ >> +static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 >> num) >> +{ >> + u16 *sctpg; >> + unsigned long flags; >> + >> + spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); >> + sctpg = ((void *)ata_scsi_rbuf); >> + >> + put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ >> + put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ >> + put_unaligned_le64(lba, &sctpg[2]); >> + put_unaligned_le64(num, &sctpg[6]); >> + put_unaligned_le32(0u, &sctpg[10]); >> + >> + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), sctpg, >> 512); >> + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); >> +} >> + >> +/** >> + * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same >> + * @qc: Command to be translated >> + * >> + * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or >> + * an SCT Write Same command. >> + * Based on WRITE SAME has the UNMAP flag >> + * When set translate to DSM TRIM >> + * When clear translate to SCT Write Same >> + */ >> static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) >> { >> struct ata_taskfile *tf = &qc->tf; >> @@ -3342,6 +3379,7 @@ static unsigned int ata_scsi_write_same_xlat(struct >> ata_queued_cmd *qc) >> u32 size; >> u16 fp; >> u8 bp = 0xff; >> +
Re: [PATCH v6 3/4] SCT Write Same / DSM Trim
On Mon, Aug 22, 2016 at 12:02 PM, Tom Yan wrote: > On 22 August 2016 at 15:04, Shaun Tancheff wrote: >> On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan wrote: >>> On 22 August 2016 at 08:31, Tom Yan wrote: > Since I have no experience with SCT Write Same at all, and neither do > I own any spinning HDD at all, I cannot firmly suggest what to do. All > I can suggest is: should we decrease it per sector size? Or would 2G > per command still be too large to avoid timeout? Timeout for WS is 120 seconds so we should be fine there. The number to look for is the: Max. Sustained Transfer Rate OD (MB/s): 190 8TB (180 5TB) Which means the above drives should complete a 2G write in about 10 to 11 seconds. If these were 4Kn drives and we allowed a 16G max then it would be 80-90 seconds, assuming the write speed didn't get any better. So holding the maximum to around 2G is probably the best overall, in my opinion. -- Shaun Tancheff On Mon, Aug 22, 2016 at 12:02 PM, Tom Yan wrote: > On 22 August 2016 at 15:04, Shaun Tancheff wrote: >> On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan wrote: >>> On 22 August 2016 at 08:31, Tom Yan wrote: >>>> As mentioned before, as of the latest draft of ACS-4, nothing about a >>>> larger payload size is mentioned. Conservatively speaking, it sort of >>> >>> *payload block size >>> >>>> means that we are allowing four 512-byte block payload on 4Kn device >>> >>> *eight 512-byte-block payload >>> >>>> regardless of the reported limit in the IDENTIFY DEVICE data. I am >>>> really not sure if it's a good thing to do. Doesn't seem necessary >>>> anyway, especially when our block layer does not support such a large >>>> bio size (well, yet), so each request will end up using a payload of >>>> two 512-byte blocks at max anyway. >>>> >>>> Also, it's IMHO better to do it in a seperate patch (series) after the >>>> SCT Write Same support has entered libata's repo too, because this has >>>> nothing to with it but TRIM translation. In case the future ACS >>>> standards has clearer/better instruction on this, it will be easier >>>> for us to revert/follow up too. >> >> I am certainly fine with dropping this patch as it is not critical to >> the reset of the series. >> >> Nothing will break if we stick with the 512 byte fixed limit. This >> is at most a prep patch for handling increased limits should >> they be reported. >> >> All it really is doing is acknowledging that any write same >> must have a payload of sector_size which can be something >> larger than 512 bytes. > > Actually I am not sure if we should hard code the limit > ata_format_dsm_trim_descr() / ata_set_lba_range_entries() at all. The > current implementation (with or without your patch) seems redundant > and unnecessary to me. > > All we need to do should be: making sure that the block limits VPD > advertises a safe Maximum Write Same Length, and reject Write Same > (16) commands that have "number of blocks" that exceeds the limit > (which is what I introduced in commit 5c79097a28c2, "libata-scsi: > reject WRITE SAME (16) with n_block that exceeds limit"). > > In that case, we don't need to hard code the limit in the > while-condition again; instead we should just make it end with the > request size, since the accepted request could never be larger than > the limit we advertise. > >> >>>> And you'll need to fix the Block Limits VPD simulation >>>> (ata_scsiop_inq_b0) too, so that it will advertise the Maximum Write >>>> Same Length dynamically as per the logical sector size, otherwise your >>>> effort will be completely in vain, even if our block layer is >>>> overhauled in the future. >> >> Martin had earlier suggested that I leave the write same defaults >> as is due to concerns with misbehaving hardware. > > It doesn't really apply in libata's anyway. SD_MAX_WS10_BLOCKS means > nothing to ATA drives, except from coincidentally being the same value > as ATA_MAX_SECTORS_LBA48 (which technically should have been 65536 > instead). > >> >> I think your patch adjusting the reported limits is reasonable >> enough. It seems to me we should have the hardware >> report it's actual limits, for example, report what the spec >> allows. > > As you mentioned yourself before, technically SCT Write Same does not > have a limit. The only practical limit is the timeout in the SCSI > layer, so the actual bytes being (over)written is probably our only &g
Re: [PATCH 2/2] Migrate zone cache from RB-Tree to arrays of descriptors
On Mon, Aug 22, 2016 at 2:11 AM, Hannes Reinecke wrote: > On 08/22/2016 06:34 AM, Shaun Tancheff wrote: >> Currently the RB-Tree zone cache is fast and flexible. It does >> use a rather largish amount of ram. This model reduces the ram >> required from 120 bytes per zone to 16 bytes per zone with a >> moderate transformation of the blk_zone_lookup() api. >> >> This model is predicated on the belief that most variations >> on zoned media will follow a pattern of using collections of same >> sized zones on a single device. Similar to the pattern of erase >> blocks on flash devices being progressivly larger 16K, 64K, ... >> >> The goal is to be able to build a descriptor which is both memory >> efficient, performant, and flexible. >> >> Signed-off-by: Shaun Tancheff >> --- >> block/blk-core.c |2 +- >> block/blk-sysfs.c | 31 +- >> block/blk-zoned.c | 103 +++-- >> drivers/scsi/sd.c |5 +- >> drivers/scsi/sd.h |4 +- >> drivers/scsi/sd_zbc.c | 1025 >> +++- >> include/linux/blkdev.h | 82 +++- >> 7 files changed, 716 insertions(+), 536 deletions(-) > Have you measure the performance impact here? As far as actual hardware (HostAware) I am seeing the same I/O performance. I suspect its just that below 100k iops the zone cache just isn't a bottleneck. > The main idea behind using an RB-tree is that each single element will > fit in the CPU cache; using an array will prevent that. > So we will increase the number of cache flushes, and most likely a > performance penalty, too. > Hence I'd rather like to see a performance measurement here before going > down that road. I think it will have to be a simulated benchmark, if that's okay. Of course I'm open to suggestions if there is something you have in mind. -- Regards, Shaun Tancheff
Re: [PATCH v6 3/4] SCT Write Same / DSM Trim
On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan wrote: > On 22 August 2016 at 08:31, Tom Yan wrote: >> As mentioned before, as of the latest draft of ACS-4, nothing about a >> larger payload size is mentioned. Conservatively speaking, it sort of > > *payload block size > >> means that we are allowing four 512-byte block payload on 4Kn device > > *eight 512-byte-block payload > >> regardless of the reported limit in the IDENTIFY DEVICE data. I am >> really not sure if it's a good thing to do. Doesn't seem necessary >> anyway, especially when our block layer does not support such a large >> bio size (well, yet), so each request will end up using a payload of >> two 512-byte blocks at max anyway. >> >> Also, it's IMHO better to do it in a seperate patch (series) after the >> SCT Write Same support has entered libata's repo too, because this has >> nothing to with it but TRIM translation. In case the future ACS >> standards has clearer/better instruction on this, it will be easier >> for us to revert/follow up too. I am certainly fine with dropping this patch as it is not critical to the reset of the series. Nothing will break if we stick with the 512 byte fixed limit. This is at most a prep patch for handling increased limits should they be reported. All it really is doing is acknowledging that any write same must have a payload of sector_size which can be something larger than 512 bytes. >> And you'll need to fix the Block Limits VPD simulation >> (ata_scsiop_inq_b0) too, so that it will advertise the Maximum Write >> Same Length dynamically as per the logical sector size, otherwise your >> effort will be completely in vain, even if our block layer is >> overhauled in the future. Martin had earlier suggested that I leave the write same defaults as is due to concerns with misbehaving hardware. I think your patch adjusting the reported limits is reasonable enough. It seems to me we should have the hardware report it's actual limits, for example, report what the spec allows. Of course there are lots of reasons to limit the absolute maximums. So in this case we are just enabling the limit to be increased but not changing the current black-listing that distrusts DSM Trim. Once we have 4Kn devices to test then we can start white-listing and see if there is an overall increase in performance. >> Please be noted that, since your haven't touched ata_scsiop_inq_b0 at >> all, the reported Maximum Write Same Length will be: >> >> On device with TRIM support: >> - 4194240 LOGICAL sector per request split / command >> -- ~=2G on non-4Kn drives >> -- ~=16G on non-4Kn drives >> >> On device without TRIM support: >> - 0 --> SD_MAX_WS10_BLOCKS (65535) per request split / command >> -- ~= 32M on non-4Kn drives >> -- ~=256M on non-4Kn drives >> >> Even if we ignore the upper limit(s) of the block layer, do we want >> such inconsistencies? Hmm. Overall I think it is still okay if a bit confusing. It is possible that for devices which support SCT Write Same and DSM TRIM will still Trim faster than they can Write Same, However the internal implementation is opaque so I can't say if Write Same is often implemented in terms of TRIM or not. I mean that's how _I_ do it [Write 1 block and map N blocks to it], But not every FTL will have come to the same conclusion. I also suspect that given the choice for most use casess that TRIM is preferred over WS when TRIM supports returning zeroed blocks. -- Shaun Tancheff
Re: [PATCH 2/2] Migrate zone cache from RB-Tree to arrays of descriptors
On Sun, Aug 21, 2016 at 11:34 PM, Shaun Tancheff wrote: > Currently the RB-Tree zone cache is fast and flexible. It does > use a rather largish amount of ram. This model reduces the ram > required from 120 bytes per zone to 16 bytes per zone with a > moderate transformation of the blk_zone_lookup() api. > > This model is predicated on the belief that most variations > on zoned media will follow a pattern of using collections of same > sized zones on a single device. Similar to the pattern of erase > blocks on flash devices being progressivly larger 16K, 64K, ... > > The goal is to be able to build a descriptor which is both memory > efficient, performant, and flexible. > > Signed-off-by: Shaun Tancheff > --- > block/blk-core.c |2 +- > block/blk-sysfs.c | 31 +- > block/blk-zoned.c | 103 +++-- > drivers/scsi/sd.c |5 +- > drivers/scsi/sd.h |4 +- > drivers/scsi/sd_zbc.c | 1025 > +++- > include/linux/blkdev.h | 82 +++- > 7 files changed, 716 insertions(+), 536 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 3a9caf7..3b084a8 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -727,7 +727,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t > gfp_mask, int node_id) > INIT_LIST_HEAD(&q->blkg_list); > #endif > #ifdef CONFIG_BLK_DEV_ZONED > - q->zones = RB_ROOT; > + q->zones = NULL; > #endif > INIT_DELAYED_WORK(&q->delay_work, blk_delay_work); > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 43f441f..ecbd434 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -232,36 +232,7 @@ static ssize_t queue_max_hw_sectors_show(struct > request_queue *q, char *page) > #ifdef CONFIG_BLK_DEV_ZONED > static ssize_t queue_zoned_show(struct request_queue *q, char *page) > { > - struct rb_node *node; > - struct blk_zone *zone; > - ssize_t offset = 0, end = 0; > - size_t size = 0, num = 0; > - enum blk_zone_type type = BLK_ZONE_TYPE_UNKNOWN; > - > - for (node = rb_first(&q->zones); node; node = rb_next(node)) { > - zone = rb_entry(node, struct blk_zone, node); > - if (zone->type != type || > - zone->len != size || > - end != zone->start) { > - if (size != 0) > - offset += sprintf(page + offset, "%zu\n", > num); > - /* We can only store one page ... */ > - if (offset + 42 > PAGE_SIZE) { > - offset += sprintf(page + offset, "...\n"); > - return offset; > - } > - size = zone->len; > - type = zone->type; > - offset += sprintf(page + offset, "%zu %zu %d ", > - zone->start, size, type); > - num = 0; > - end = zone->start + size; > - } else > - end += zone->len; > - num++; > - } > - offset += sprintf(page + offset, "%zu\n", num); > - return offset; > + return sprintf(page, "%u\n", q->zones ? 1 : 0); > } > #endif > > diff --git a/block/blk-zoned.c b/block/blk-zoned.c > index 975e863..338a1af 100644 > --- a/block/blk-zoned.c > +++ b/block/blk-zoned.c > @@ -8,63 +8,84 @@ > #include > #include > #include > -#include > +#include > > -struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t lba) > +/** > + * blk_lookup_zone() - Lookup zones > + * @q: Request Queue > + * @sector: Location to lookup > + * @start: Pointer to starting location zone (OUT) > + * @len: Pointer to length of zone (OUT) > + * @lock: Pointer to spinlock of zones in owning descriptor (OUT) > + */ > +struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector, > +sector_t *start, sector_t *len, > +spinlock_t **lock) > { > - struct rb_root *root = &q->zones; > - struct rb_node *node = root->rb_node; > + int iter; > + struct blk_zone *bzone = NULL; > + struct zone_wps *zi = q->zones; > + > + *start = 0; > + *len = 0; > + *lock = NULL; > + > + if (!q->zones) > + goto out; > > - while (node) { > - struct blk_zone *zone = contain
[PATCH 2/2] Migrate zone cache from RB-Tree to arrays of descriptors
Currently the RB-Tree zone cache is fast and flexible. It does use a rather largish amount of ram. This model reduces the ram required from 120 bytes per zone to 16 bytes per zone with a moderate transformation of the blk_zone_lookup() api. This model is predicated on the belief that most variations on zoned media will follow a pattern of using collections of same sized zones on a single device. Similar to the pattern of erase blocks on flash devices being progressivly larger 16K, 64K, ... The goal is to be able to build a descriptor which is both memory efficient, performant, and flexible. Signed-off-by: Shaun Tancheff --- block/blk-core.c |2 +- block/blk-sysfs.c | 31 +- block/blk-zoned.c | 103 +++-- drivers/scsi/sd.c |5 +- drivers/scsi/sd.h |4 +- drivers/scsi/sd_zbc.c | 1025 +++- include/linux/blkdev.h | 82 +++- 7 files changed, 716 insertions(+), 536 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 3a9caf7..3b084a8 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -727,7 +727,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) INIT_LIST_HEAD(&q->blkg_list); #endif #ifdef CONFIG_BLK_DEV_ZONED - q->zones = RB_ROOT; + q->zones = NULL; #endif INIT_DELAYED_WORK(&q->delay_work, blk_delay_work); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 43f441f..ecbd434 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -232,36 +232,7 @@ static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page) #ifdef CONFIG_BLK_DEV_ZONED static ssize_t queue_zoned_show(struct request_queue *q, char *page) { - struct rb_node *node; - struct blk_zone *zone; - ssize_t offset = 0, end = 0; - size_t size = 0, num = 0; - enum blk_zone_type type = BLK_ZONE_TYPE_UNKNOWN; - - for (node = rb_first(&q->zones); node; node = rb_next(node)) { - zone = rb_entry(node, struct blk_zone, node); - if (zone->type != type || - zone->len != size || - end != zone->start) { - if (size != 0) - offset += sprintf(page + offset, "%zu\n", num); - /* We can only store one page ... */ - if (offset + 42 > PAGE_SIZE) { - offset += sprintf(page + offset, "...\n"); - return offset; - } - size = zone->len; - type = zone->type; - offset += sprintf(page + offset, "%zu %zu %d ", - zone->start, size, type); - num = 0; - end = zone->start + size; - } else - end += zone->len; - num++; - } - offset += sprintf(page + offset, "%zu\n", num); - return offset; + return sprintf(page, "%u\n", q->zones ? 1 : 0); } #endif diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 975e863..338a1af 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -8,63 +8,84 @@ #include #include #include -#include +#include -struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t lba) +/** + * blk_lookup_zone() - Lookup zones + * @q: Request Queue + * @sector: Location to lookup + * @start: Pointer to starting location zone (OUT) + * @len: Pointer to length of zone (OUT) + * @lock: Pointer to spinlock of zones in owning descriptor (OUT) + */ +struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector, +sector_t *start, sector_t *len, +spinlock_t **lock) { - struct rb_root *root = &q->zones; - struct rb_node *node = root->rb_node; + int iter; + struct blk_zone *bzone = NULL; + struct zone_wps *zi = q->zones; + + *start = 0; + *len = 0; + *lock = NULL; + + if (!q->zones) + goto out; - while (node) { - struct blk_zone *zone = container_of(node, struct blk_zone, -node); + for (iter = 0; iter < zi->wps_count; iter++) { + if (sector >= zi->wps[iter]->start_lba && + sector < zi->wps[iter]->last_lba) { + struct contiguous_wps *wp = zi->wps[iter]; + u64 index = (sector - wp->start_lba) / wp->zone_size; - if (lba < zone->start) - node = node->rb_left; - else if (lba >= zone->start + zone->len) - node = node->rb_rig
[PATCH 1/2] Move ZBC core setup to sd_zbc
Move the remaining ZBC specific code to sd_zbc.c Signed-off-by: Shaun Tancheff --- drivers/scsi/sd.c | 65 +-- drivers/scsi/sd.h | 20 ++ drivers/scsi/sd_zbc.c | 170 +++--- 3 files changed, 126 insertions(+), 129 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 9a649fa..f144df4 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2244,68 +2244,6 @@ static int sd_read_protection_type(struct scsi_disk *sdkp, unsigned char *buffer return ret; } -static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer) -{ - int retval; - unsigned char *desc; - u32 rep_len; - u8 same; - u64 zone_len, lba; - - if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) - /* -* Device managed or normal SCSI disk, -* no special handling required -*/ - return; - - retval = sd_zbc_report_zones(sdkp, buffer, SD_BUF_SIZE, -0, ZBC_ZONE_REPORTING_OPTION_ALL, false); - if (retval < 0) - return; - - rep_len = get_unaligned_be32(&buffer[0]); - if (rep_len < 64) { - sd_printk(KERN_WARNING, sdkp, - "REPORT ZONES report invalid length %u\n", - rep_len); - return; - } - - if (sdkp->rc_basis == 0) { - /* The max_lba field is the capacity of a zoned device */ - lba = get_unaligned_be64(&buffer[8]); - if (lba + 1 > sdkp->capacity) { - if (sdkp->first_scan) - sd_printk(KERN_WARNING, sdkp, - "Changing capacity from %zu to Max LBA+1 %zu\n", - sdkp->capacity, (sector_t) lba + 1); - sdkp->capacity = lba + 1; - } - } - - /* -* Adjust 'chunk_sectors' to the zone length if the device -* supports equal zone sizes. -*/ - same = buffer[4] & 0xf; - if (same > 3) { - sd_printk(KERN_WARNING, sdkp, - "REPORT ZONES SAME type %d not supported\n", same); - return; - } - /* Read the zone length from the first zone descriptor */ - desc = &buffer[64]; - zone_len = get_unaligned_be64(&desc[8]); - sdkp->unmap_alignment = zone_len; - sdkp->unmap_granularity = zone_len; - blk_queue_chunk_sectors(sdkp->disk->queue, - logical_to_sectors(sdkp->device, zone_len)); - - sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE); - sd_config_discard(sdkp, SD_ZBC_RESET_WP); -} - static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp, struct scsi_sense_hdr *sshdr, int sense_valid, int the_result) @@ -2611,7 +2549,8 @@ got_data: sdkp->physical_block_size); sdkp->device->sector_size = sector_size; - sd_read_zones(sdkp, buffer); + if (sd_zbc_config(sdkp, buffer, SD_BUF_SIZE)) + sd_config_discard(sdkp, SD_ZBC_RESET_WP); { char cap_str_2[10], cap_str_10[10]; diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index adbf3e0..fc766db 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -289,10 +289,6 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, unsigned int a) #define SD_ZBC_WRITE_ERR 2 #ifdef CONFIG_SCSI_ZBC - -extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int, - sector_t, enum zbc_zone_reporting_options, bool); -extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len); extern int sd_zbc_setup_zone_report_cmnd(struct scsi_cmnd *cmd, u8 rpt_opt); extern int sd_zbc_setup_zone_action(struct scsi_cmnd *cmd); extern int sd_zbc_setup_discard(struct scsi_cmnd *cmd); @@ -303,23 +299,15 @@ extern void sd_zbc_uninit_command(struct scsi_cmnd *cmd); extern void sd_zbc_remove(struct scsi_disk *); extern void sd_zbc_reset_zones(struct scsi_disk *); extern void sd_zbc_update_zones(struct scsi_disk *, sector_t, int, int reason); +extern bool sd_zbc_config(struct scsi_disk *, void *, size_t); + extern unsigned int sd_zbc_discard_granularity(struct scsi_disk *sdkp); #else /* CONFIG_SCSI_ZBC */ -static inline int sd_zbc_report_zones(struct scsi_disk *sdkp, - unsigned char *buf, int buf_len, - sector_t start_sector, - enum zbc_zone_reporting_options option, -
[PATCH 0/2] Change zone cache format to use less memory
Currently the RB-Tree zone cache is fast and flexible. It does use a rather largish amount of ram. This model reduces the ram required from 120 bytes per zone to 16 bytes per zone with a moderate transformation of the blk_zone_lookup() api. This model is predicated on the belief that most variations on zoned media will follow a pattern of using collections of same sized zones on a single device. Similar to the pattern of erase blocks on flash devices being progressivly larger 16K, 64K, ... The goal is to be able to build a descriptor which is both memory efficient, performant, and flexible. Shaun Tancheff (2): Move ZBC core setup to sd_zbc Migrate zone cache from RB-Tree to arrays of descriptors block/blk-core.c |2 +- block/blk-sysfs.c | 31 +- block/blk-zoned.c | 103 +++-- drivers/scsi/sd.c | 66 +-- drivers/scsi/sd.h | 20 +- drivers/scsi/sd_zbc.c | 1037 +--- include/linux/blkdev.h | 82 +++- 7 files changed, 759 insertions(+), 582 deletions(-) -- 2.9.3
[PATCH v2 4/4] Integrate ZBC command requests with zone cache.
Block layer (bio/request) commands can use or update the sd_zbc zone cache as appropriate for each command. Report Zones [REQ_OP_ZONE_REPORT] by default uses the current zone cache data to generate a device (ZBC spec) formatted response. REQ_META can also be specified to force the command to the device and the result will be used to refresh the zone cache. Reset WP [REQ_OP_ZONE_RESET] by default will attempt to translate the request into a discard following the SD_ZBC_RESET_WP provisioning mode. REQ_META can also be specified to force the command to be sent to the device. Open, Close and Finish zones having no other analog are sent directly to the device. On successful completion each zone action will update the zone cache as appropriate. Signed-off-by: Shaun Tancheff --- block/blk-lib.c | 16 -- drivers/scsi/sd.c | 42 +++- drivers/scsi/sd.h | 22 +- drivers/scsi/sd_zbc.c | 672 +++--- 4 files changed, 698 insertions(+), 54 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 67b9258..8cc5893 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -307,22 +307,6 @@ int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags, bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags); ret = submit_bio_wait(bio); - /* -* When our request it nak'd the underlying device maybe conventional -* so ... report a single conventional zone the size of the device. -*/ - if (ret == -EIO && conv->descriptor_count) { - /* Adjust the conventional to the size of the partition ... */ - __be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects); - - conv->maximum_lba = blksz; - conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL; - conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4; - conv->descriptors[0].length = blksz; - conv->descriptors[0].lba_start = 0; - conv->descriptors[0].lba_wptr = blksz; - ret = 0; - } bio_put(bio); return ret; } diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index b76ffbb..9a649fa 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1181,9 +1181,10 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd) struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); struct bio *bio = rq->bio; sector_t sector = blk_rq_pos(rq); - struct gendisk *disk = rq->rq_disk; unsigned int nr_bytes = blk_rq_bytes(rq); int ret = BLKPREP_KILL; + bool is_fua = (rq->cmd_flags & REQ_META) ? true : false; + u8 rpt_opt = ZBC_ZONE_REPORTING_OPTION_ALL; WARN_ON(nr_bytes == 0); @@ -1194,18 +1195,35 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd) if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) { void *src; struct bdev_zone_report *conv; + __be64 blksz = cpu_to_be64(sdkp->capacity); - if (nr_bytes < sizeof(struct bdev_zone_report)) + if (nr_bytes < 512) goto out; src = kmap_atomic(bio->bi_io_vec->bv_page); conv = src + bio->bi_io_vec->bv_offset; conv->descriptor_count = cpu_to_be32(1); conv->same_field = BLK_ZONE_SAME_ALL; - conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects); + conv->maximum_lba = blksz; + conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL; + conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4; + conv->descriptors[0].length = blksz; + conv->descriptors[0].lba_start = 0; + conv->descriptors[0].lba_wptr = blksz; kunmap_atomic(src); + ret = BLKPREP_DONE; goto out; } + /* FUTURE ... when streamid is available */ + /* rpt_opt = bio_get_streamid(bio); */ + + if (!is_fua) { + ret = sd_zbc_setup_zone_report_cmnd(cmd, rpt_opt); + if (ret == BLKPREP_DONE || ret == BLKPREP_DEFER) + goto out; + if (ret == BLKPREP_KILL) + pr_err("No Zone Cache, query media.\n"); + } ret = scsi_init_io(cmd); if (ret != BLKPREP_OK) @@ -1224,8 +1242,7 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd) cmd->cmnd[1] = ZI_REPORT_ZONES; put_unaligned_be64(sector, &cmd->cmnd[2]); put_unaligned_be32(nr_bytes, &cmd->cmnd[10]); - /* FUTURE ... when streamid is available */ - /* cmd->cmnd[14] = bio_get_streamid(bio); */ + cmd->cmnd[14] = rpt_opt; cmd->sc_data_direction = DMA_FROM_DE
[PATCH v2 2/4] On Discard either do Reset WP or Write Same
Based on the type of zone either perform a Reset WP for Sequential zones or a Write Same for Conventional zones. Also detect and handle the runt zone, if there is one. One additional check is added to error on discard requests that do not include all the active data in zone. By way of example when the WP indicates that 2000 blocks in the zone are in use and the discard indicated 1000 blocks can be unmapped the discard should fail as a Reset WP will unmap all the 2000 blocks in the zone. Signed-off-by: Shaun Tancheff --- drivers/scsi/sd.c | 45 ++--- drivers/scsi/sd.h | 9 ++-- drivers/scsi/sd_zbc.c | 135 +++--- 3 files changed, 114 insertions(+), 75 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 7903e21..d5ef6d8 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -729,21 +729,19 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) sector_t sector = blk_rq_pos(rq); unsigned int nr_sectors = blk_rq_sectors(rq); unsigned int nr_bytes = blk_rq_bytes(rq); - unsigned int len; - int ret = 0; + int ret; char *buf; - struct page *page = NULL; + struct page *page; sector >>= ilog2(sdp->sector_size) - 9; nr_sectors >>= ilog2(sdp->sector_size) - 9; - if (sdkp->provisioning_mode != SD_ZBC_RESET_WP) { - page = alloc_page(GFP_ATOMIC | __GFP_ZERO); - if (!page) - return BLKPREP_DEFER; - } + page = alloc_page(GFP_ATOMIC | __GFP_ZERO); + if (!page) + return BLKPREP_DEFER; rq->completion_data = page; + rq->timeout = SD_TIMEOUT; switch (sdkp->provisioning_mode) { case SD_LBP_UNMAP: @@ -758,7 +756,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) put_unaligned_be64(sector, &buf[8]); put_unaligned_be32(nr_sectors, &buf[16]); - len = 24; + cmd->transfersize = 24; break; case SD_LBP_WS16: @@ -768,7 +766,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) put_unaligned_be64(sector, &cmd->cmnd[2]); put_unaligned_be32(nr_sectors, &cmd->cmnd[10]); - len = sdkp->device->sector_size; + cmd->transfersize = sdp->sector_size; break; case SD_LBP_WS10: @@ -777,35 +775,24 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) cmd->cmnd[0] = WRITE_SAME; if (sdkp->provisioning_mode == SD_LBP_WS10) cmd->cmnd[1] = 0x8; /* UNMAP */ + else + rq->timeout = SD_WRITE_SAME_TIMEOUT; put_unaligned_be32(sector, &cmd->cmnd[2]); put_unaligned_be16(nr_sectors, &cmd->cmnd[7]); - len = sdkp->device->sector_size; + cmd->transfersize = sdp->sector_size; break; case SD_ZBC_RESET_WP: - /* sd_zbc_setup_discard uses block layer sector units */ - ret = sd_zbc_setup_discard(sdkp, rq, blk_rq_pos(rq), - blk_rq_sectors(rq)); + ret = sd_zbc_setup_discard(cmd); if (ret != BLKPREP_OK) goto out; - cmd->cmd_len = 16; - cmd->cmnd[0] = ZBC_OUT; - cmd->cmnd[1] = ZO_RESET_WRITE_POINTER; - put_unaligned_be64(sector, &cmd->cmnd[2]); - /* Reset Write Pointer doesn't have a payload */ - len = 0; - cmd->sc_data_direction = DMA_NONE; break; - default: ret = BLKPREP_INVALID; goto out; } - rq->timeout = SD_TIMEOUT; - - cmd->transfersize = len; cmd->allowed = SD_MAX_RETRIES; /* @@ -816,17 +803,15 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) * discarded on disk. This allows us to report completion on the full * amount of blocks described by the request. */ - if (len) { - blk_add_request_payload(rq, page, 0, len); + if (cmd->transfersize) { + blk_add_request_payload(rq, page, 0, cmd->transfersize); ret = scsi_init_io(cmd); } rq->__data_len = nr_bytes; out: - if (page && ret != BLKPREP_OK) { - rq->completion_data = NULL; + if (ret != BLKPREP_OK) __free_page(page); - } return ret; } diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index ef6c132..2792c10 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -295,8 +295,7 @@ extern int sd_zbc_report_zones(struct s
[PATCH v2 3/4] Merge ZBC constants
Dedupe ZBC/ZAC constants used for reporting options, same code, zone condition and zone type. These are all useful to programs consuming zone information from user space as well so include them in a uapi header. Signed-off-by: Shaun Tancheff --- block/blk-lib.c | 4 +- drivers/scsi/sd.c | 2 +- include/linux/blkdev.h| 20 - include/scsi/scsi_proto.h | 17 include/uapi/linux/blkzoned_api.h | 167 +++--- 5 files changed, 103 insertions(+), 107 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index e92bd56..67b9258 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -316,8 +316,8 @@ int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags, __be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects); conv->maximum_lba = blksz; - conv->descriptors[0].type = ZTYP_CONVENTIONAL; - conv->descriptors[0].flags = ZCOND_CONVENTIONAL << 4; + conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL; + conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4; conv->descriptors[0].length = blksz; conv->descriptors[0].lba_start = 0; conv->descriptors[0].lba_wptr = blksz; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d5ef6d8..b76ffbb 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1201,7 +1201,7 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd *cmd) src = kmap_atomic(bio->bi_io_vec->bv_page); conv = src + bio->bi_io_vec->bv_offset; conv->descriptor_count = cpu_to_be32(1); - conv->same_field = ZS_ALL_SAME; + conv->same_field = BLK_ZONE_SAME_ALL; conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects); kunmap_atomic(src); goto out; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 68198eb..d5cdb5d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -263,26 +263,6 @@ struct blk_queue_tag { #define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8)) #ifdef CONFIG_BLK_DEV_ZONED -enum blk_zone_type { - BLK_ZONE_TYPE_UNKNOWN, - BLK_ZONE_TYPE_CONVENTIONAL, - BLK_ZONE_TYPE_SEQWRITE_REQ, - BLK_ZONE_TYPE_SEQWRITE_PREF, - BLK_ZONE_TYPE_RESERVED, -}; - -enum blk_zone_state { - BLK_ZONE_NO_WP, - BLK_ZONE_EMPTY, - BLK_ZONE_OPEN, - BLK_ZONE_OPEN_EXPLICIT, - BLK_ZONE_CLOSED, - BLK_ZONE_UNKNOWN = 5, - BLK_ZONE_READONLY = 0xd, - BLK_ZONE_FULL, - BLK_ZONE_OFFLINE, - BLK_ZONE_BUSY = 0x20, -}; struct blk_zone { struct rb_node node; diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h index 6ba66e0..d1defd1 100644 --- a/include/scsi/scsi_proto.h +++ b/include/scsi/scsi_proto.h @@ -299,21 +299,4 @@ struct scsi_lun { #define SCSI_ACCESS_STATE_MASK0x0f #define SCSI_ACCESS_STATE_PREFERRED 0x80 -/* Reporting options for REPORT ZONES */ -enum zbc_zone_reporting_options { - ZBC_ZONE_REPORTING_OPTION_ALL = 0, - ZBC_ZONE_REPORTING_OPTION_EMPTY, - ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN, - ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN, - ZBC_ZONE_REPORTING_OPTION_CLOSED, - ZBC_ZONE_REPORTING_OPTION_FULL, - ZBC_ZONE_REPORTING_OPTION_READONLY, - ZBC_ZONE_REPORTING_OPTION_OFFLINE, - ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10, - ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE, - ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f, -}; - -#define ZBC_REPORT_ZONE_PARTIAL 0x80 - #endif /* _SCSI_PROTO_H_ */ diff --git a/include/uapi/linux/blkzoned_api.h b/include/uapi/linux/blkzoned_api.h index cd81a9f..fa12976 100644 --- a/include/uapi/linux/blkzoned_api.h +++ b/include/uapi/linux/blkzoned_api.h @@ -16,97 +16,123 @@ #include +#define ZBC_REPORT_OPTION_MASK 0x3f +#define ZBC_REPORT_ZONE_PARTIAL 0x80 + /** * enum zone_report_option - Report Zones types to be included. * - * @ZOPT_NON_SEQ_AND_RESET: Default (all zones). - * @ZOPT_ZC1_EMPTY: Zones which are empty. - * @ZOPT_ZC2_OPEN_IMPLICIT: Zones open but not explicitly opened - * @ZOPT_ZC3_OPEN_EXPLICIT: Zones opened explicitly - * @ZOPT_ZC4_CLOSED: Zones closed for writing. - * @ZOPT_ZC5_FULL: Zones that are full. - * @ZOPT_ZC6_READ_ONLY: Zones that are read-only - * @ZOPT_ZC7_OFFLINE: Zones that are offline - * @ZOPT_RESET: Zones that are empty - * @ZOPT_NON_SEQ: Zones that have HA media-cache writes pending - * @ZOPT_NON_WP_ZONES: Zones that do not have Write Pointers (conventional) - * @ZOPT_PARTIAL_FLAG: Modifies the definition of the Zone List Length field. + * @ZBC_ZONE_REPORTING_OPTION_ALL: Default (all zones). + * @ZBC_ZONE_REPORTING_OPTION_EMPTY: Zones which are empty. + * @ZBC_ZONE_REPORTI
[PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache
Hi, As per Christoph's request this patch incorporates Hannes' cache of zone information. This approach is to have REQ_OP_ZONE_REPORT return data in the same format regardless of the availability of the zone cache. So if the is kernel being built with or without BLK_DEV_ZONED [and SCSI_ZBC] users of blkdev_issue_zone_report() and/or REQ_OP_ZONE_REPORT bio's will have a consistent data format to digest. Additionally it seems reasonable to allow the REQ_OP_ZONE_* to be able to indicate if the command *must* be delivered to the device [and update the zone cache] accordingly. Here REQ_META is being used as REQ_FUA can be dropped causing sd_done to be skipped. Rather than special case the current code I chose to pick an otherwise non-applicable flag. This series is based off of Linus's v4.8-rc2 and builds on top of the previous series of block layer support: Add ioctl to issue ZBC/ZAC commands via block layer Add bio/request flags to issue ZBC/ZAC commands as well as the series posted by Hannes sd_zbc: Fix handling of ZBC read after write pointer sd: Limit messages for ZBC disks capacity change sd: Implement support for ZBC devices sd: Implement new RESET_WP provisioning mode sd: configure ZBC devices ... Patches for util-linux can be found here: g...@github.com:stancheff/util-linux.git v2.28.1+biof https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof This patch is available here: https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v9 g...@github.com:stancheff/linux.git v4.8-rc2+biof.v9 v2: - Fully integrated bio <-> zone cache [<-> device] - Added discard -> write same for conventional zones. - Merged disparate constants into a canonical set. Shaun Tancheff (4): Enable support for Seagate HostAware drives (testing). On Discard either do Reset WP or Write Same Merge ZBC constants Integrate ZBC command requests with zone cache. block/blk-lib.c | 16 - drivers/scsi/sd.c | 111 +++-- drivers/scsi/sd.h | 49 ++- drivers/scsi/sd_zbc.c | 904 ++ include/linux/blkdev.h| 22 +- include/scsi/scsi_proto.h | 17 - include/uapi/linux/blkzoned_api.h | 167 --- 7 files changed, 1032 insertions(+), 254 deletions(-) -- 2.9.3
[PATCH v2 1/4] Enable support for Seagate HostAware drives
Seagate drives report a SAME code of 0 due to having: - Zones of different types (CMR zones at the low LBA space). - Zones of different size (A terminating 'runt' zone in the high lba space). Support loading the zone topology into the zone cache. Signed-off-by: Shaun Tancheff --- drivers/scsi/sd.c | 22 +++--- drivers/scsi/sd.h | 20 -- drivers/scsi/sd_zbc.c | 183 +++-- include/linux/blkdev.h | 16 +++-- 4 files changed, 170 insertions(+), 71 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 059a57f..7903e21 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -693,8 +693,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) break; case SD_ZBC_RESET_WP: - max_blocks = sdkp->unmap_granularity; q->limits.discard_zeroes_data = 1; + q->limits.discard_granularity = + sd_zbc_discard_granularity(sdkp); + + max_blocks = min_not_zero(sdkp->unmap_granularity, + q->limits.discard_granularity >> + ilog2(logical_block_size)); break; case SD_LBP_ZERO: @@ -1955,13 +1960,12 @@ static int sd_done(struct scsi_cmnd *SCpnt) good_bytes = blk_rq_bytes(req); scsi_set_resid(SCpnt, 0); } else { -#ifdef CONFIG_SCSI_ZBC if (op == ZBC_OUT) /* RESET WRITE POINTER failed */ sd_zbc_update_zones(sdkp, blk_rq_pos(req), - 512, true); -#endif + 512, SD_ZBC_RESET_WP_ERR); + good_bytes = 0; scsi_set_resid(SCpnt, blk_rq_bytes(req)); } @@ -2034,7 +2038,6 @@ static int sd_done(struct scsi_cmnd *SCpnt) good_bytes = blk_rq_bytes(req); scsi_set_resid(SCpnt, 0); } -#ifdef CONFIG_SCSI_ZBC /* * ZBC: Unaligned write command. * Write did not start a write pointer position. @@ -2042,8 +2045,7 @@ static int sd_done(struct scsi_cmnd *SCpnt) if (sshdr.ascq == 0x04) sd_zbc_update_zones(sdkp, blk_rq_pos(req), - 512, true); -#endif + 512, SD_ZBC_WRITE_ERR); } break; default: @@ -2270,7 +2272,7 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer) * supports equal zone sizes. */ same = buffer[4] & 0xf; - if (same == 0 || same > 3) { + if (same > 3) { sd_printk(KERN_WARNING, sdkp, "REPORT ZONES SAME type %d not supported\n", same); return; @@ -2282,9 +2284,9 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer) sdkp->unmap_granularity = zone_len; blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, zone_len)); - sd_config_discard(sdkp, SD_ZBC_RESET_WP); - sd_zbc_setup(sdkp, buffer, SD_BUF_SIZE); + sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE); + sd_config_discard(sdkp, SD_ZBC_RESET_WP); } static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp, diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 6ae4505..ef6c132 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -283,19 +283,24 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, unsigned int a) #endif /* CONFIG_BLK_DEV_INTEGRITY */ + +#define SD_ZBC_INIT0 +#define SD_ZBC_RESET_WP_ERR1 +#define SD_ZBC_WRITE_ERR 2 + #ifdef CONFIG_SCSI_ZBC extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int, sector_t, enum zbc_zone_reporting_options, bool); -extern int sd_zbc_setup(struct scsi_disk *, char *, int); +extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len); extern void sd_zbc_remove(struct scsi_disk *); extern void sd_zbc_reset_zones(struct scsi_disk *); extern int sd_zbc_setup_discard(struct scsi_disk *, struct request *, sector_t, unsigned int); extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *, sector_t, unsigned int *); -extern void sd_zbc_upda
[PATCH v6 4/4] SCT Write Same handle ATA_DFLAG_PIO
Use non DMA write log when ATA_DFLAG_PIO is set. Signed-off-by: Shaun Tancheff --- v6: Added check for ATA_DFLAG_PIO and fallback to non DMA write log for SCT Write Same drivers/ata/libata-scsi.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 37f456e..e50b7a7 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3485,6 +3485,8 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) tf->device = ATA_CMD_STANDBYNOW1; tf->protocol = ATA_PROT_DMA; tf->command = ATA_CMD_WRITE_LOG_DMA_EXT; + if (unlikely(dev->flags & ATA_DFLAG_PIO)) + tf->command = ATA_CMD_WRITE_LOG_EXT; } tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE | ATA_TFLAG_LBA48 | -- 2.9.3
[PATCH v6 3/4] SCT Write Same / DSM Trim
Correct handling of devices with sector_size other that 512 bytes. Signed-off-by: Shaun Tancheff --- In the case of a 4Kn device sector_size it is possible to describe a much larger DSM Trim than the current fixed default of 512 bytes. This patch assumes the minimum descriptor is sector_size and fills out the descriptor accordingly. The ACS-2 specification is quite clear that the DSM command payload is sized as number of 512 byte transfers so a 4Kn device will operate correctly without this patch. v5: - Added support for a sector_size descriptor other than 512 bytes. drivers/ata/libata-scsi.c | 85 +++ 1 file changed, 57 insertions(+), 28 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index ebf1a04..37f456e 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3283,7 +3283,7 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) /** * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim * @cmd: SCSI command being translated - * @num: Maximum number of entries (nominally 64). + * @trmax: Maximum number of entries that will fit in sector_size bytes. * @sector: Starting sector * @count: Total Range of request in logical sectors * @@ -3298,63 +3298,80 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) * LBA's should be sorted order and not overlap. * * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET + * + * Return: Number of bytes copied into sglist. */ -static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num, - u64 sector, u32 count) +static size_t ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 trmax, + u64 sector, u32 count) { - __le64 *buffer; - u32 i = 0, used_bytes; + struct scsi_device *sdp = cmd->device; + size_t len = sdp->sector_size; + size_t r; + __le64 *buf; + u32 i = 0; unsigned long flags; - BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE); + WARN_ON(len > ATA_SCSI_RBUF_SIZE); + + if (len > ATA_SCSI_RBUF_SIZE) + len = ATA_SCSI_RBUF_SIZE; spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); - buffer = ((void *)ata_scsi_rbuf); - while (i < num) { + buf = ((void *)ata_scsi_rbuf); + memset(buf, 0, len); + while (i < trmax) { u64 entry = sector | ((u64)(count > 0x ? 0x : count) << 48); - buffer[i++] = __cpu_to_le64(entry); + buf[i++] = __cpu_to_le64(entry); if (count <= 0x) break; count -= 0x; sector += 0x; } - - used_bytes = ALIGN(i * 8, 512); - memset(buffer + i, 0, used_bytes - i * 8); - sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 512); + r = sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buf, len); spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); - return used_bytes; + return r; } /** * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same * @cmd: SCSI command being translated * @lba: Starting sector - * @num: Number of logical sectors to be zero'd. + * @num: Number of sectors to be zero'd. * - * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted + * Rewrite the WRITE SAME payload to be an SCT Write Same formatted * descriptor. * NOTE: Writes a pattern (0's) in the foreground. - * Large write-same requents can timeout. + * + * Return: Number of bytes copied into sglist. */ -static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num) +static size_t ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num) { - u16 *sctpg; + struct scsi_device *sdp = cmd->device; + size_t len = sdp->sector_size; + size_t r; + u16 *buf; unsigned long flags; spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); - sctpg = ((void *)ata_scsi_rbuf); + buf = ((void *)ata_scsi_rbuf); + + put_unaligned_le16(0x0002, &buf[0]); /* SCT_ACT_WRITE_SAME */ + put_unaligned_le16(0x0101, &buf[1]); /* WRITE PTRN FG */ + put_unaligned_le64(lba, &buf[2]); + put_unaligned_le64(num, &buf[6]); + put_unaligned_le32(0u, &buf[10]); /* pattern */ + + WARN_ON(len > ATA_SCSI_RBUF_SIZE); - put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ - put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ - put_unaligned_le64(lba, &sctpg[2]); - put_unaligned_le64(num, &sctpg[6]); - put_unaligned_le32(0u, &sctpg[10]); + if (len > ATA_SCSI_RBUF_SIZE
[PATCH v6 2/4] Add support for SCT Write Same
SATA drives may support write same via SCT. This is useful for setting the drive contents to a specific pattern (0's). Translate a SCSI WRITE SAME 16 command to be either a DSM TRIM command or an SCT Write Same command. Based on the UNMAP flag: - When set translate to DSM TRIM - When not set translate to SCT Write Same Signed-off-by: Shaun Tancheff --- v6: - Change to use sg_copy_from_buffer as per Christoph Hellwig v5: - Addressed review comments - Report support for ZBC only for zoned devices. - kmap page during rewrite - Fix unmap set to require trim or error, if not unmap then sct write same or error. v4: - Added partial MAINTENANCE_IN opcode simulation - Dropped all changes in drivers/scsi/* - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT. v3: - Demux UNMAP/TRIM from WRITE SAME v2: - Remove fugly ata hacking from sd.c drivers/ata/libata-scsi.c | 199 +++--- include/linux/ata.h | 43 ++ 2 files changed, 213 insertions(+), 29 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 7990cb2..ebf1a04 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device *sdev) { sdev->use_10_for_rw = 1; sdev->use_10_for_ms = 1; - sdev->no_report_opcodes = 1; - sdev->no_write_same = 1; /* Schedule policy is determined by ->qc_defer() callback and * it needs to see every deferred qc. Set dev_blocked to 1 to @@ -3287,7 +3285,7 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) * @cmd: SCSI command being translated * @num: Maximum number of entries (nominally 64). * @sector: Starting sector - * @count: Total Range of request + * @count: Total Range of request in logical sectors * * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted * descriptor. @@ -3330,6 +3328,45 @@ static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num, return used_bytes; } +/** + * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same + * @cmd: SCSI command being translated + * @lba: Starting sector + * @num: Number of logical sectors to be zero'd. + * + * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted + * descriptor. + * NOTE: Writes a pattern (0's) in the foreground. + * Large write-same requents can timeout. + */ +static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num) +{ + u16 *sctpg; + unsigned long flags; + + spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); + sctpg = ((void *)ata_scsi_rbuf); + + put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ + put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ + put_unaligned_le64(lba, &sctpg[2]); + put_unaligned_le64(num, &sctpg[6]); + put_unaligned_le32(0u, &sctpg[10]); + + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), sctpg, 512); + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); +} + +/** + * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same + * @qc: Command to be translated + * + * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or + * an SCT Write Same command. + * Based on WRITE SAME has the UNMAP flag + * When set translate to DSM TRIM + * When clear translate to SCT Write Same + */ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) { struct ata_taskfile *tf = &qc->tf; @@ -3342,6 +3379,7 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) u32 size; u16 fp; u8 bp = 0xff; + u8 unmap = cdb[1] & 0x8; /* we may not issue DMA commands if no DMA mode is set */ if (unlikely(!dev->dma_mode)) @@ -3353,11 +3391,26 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) } scsi_16_lba_len(cdb, &block, &n_block); - /* for now we only support WRITE SAME with the unmap bit set */ - if (unlikely(!(cdb[1] & 0x8))) { - fp = 1; - bp = 3; - goto invalid_fld; + if (unmap) { + /* If trim is not enabled the cmd is invalid. */ + if ((dev->horkage & ATA_HORKAGE_NOTRIM) || + !ata_id_has_trim(dev->id)) { + fp = 1; + bp = 3; + goto invalid_fld; + } + /* If the request is too large the cmd is invalid */ + if (n_block > 0x * trmax) { + fp = 2; + goto invalid_fld; + } + } else { + /* If write same is not available the cmd is invalid */ +
[PATCH v6 0/4] SCT Write Same
At some point the method of issuing Write Same for ATA drives changed. Currently write same is commonly available via SCT so expose the SCT capabilities and use SCT Write Same when it is available. This is useful for zoned based media that prefers to support discard with lbprz set, aka discard zeroes data by mapping discard operations to reset write pointer operations. Conventional zones that do not support reset write pointer can still honor the discard zeroes data by issuing a write same over the zone. It may also be nice to know if various controllers that currently disable WRITE SAME will work with the SCT Write Same code path: aacraid, arcmsr, megaraid, 3w-9xxx, 3w-sas, 3w-, gdth, hpsa, ips, megaraid, pmcraid, storvsc_drv This patch against v4.8-rc2 is also at https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v9 g...@github.com:stancheff/linux.git v4.8-rc2+biof.v9 v6: - Fix bisect bug reported by Tom Yan - Change to use sg_copy_from_buffer as per Christoph Hellwig - Added support for a sector_size descriptor other than 512 bytes. v5: - Addressed review comments - Report support for ZBC only for zoned devices. - kmap page during rewrite - Fix unmap set to require trim or error, if not unmap then sct write same or error. v4: - Added partial MAINTENANCE_IN opcode simulation - Dropped all changes in drivers/scsi/* - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT. v3: - Demux UNMAP/TRIM from WRITE SAME v2: - Remove fugly ata hacking from sd.c Shaun Tancheff (4): libata: Safely overwrite attached page in WRITE SAME xlat Add support for SCT Write Same SCT Write Same / DSM Trim SCT Write Same handle ATA_DFLAG_PIO drivers/ata/libata-scsi.c | 280 +- include/linux/ata.h | 69 +++- 2 files changed, 292 insertions(+), 57 deletions(-) -- 2.9.3
[PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat
Safely overwriting the attached page to ATA format from the SCSI formatted variant. Signed-off-by: Shaun Tancheff --- v6: - Fix bisect bug reported by Tom Yan - Change to use sg_copy_from_buffer as per Christoph Hellwig v5: - Added prep patch to work with non-page aligned scatterlist pages and use kmap_atomic() to lock page during modification. drivers/ata/libata-scsi.c | 56 ++- include/linux/ata.h | 26 -- 2 files changed, 51 insertions(+), 31 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index e207b33..7990cb2 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3282,6 +3282,54 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) return 1; } +/** + * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim + * @cmd: SCSI command being translated + * @num: Maximum number of entries (nominally 64). + * @sector: Starting sector + * @count: Total Range of request + * + * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted + * descriptor. + * + * Upto 64 entries of the format: + * 63:48 Range Length + * 47:0 LBA + * + * Range Length of 0 is ignored. + * LBA's should be sorted order and not overlap. + * + * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET + */ +static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num, + u64 sector, u32 count) +{ + __le64 *buffer; + u32 i = 0, used_bytes; + unsigned long flags; + + BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE); + + spin_lock_irqsave(&ata_scsi_rbuf_lock, flags); + buffer = ((void *)ata_scsi_rbuf); + while (i < num) { + u64 entry = sector | + ((u64)(count > 0x ? 0x : count) << 48); + buffer[i++] = __cpu_to_le64(entry); + if (count <= 0x) + break; + count -= 0x; + sector += 0x; + } + + used_bytes = ALIGN(i * 8, 512); + memset(buffer + i, 0, used_bytes - i * 8); + sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 512); + spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags); + + return used_bytes; +} + static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) { struct ata_taskfile *tf = &qc->tf; @@ -3290,8 +3338,8 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) const u8 *cdb = scmd->cmnd; u64 block; u32 n_block; + const u32 trmax = ATA_MAX_TRIM_RNUM; u32 size; - void *buf; u16 fp; u8 bp = 0xff; @@ -3319,10 +3367,8 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) if (!scsi_sg_count(scmd)) goto invalid_param_len; - buf = page_address(sg_page(scsi_sglist(scmd))); - - if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) { - size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, block, n_block); + if (n_block <= 0x * trmax) { + size = ata_format_dsm_trim_descr(scmd, trmax, block, n_block); } else { fp = 2; goto invalid_fld; diff --git a/include/linux/ata.h b/include/linux/ata.h index adbc812..45a1d71 100644 --- a/include/linux/ata.h +++ b/include/linux/ata.h @@ -1071,32 +1071,6 @@ static inline void ata_id_to_hd_driveid(u16 *id) #endif } -/* - * Write LBA Range Entries to the buffer that will cover the extent from - * sector to sector + count. This is used for TRIM and for ADD LBA(S) - * TO NV CACHE PINNED SET. - */ -static inline unsigned ata_set_lba_range_entries(void *_buffer, - unsigned num, u64 sector, unsigned long count) -{ - __le64 *buffer = _buffer; - unsigned i = 0, used_bytes; - - while (i < num) { - u64 entry = sector | - ((u64)(count > 0x ? 0x : count) << 48); - buffer[i++] = __cpu_to_le64(entry); - if (count <= 0x) - break; - count -= 0x; - sector += 0x; - } - - used_bytes = ALIGN(i * 8, 512); - memset(buffer + i, 0, used_bytes - i * 8); - return used_bytes; -} - static inline bool ata_ok(u8 status) { return ((status & (ATA_BUSY | ATA_DRDY | ATA_DF | ATA_DRQ | ATA_ERR)) -- 2.9.3
[PATCH v8 2/2] Add ioctl to issue ZBC/ZAC commands via block layer
Add support for ZBC ioctl's BLKREPORT - Issue Report Zones to device. BLKZONEACTION - Issue a Zone Action (Close, Finish, Open, or Reset) Signed-off-by: Shaun Tancheff --- v8: - Changed ioctl for zone actions to a single ioctl that takes a structure including the zone, zone action, all flag, and force option - Mapped REQ_META flag to 'force unit access' for zone operations v6: - Added GFP_DMA to gfp mask. v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's block/ioctl.c | 149 ++ include/uapi/linux/blkzoned_api.h | 30 +++- include/uapi/linux/fs.h | 1 + 3 files changed, 179 insertions(+), 1 deletion(-) diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..d760523 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -194,6 +194,151 @@ int blkdev_reread_part(struct block_device *bdev) } EXPORT_SYMBOL(blkdev_reread_part); +static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode, + void __user *parg) +{ + int error = -EFAULT; + gfp_t gfp = GFP_KERNEL | GFP_DMA; + void *iopg = NULL; + struct bdev_zone_report_io *bzrpt = NULL; + int order = 0; + struct page *pgs = NULL; + u32 alloc_size = PAGE_SIZE; + unsigned int op_flags = 0; + u8 opt = 0; + + if (!(mode & FMODE_READ)) + return -EBADF; + + iopg = (void *)get_zeroed_page(gfp); + if (!iopg) { + error = -ENOMEM; + goto report_zones_out; + } + bzrpt = iopg; + if (copy_from_user(bzrpt, parg, sizeof(*bzrpt))) { + error = -EFAULT; + goto report_zones_out; + } + if (bzrpt->data.in.return_page_count > alloc_size) { + int npages; + + alloc_size = bzrpt->data.in.return_page_count; + npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT; + pgs = alloc_pages(gfp, ilog2(npages)); + if (pgs) { + void *mem = page_address(pgs); + + if (!mem) { + error = -ENOMEM; + goto report_zones_out; + } + order = ilog2(npages); + memset(mem, 0, alloc_size); + memcpy(mem, bzrpt, sizeof(*bzrpt)); + bzrpt = mem; + } else { + /* Result requires DMA capable memory */ + pr_err("Not enough memory available for request.\n"); + error = -ENOMEM; + goto report_zones_out; + } + } else { + alloc_size = bzrpt->data.in.return_page_count; + } + if (bzrpt->data.in.force_unit_access) + op_flags |= REQ_META; + opt = bzrpt->data.in.report_option; + error = blkdev_issue_zone_report(bdev, op_flags, + bzrpt->data.in.zone_locator_lba, opt, + pgs ? pgs : virt_to_page(iopg), + alloc_size, GFP_KERNEL); + if (error) + goto report_zones_out; + + if (pgs) { + void *src = bzrpt; + u32 off = 0; + + /* +* When moving a multi-order page with GFP_DMA +* the copy to user can trap "" +* so instead we copy out 1 page at a time. +*/ + while (off < alloc_size && !error) { + u32 len = min_t(u32, PAGE_SIZE, alloc_size - off); + + memcpy(iopg, src + off, len); + if (copy_to_user(parg + off, iopg, len)) + error = -EFAULT; + off += len; + } + } else { + if (copy_to_user(parg, iopg, alloc_size)) + error = -EFAULT; + } + +report_zones_out: + if (pgs) + __free_pages(pgs, order); + if (iopg) + free_page((unsigned long)iopg); + return error; +} + +static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode, + void __user *parg) +{ + unsigned int op = 0; + unsigned int op_flags = 0; + sector_t lba; + struct bdev_zone_action za; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + /* When acting on zones we explicitly disallow using a partition. */ + if (bdev != bdev->bd_contains) { + pr_err("%s: All zone operations disallowed on this device\n", + __func__); + return -EFAULT; + } + + if (copy_from_user(&za, parg, sizeof(za))) + return -EFAULT
[PATCH v8 1/2] Add bio/request flags to issue ZBC/ZAC commands
Add op flags to access to zone information as well as open, close and reset zones: - REQ_OP_ZONE_REPORT - Query zone information (Report zones) - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing - REQ_OP_ZONE_CLOSE - Explicitly close a zone - REQ_OP_ZONE_FINISH - Explicitly finish a zone - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone These op flags can be used to create bio's to control zoned devices through the block layer. This is useful for file systems and device mappers that need explicit control of zoned devices such as Host Managed and Host Aware SMR drives, Report zones is a device read that requires a buffer. Open, Close, Finish and Reset are device commands that have no associated data transfer. Open - Open is a zone for writing. Close - Disallow writing to a zone. Finish - Disallow writing a zone and set the WP to the end of the zone. Reset - Discard data in a zone and reset the WP to the start of the zone. Sending an LBA of ~0 will attempt to operate on all zones. This is typically used with Reset to wipe a drive as a Reset behaves similar to TRIM in that all data in the zone(s) is deleted. Report zones currently defaults to reporting on all zones. It expected that support for the zone option flag will piggy back on streamid support. The report option flag is useful as it can reduce the number of zones in each report, but not critical. Signed-off-by: Shaun Tancheff --- v8: - Added Finish Zone op - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled v6: - Added GFP_DMA to gfp mask. v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Removed include/uapi/linux/fs.h from this patch. MAINTAINERS | 9 ++ block/blk-lib.c | 94 drivers/scsi/sd.c | 121 + drivers/scsi/sd.h | 1 + include/linux/bio.h | 8 +- include/linux/blk_types.h | 7 +- include/linux/blkdev.h| 1 + include/linux/blkzoned_api.h | 25 ++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 182 ++ 10 files changed, 447 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h diff --git a/MAINTAINERS b/MAINTAINERS index a306795..aedf311 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12984,6 +12984,15 @@ F: Documentation/networking/z8530drv.txt F: drivers/net/hamradio/*scc.c F: drivers/net/hamradio/z8530.h +ZBC AND ZBC BLOCK DEVICES +M: Shaun Tancheff +W: http://seagate.com +W: https://github.com/Seagate/ZDM-Device-Mapper +L: linux-bl...@vger.kernel.org +S: Maintained +F: include/linux/blkzoned_api.h +F: include/uapi/linux/blkzoned_api.h + ZBUD COMPRESSED PAGE ALLOCATOR M: Seth Jennings L: linux...@kvack.org diff --git a/block/blk-lib.c b/block/blk-lib.c index 083e56f..e92bd56 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -266,3 +266,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask); } EXPORT_SYMBOL(blkdev_issue_zeroout); + +/** + * blkdev_issue_zone_report - queue a report zones operation + * @bdev: target blockdev + * @op_flags: extra bio rw flags. If unsure, use 0. + * @sector:starting sector (report will include this sector). + * @opt: See: zone_report_option, default is 0 (all zones). + * @page: one or more contiguous pages. + * @pgsz: up to size of page in bytes, size of report. + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + *Issue a zone report request for the sectors in question. + */ +int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags, +sector_t sector, u8 opt, struct page *page, +size_t pgsz, gfp_t gfp_mask) +{ + struct bdev_zone_report *conv = page_address(page); + struct bio *bio; + unsigned int nr_iovecs = 1; + int ret = 0; + + if (pgsz < (sizeof(struct bdev_zone_report) + + sizeof(struct bdev_zone_descriptor))) + return -EINVAL; + + bio = bio_alloc(gfp_mask, nr_iovecs); + if (!bio) + return -ENOMEM; + + conv->descriptor_count = 0; + bio->bi_iter.bi_sect
[PATCH v8 0/2] Block layer support ZAC/ZBC commands
Hi Jens, This series is based on linus' v4.8-rc2 branch. As Host Aware drives are becoming available we would like to be able to make use of such drives. This series is also intended to be suitable for use by Host Managed drives. ZBC [and ZAC] drives add new commands for discovering and working with Zones. Part one of this series expands the bio/request reserved op size from 3 to 4 bits and then adds op codes for each of the ZBC commands: Report zones, close zone, finish zone, open zone and reset zone. Part two of this series deals with integrating these new bio/request op's with Hannes' zone cache. This extends the ZBC support up to the block layer allowing direct control by file systems or device mapper targets. Also by deferring the zone handling to the authoritative subsystem there is an overall lower memory usage for holding the active zone information as well as clarifying responsible party for maintaining the write pointer for each active zone. By way of example a DM target may have several writes in progress. To sector (or lba) for those writes will each depend on the previous write. While the drive's write pointer will be updated as writes are completed the DM target will be maintaining both where the next write should be scheduled from and where the write pointer is based on writes completed w/o errors. Knowing the drive zone topology enables DM targets and file systems to extend their block allocation schemes and issue write pointer resets (or discards) that are zone aligned. A perhaps non-obvious approach is that a conventional drive will returns a zone report descriptor with a single large conventional zone. This is intended to allow a collection of zoned and non-zoned media to be stitched together to provide a file system with a zoned device with conventional space mapped to where it is useful. Patches for util-linux can be found here: g...@github.com:stancheff/util-linux.git v2.28.1+biof https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof This patch is available here: https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v8 g...@github.com:stancheff/linux.git v4.8-rc2+biof.v8 v8: - Changed zone report to default to reading from zone cache. - Changed ioctl for zone commands to support forcing a query or command to be sent to media. - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled v7: - Initial support for Hannes' zone cache. v6: - Fix page alloc to include DMA flag for ioctl. v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's - Dropped ata16 hackery V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. - Use zoned report reserved bit for ata-passthrough flag. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Moved include/uapi/linux/fs.h changes to patch 3 - Fixed commit message for first patch in series. Shaun Tancheff (2): Add bio/request flags to issue ZBC/ZAC commands Add ioctl to issue ZBC/ZAC commands via block layer MAINTAINERS | 9 ++ block/blk-lib.c | 94 + block/ioctl.c | 149 +++ drivers/scsi/sd.c | 121 ++ drivers/scsi/sd.h | 1 + include/linux/bio.h | 8 +- include/linux/blk_types.h | 7 +- include/linux/blkdev.h| 1 + include/linux/blkzoned_api.h | 25 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 210 ++ include/uapi/linux/fs.h | 1 + 12 files changed, 625 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h -- 2.9.3
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal wrote: > Shaun, > > On 8/10/16 12:58, Shaun Tancheff wrote: >> >> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal >> wrote: >>>> >>>> On Aug 9, 2016, at 15:47, Hannes Reinecke wrote: >> >> >> [trim] >> >>>>> Since disk type == 0 for everything that isn't HM so I would prefer the >>>>> sysfs 'zoned' file just report if the drive is HA or HM. >>>>> >>>> Okay. So let's put in the 'zoned' attribute the device type: >>>> 'host-managed', 'host-aware', or 'device managed'. >>> >>> >>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned >>> file. >>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything >>> else. This means that drive managed models are not exposed as zoned block >>> devices. For HM vs HA differentiation, an application can look at the >>> device type file since it is already present. >>> >>> We could indeed set the "zoned" file to the device type, but HM drives >>> and >>> regular drives will both have "0" in it, so no differentiation possible. >>> The other choice could be the "zoned" bits defined by ZBC, but these >>> do not define a value for host managed drives, and the drive managed >>> value >>> being not "0" could be confusing too. So I settled for a simple 0/1 >>> boolean. >> >> >> This seems good to me. > > > Another option I forgot is for the "zoned" file to indicate the total number > of zones of the device, and 0 for a non zoned regular block device. That > would work as well. Clearly either is sufficient. > [...] >>> >>> Done: I hacked Shaun ioctl code and added finish zone too. The >>> difference with Shaun initial code is that the ioctl are propagated down >>> to >>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need >>> for >>> BIO request definition for the zone operations. So a lot less code added. >> >> >> The purpose of the BIO flags is not to enable the ioctls so much as >> the other way round. Creating BIO op's is to enable issuing ZBC >> commands from device mapper targets and file systems without some >> heinous ioctl hacks. >> Making the resulting block layer interfaces available via ioctls is just a >> reasonable way to exercise the code ... or that was my intent. > > > Yes, I understood your code. However, since (or if) we keep the zone > information in the RB-tree cache, there is no need for the report zone > operation BIO interface. Same for reset write pointer by keeping the mapping > to discard. blk_lookup_zone can be used in kernel as a report zone BIO > replacement and works as well for the report zone ioctl implementation. For > reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl > becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and > finish zone remains. For these, adding the BIO interface seemed an overkill. > Hence my choice of propagating the ioctl to the driver. > This is debatable of course, and adding an in-kernel interface is not hard: > we can implement blk_open_zone, blk_close_zone and blk_finish_zone using > __blkdev_driver_ioctl. That looks clean to me. Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API -> Kernel is not really a good designed IMO. > Overall, my concern with the BIO based interface for the ZBC commands is > that it adds one flag for each command, which is not really the philosophy > of the interface and potentially opens the door for more such > implementations in the future with new standards and new commands coming up. > Clearly that is not a sustainable path. So I think that a more specific > interface for these zone operations is a better choice. That is consistent > with what happens with the tons of ATA and SCSI commands not actually doing > data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and > are processed as request REQ_TYPE_BLOCK_PC. Part of the reason for following on Mike Christie's bio op/flags cleanup was to make these op's. The advantage of being added as ops is that there is only 1 extra bit need (not 4 or 5 bits for flags). The other reason for being promoted into the block layer as commands is because it seems to me to make sense that these abstractions could be allowed to be passed through a DM layer and be handled by a files sy
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal wrote: > > Shaun, > >> On Aug 14, 2016, at 09:09, Shaun Tancheff wrote: > […] >>>> >>> No, surely not. >>> But one of the _big_ advantages for the RB tree is blkdev_discard(). >>> Without the RB tree any mkfs program will issue a 'discard' for every >>> sector. We will be able to coalesce those into one discard per zone, but >>> we still need to issue one for _every_ zone. >> >> How can you make coalesce work transparently in the >> sd layer _without_ keeping some sort of a discard cache along >> with the zone cache? >> >> Currently the block layer's blkdev_issue_discard() is breaking >> large discard's into nice granular and aligned chunks but it is >> not preventing small discards nor coalescing them. >> >> In the sd layer would there be way to persist or purge an >> overly large discard cache? What about honoring >> discard_zeroes_data? Once the discard is completed with >> discard_zeroes_data you have to return zeroes whenever >> a discarded sector is read. Isn't that a log more than just >> tracking a write pointer? Couldn't a zone have dozens of holes? > > My understanding of the standards regarding discard is that it is not > mandatory and that it is a hint to the drive. The drive can completely > ignore it if it thinks that is a better choice. I may be wrong on this > though. Need to check again. But you are currently setting discard_zeroes_data=1 in your current patches. I believe that setting discard_zeroes_data=1 effectively promotes discards to being mandatory. I have a follow on patch to my SCT Write Same series that handles the CMR zone case in the sd_zbc_setup_discard() handler. > For reset write pointer, the mapping to discard requires that the calls > to blkdev_issue_discard be zone aligned for anything to happen. Specify > less than a zone and nothing will be done. This I think preserve the > discard semantic. Oh. If that is the intent then there is just a bug in the handler. I have pointed out where I believe it to be in my response to the zone cache patch being posted. > As for the “discard_zeroes_data” thing, I also think that is a drive > feature not mandatory. Drives may have it or not, which is consistent > with the ZBC/ZAC standards regarding reading after write pointer (nothing > says that zeros have to be returned). In any case, discard of CMR zones > will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better > choice. However I am still curious about discard's being coalesced. >>> Which is (as indicated) really slow, and easily takes several minutes. >>> With the RB tree we can short-circuit discards to empty zones, and speed >>> up processing time dramatically. >>> Sure we could be moving the logic into mkfs and friends, but that would >>> require us to change the programs and agree on a library (libzbc?) which >>> should be handling that. >> >> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... >> so I'm not sure your argument is valid here. > > This initial SMR support patch is just that: a first try. Jaegeuk > used SG_IO (in fact copy-paste of parts of libzbc) because the current > ZBC patch-set has no ioctl API for zone information manipulation. We > will fix this mkfs.f2fs once we agree on an ioctl interface. Which again is my point. If mkfs.f2fs wants to speed up it's discard pass in mkfs.f2fs by _not_ sending unneccessary Reset WP for zones that are already empty it has all the information it needs to do so. Here it seems to me that the zone cache is _at_best_ doing double work. At works the zone cache could be doing the wrong thing _if_ the zone cache got out of sync. It is certainly possible (however unlikely) that someone was doing some raw sg activity that is not seed by the sd path. All I am trying to do is have a discussion about the reasons for and against have a zone cache. Where it works and where it breaks this should be entirely technical but I understand that we have all spent a lot of time _not_ discussing this for various non-technical reasons. So far the only reason I've been able to ascertain is that Host Manged drives really don't like being stuck with the URSWRZ and would like to have a software hack to return MUD rather than ship drives with some weird out-of-the box config where the last zone is marked as FINISH'd thereby returning MUD on reads as per spec. I understand that it would be strange state to see of first boot and likely people would just do a ResetWP and have weird boot errors, which would probably just make matters worse. I just would rather the work around be a bit cleaner and/or use less
Re: [PATCH] block: Fix secure erase
if (bio_op(bio) == REQ_OP_SECURE_ERASE) > + return 1; > + > if (bio_op(bio) == REQ_OP_WRITE_SAME) > return 1; > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 2c210b6a7bcf..e79055c8b577 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -882,7 +882,7 @@ static inline unsigned int blk_rq_cur_sectors(const > struct request *rq) > static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, > int op) > { > - if (unlikely(op == REQ_OP_DISCARD)) > + if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE)) > return min(q->limits.max_discard_sectors, UINT_MAX >> 9); > > if (unlikely(op == REQ_OP_WRITE_SAME)) > @@ -913,7 +913,9 @@ static inline unsigned int blk_rq_get_max_sectors(struct > request *rq, > if (unlikely(rq->cmd_type != REQ_TYPE_FS)) > return q->limits.max_hw_sectors; > > - if (!q->limits.chunk_sectors || (req_op(rq) == REQ_OP_DISCARD)) > + if (!q->limits.chunk_sectors || > + req_op(rq) == REQ_OP_DISCARD || > + req_op(rq) == REQ_OP_SECURE_ERASE) > return blk_queue_get_max_sectors(q, req_op(rq)); > > return min(blk_max_size_offset(q, offset), > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c > index 7598e6ca817a..dbafc5df03f3 100644 > --- a/kernel/trace/blktrace.c > +++ b/kernel/trace/blktrace.c > @@ -223,7 +223,7 @@ static void __blk_add_trace(struct blk_trace *bt, > sector_t sector, int bytes, > what |= MASK_TC_BIT(op_flags, META); > what |= MASK_TC_BIT(op_flags, PREFLUSH); > what |= MASK_TC_BIT(op_flags, FUA); > - if (op == REQ_OP_DISCARD) > + if (op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE) > what |= BLK_TC_ACT(BLK_TC_DISCARD); > if (op == REQ_OP_FLUSH) > what |= BLK_TC_ACT(BLK_TC_FLUSH); > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at > https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=Y15JQ82w_sRbTpLbwWZih05XQoGHNqYfhx0M_42AepQ&s=LMrzTEwBuBaQO9U9V-bZnGTGnpBfUXCdpjig4yyXvDM&e= -- Shaun Tancheff
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke wrote: > On 08/05/2016 10:35 PM, Shaun Tancheff wrote: >> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal >> wrote: >>> Hannes, Shaun, >>> >>> Let me add some more comments. >>> >>>> On Aug 2, 2016, at 23:35, Hannes Reinecke wrote: >>>> >>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig wrote: >>>>>> >>>>>> Can you please integrate this with Hannes series so that it uses >>>>>> his cache of the zone information? >>>>> >>>>> Adding Hannes and Damien to Cc. >>>>> >>>>> Christoph, >>>>> >>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, >>>>> that is >>>>> quite simple. I can even have the open/close/reset zone commands update >>>>> the >>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around >>>>> the >>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups >>>>> that do >>>>> not need the RB-Tree to function with zoned media. >> >> I have posted patches to integrate with the zone cache, hopefully they >> make sense. >> > [ .. ] >>>> I have thought about condensing the RB tree information, but then I >>>> figured that for 'real' SMR handling we cannot assume all zones are of >>>> fixed size, and hence we need all the information there. >>>> Any condensing method would assume a given structure of the zones, which >>>> the standard just doesn't provide. >>>> Or am I missing something here? Of course you can condense the zone cache without loosing any information. Here is the layout I used ... I haven't update the patch to the latest posted patches but this is the basic idea. [It was originally done as a follow on of making your zone cache work with Seagate's HA drive. I did not include the wp-in-arrays patch along with the HA drive support that I sent you in May as you were quite terse about RB trees when I tried to discuss this approach with you at Vault] struct blk_zone { unsigned type:4; unsigned state:5; unsigned extra:7; unsigned wp:40; void *private_data; }; struct contiguous_wps { u64 start_lba; u64 last_lba; /* or # of blocks */ u64 zone_size; /* size in blocks */ unsigned is_zoned:1; u32 zone_count; spinlock_t lock; struct blk_zone zones[0]; }; struct zone_wps { u32 wps_count; struct contiguous_wps **wps; }; Then in struct request_queue -struct rb_root zones; + struct struct zone_wps *zones; For each contiguous chunk of zones you need a descriptor. In the current drives you need 1 or 2 descriptors. Here a conventional drive is encapsulated as zoned media with one drive sized conventional zone. I have not spent time building an ad-hoc LVM comprised of zoned and conventional media so it's not all ironed out yet. I think you can see the advantage of being able to put conventional space anywhere you would like to work around zoned media not being laid out the the best manner for your setup. Yes things start to break down if every other zone is a different size .. The point being that even with supporting zones that order 48 bytes. in size this saves a lot of space with no loss of information. I still kind of prefer pushing blk_zone down to a u32 by reducing the max zone size and dropping the private_data ... but that may be going a bit too far. blk_lookup_zone then has an [unfortunate] signature change: /** * blk_lookup_zone() - Lookup zones * @q: Request Queue * @sector: Location to lookup * @start: Starting location zone (OUT: Required) * @len: Length of zone (OUT: Required) * @lock: Spinlock of zones (OUT: Required) */ struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector, sector_t *start, sector_t *len, spinlock_t **lock) { int iter; struct blk_zone *bzone = NULL; struct zone_wps *zi = q->zones; *start = 0; *len = 0; *lock = NULL; if (!q->zones) goto out; for (iter = 0; iter < zi->wps_count; iter++) { if (sector >= zi->wps[iter]->start_lba && sector < zi->wps[iter]->last_lba) { struct contiguous_wps *wp = zi->wps[iter]; u64 index = (sector - wp->start_lba) / wp->zone_size;
[PATCH] Update WRITE_SAME timeout in sd_setup_discard_cmnd
In sd_setup_discard_cmnd() there are a some discard methods that fall back to using WRITE_SAME. It appears that those paths using WRITE_SAME should also use the SD_WRITE_SAME_TIMEOUT instead of the default SD_TIMEOUT. Signed-off-by: Shaun Tancheff --- I don't have a use case that breaks the current code. It just seems to me that setups for discard and write same should be consistent. --- drivers/scsi/sd.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d3e852a..3c15f3a 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -722,6 +722,8 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) if (!page) return BLKPREP_DEFER; + rq->timeout = SD_TIMEOUT; + switch (sdkp->provisioning_mode) { case SD_LBP_UNMAP: buf = page_address(page); @@ -746,6 +748,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) put_unaligned_be32(nr_sectors, &cmd->cmnd[10]); len = sdkp->device->sector_size; + rq->timeout = SD_WRITE_SAME_TIMEOUT; break; case SD_LBP_WS10: @@ -758,6 +761,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) put_unaligned_be16(nr_sectors, &cmd->cmnd[7]); len = sdkp->device->sector_size; + rq->timeout = SD_WRITE_SAME_TIMEOUT; break; default: @@ -766,8 +770,6 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd) } rq->completion_data = page; - rq->timeout = SD_TIMEOUT; - cmd->transfersize = len; cmd->allowed = SD_MAX_RETRIES; -- 2.8.1
Re: [PATCH v5 2/2] Add support for SCT Write Same
On Wed, Aug 10, 2016 at 6:30 AM, Tom Yan wrote: > On 10 August 2016 at 14:31, Tom Yan wrote: >> I don't really know about SCT Write Same but there is one concern I >> could I think of. >> >> libata's SATL would report a maximum write same length base on the >> number of sectors a one-block TRIM payload can describe at most, which >> is 65535 * 64 = 4194240 (see ata_scsiop_inq_b0 in libata-scsi.c). If >> the drive does not support TRIM, it will not report such length. That >> is technically fine, as per SBC standard, and I suppose the SCSI disk >> driver would use SD_MAX_WS16_BLOCKS = 0x7f (8388607). > > Actually it will use SD_MAX_WS10_BLOCKS = 0x (65535) in such case. > See sd_config_write_same() in sd.c. So if the device support TRIM, > each SCT Write Same will cover 4194240 logical sectors; if it does > not, each will cover 65535 logical sectors. In that case, perhaps we > should report the same value in the Maximum Write Same Length field > when (only) SCT Write Same is supported? (Not the Optimal Unmap > Granularity field though). You are correct in that we can advertise the larger limit in ata_scsi_dev_config() when only SCT write same is supported rather than fall back to WS10. TRIM is bound by an interface maximum. You can only stuff 64 entries of a 16 bit length followed by 48 bit lba into a 512 byte block. SCT is not restricted (you can wipe an entire drive) however there is a practical limit in that I have coded the SCT to operate in the foreground so the command could timeout depending on how fast the media can write. On my machine the default timeout is 30s so to clear 4194240 (16G): 30s -> 547 MB/s 60s -> 274 MB/s 90s -> 183 MB/s 120s -> 137 MB/s So for my drives 8G and 30s or 16G and 60s is fine. For older or slow drives 4G and 30s should be fine. I really am not sure what would be considered the correct solution though. I believe that the WRITE SAME defaults are currently being chosen around physical limits. We could reduce the trim to 16 entries when SCT is available and bump SCT to the same 16 * 63335 maximum? I think we can also bump the command timeout for WRITE SAME? Suggestions are welcome. -- Shaun Tancheff
Re: [PATCH v5 2/2] Add support for SCT Write Same
On Wed, Aug 10, 2016 at 11:50 AM, Tom Yan wrote: > On 10 August 2016 at 14:34, Shaun Tancheff wrote: >> >> You are correct in that we can advertise the larger limit in >> ata_scsi_dev_config() when only SCT write same is supported >> rather than fall back to WS10. > > ata_scsi_dev_config()? Not sure if I follow. We should only need to > report Maximum Write Same Length in the Block Limit VPD > (ata_scsiop_inq_b0). > >> >> TRIM is bound by an interface maximum. You can only stuff 64 entries >> of a 16 bit length followed by 48 bit lba into a 512 byte block. > > Well that is actually the minimum. Modern SSDs often support more than > one-block payload (e.g. 8, 16...). It's just our SCSI disk driver > statically limit it to the minimum. Though it allows only 0x / > 512 = 8388607 (SD_MAX_WS16_BLOCKS) blocks per WRITE SAME (16) command > anyway, so we can at most allow only a 2-block (well, or 3-block) > payload. Ah. Thanks for the clarification. >> SCT is not restricted (you can wipe an entire drive) however there >> is a practical limit in that I have coded the SCT to operate >> in the foreground so the command could timeout depending >> on how fast the media can write. >> >> On my machine the default timeout is 30s so to clear 4194240 (16G): > > You are talking about an AF 4Kn drive I suppose? For a 512e drive it > should be only ~2G. I stand corrected. Since all the kernel math is 512 byte sectors you are absolutely correct and this isn't an issue at all. We should report SD_MAX_WS16_BLOCKS when only SCT is available and 4194240 when TRIM is available. You can safely ignore the remainder of my pointless rambling. Thanks for you patience none the less. >> 30s -> 547 MB/s >> 60s -> 274 MB/s >> 90s -> 183 MB/s >> 120s -> 137 MB/s >> >> So for my drives 8G and 30s or 16G and 60s is fine. >> For older or slow drives 4G and 30s should be fine. >> >> I really am not sure what would be considered the correct >> solution though. I believe that the WRITE SAME defaults >> are currently being chosen around physical limits. > > Not sure about what WRITE SAME defaults and physical limits you are > referring to. Just that the WRITE SAME limit SD_MAX_WS16_BLOCKS is derived from the request interface as opposed to some other arbitrary limit. >> >> We could reduce the trim to 16 entries when SCT is available and >> bump SCT to the same 16 * 63335 maximum? > > I am not sure if that's a good idea. Small TRIM payloads (hence more > TRIM commands) could lead to noticeable overhead in my experience. But > if 4194240 blocks is really too many for SCT Write Same in any case, I > guess we will have to compromise, since the Maximum Write Same Length > field is shared. (Now it feels unfortunate that we decided to switch > from UNMAP -> TRIM to WRITE SAME (16) -> TRIM long ago.) The question > is, do we want the value to stay at 4194240 when SCT Write Same is not > available? > > I have no idea what the value should be. But, given the fact sector > size seems to matter much in the SCT case, perhaps at the very least, > we would want to derive the multiplier from that? > >> >> I think we can also bump the command timeout for WRITE SAME? > > I have no idea where the timeout comes from. Is it even a thing in the > kernel (instead of one in the firmware of the drive or the ACS > standard)? Oh ... the timeout I was thinking of is this (or defaulted from): /sys/block/sdX/device/timeout It's the struct request_queue's 'rq_timeout' At least that's where my line of thought was going. I will update the patch to use SD_MAX_WS16_BLOCKS and 4194240 as appropriate. Regards, Shaun
Re: [PATCH v5 1/2] Use kmap_atomic when rewriting attached page
On Wed, Aug 10, 2016 at 5:56 AM, Tom Yan wrote: > On 10 August 2016 at 09:00, Shaun Tancheff wrote: >> static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) >> { >> struct ata_taskfile *tf = &qc->tf; >> struct scsi_cmnd *scmd = qc->scsicmd; >> struct ata_device *dev = qc->dev; >> const u8 *cdb = scmd->cmnd; >> + struct scatterlist *sg; >> u64 block; >> u32 n_block; >> + const u32 trmax = ATA_MAX_TRIM_RNUM; >> u32 size; >> - void *buf; >> u16 fp; >> u8 bp = 0xff; >> >> @@ -3319,10 +3363,9 @@ static unsigned int ata_scsi_write_same_xlat(struct >> ata_queued_cmd *qc) >> if (!scsi_sg_count(scmd)) >> goto invalid_param_len; >> >> - buf = page_address(sg_page(scsi_sglist(scmd))); >> - >> - if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) { >> - size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, >> block, n_block); >> + sg = scsi_sglist(scmd); >> + if (n_block <= 0x * cmax) { > > Although this got moved and corrected in the next patch, but perhaps > you should still correct the `cmax` here, which should be `trmax`. You are correct. I will fix this up. Thanks! >> + size = ata_format_dsm_trim_descr(sg, trmax, block, n_block); >> } else { >> fp = 2; >> goto invalid_fld; -- Shaun Tancheff
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
happily merge BIOs across zone > boundaries > and the discard code will not split and align calls on the zones. But upper > layers > (an FS or a device mapper) can still do all this by themselves if they > want/can > support non-constant zone sizes. > > The only exception is drives like the Seagate one with only the last zone of a > different size. This case is handled exactly as if all zones are the same size > simply because any operation on the last smaller zone will naturally align as > the checks of operation size against the drive capacity will do the right > things. > > The ioctls work for all cases (drive with constant zone size or not). This is > again > to allow supporting eventual weird drives at application level. I integrated > all > these ioctl into libzbc block device backend driver and everything is fine. > Can't > tell the difference with direct-to-drive SG_IO accesses. But unlike these, > the zone > ioctls keep the zone information RB-tree cache up to date. > >> >> I will be updating my patchset accordingly. > > I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I > will send > everything for review. If you have any comment on the above, please let me > know and > I will be happy to incorporate changes. > > Best regards. > > > > Damien Le Moal, Ph.D. > Sr. Manager, System Software Group, HGST Research, > HGST, a Western Digital brand > damien.lem...@hgst.com > (+81) 0466-98-3593 (ext. 513593) > 1 kirihara-cho, Fujisawa, > Kanagawa, 252-0888 Japan > www.hgst.com > > Western Digital Corporation (and its subsidiaries) E-mail Confidentiality > Notice & Disclaimer: > > This e-mail and any files transmitted with it may contain confidential or > legally privileged information of WDC and/or its affiliates, and are intended > solely for the use of the individual or entity to which they are addressed. > If you are not the intended recipient, any disclosure, copying, distribution > or any action taken or omitted to be taken in reliance on it, is prohibited. > If you have received this e-mail in error, please notify the sender > immediately and delete the e-mail in its entirety from your system. > -- Shaun Tancheff
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke wrote: > On 08/05/2016 10:35 PM, Shaun Tancheff wrote: >> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal >> wrote: >>>> On Aug 2, 2016, at 23:35, Hannes Reinecke wrote: >>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig wrote: [trim] >> Also the zone report is 'slow' in that there is an overhead for the >> report itself but >> the number of zones per query can be quite large so 4 or 5 I/Os that >> run into the >> hundreds if milliseconds to cache the entire drive isn't really unworkable >> for >> something that is used infrequently. >> > No, surely not. > But one of the _big_ advantages for the RB tree is blkdev_discard(). > Without the RB tree any mkfs program will issue a 'discard' for every > sector. We will be able to coalesce those into one discard per zone, but > we still need to issue one for _every_ zone. > Which is (as indicated) really slow, and easily takes several minutes. > With the RB tree we can short-circuit discards to empty zones, and speed > up processing time dramatically. > Sure we could be moving the logic into mkfs and friends, but that would > require us to change the programs and agree on a library (libzbc?) which > should be handling that. Adding an additional library dependency seems overkill for a program that is already doing ioctls and raw block I/O ... but I would leave that up to each file system. As it sits issuing the ioctl and walking the array of data returned [see blkreport.c] is already quite trivial. I believe the goal here is for F2FS, and perhaps NILFS? to "just work" with the DISCARD to Reset WP and zone cache in place. Still quite skeptical about other common file systems "just working" without their respective mkfs et. al. being zone aware and handling the topology of the media at mkfs time. Perhaps there is something I am unaware of? [trim] >> I can add finish zone ... but I really can't think of a use for it, myself. >> > Which is not the point. The standard defines this, so clearly someone > found it a reasonable addendum. So let's add this for completeness. Agreed and queued for the next version. Regards, Shaun
[PATCH v5 2/2] Add support for SCT Write Same
SATA drives may support write same via SCT. This is useful for setting the drive contents to a specific pattern (0's). Translate a SCSI WRITE SAME command to be either a DSM TRIM command or an SCT Write Same command. Based on the UNMAP flag: - When set translate to DSM TRIM - When not set translate to SCT Write Same Signed-off-by: Shaun Tancheff --- v5: - Addressed review comments - Report support for ZBC only for zoned devices. - kmap page during rewrite - Fix unmap set to require trim or error, if not unmap then sct write same or error. v4: - Added partial MAINTENANCE_IN opcode simulation - Dropped all changes in drivers/scsi/* - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT. v3: - Demux UNMAP/TRIM from WRITE SAME v2: - Remove fugly ata hacking from sd.c --- drivers/ata/libata-scsi.c | 189 +++--- include/linux/ata.h | 43 +++ 2 files changed, 205 insertions(+), 27 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index a71067a..99b0e6c 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device *sdev) { sdev->use_10_for_rw = 1; sdev->use_10_for_ms = 1; - sdev->no_report_opcodes = 1; - sdev->no_write_same = 1; /* Schedule policy is determined by ->qc_defer() callback and * it needs to see every deferred qc. Set dev_blocked to 1 to @@ -3325,6 +3323,41 @@ static unsigned int ata_format_dsm_trim_descr(struct scatterlist *sg, u32 num, return used_bytes; } +/** + * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same + * @sg: Scatter / Gather list attached to command. + * @lba: Starting sector + * @num: Number of bytes to be zero'd. + * + * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted + * descriptor. + * NOTE: Writes a pattern (0's) in the foreground. + * Large write-same requents can timeout. + */ +static void ata_format_sct_write_same(struct scatterlist *sg, u64 lba, u64 num) +{ + void *ptr = kmap_atomic(sg_page(sg)); + u16 *sctpg = ptr + sg->offset; + + put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ + put_unaligned_le16(0x0101, &sctpg[1]); /* WRITE PTRN FG */ + put_unaligned_le64(lba, &sctpg[2]); + put_unaligned_le64(num, &sctpg[6]); + put_unaligned_le32(0u, &sctpg[10]); + + kunmap_atomic(ptr); +} + +/** + * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same + * @qc: Command to be translated + * + * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or + * an SCT Write Same command. + * Based on WRITE SAME has the UNMAP flag + * When set translate to DSM TRIM + * When clear translate to SCT Write Same + */ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) { struct ata_taskfile *tf = &qc->tf; @@ -3338,6 +3371,7 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) u32 size; u16 fp; u8 bp = 0xff; + u8 unmap = cdb[1] & 0x8; /* we may not issue DMA commands if no DMA mode is set */ if (unlikely(!dev->dma_mode)) @@ -3350,10 +3384,23 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) scsi_16_lba_len(cdb, &block, &n_block); /* for now we only support WRITE SAME with the unmap bit set */ - if (unlikely(!(cdb[1] & 0x8))) { - fp = 1; - bp = 3; - goto invalid_fld; + if (unmap) { + if ((dev->horkage & ATA_HORKAGE_NOTRIM) || + !ata_id_has_trim(dev->id)) { + fp = 1; + bp = 3; + goto invalid_fld; + } + if (n_block > 0x * trmax) { + fp = 2; + goto invalid_fld; + } + } else { + if (!ata_id_sct_write_same(dev->id)) { + fp = 1; + bp = 3; + goto invalid_fld; + } } /* @@ -3364,30 +3411,42 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) goto invalid_param_len; sg = scsi_sglist(scmd); - if (n_block <= 0x * cmax) { + if (unmap) { size = ata_format_dsm_trim_descr(sg, trmax, block, n_block); + if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) { + /* Newer devices support queued TRIM commands */ + tf->protocol = ATA_PROT_NCQ; + tf->command = ATA_CMD_FPDMA_SEND; + tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_
[PATCH v5 1/2] Use kmap_atomic when rewriting attached page
The current SATL for WRITE_SAME does not protect against misaligned pages. Additionally the associated page should also kmap'd when being modified. Signed-off-by: Shaun Tancheff --- v5: Added prep patch to work with non-page aligned scatterlist pages and use kmap_atomic() to lock page during modification. drivers/ata/libata-scsi.c | 53 ++- include/linux/ata.h | 26 --- 2 files changed, 48 insertions(+), 31 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index e207b33..a71067a 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -3282,16 +3282,60 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) return 1; } +/** + * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim + * @sg: Scatter / Gather list attached to command. + * @num: Maximum number of entries (nominally 64). + * @sector: Starting sector + * @count: Total Range of request + * + * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted + * descriptor. + * + * Upto 64 entries of the format: + * 63:48 Range Length + * 47:0 LBA + * + * Range Length of 0 is ignored. + * LBA's should be sorted order and not overlap. + * + * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET + */ +static unsigned int ata_format_dsm_trim_descr(struct scatterlist *sg, u32 num, + u64 sector, u32 count) +{ + void *ptr = kmap_atomic(sg_page(sg)); + __le64 *buffer = ptr + sg->offset; + u32 i = 0, used_bytes; + + while (i < num) { + u64 entry = sector | + ((u64)(count > 0x ? 0x : count) << 48); + buffer[i++] = __cpu_to_le64(entry); + if (count <= 0x) + break; + count -= 0x; + sector += 0x; + } + + used_bytes = ALIGN(i * 8, 512); + memset(buffer + i, 0, used_bytes - i * 8); + + kunmap_atomic(ptr); + return used_bytes; +} + static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) { struct ata_taskfile *tf = &qc->tf; struct scsi_cmnd *scmd = qc->scsicmd; struct ata_device *dev = qc->dev; const u8 *cdb = scmd->cmnd; + struct scatterlist *sg; u64 block; u32 n_block; + const u32 trmax = ATA_MAX_TRIM_RNUM; u32 size; - void *buf; u16 fp; u8 bp = 0xff; @@ -3319,10 +3363,9 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) if (!scsi_sg_count(scmd)) goto invalid_param_len; - buf = page_address(sg_page(scsi_sglist(scmd))); - - if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) { - size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, block, n_block); + sg = scsi_sglist(scmd); + if (n_block <= 0x * cmax) { + size = ata_format_dsm_trim_descr(sg, trmax, block, n_block); } else { fp = 2; goto invalid_fld; diff --git a/include/linux/ata.h b/include/linux/ata.h index adbc812..45a1d71 100644 --- a/include/linux/ata.h +++ b/include/linux/ata.h @@ -1071,32 +1071,6 @@ static inline void ata_id_to_hd_driveid(u16 *id) #endif } -/* - * Write LBA Range Entries to the buffer that will cover the extent from - * sector to sector + count. This is used for TRIM and for ADD LBA(S) - * TO NV CACHE PINNED SET. - */ -static inline unsigned ata_set_lba_range_entries(void *_buffer, - unsigned num, u64 sector, unsigned long count) -{ - __le64 *buffer = _buffer; - unsigned i = 0, used_bytes; - - while (i < num) { - u64 entry = sector | - ((u64)(count > 0x ? 0x : count) << 48); - buffer[i++] = __cpu_to_le64(entry); - if (count <= 0x) - break; - count -= 0x; - sector += 0x; - } - - used_bytes = ALIGN(i * 8, 512); - memset(buffer + i, 0, used_bytes - i * 8); - return used_bytes; -} - static inline bool ata_ok(u8 status) { return ((status & (ATA_BUSY | ATA_DRDY | ATA_DF | ATA_DRQ | ATA_ERR)) -- 2.8.1
[PATCH v5 0/2] Add support for SCT Write Same
At some point the method of issuing Write Same for ATA drives changed. Currently write same is commonly available via SCT so expose the SCT capabilities and use SCT Write Same when it is available. This is useful for zoned based media that prefers to support discard with lbprz set, aka discard zeroes data by mapping discard operations to reset write pointer operations. Conventional zones that do not support reset write pointer can still honor the discard zeroes data by issuing a write same over the zone. It may also be nice to know if various controllers that currently disable WRITE SAME will work with the SCT Write Same code path: aacraid, arcmsr, megaraid, 3w-9xxx, 3w-sas, 3w-, gdth, hpsa, ips, megaraid, pmcraid, storvsc_drv This patch against v4.8-rc1 is also at https://github.com/stancheff/linux/tree/v4.8-rc1+ws.v5 g...@github.com:stancheff/linux.git v4.8-rc1+ws.v5 Shaun Tancheff (2): Use kmap_atomic when rewriting attached page Add support for SCT Write Same drivers/ata/libata-scsi.c | 240 -- include/linux/ata.h | 69 - 2 files changed, 252 insertions(+), 57 deletions(-) -- 2.8.1
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal wrote: > Hannes, Shaun, > > Let me add some more comments. > >> On Aug 2, 2016, at 23:35, Hannes Reinecke wrote: >> >> On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig wrote: >>>> >>>> Can you please integrate this with Hannes series so that it uses >>>> his cache of the zone information? >>> >>> Adding Hannes and Damien to Cc. >>> >>> Christoph, >>> >>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that >>> is >>> quite simple. I can even have the open/close/reset zone commands update the >>> RB-Tree .. the non-private parts anyway. I would prefer to do this around >>> the >>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that >>> do >>> not need the RB-Tree to function with zoned media. I have posted patches to integrate with the zone cache, hopefully they make sense. >>> >>> I do still have concerns with the approach which I have shared in smaller >>> forums but perhaps I have to bring them to this group. >>> >>> First is the memory consumption. This isn't really much of a concern for >>> large >>> servers with few drives but I think the embedded NAS market will grumble as >>> well as the large data pods trying to stuff 300+ drives in a chassis. >>> >>> As of now the RB-Tree needs to hold ~3 zones. >>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields >>> around 3.5 MB per zoned drive attached. >>> Which is fine if it is really needed, but most of it is fixed information >>> and it can be significantly condensed (I have proposed 8 bytes per zone held >>> in an array as more than adequate). Worse is that the crucial piece of >>> information, the current wp needed for scheduling the next write, is mostly >>> out of date because it is updated only after the write completes and zones >>> being actively written to must work off of the last location / size that was >>> submitted, not completed. The work around is for that tracking to be handled >>> in the private_data member. I am not saying that updating the wp on >>> completing a write isn’t important, I am saying that the bi_end_io hook is >>> the existing hook that works just fine. >>> >> Which _actually_ is not true; with my patches I'll update the write >> pointer prior to submit the I/O (on the reasoning that most of the time >> I/O will succeed) and re-read the zone information if an I/O failed. >> (Which I'll have to do anyway as after an I/O failure the write pointer >> status is not clearly defined.) Apologies for my mis-characterization. >> I have thought about condensing the RB tree information, but then I >> figured that for 'real' SMR handling we cannot assume all zones are of >> fixed size, and hence we need all the information there. >> Any condensing method would assume a given structure of the zones, which >> the standard just doesn't provide. >> Or am I missing something here? > > Indeed, the standards do not mandate any particular zone configuration, > constant zone size, etc. So writing code so that can be handled is certainly > the right way of doing things. However, if we decide to go forward with > mapping RESET WRITE POINTER command to DISCARD, then at least a constant > zone size (minus the last zone as you said) must be assumed, and that > information can be removed from the entries in the RB tree (as it will be > saved for the sysfs "zone_size" file anyway. Adding a little code to handle > that eventual last runt zone with a different size is not a big problem. >> As for write pointer handling: yes, the write pointer on the zones is >> not really useful for upper-level usage. >> Where we do need it is to detect I/O which is crossing the write pointer >> (eg when doing reads over the entire zone). >> As per spec you will be getting an I/O error here, so we need to split >> the I/O on the write pointer to get valid results back. > > To be precise here, the I/O splitting will be handled by the block layer > thanks to the "chunk_sectors" setting. But that relies on a constant zone > size assumption too. > > The RB-tree here is most useful for reads over or after the write pointer as > this can have different behavior on different drives (URSWRZ bit). The RB-tree > allows us to hide these differences to upper layers and simplify support at > those levels
Re: [PATCH 37/45] drivers: use req op accessor
On Thu, Aug 4, 2016 at 10:46 AM, Christoph Hellwig wrote: > On Wed, Aug 03, 2016 at 07:30:29PM -0500, Shaun Tancheff wrote: >> I think the translation in loop.c is suspicious here: >> >> "if use DIO && not (a flush_flag or discard_flag)" >> should translate to: >> "if use DIO && not ((a flush_flag) || op == discard)" >> >> But in the patch I read: >> "if use DIO && ((not a flush_flag) || op == discard) >> >> Which would have DIO && discards follow the AIO path? > > Indeed. Sorry for missing out on your patch, I just sent a fix > in reply to Dave's other report earlier which is pretty similar to > yours. No worries. I prefer your switch to a an if conditional here. -- Shaun Tancheff
Re: [PATCH 37/45] drivers: use req op accessor
On Wed, Aug 3, 2016 at 6:47 PM, Mike Christie wrote: > On 08/03/2016 05:33 PM, Ross Zwisler wrote: >> On Sun, Jun 5, 2016 at 1:32 PM, wrote: >>> From: Mike Christie >>> >>> The req operation REQ_OP is separated from the rq_flag_bits >>> definition. This converts the block layer drivers to >>> use req_op to get the op from the request struct. >>> >>> Signed-off-by: Mike Christie >>> --- >>> drivers/block/loop.c | 6 +++--- >>> drivers/block/mtip32xx/mtip32xx.c | 2 +- >>> drivers/block/nbd.c | 2 +- >>> drivers/block/rbd.c | 4 ++-- >>> drivers/block/xen-blkfront.c | 8 +--- >>> drivers/ide/ide-floppy.c | 2 +- >>> drivers/md/dm.c | 2 +- >>> drivers/mmc/card/block.c | 7 +++ >>> drivers/mmc/card/queue.c | 6 ++ >> >> Dave Chinner reported a deadlock with XFS + DAX, which I reproduced >> and bisected to this commit: >> >> commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34 >> Author: Mike Christie >> Date: Sun Jun 5 14:32:17 2016 -0500 >> drivers: use req op accessor >> >> Here are the steps to reproduce the deadlock with a BRD ramdisk: >> >> mkfs.xfs -f /dev/ram0 >> mount -o dax /dev/ram0 /mnt/scratch > > When using ramdisks, we need the attached patch like in your other bug > report. I think it will fix some hangs people are seeing. > > I do not think that it should cause the failure to run issue you saw > when doing generic/008 and ext2. > I think the translation in loop.c is suspicious here: "if use DIO && not (a flush_flag or discard_flag)" should translate to: "if use DIO && not ((a flush_flag) || op == discard)" But in the patch I read: "if use DIO && ((not a flush_flag) || op == discard) Which would have DIO && discards follow the AIO path? So I would humbly suggest something like the following (on top of commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34): [Please excuse the messed up patch format ... gmail eats tabs] diff --git a/drivers/block/loop.c b/drivers/block/loop.c index b9b737c..0754d83 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1659,8 +1659,9 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, if (lo->lo_state != Lo_bound) return -EIO; - if (lo->use_dio && (!(cmd->rq->cmd_flags & REQ_FLUSH) || - req_op(cmd->rq) == REQ_OP_DISCARD)) + if (lo->use_dio && !( + (cmd->rq->cmd_flags & REQ_FLUSH) || +req_op(cmd->rq) == REQ_OP_DISCARD)) cmd->use_aio = true; else cmd->use_aio = false; -- Shaun Tancheff
[PATCH 2/2] Enable support for Seagate HostAware drives (testing).
Seagate drives report a SAME code of 0 due to having: - Zones of different types (CMR zones at the low LBA space). - Zones of different size (A terminating 'runt' zone in the high lba space). Support loading the zone topology into the sd_zbc zone cache. Signed-off-by: Shaun Tancheff Cc: Hannes Reinecke Cc: Damien Le Moal --- v1: - Updated kernel version / re-sync with Hannes' zac.v3 branch. --- drivers/scsi/sd.c | 22 drivers/scsi/sd.h | 20 +-- drivers/scsi/sd_zbc.c | 150 ++ 3 files changed, 155 insertions(+), 37 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 7c38975..5fbc599 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -694,8 +694,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) break; case SD_ZBC_RESET_WP: - max_blocks = sdkp->unmap_granularity; q->limits.discard_zeroes_data = 1; + q->limits.discard_granularity = + sd_zbc_discard_granularity(sdkp); + + max_blocks = min_not_zero(sdkp->unmap_granularity, + q->limits.discard_granularity >> + ilog2(logical_block_size)); break; case SD_LBP_ZERO: @@ -1952,13 +1957,12 @@ static int sd_done(struct scsi_cmnd *SCpnt) good_bytes = blk_rq_bytes(req); scsi_set_resid(SCpnt, 0); } else { -#ifdef CONFIG_SCSI_ZBC if (op == ZBC_OUT) /* RESET WRITE POINTER failed */ sd_zbc_update_zones(sdkp, blk_rq_pos(req), - 512, true); -#endif + 512, SD_ZBC_RESET_WP_ERR); + good_bytes = 0; scsi_set_resid(SCpnt, blk_rq_bytes(req)); } @@ -2031,7 +2035,6 @@ static int sd_done(struct scsi_cmnd *SCpnt) good_bytes = blk_rq_bytes(req); scsi_set_resid(SCpnt, 0); } -#ifdef CONFIG_SCSI_ZBC /* * ZBC: Unaligned write command. * Write did not start a write pointer position. @@ -2039,8 +2042,7 @@ static int sd_done(struct scsi_cmnd *SCpnt) if (sshdr.ascq == 0x04) sd_zbc_update_zones(sdkp, blk_rq_pos(req), - 512, true); -#endif + 512, SD_ZBC_WRITE_ERR); } break; default: @@ -2267,7 +2269,7 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer) * supports equal zone sizes. */ same = buffer[4] & 0xf; - if (same == 0 || same > 3) { + if (same > 3) { sd_printk(KERN_WARNING, sdkp, "REPORT ZONES SAME type %d not supported\n", same); return; @@ -2279,9 +2281,9 @@ static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer) sdkp->unmap_granularity = zone_len; blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, zone_len)); - sd_config_discard(sdkp, SD_ZBC_RESET_WP); - sd_zbc_setup(sdkp, buffer, SD_BUF_SIZE); + sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE); + sd_config_discard(sdkp, SD_ZBC_RESET_WP); } static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp, diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 6ae4505..ef6c132 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -283,19 +283,24 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, unsigned int a) #endif /* CONFIG_BLK_DEV_INTEGRITY */ + +#define SD_ZBC_INIT0 +#define SD_ZBC_RESET_WP_ERR1 +#define SD_ZBC_WRITE_ERR 2 + #ifdef CONFIG_SCSI_ZBC extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int, sector_t, enum zbc_zone_reporting_options, bool); -extern int sd_zbc_setup(struct scsi_disk *, char *, int); +extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len); extern void sd_zbc_remove(struct scsi_disk *); extern void sd_zbc_reset_zones(struct scsi_disk *); extern int sd_zbc_setup_discard(struct scsi_disk *, struct request *, sector_t, unsigned int); extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *,
[PATCH 1/2] bio/zbc support for zone cache
Zone actions (Open/Close/Reset) update zone cache on success. Add helpers for - Zone actions to update zone cache - Zone report to translate cache to ZBC format structs Update blkreport to pull from zone cache instead of querying media. Added open explicit and closed states for zone cache Signed-off-by: Shaun Tancheff Cc: Hannes Reinecke Cc: Damien Le Moal Cc: Dan Williams Cc: Sagi Grimberg Cc: Mike Christie Cc: Toshi Kani Cc: Kent Overstreet Cc: Ming Lei --- block/blk-lib.c| 3 +- block/blk-zoned.c | 190 + block/ioctl.c | 39 +++--- include/linux/blkdev.h | 14 +++- 4 files changed, 234 insertions(+), 12 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 6dcdcbf..92898ec 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -6,7 +6,6 @@ #include #include #include -#include #include "blk.h" @@ -358,6 +357,8 @@ int blkdev_issue_zone_action(struct block_device *bdev, unsigned int op, bio_set_op_attrs(bio, op, op_flags); ret = submit_bio_wait(bio); bio_put(bio); + if (ret == 0) + update_zone_state(bdev, sector, op); return ret; } EXPORT_SYMBOL(blkdev_issue_zone_action); diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 975e863..799676b 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -68,3 +68,193 @@ void blk_drop_zones(struct request_queue *q) q->zones = RB_ROOT; } EXPORT_SYMBOL_GPL(blk_drop_zones); + +static void __set_zone_state(struct blk_zone *zone, int op) +{ + if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) + return; + + switch (op) { + case REQ_OP_ZONE_OPEN: + zone->state = BLK_ZONE_OPEN_EXPLICIT; + break; + case REQ_OP_ZONE_CLOSE: + zone->state = BLK_ZONE_CLOSED; + break; + case REQ_OP_ZONE_RESET: + zone->wp = zone->start; + break; + default: + WARN_ONCE(1, "%s: invalid op code: %u\n", __func__, op); + } +} + +void update_zone_state(struct block_device *bdev, sector_t lba, unsigned int op) +{ + struct request_queue *q = bdev_get_queue(bdev); + struct blk_zone *zone = NULL; + + if (lba == ~0ul) { + struct rb_node *node; + + for (node = rb_first(&q->zones); node; node = rb_next(node)) { + zone = rb_entry(node, struct blk_zone, node); + __set_zone_state(zone, op); + } + return; + } + zone = blk_lookup_zone(q, lba); + if (zone) + __set_zone_state(zone, op); +} +EXPORT_SYMBOL_GPL(update_zone_state); + +void bzrpt_fill(struct block_device *bdev, struct bdev_zone_report *bzrpt, + size_t sz, sector_t lba, u8 opt) +{ + u64 clen = ~0ul; + struct blk_zone *zone = NULL; + struct rb_node *node = NULL; + struct request_queue *q = bdev_get_queue(bdev); + u32 max_entries = (sz - sizeof(struct bdev_zone_report)) + / sizeof(struct bdev_zone_descriptor); + u32 entry; + int len_diffs = 0; + int type_diffs = 0; + u8 ctype; + u8 same = 0; + + zone = blk_lookup_zone(q, lba); + if (zone) + node = &zone->node; + + for (entry = 0; +entry < max_entries && node; +entry++, node = rb_next(node)) { + u64 wp; + u8 cond = 0; + u8 flgs = 0; + + zone = rb_entry(node, struct blk_zone, node); + if (blk_zone_is_cmr(zone)) + wp = zone->start + zone->len; + else + wp = zone->wp; + + bzrpt->descriptors[entry].lba_start = cpu_to_be64(zone->start); + bzrpt->descriptors[entry].length = cpu_to_be64(zone->len); + bzrpt->descriptors[entry].type = zone->type; + bzrpt->descriptors[entry].lba_wptr = cpu_to_be64(wp); + + switch (zone->state) { + case BLK_ZONE_NO_WP: + cond = ZCOND_CONVENTIONAL; + break; + case BLK_ZONE_OPEN: + cond = ZCOND_ZC2_OPEN_IMPLICIT; + break; + case BLK_ZONE_OPEN_EXPLICIT: + cond = ZCOND_ZC3_OPEN_EXPLICIT; + break; + case BLK_ZONE_CLOSED: + cond = ZCOND_ZC4_CLOSED; + break; + case BLK_ZONE_READONLY: + cond = ZCOND_ZC6_READ_ONLY; + break; + case BLK_ZONE_OFFLINE: + cond = ZCOND_ZC7_OFFLINE; + break; + default: +
[PATCH 0/2] Bio flags use zone cache when available.
Hi, As per Christoph's request this patch incorporates Hannes' cache of zone information. The approach is to provide the same blkreport format as would be done when BLK_DEV_ZONED is not enabled via an ioctl. Reset WP, Open, and Close zone will update the zone cache. Using blkdev_issue_zone_report() from within the kernel the report will still come directly from the media. I am considering an option for the ioctl to force pulling the zone report from the media when BLK_DEV_ZONED is enabled and using the resulting information to refresh the zone cache. It is not clear that this is useful so I have not included that in this patch. The follow-on patch just enables the zone cache to be populated using a Seagate's HostAware drive. This series is based off of Linus current tip post (f38d2e5313f0af..) and builds on top of the previous series of block layer support: Add ioctl to issue ZBC/ZAC commands via block layer Add bio/request flags to issue ZBC/ZAC commands as well as the series post by Hannes' sd_zbc: Fix handling of ZBC read after write pointer sd: Limit messages for ZBC disks capacity change sd: Implement support for ZBC devices sd: Implement new RESET_WP provisioning mode sd: configure ZBC devices ... Cc: Hannes Reinecke Cc: Damien Le Moal Shaun Tancheff (2): bio/zbc support for zone cache Enable support for Seagate HostAware drives (testing). block/blk-lib.c| 3 +- block/blk-zoned.c | 190 + block/ioctl.c | 39 +++--- drivers/scsi/sd.c | 22 +++--- drivers/scsi/sd.h | 20 -- drivers/scsi/sd_zbc.c | 150 -- include/linux/blkdev.h | 14 +++- 7 files changed, 389 insertions(+), 49 deletions(-) -- 2.8.1 sd_zbc: Fix handling of ZBC read after write pointer sd: Limit messages for ZBC disks capacity change sd: Implement support for ZBC devices sd: Implement new RESET_WP provisioning mode sd: configure ZBC devices block: Add 'BLK_MQ_RQ_QUEUE_DONE' return value block: Introduce BLKPREP_DONE block: Add 'zoned' sysfs queue attribute block: Implement support for zoned block devices block: update chunk_sectors in blk_stack_limits() blk-sysfs: Add 'chunk_sectors' to sysfs attributes sd: configure ZBC devices block: Add 'BLK_MQ_RQ_QUEUE_DONE' return value block: Introduce BLKPREP_DONE block: Add 'zoned' sysfs queue attribute block: Implement support for zoned block devices block: update chunk_sectors in blk_stack_limits() blk-sysfs: Add 'chunk_sectors' to sysfs attributes
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig wrote: > > Can you please integrate this with Hannes series so that it uses > his cache of the zone information? Adding Hannes and Damien to Cc. Christoph, I can make a patch the marshal Hannes' RB-Tree into to a block report, that is quite simple. I can even have the open/close/reset zone commands update the RB-Tree .. the non-private parts anyway. I would prefer to do this around the CONFIG_SD_ZBC support, offering the existing type of patch for setups that do not need the RB-Tree to function with zoned media. I do still have concerns with the approach which I have shared in smaller forums but perhaps I have to bring them to this group. First is the memory consumption. This isn't really much of a concern for large servers with few drives but I think the embedded NAS market will grumble as well as the large data pods trying to stuff 300+ drives in a chassis. As of now the RB-Tree needs to hold ~3 zones. sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields around 3.5 MB per zoned drive attached. Which is fine if it is really needed, but most of it is fixed information and it can be significantly condensed (I have proposed 8 bytes per zone held in an array as more than adequate). Worse is that the crucial piece of information, the current wp needed for scheduling the next write, is mostly out of date because it is updated only after the write completes and zones being actively written to must work off of the last location / size that was submitted, not completed. The work around is for that tracking to be handled in the private_data member. I am not saying that updating the wp on completing a write isn’t important, I am saying that the bi_end_io hook is the existing hook that works just fine. This all tails into domain responsability. With the RB-Tree doing half of the work and the ‘responsible’ domain handling the active path via private_data why have the split at all? It seems to be a double work to have second object tracking the first so that I/O scheduling can function. Finally is the error handling path when the RB-Tree encounters and error it attempts to requery the drive topology virtually guaranteeing that the private_data is now out-of-sync with the RB-Tree. Again this is something that can be better encapsulated in the bi_end_io to be informed of the failed I/O and schedule the appropriate recovery (including re-querying the zone information of the affected zone(s)). Anyway those are my concerns and why I am still reluctant to drop this line of support. I have incorporated Hannes changes at various points. Hence the SCT Write Same to attempt to work around some of the flaws in mapping discard to reset write pointer. Thanks and Regards, Shaun > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at > https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg&s=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&e= -- Shaun Tancheff
Fixup direct bi_rw modifiers
bi_rw should be using bio_set_op_attrs to set bi_rw. Signed-off-by: Shaun Tancheff Cc: Chris Mason Cc: Josef Bacik Cc: David Sterba Cc: Mike Christie --- Patch is against linux-next tag next-20160729 NOTE: In 4.7 this was not including the 'WRITE' macro so may have it may not have been operating as intended. --- fs/btrfs/extent_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f67d6a1..720e6ef 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2050,7 +2050,7 @@ int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical, return -EIO; } bio->bi_bdev = dev->bdev; - bio->bi_rw = WRITE_SYNC; + bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_SYNC); bio_add_page(bio, page, length, pg_offset); if (btrfsic_submit_bio_wait(bio)) { -- 2.8.1
[PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer
Add support for ZBC ioctl's BLKREPORT- Issue Report Zones to device. BLKOPENZONE - Issue Zone Action: Open Zone command. BLKCLOSEZONE - Issue Zone Action: Close Zone command. BLKRESETZONE - Issue Zone Action: Reset Zone command. Signed-off-by: Shaun Tancheff --- v6: - Added GFP_DMA to gfp mask. v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's --- block/ioctl.c | 110 ++ include/uapi/linux/blkzoned_api.h | 6 +++ include/uapi/linux/fs.h | 1 + 3 files changed, 117 insertions(+) diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..a2a6c2c 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include @@ -194,6 +195,109 @@ int blkdev_reread_part(struct block_device *bdev) } EXPORT_SYMBOL(blkdev_reread_part); +static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode, + void __user *parg) +{ + int error = -EFAULT; + gfp_t gfp = GFP_KERNEL | GFP_DMA; + struct bdev_zone_report_io *zone_iodata = NULL; + int order = 0; + struct page *pgs = NULL; + u32 alloc_size = PAGE_SIZE; + unsigned long op_flags = 0; + u8 opt = 0; + + if (!(mode & FMODE_READ)) + return -EBADF; + + zone_iodata = (void *)get_zeroed_page(gfp); + if (!zone_iodata) { + error = -ENOMEM; + goto report_zones_out; + } + if (copy_from_user(zone_iodata, parg, sizeof(*zone_iodata))) { + error = -EFAULT; + goto report_zones_out; + } + if (zone_iodata->data.in.return_page_count > alloc_size) { + int npages; + + alloc_size = zone_iodata->data.in.return_page_count; + npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT; + pgs = alloc_pages(gfp, ilog2(npages)); + if (pgs) { + void *mem = page_address(pgs); + + if (!mem) { + error = -ENOMEM; + goto report_zones_out; + } + order = ilog2(npages); + memset(mem, 0, alloc_size); + memcpy(mem, zone_iodata, sizeof(*zone_iodata)); + free_page((unsigned long)zone_iodata); + zone_iodata = mem; + } else { + /* Result requires DMA capable memory */ + pr_err("Not enough memory available for request.\n"); + error = -ENOMEM; + goto report_zones_out; + } + } + opt = zone_iodata->data.in.report_option; + error = blkdev_issue_zone_report(bdev, op_flags, + zone_iodata->data.in.zone_locator_lba, opt, + pgs ? pgs : virt_to_page(zone_iodata), + alloc_size, GFP_KERNEL); + + if (error) + goto report_zones_out; + + if (copy_to_user(parg, zone_iodata, alloc_size)) + error = -EFAULT; + +report_zones_out: + if (pgs) + __free_pages(pgs, order); + else if (zone_iodata) + free_page((unsigned long)zone_iodata); + return error; +} + +static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + unsigned int op = 0; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + /* +* When acting on zones we explicitly disallow using a partition. +*/ + if (bdev != bdev->bd_contains) { + pr_err("%s: All zone operations disallowed on this device\n", + __func__); + return -EFAULT; + } + + switch (cmd) { + case BLKOPENZONE: + op = REQ_OP_ZONE_OPEN; + break; + case BLKCLOSEZONE: + op = REQ_OP_ZONE_CLOSE; + break; + case BLKRESETZONE: + op = REQ_OP_ZONE_RESET; + break; + default: + pr_err("%s: Unknown action: %u\n", __func__, cmd); + return -EINVAL; + } + return blkdev_issue_zone_action(bdev, op, 0, arg, GFP_KERNEL); +} + static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode, unsigned long arg, unsigned long flags) { @@ -568,6 +672,12 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKTRACESETUP: case BLKTRACETEARDOWN: return blk_trace_ioctl(bdev, cmd, argp); + case BLKREPORT: + return blk_zoned_report_ioctl(bdev, mode, argp); + case BLKOPENZONE: +
[PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands
Add op flags to access to zone information as well as open, close and reset zones: - REQ_OP_ZONE_REPORT - Query zone information (Report zones) - REQ_OP_ZONE_OPEN - Explictly open a zone for writing - REQ_OP_ZONE_CLOSE - Explictly close a zone - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone These op flags can be used to create bio's to control zoned devices through the block layer. This is useful for filesystems and device mappers that need explicit control of zoned devices such as Host Mananged and Host Aware SMR drives, Report zones is a device read that requires a buffer. Open, Close and Reset are device commands that have no associated data transfer. Sending an LBA of ~0 will attempt to operate on all zones. This is typically used with Reset to wipe a drive as a Reset behaves similar to TRIM in that all data in the zone(s) is deleted. The Finish zone command is intentionally not implimented as there is no current use case for that operation. Report zones currently defaults to reporting on all zones. It expected that support for the zone option flag will piggy back on streamid support. The report option flag is useful as it can reduce the number of zones in each report, but not critical. Signed-off-by: Shaun Tancheff --- v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Removed include/uapi/linux/fs.h from this patch. --- MAINTAINERS | 9 ++ block/blk-lib.c | 95 + drivers/scsi/sd.c | 118 + drivers/scsi/sd.h | 1 + include/linux/bio.h | 7 +- include/linux/blk_types.h | 6 +- include/linux/blkzoned_api.h | 25 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 214 ++ 9 files changed, 474 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h diff --git a/MAINTAINERS b/MAINTAINERS index 771c31c..32f5598 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12785,6 +12785,15 @@ F: Documentation/networking/z8530drv.txt F: drivers/net/hamradio/*scc.c F: drivers/net/hamradio/z8530.h +ZBC AND ZBC BLOCK DEVICES +M: Shaun Tancheff +W: http://seagate.com +W: https://github.com/Seagate/ZDM-Device-Mapper +L: linux-bl...@vger.kernel.org +S: Maintained +F: include/linux/blkzoned_api.h +F: include/uapi/linux/blkzoned_api.h + ZBUD COMPRESSED PAGE ALLOCATOR M: Seth Jennings L: linux...@kvack.org diff --git a/block/blk-lib.c b/block/blk-lib.c index 083e56f..6dcdcbf 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "blk.h" @@ -266,3 +267,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask); } EXPORT_SYMBOL(blkdev_issue_zeroout); + +/** + * blkdev_issue_zone_report - queue a report zones operation + * @bdev: target blockdev + * @op_flags: extra bio rw flags. If unsure, use 0. + * @sector:starting sector (report will include this sector). + * @opt: See: zone_report_option, default is 0 (all zones). + * @page: one or more contiguous pages. + * @pgsz: up to size of page in bytes, size of report. + * @gfp_mask: memory allocation flags (for bio_alloc) + * + * Description: + *Issue a zone report request for the sectors in question. + */ +int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags, +sector_t sector, u8 opt, struct page *page, +size_t pgsz, gfp_t gfp_mask) +{ + struct bdev_zone_report *conv = page_address(page); + struct bio *bio; + unsigned int nr_iovecs = 1; + int ret = 0; + + if (pgsz < (sizeof(struct bdev_zone_report) + + sizeof(struct bdev_zone_descriptor))) + return -EINVAL; + + bio = bio_alloc(gfp_mask, nr_iovecs); + if (!bio) + return -ENOMEM; + + conv->descriptor_count = 0; + bio->bi_iter.bi_sector = sector; + bio->bi_bdev = bdev; + bio->bi_vcnt = 0; + bio->bi_iter.bi_size = 0; + + bio_add_page(bio, page, pgsz, 0); + bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags); + ret = submit_bio_wait(bio); + + /* +* When our request it nak
[PATCH v6 0/2] Block layer support ZAC/ZBC commands
Hi Jens, This series is based on linus' current tip after the merge of 'for-4.8/core' As Host Aware drives are becoming available we would like to be able to make use of such drives. This series is also intended to be suitable for use by Host Managed drives. ZAC/ZBC drives add new commands for discovering and working with Zones. This extends the ZAC/ZBC support up to the block layer allowing direct control by file systems or device mapper targets. Also by deferring the zone handling to the authoritative subsystem there is an overall lower memory usage for holding the active zone information as well as clarifying responsible party for maintaining the write pointer for each active zone. By way of example a DM target may have several writes in progress. To sector (or lba) for those writes will each depend on the previous write. While the drive's write pointer will be updated as writes are completed the DM target will be maintaining both where the next write should be scheduled from and where the write pointer is based on writes completed w/o errors. Knowing the drive's zone topology enables DM targets and file systems to extend their block allocation schemes and issue write pointer resets (or discards) that are zone aligned. A perhaps non-obvious approach is that a conventional drive will returns a zone report descriptor with a single large conventional zone. Patches for util-linux can be found here: https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux This patch is available here: https://github.com/stancheff/linux/tree/zbc.bio.v6 g...@github.com:stancheff/linux.git zbc.bio.v6 v6: - Fix page alloc to include DMA flag for ioctl. v5: - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent - In blk-lib fix documentation v4: - Rebase on linux-next tag next-20160617. - Change bio flags to bio op's - Dropped ata16 hackery V3: - Rebase on Mike Cristie's separate bio operations - Update blkzoned_api.h to include report zones PARTIAL bit. - Use zoned report reserved bit for ata-passthrough flag. V2: - Changed bi_rw to op_flags clarify sepeartion of bio op from flags. - Fixed memory leak in blkdev_issue_zone_report failing to put_bio(). - Documented opt in blkdev_issue_zone_report. - Moved include/uapi/linux/fs.h changes to patch 3 - Fixed commit message for first patch in series. Shaun Tancheff (2): Add bio/request flags to issue ZBC/ZAC commands Add ioctl to issue ZBC/ZAC commands via block layer MAINTAINERS | 9 ++ block/blk-lib.c | 95 block/ioctl.c | 110 +++ drivers/scsi/sd.c | 118 drivers/scsi/sd.h | 1 + include/linux/bio.h | 7 +- include/linux/blk_types.h | 6 +- include/linux/blkzoned_api.h | 25 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned_api.h | 220 ++ include/uapi/linux/fs.h | 1 + 11 files changed, 591 insertions(+), 2 deletions(-) create mode 100644 include/linux/blkzoned_api.h create mode 100644 include/uapi/linux/blkzoned_api.h -- 2.8.1
[PATCH v4] Add support for SCT Write Same
SATA drives may support write same via SCT. This is useful for setting the drive contents to a specific pattern (0's). If UNMAP is not set or TRIM is not available then fall back to SCT WRITE SAME, if it is available. In this way it would be possible to mimic lbprz for devices that support TRIM but fail to zero blocks reliably. For example a file-system or DM target could issue a write same w/o unmap followed by an trim. Signed-off-by: Shaun Tancheff --- At some point the method of issuing Write Same for ATA drives changed. Currently write same is commonly available via SCT so expose the SCT capabilities and use SCT Write Same when it is available. This is useful for zoned based media that prefers to support discard with lbprz set, aka discard zeroes data by mapping discard operations to reset write pointer operations. Conventional zones that do not support reset write pointer can still honor the discard zeroes data by issuing a write same over the zone. With this patch if UNMAP is set the TRIM support will follow the existing code path. When UNMAP is not set and SCT WRITE SAME support is available the specified blocks will be zero'd. This patch is also at https://github.com/stancheff/linux/tree/v4.7-sct.ws.v4 g...@github.com:stancheff/linux.git v4.7-sct.ws.v4 v4: - Added partial MAINTENANCE_IN opcode simulation - Dropped all changes in drivers/scsi/* - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT. v3: - Demux UNMAP/TRIM from WRITE SAME v2: - Remove fugly ata hacking from sd.c --- drivers/ata/libata-scsi.c | 166 -- include/linux/ata.h | 43 2 files changed, 187 insertions(+), 22 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index bfec66f..6c7a87e 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device *sdev) { sdev->use_10_for_rw = 1; sdev->use_10_for_ms = 1; - sdev->no_report_opcodes = 1; - sdev->no_write_same = 1; /* Schedule policy is determined by ->qc_defer() callback and * it needs to see every deferred qc. Set dev_blocked to 1 to @@ -3267,11 +3265,20 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc) return 1; } +/** + * ata_scsi_write_same_xlat - Handle WRITE_SAME commands + * + * @qc: command structure + * + * Translate WRITE_SAME w/UNMAP set to TRIM/discard request. + * Otherwise use SCT WRITE SAME when available. + */ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) { struct ata_taskfile *tf = &qc->tf; struct scsi_cmnd *scmd = qc->scsicmd; struct ata_device *dev = qc->dev; + struct scatterlist *sg; const u8 *cdb = scmd->cmnd; u64 block; u32 n_block; @@ -3279,6 +3286,8 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) void *buf; u16 fp; u8 bp = 0xff; + u8 unmap = cdb[1] & 0x8; + bool use_sct = unmap ? false : true; /* we may not issue DMA commands if no DMA mode is set */ if (unlikely(!dev->dma_mode)) @@ -3290,8 +3299,10 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) } scsi_16_lba_len(cdb, &block, &n_block); - /* for now we only support WRITE SAME with the unmap bit set */ - if (unlikely(!(cdb[1] & 0x8))) { + /* +* If UNMAP is not set and SCT write same is not available then fail. +*/ + if (use_sct && !ata_id_sct_write_same(dev->id)) { fp = 1; bp = 3; goto invalid_fld; @@ -3304,26 +3315,56 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc) if (!scsi_sg_count(scmd)) goto invalid_param_len; - buf = page_address(sg_page(scsi_sglist(scmd))); - size = ata_set_lba_range_entries(buf, 512, block, n_block); + sg = scsi_sglist(scmd); + buf = page_address(sg_page(sg)) + sg->offset; - if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) { - /* Newer devices support queued TRIM commands */ - tf->protocol = ATA_PROT_NCQ; - tf->command = ATA_CMD_FPDMA_SEND; - tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_DSM & 0x1f; - tf->nsect = qc->tag << 3; - tf->hob_feature = (size / 512) >> 8; - tf->feature = size / 512; + /* +* if we only have SCT then ignore the state of unmap request +* a zero the blocks. +*/ + if (use_sct) { + u16 *sctpg = buf; + + put_unaligned_le16(0x0002, &sctpg[0]); /* SCT_ACT_WRITE_SAME */ + put_unaligned_le16(0x0101, &
[PATCH] Change MPI alloc from GFP_KERNEL to GFP_ATOMIC
Running Ubuntu 14.04 for testing against 4.8 merge window I am getting. One quick fix is to change GFP_KERNEL to GFP_ATOMIC. BUG: sleeping function called from invalid context at mm/slab.h:393 in_atomic(): 1, irqs_disabled(): 0, pid: 594, name: modprobe no locks held by modprobe/594. CPU: 1 PID: 594 Comm: modprobe Not tainted 4.7.0-zdm #64 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 8ef5d4e5b978 8b5eded3 8ef5d9021940 8bee5c08 8ef5d4e5b9a0 8b0b5109 8bee5c08 0189 8ef5d4e5b9c8 8b0b5209 Call Trace: [] dump_stack+0x85/0xc2 [] ___might_sleep+0x179/0x230 [] __might_sleep+0x49/0x80 [] kmem_cache_alloc_trace+0x1ba/0x2f0 [] ? mpi_alloc+0x20/0x80 [] mpi_alloc+0x20/0x80 [] mpi_read_raw_from_sgl+0xc7/0x1d0 [] rsa_verify+0x57/0xd0 [] ? pkcs1pad_sg_set_buf+0x40/0xb0 [] pkcs1pad_verify+0xbb/0x100 [] public_key_verify_signature+0x1c0/0x2a0 [] ? debug_check_no_locks_freed+0xd0/0x160 [] public_key_verify_signature_2+0x15/0x20 [] verify_signature+0x3c/0x50 [] pkcs7_validate_trust+0x1fa/0x260 [] verify_pkcs7_signature+0x7d/0x100 [] mod_verify_sig+0x7c/0xb0 [] load_module+0x170/0x2ac0 [] ? __vfs_read+0xbd/0x110 [] ? ima_post_read_file+0x7d/0xa0 [] ? kernel_read_file+0x191/0x1b0 [] SYSC_finit_module+0xc3/0xf0 [] SyS_finit_module+0xe/0x10 [] entry_SYSCALL_64_fastpath+0x23/0xc1 Signed-off-by: Shaun Tancheff --- lib/mpi/mpiutil.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/mpi/mpiutil.c b/lib/mpi/mpiutil.c index 314f4df..420ace3 100644 --- a/lib/mpi/mpiutil.c +++ b/lib/mpi/mpiutil.c @@ -31,7 +31,7 @@ MPI mpi_alloc(unsigned nlimbs) { MPI a; - a = kmalloc(sizeof *a, GFP_KERNEL); + a = kmalloc(sizeof *a, GFP_ATOMIC); if (!a) return a; @@ -61,7 +61,7 @@ mpi_ptr_t mpi_alloc_limb_space(unsigned nlimbs) if (!len) return NULL; - return kmalloc(len, GFP_KERNEL); + return kmalloc(len, GFP_ATOMIC); } void mpi_free_limb_space(mpi_ptr_t a) @@ -91,14 +91,14 @@ int mpi_resize(MPI a, unsigned nlimbs) return 0; /* no need to do it */ if (a->d) { - p = kmalloc(nlimbs * sizeof(mpi_limb_t), GFP_KERNEL); + p = kmalloc(nlimbs * sizeof(mpi_limb_t), GFP_ATOMIC); if (!p) return -ENOMEM; memcpy(p, a->d, a->alloced * sizeof(mpi_limb_t)); kzfree(a->d); a->d = p; } else { - a->d = kzalloc(nlimbs * sizeof(mpi_limb_t), GFP_KERNEL); + a->d = kzalloc(nlimbs * sizeof(mpi_limb_t), GFP_ATOMIC); if (!a->d) return -ENOMEM; } -- 2.8.1
[PATCH] Missing bio_put following submit_bio_wait
submit_bio_wait() gives the caller an opportunity to examine struct bio and so expects the caller to issue the put_bio() This fixes a memory leak reported by a few people in 4.7-rc2 kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch") Signed-off-by: Shaun Tancheff Tested-by: Catalin Marinas Tested-by: Larry fin...@lwfinger.net Tested-by: David Drysdale --- block/blk-lib.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 23d7f30..9e29dc3 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -113,6 +113,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, ret = submit_bio_wait(type, bio); if (ret == -EOPNOTSUPP) ret = 0; + bio_put(bio); } blk_finish_plug(&plug); @@ -165,8 +166,10 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, } } - if (bio) + if (bio) { ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio); + bio_put(bio); + } return ret != -EOPNOTSUPP ? ret : 0; } EXPORT_SYMBOL(blkdev_issue_write_same); @@ -206,8 +209,11 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, } } - if (bio) - return submit_bio_wait(WRITE, bio); + if (bio) { + ret = submit_bio_wait(WRITE, bio); + bio_put(bio); + return ret; + } return 0; } -- 2.8.1
[PATCH] Fix reported kmemleak
This fixes a memory leak reported by a few people in 4.7-rc1 kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch") This patch just formalizes the one in this discussion here: https://lkml.kernel.org/r/20160606112620.ga29...@e104818-lin.cambridge.arm.com The same issue appears here: https://lkml.kernel.org/r/20160607102651.ga6...@dhcp12-144.nay.redhat.com Patch is also available at: https://github.com/stancheff/linux.git branch: v4.7-rc2+bio_put Cc: Christoph Hellwig Cc: ax...@kernel.dk cc: David Drysdale Cc: Xiong Zhou Cc: Stephen Rothwell Cc: linux-n...@vger.kernel.org Cc: linux-nvd...@ml01.01.org Cc: Christoph Hellwig Cc: Larry Finger Cc: Catalin Marinas Cc: LKML , Cc: Jens Axboe , Cc: bart.vanass...@sandisk.com Shaun Tancheff (1): Missing bio_put following submit_bio_wait block/blk-lib.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) -- 2.8.1
Re: kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch")
derstand this code, this chaining may not > even be essential to struct bio freeing (and I'm investigating the wrong > path). > > -- > Catalin > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at > https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=uHYaMtk0CBsO1MWg4alX8kHC6bvXsmWCR9wL8nrX28E&s=4EHwF1z6acQYr9R052hraaTE1KUf6IiXcEmd-CSLV48&e= I'm pretty sure it is missing a bio_put() after submit_bio_wait(). Please excuse the hack-y patch but I think you need to do something like this ... (Note tabs eaten by gmail). diff --git a/block/blk-lib.c b/block/blk-lib.c index 23d7f30..9e29dc3 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -113,6 +113,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, ret = submit_bio_wait(type, bio); if (ret == -EOPNOTSUPP) ret = 0; + bio_put(bio); } blk_finish_plug(&plug); @@ -165,8 +166,10 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, } } - if (bio) + if (bio) { ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio); + bio_put(bio); + } return ret != -EOPNOTSUPP ? ret : 0; } EXPORT_SYMBOL(blkdev_issue_write_same); @@ -206,8 +209,11 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, } } - if (bio) - return submit_bio_wait(WRITE, bio); + if (bio) { + ret = submit_bio_wait(WRITE, bio); + bio_put(bio); + return ret; + } return 0; } -- Shaun Tancheff
[PATCH] RAID Cleanup for bio-split
It looks like some minor changes slipped through on the RAID. A couple of checks for REQ_PREFLUSH flag should also check for bi_op matching REQ_OP_FLUSH. Wrappers for sync_page_io() are passed READ/WRITE but need to be passed REQ_OP_READ and REQ_OP_WRITE. Signed-off-by: Shaun Tancheff --- drivers/md/raid0.c | 3 ++- drivers/md/raid1.c | 13 +++-- drivers/md/raid10.c | 21 ++--- drivers/md/raid5.c | 3 ++- 4 files changed, 21 insertions(+), 19 deletions(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index f95463d..46e9ba8 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -458,7 +458,8 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio) struct md_rdev *tmp_dev; struct bio *split; - if (unlikely(bio->bi_rw & REQ_PREFLUSH)) { + if (unlikely(bio->bi_rw & REQ_PREFLUSH) || + unlikely(bio->bi_op == REQ_OP_FLUSH)) { md_flush_request(mddev, bio); return; } diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 2a2c177..f7c0577 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1771,12 +1771,12 @@ static void end_sync_write(struct bio *bio) } static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, - int sectors, struct page *page, int rw) + int sectors, struct page *page, int op) { - if (sync_page_io(rdev, sector, sectors << 9, page, rw, 0, false)) + if (sync_page_io(rdev, sector, sectors << 9, page, op, 0, false)) /* success */ return 1; - if (rw == WRITE) { + if (op == REQ_OP_WRITE) { set_bit(WriteErrorSeen, &rdev->flags); if (!test_and_set_bit(WantReplacement, &rdev->flags)) @@ -1883,7 +1883,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) rdev = conf->mirrors[d].rdev; if (r1_sync_page_io(rdev, sect, s, bio->bi_io_vec[idx].bv_page, - WRITE) == 0) { + REQ_OP_WRITE) == 0) { r1_bio->bios[d]->bi_end_io = NULL; rdev_dec_pending(rdev, mddev); } @@ -2118,7 +2118,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk, if (rdev && !test_bit(Faulty, &rdev->flags)) r1_sync_page_io(rdev, sect, s, - conf->tmppage, WRITE); + conf->tmppage, REQ_OP_WRITE); } d = start; while (d != read_disk) { @@ -2130,7 +2130,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk, if (rdev && !test_bit(Faulty, &rdev->flags)) { if (r1_sync_page_io(rdev, sect, s, - conf->tmppage, READ)) { + conf->tmppage, REQ_OP_READ)) { atomic_add(s, &rdev->corrected_errors); printk(KERN_INFO "md/raid1:%s: read error corrected " @@ -2204,6 +2204,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i) } wbio->bi_op = REQ_OP_WRITE; + wbio->bi_rw = 0; wbio->bi_iter.bi_sector = r1_bio->sector; wbio->bi_iter.bi_size = r1_bio->sectors << 9; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index c5dc4e4..44e87c2 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1364,8 +1364,7 @@ retry_write: mbio->bi_bdev = rdev->bdev; mbio->bi_end_io = raid10_end_write_request; mbio->bi_op = op; - mbio->bi_rw = - do_sync | do_fua | do_sec; + mbio->bi_rw = do_sync | do_fua | do_sec; mbio->bi_private = r10_bio; atomic_inc(&r10_bio->remaining); @@ -1408,8 +1407,7 @@ retry_write: mbio->bi_bdev = rdev->bdev; mbio->bi_end_io = raid10_end_write_request; mbio->bi_op = op; - mbio->bi_rw = - do_sync | do_fua | do_sec; + mbio->bi_rw = do_sync | do_fua | do_sec; mbio->bi_