Re: [PATCH] sd: fix uninitialized variable access in error handling

2016-10-21 Thread Shaun Tancheff
On Fri, Oct 21, 2016 at 10:32 AM, Arnd Bergmann  wrote:
> If sd_zbc_report_zones fails, the check for 'zone_blocks == 0'
> later in the function accesses uninitialized data:
>
> drivers/scsi/sd_zbc.c: In function ‘sd_zbc_read_zones’:
> drivers/scsi/sd_zbc.c:520:7: error: ‘zone_blocks’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
>
> This sets it to zero, which has the desired effect of leaving
> the sd_zbc_read_zones successfully with sdkp->zone_blocks = 0.
>
> Fixes: 89d947561077 ("sd: Implement support for ZBC devices")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/scsi/sd_zbc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index 16d3fa62d8ac..d5b3bd915d9e 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -455,8 +455,10 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
>
> /* Do a report zone to get the same field */
> ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, 0);
> -   if (ret)
> +   if (ret) {
> +   zone_blocks = 0;
> goto out;
> +   }
>
> same = buf[4] & 0x0f;
> if (same > 0) {
> --
> 2.9.0
>

Reviewed-by: Shaun Tancheff 
-- 
Shaun Tancheff


Re: [PATCH] block: zoned: fix harmless maybe-uninitialized warning

2016-10-21 Thread Shaun Tancheff
On Fri, Oct 21, 2016 at 10:42 AM, Arnd Bergmann  wrote:
> The blkdev_report_zones produces a harmless warning when
> -Wmaybe-uninitialized is set, after gcc gets a little confused
> about the multiple 'goto' here:
>
> block/blk-zoned.c: In function 'blkdev_report_zones':
> block/blk-zoned.c:188:13: error: 'nz' may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>
> Moving the assignment to nr_zones makes this a little simpler
> while also avoiding the warning reliably. I'm removing the
> extraneous initialization of 'int ret' in the same patch, as
> that is semi-related and could cause an uninitialized use of
> that variable to not produce a warning.
>
> Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices")
> Signed-off-by: Arnd Bergmann 
> ---
>  block/blk-zoned.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 667f95d86695..472211fa183a 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -80,7 +80,7 @@ int blkdev_report_zones(struct block_device *bdev,
> unsigned int i, n, nz;
> unsigned int ofst;
> void *addr;
> -   int ret = 0;
> +   int ret;
>
> if (!q)
> return -ENXIO;
> @@ -179,14 +179,12 @@ int blkdev_report_zones(struct block_device *bdev,
>
> }
>
> +   *nr_zones = nz;
>  out:
> bio_for_each_segment_all(bv, bio, i)
> __free_page(bv->bv_page);
>     bio_put(bio);
>
> -   if (ret == 0)
> -   *nr_zones = nz;
> -
> return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
> --
> 2.9.0

Reviewed-by: Shaun Tancheff 

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=RyRS1pzTGdFENUb0PbQRSMAhvMgZx_dBftw2khYVIXU&s=p9SYS2a__p_YHv8FoZVz9kuTQQ7LIZBVKCkZuQgR0cs&e=



-- 
Shaun Tancheff


Re: [PATCH v5 2/4] fusion: remove iopriority handling

2016-10-13 Thread Shaun Tancheff
On Thu, Oct 13, 2016 at 6:00 PM, Adam Manzanares
 wrote:
> The request priority is now by default coming from the ioc. It was not
> clear what this code was trying to do based upon the iopriority class or
> data. The driver should check that a device supports priorities and use
> them according to the specificiations of ioprio.
>
> Signed-off-by: Adam Manzanares 
> ---
>  drivers/message/fusion/mptscsih.c | 5 -
>  1 file changed, 5 deletions(-)
>
> diff --git a/drivers/message/fusion/mptscsih.c 
> b/drivers/message/fusion/mptscsih.c
> index 6c9fc11..4740bb6 100644
> --- a/drivers/message/fusion/mptscsih.c
> +++ b/drivers/message/fusion/mptscsih.c
> @@ -1369,11 +1369,6 @@ mptscsih_qcmd(struct scsi_cmnd *SCpnt)
> if ((vdevice->vtarget->tflags & MPT_TARGET_FLAGS_Q_YES)
> && (SCpnt->device->tagged_supported)) {
> scsictl = scsidir | MPI_SCSIIO_CONTROL_SIMPLEQ;
> -   if (SCpnt->request && SCpnt->request->ioprio) {
> -   if (((SCpnt->request->ioprio & 0x7) == 1) ||
> -   !(SCpnt->request->ioprio & 0x7))
> -   scsictl |= MPI_SCSIIO_CONTROL_HEADOFQ;
> -   }
> } else
> scsictl = scsidir | MPI_SCSIIO_CONTROL_UNTAGGED;

Style wise you can further remove the extra parens around
  SCpnt->device->tagged_supported
As well as the now redundant braces.

Regards,
Shaun

> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=ZE7JzxXeXPEWqk9WYm42hZHj8gESRg1QoS5XklfbprM&s=C0iMyTgYbYl06F1SQ2DqfdESKBtl3Whp5rSnHSBXOc4&e=


[PATCH v6 7/7] blk-zoned: implement ioctls

2016-09-30 Thread Shaun Tancheff
Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively
obtaining the zone configuration of a zoned block device and resetting
the write pointer of sequential zones of a zoned block device.

The BLKREPORTZONE ioctl maps directly to a single call of the function
blkdev_report_zones. The zone information result is passed as an array
of struct blk_zone identical to the structure used internally for
processing the REQ_OP_ZONE_REPORT operation.  The BLKRESETZONE ioctl
maps to a single call of the blkdev_reset_zones function.

Signed-off-by: Shaun Tancheff 
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 block/blk-zoned.c | 93 +++
 block/ioctl.c |  4 ++
 include/linux/blkdev.h| 21 ++
 include/uapi/linux/blkzoned.h | 40 +++
 include/uapi/linux/fs.h   |  4 ++
 5 files changed, 162 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1603573..667f95d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -255,3 +255,96 @@ int blkdev_reset_zones(struct block_device *bdev,
return 0;
 }
 EXPORT_SYMBOL_GPL(blkdev_reset_zones);
+
+/**
+ * BLKREPORTZONE ioctl processing.
+ * Called from blkdev_ioctl.
+ */
+int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+   void __user *argp = (void __user *)arg;
+   struct request_queue *q;
+   struct blk_zone_report rep;
+   struct blk_zone *zones;
+   int ret;
+
+   if (!argp)
+   return -EINVAL;
+
+   q = bdev_get_queue(bdev);
+   if (!q)
+   return -ENXIO;
+
+   if (!blk_queue_is_zoned(q))
+   return -ENOTTY;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   if (copy_from_user(&rep, argp, sizeof(struct blk_zone_report)))
+   return -EFAULT;
+
+   if (!rep.nr_zones)
+   return -EINVAL;
+
+   zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+   if (!zones)
+   return -ENOMEM;
+
+   ret = blkdev_report_zones(bdev, rep.sector,
+ zones, &rep.nr_zones,
+ GFP_KERNEL);
+   if (ret)
+   goto out;
+
+   if (copy_to_user(argp, &rep, sizeof(struct blk_zone_report))) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   if (rep.nr_zones) {
+   if (copy_to_user(argp + sizeof(struct blk_zone_report), zones,
+sizeof(struct blk_zone) * rep.nr_zones))
+   ret = -EFAULT;
+   }
+
+ out:
+   kfree(zones);
+
+   return ret;
+}
+
+/**
+ * BLKRESETZONE ioctl processing.
+ * Called from blkdev_ioctl.
+ */
+int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode,
+unsigned int cmd, unsigned long arg)
+{
+   void __user *argp = (void __user *)arg;
+   struct request_queue *q;
+   struct blk_zone_range zrange;
+
+   if (!argp)
+   return -EINVAL;
+
+   q = bdev_get_queue(bdev);
+   if (!q)
+   return -ENXIO;
+
+   if (!blk_queue_is_zoned(q))
+   return -ENOTTY;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   if (copy_from_user(&zrange, argp, sizeof(struct blk_zone_range)))
+   return -EFAULT;
+
+   return blkdev_reset_zones(bdev, zrange.sector, zrange.nr_sectors,
+ GFP_KERNEL);
+}
diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..448f78a 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -513,6 +513,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
BLKDEV_DISCARD_SECURE);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
+   case BLKREPORTZONE:
+   return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
+   case BLKRESETZONE:
+   return blkdev_reset_zones_ioctl(bdev, mode, cmd, arg);
case HDIO_GETGEO:
return blkdev_getgeo(bdev, argp);
case BLKRAGET:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 252043f..90097dd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -316,6 +316,27 @@ extern int blkdev_report_zones(struct block_device *bdev,
 extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors,
  sector_t nr_sectors, gfp_t gfp_mask);
 
+extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
+unsigned int cmd, unsigned long arg);
+extern int blkdev_reset_zones_ioctl(struct block_device

[PATCH v6 5/7] block: Implement support for zoned block devices

2016-09-30 Thread Shaun Tancheff
From: Hannes Reinecke 

Implement zoned block device zone information reporting and reset.
Zone information are reported as struct blk_zone. This implementation
does not differentiate between host-aware and host-managed device
models and is valid for both. Two functions are provided:
blkdev_report_zones for discovering the zone configuration of a
zoned block device, and blkdev_reset_zones for resetting the write
pointer of sequential zones. The helper function blk_queue_zone_size
and bdev_zone_size are also provided for, as the name suggest,
obtaining the zone size (in 512B sectors) of the zones of the device.

Signed-off-by: Hannes Reinecke 

[Damien: * Removed the zone cache
 * Implement report zones operation based on earlier proposal
   by Shaun Tancheff ]
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
Changes from v5:
* Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928).

 block/Kconfig |   8 ++
 block/Makefile|   2 +-
 block/blk-zoned.c | 257 ++
 include/linux/blkdev.h|  31 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned.h | 103 +
 6 files changed, 401 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-zoned.c
 create mode 100644 include/uapi/linux/blkzoned.h

diff --git a/block/Kconfig b/block/Kconfig
index 5136ad4..7bb9bf8 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -89,6 +89,14 @@ config BLK_DEV_INTEGRITY
T10/SCSI Data Integrity Field or the T13/ATA External Path
Protection.  If in doubt, say N.
 
+config BLK_DEV_ZONED
+   bool "Zoned block device support"
+   ---help---
+   Block layer zoned block device support. This option enables
+   support for ZAC/ZBC host-managed and host-aware zoned block devices.
+
+   Say yes here if you have a ZAC or ZBC storage device.
+
 config BLK_DEV_THROTTLING
bool "Block layer bio throttling support"
depends on BLK_CGROUP=y
diff --git a/block/Makefile b/block/Makefile
index 9eda232..4676969 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,4 +22,4 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
 obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)   += cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
-
+obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
new file mode 100644
index 000..1603573
--- /dev/null
+++ b/block/blk-zoned.c
@@ -0,0 +1,257 @@
+/*
+ * Zoned block device handling
+ *
+ * Copyright (c) 2015, Hannes Reinecke
+ * Copyright (c) 2015, SUSE Linux GmbH
+ *
+ * Copyright (c) 2016, Damien Le Moal
+ * Copyright (c) 2016, Western Digital
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+static inline sector_t blk_zone_start(struct request_queue *q,
+ sector_t sector)
+{
+   sector_t zone_mask = blk_queue_zone_size(q) - 1;
+
+   return sector & ~zone_mask;
+}
+
+/*
+ * Check that a zone report belongs to the partition.
+ * If yes, fix its start sector and write pointer, copy it in the
+ * zone information array and return true. Return false otherwise.
+ */
+static bool blkdev_report_zone(struct block_device *bdev,
+  struct blk_zone *rep,
+  struct blk_zone *zone)
+{
+   sector_t offset = get_start_sect(bdev);
+
+   if (rep->start < offset)
+   return false;
+
+   rep->start -= offset;
+   if (rep->start + rep->len > bdev->bd_part->nr_sects)
+   return false;
+
+   if (rep->type == BLK_ZONE_TYPE_CONVENTIONAL)
+   rep->wp = rep->start + rep->len;
+   else
+   rep->wp -= offset;
+   memcpy(zone, rep, sizeof(struct blk_zone));
+
+   return true;
+}
+
+/**
+ * blkdev_report_zones - Get zones information
+ * @bdev:  Target block device
+ * @sector:Sector from which to report zones
+ * @zones: Array of zone structures where to return the zones information
+ * @nr_zones:  Number of zone structures in the zone array
+ * @gfp_mask:  Memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Get zone information starting from the zone containing @sector.
+ *The number of zone information reported may be less than the number
+ *requested by @nr_zones. The number of zones actually reported is
+ *returned in @nr_zones.
+ */
+int blkdev_report_zones(struct block_device *bdev,
+   sector_t sector,
+   struct blk_zone *zones,
+   unsigned int *nr_zones,
+   gfp_t gfp_mask)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+   s

[PATCH v6 6/7] sd: Implement support for ZBC devices

2016-09-30 Thread Shaun Tancheff
From: Hannes Reinecke 

Implement ZBC support functions to setup zoned disks, both
host-managed and host-aware models. Only zoned disks that satisfy
the following conditions are supported:
1) All zones are the same size, with the exception of an eventual
   last smaller runt zone.
2) For host-managed disks, reads are unrestricted (reads are not
   failed due to zone or write pointer alignement constraints).
Zoned disks that do not satisfy these 2 conditions are setup with
a capacity of 0 to prevent their use.

The function sd_zbc_read_zones, called from sd_revalidate_disk,
checks that the device satisfies the above two constraints. This
function may also change the disk capacity previously set by
sd_read_capacity for devices reporting only the capacity of
conventional zones at the beginning of the LBA range (i.e. devices
reporting rc_basis set to 0).

The capacity message output was moved out of sd_read_capacity into
a new function sd_print_capacity to include this eventual capacity
change by sd_zbc_read_zones. This new function also includes a call
to sd_zbc_print_zones to display the number of zones and zone size
of the device.

Signed-off-by: Hannes Reinecke 

[Damien: * Removed zone cache support
 * Removed mapping of discard to reset write pointer command
 * Modified sd_zbc_read_zones to include checks that the
   device satisfies the kernel constraints
 * Implemeted REPORT ZONES setup and post-processing based
   on code from Shaun Tancheff 
 * Removed confusing use of 512B sector units in functions
   interface]
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
Changes from v5:
* Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928).

 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.c | 141 ---
 drivers/scsi/sd.h |  67 +
 drivers/scsi/sd_zbc.c | 627 ++
 include/scsi/scsi_proto.h |  17 ++
 5 files changed, 820 insertions(+), 33 deletions(-)
 create mode 100644 drivers/scsi/sd_zbc.c

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index d539798..fabcb6d 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -179,6 +179,7 @@ hv_storvsc-y:= storvsc_drv.o
 
 sd_mod-objs:= sd.o
 sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
+sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o
 
 sr_mod-objs:= sr.o sr_ioctl.o sr_vendor.o
 ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d3e852a..fb324ac 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -92,6 +92,7 @@ MODULE_ALIAS_BLOCKDEV_MAJOR(SCSI_DISK15_MAJOR);
 MODULE_ALIAS_SCSI_DEVICE(TYPE_DISK);
 MODULE_ALIAS_SCSI_DEVICE(TYPE_MOD);
 MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);
+MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC);
 
 #if !defined(CONFIG_DEBUG_BLOCK_EXT_DEVT)
 #define SD_MINORS  16
@@ -162,7 +163,7 @@ cache_type_store(struct device *dev, struct 
device_attribute *attr,
static const char temp[] = "temporary ";
int len;
 
-   if (sdp->type != TYPE_DISK)
+   if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
/* no cache control on RBC devices; theoretically they
 * can do it, but there's probably so many exceptions
 * it's not worth the risk */
@@ -261,7 +262,7 @@ allow_restart_store(struct device *dev, struct 
device_attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
-   if (sdp->type != TYPE_DISK)
+   if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
return -EINVAL;
 
sdp->allow_restart = simple_strtoul(buf, NULL, 10);
@@ -391,6 +392,11 @@ provisioning_mode_store(struct device *dev, struct 
device_attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
+   if (sd_is_zoned(sdkp)) {
+   sd_config_discard(sdkp, SD_LBP_DISABLE);
+   return count;
+   }
+
if (sdp->type != TYPE_DISK)
return -EINVAL;
 
@@ -458,7 +464,7 @@ max_write_same_blocks_store(struct device *dev, struct 
device_attribute *attr,
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
 
-   if (sdp->type != TYPE_DISK)
+   if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC)
return -EINVAL;
 
err = kstrtoul(buf, 10, &max);
@@ -843,6 +849,12 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd)
 
BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
 
+   if (sd_is_zoned(sdkp)) {
+   ret = sd_zbc_setup_read_write(cmd);
+   if (ret != BLKPREP_OK)
+   return ret;
+   }
+
sector >>= ilog2(sdp->sector_size) - 9;
nr_sectors 

[PATCH v6 4/7] block: Define zoned block device operations

2016-09-30 Thread Shaun Tancheff
From: Shaun Tancheff 

Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of
host-managed and host-aware zoned block devices. With with these two
new operations, the total number of operations defined reaches 8 and
still fits with the 3 bits definition of REQ_OP_BITS.

Signed-off-by: Shaun Tancheff 
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
---
 block/blk-core.c  | 4 
 include/linux/blk_types.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c07..e4eda5d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1941,6 +1941,10 @@ generic_make_request_checks(struct bio *bio)
case REQ_OP_WRITE_SAME:
if (!bdev_write_same(bio->bi_bdev))
goto not_supported;
+   case REQ_OP_ZONE_REPORT:
+   case REQ_OP_ZONE_RESET:
+   if (!bdev_is_zoned(bio->bi_bdev))
+   goto not_supported;
break;
default:
break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cd395ec..dd50dce 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -243,6 +243,8 @@ enum req_op {
REQ_OP_SECURE_ERASE,/* request to securely erase sectors */
REQ_OP_WRITE_SAME,  /* write same block many times */
REQ_OP_FLUSH,   /* request for cache flush */
+   REQ_OP_ZONE_REPORT, /* Get zone information */
+   REQ_OP_ZONE_RESET,  /* Reset a zone write pointer */
 };
 
 #define REQ_OP_BITS 3
-- 
2.9.3



[PATCH v6 3/7] block: update chunk_sectors in blk_stack_limits()

2016-09-30 Thread Shaun Tancheff
From: Hannes Reinecke 

Signed-off-by: Hannes Reinecke 
Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 block/blk-settings.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index b1d5b7f..55369a6 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -631,6 +631,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->discard_granularity;
}
 
+   if (b->chunk_sectors)
+   t->chunk_sectors = min_not_zero(t->chunk_sectors,
+   b->chunk_sectors);
+
return ret;
 }
 EXPORT_SYMBOL(blk_stack_limits);
-- 
2.9.3



[PATCH v6 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes

2016-09-30 Thread Shaun Tancheff
From: Hannes Reinecke 

The queue limits already have a 'chunk_sectors' setting, so
we should be presenting it via sysfs.

Signed-off-by: Hannes Reinecke 

[Damien: Updated Documentation/ABI/testing/sysfs-block]

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 Documentation/ABI/testing/sysfs-block | 13 +
 block/blk-sysfs.c | 11 +++
 2 files changed, 24 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 75a5055..ee2d5cd 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -251,3 +251,16 @@ Description:
since drive-managed zoned block devices do not support
zone commands, they will be treated as regular block
devices and zoned will report "none".
+
+What:  /sys/block//queue/chunk_sectors
+Date:  September 2016
+Contact:   Hannes Reinecke 
+Description:
+   chunk_sectors has different meaning depending on the type
+   of the disk. For a RAID device (dm-raid), chunk_sectors
+   indicates the size in 512B sectors of the RAID volume
+   stripe segment. For a zoned block device, either
+   host-aware or host-managed, chunk_sectors indicates the
+   size of 512B sectors of the zones of the device, with
+   the eventual exception of the last zone of the device
+   which may be smaller.
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index ff9cd9c..488c2e2 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -130,6 +130,11 @@ static ssize_t queue_physical_block_size_show(struct 
request_queue *q, char *pag
return queue_var_show(queue_physical_block_size(q), page);
 }
 
+static ssize_t queue_chunk_sectors_show(struct request_queue *q, char *page)
+{
+   return queue_var_show(q->limits.chunk_sectors, page);
+}
+
 static ssize_t queue_io_min_show(struct request_queue *q, char *page)
 {
return queue_var_show(queue_io_min(q), page);
@@ -455,6 +460,11 @@ static struct queue_sysfs_entry 
queue_physical_block_size_entry = {
.show = queue_physical_block_size_show,
 };
 
+static struct queue_sysfs_entry queue_chunk_sectors_entry = {
+   .attr = {.name = "chunk_sectors", .mode = S_IRUGO },
+   .show = queue_chunk_sectors_show,
+};
+
 static struct queue_sysfs_entry queue_io_min_entry = {
.attr = {.name = "minimum_io_size", .mode = S_IRUGO },
.show = queue_io_min_show,
@@ -555,6 +565,7 @@ static struct attribute *default_attrs[] = {
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
+   &queue_chunk_sectors_entry.attr,
&queue_io_min_entry.attr,
&queue_io_opt_entry.attr,
&queue_discard_granularity_entry.attr,
-- 
2.9.3



[PATCH v6 1/7] block: Add 'zoned' queue limit

2016-09-30 Thread Shaun Tancheff
From: Damien Le Moal 

Add the zoned queue limit to indicate the zoning model of a block device.
Defined values are 0 (BLK_ZONED_NONE) for regular block devices,
1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM)
for host-managed zone block devices. The standards defined drive managed
model is not defined here since these block devices do not provide any
command for accessing zone information. Drive managed model devices will
be reported as BLK_ZONED_NONE.

The helper functions blk_queue_zoned_model and bdev_zoned_model return
the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned
return a boolean for callers to test if a block device is zoned.

The zoned attribute is also exported as a string to applications via
sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and
BLK_ZONED_HM as "host-managed".

Signed-off-by: Damien Le Moal 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Reviewed-by: Shaun Tancheff 
Tested-by: Shaun Tancheff 
---
 Documentation/ABI/testing/sysfs-block | 16 
 block/blk-settings.c  |  1 +
 block/blk-sysfs.c | 18 ++
 include/linux/blkdev.h| 47 +++
 4 files changed, 82 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index 71d184d..75a5055 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -235,3 +235,19 @@ Description:
write_same_max_bytes is 0, write same is not supported
by the device.
 
+What:  /sys/block//queue/zoned
+Date:  September 2016
+Contact:   Damien Le Moal 
+Description:
+   zoned indicates if the device is a zoned block device
+   and the zone model of the device if it is indeed zoned.
+   The possible values indicated by zoned are "none" for
+   regular block devices and "host-aware" or "host-managed"
+   for zoned block devices. The characteristics of
+   host-aware and host-managed zoned block devices are
+   described in the ZBC (Zoned Block Commands) and ZAC
+   (Zoned Device ATA Command Set) standards. These standards
+   also define the "drive-managed" zone model. However,
+   since drive-managed zoned block devices do not support
+   zone commands, they will be treated as regular block
+   devices and zoned will report "none".
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f679ae1..b1d5b7f 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -107,6 +107,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->io_opt = 0;
lim->misaligned = 0;
lim->cluster = 1;
+   lim->zoned = BLK_ZONED_NONE;
 }
 EXPORT_SYMBOL(blk_set_default_limits);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9cc8d7c..ff9cd9c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -257,6 +257,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
 QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
+static ssize_t queue_zoned_show(struct request_queue *q, char *page)
+{
+   switch (blk_queue_zoned_model(q)) {
+   case BLK_ZONED_HA:
+   return sprintf(page, "host-aware\n");
+   case BLK_ZONED_HM:
+   return sprintf(page, "host-managed\n");
+   default:
+   return sprintf(page, "none\n");
+   }
+}
+
 static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
 {
return queue_var_show((blk_queue_nomerges(q) << 1) |
@@ -485,6 +497,11 @@ static struct queue_sysfs_entry queue_nonrot_entry = {
.store = queue_store_nonrot,
 };
 
+static struct queue_sysfs_entry queue_zoned_entry = {
+   .attr = {.name = "zoned", .mode = S_IRUGO },
+   .show = queue_zoned_show,
+};
+
 static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
.show = queue_nomerges_show,
@@ -546,6 +563,7 @@ static struct attribute *default_attrs[] = {
&queue_discard_zeroes_data_entry.attr,
&queue_write_same_max_entry.attr,
&queue_nonrot_entry.attr,
+   &queue_zoned_entry.attr,
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358..f19e16b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -261,6 +261,15 @@ struct blk_queue_tag {
 #define BLK_SCSI_MAX_CMDS  (256)
 #define BLK_SCSI_CMD_PER_LONG  (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
 
+/*
+ * Z

[PATCH v6 0/7] ZBC / Zoned block device support

2016-09-30 Thread Shaun Tancheff
This series introduces support for zoned block devices. It integrates
earlier submissions by Hannes Reinecke, Damien Le Moal and Shaun Tancheff.
Compared to the previous series version, the code was significantly
simplified by limiting support to zoned devices satisfying the following
conditions:
1) All zones of the device are the same size, with the exception of an
   eventual last smaller runt zone.
2) For host-managed disks, reads must be unrestricted (read commands do not
   fail due to zone or write pointer alignement constraints).
Zoned disks that do not satisfy these 2 conditions are ignored.

These 2 conditions allowed dropping the zone information cache implemented
in the previous version. This simplifies the code and also reduces the memory
consumption at run time. Support for zoned devices now only require one bit
per zone (less than 8KB in total). This bit field is used to write-lock
zones and prevent the concurrent execution of multiple write commands in
the same zone. This avoids write ordering problems at dispatch time, for
both the simple queue and scsi-mq settings.

The new operations introduced to suport zone manipulation was reduced to
only the two main ZBC/ZAC defined commands: REPORT ZONES (REQ_OP_ZONE_REPORT)
and RESET WRITE POINTER (REQ_OP_ZONE_RESET). This brings the total number of
operations defined to 8, which fits in the 3 bits (REQ_OP_BITS) reserved for
operation code in bio->bi_opf and req->cmd_flags.

Most of the ZBC specific code is kept out of sd.c and implemented in the
new file sd_zbc.c. Similarly, at the block layer, most of the zoned block
device code is implemented in the new blk-zoned.c.

For host-managed zoned block devices, the sequential write constraint of
write pointer zones is exposed to the user. Users of the disk (applications,
file systems or device mappers) must sequentially write to zones. This means
that for raw block device accesses from applications, buffered writes are
unreliable and direct I/Os must be used (or buffered writes with O_SYNC).

Access to zone manipulation operations is also provided to applications
through a set of new ioctls. This allows applications operating on raw
block devices (e.g. mkfs.xxx) to discover a device zone layout and
manipulate zone state.

Changes from v5:
* Rebased on Jens' for-4.9/block branch (v5 is based on next-20160928).

Changes from v4:
* Changed interface of sd_zbc_setup_read_write

Changes from v3:
* Fixed several typos and tabs/spaces
* Added description of zoned and chunk_sectors queue attributes in
  Documentation/ABI/testing/sysfs-block
* Fixed sd_read_capacity call in sd.c and to avoid missing information on
  the first pass of a disk scan
* Fixed scsi_disk zone related field to use logical block size unit instead
  of 512B sector unit.

Changes from v2:
* Use kcalloc to allocate zone information array for ioctl
* Use kcalloc to allocate zone information array for ioctl
* Export GPL the functions blkdev_report_zones and blkdev_reset_zones
* Shuffled uapi definitions from patch 7 into patch 5


Damien Le Moal (1):
  block: Add 'zoned' queue limit

Hannes Reinecke (4):
  blk-sysfs: Add 'chunk_sectors' to sysfs attributes
  block: update chunk_sectors in blk_stack_limits()
  block: Implement support for zoned block devices
  sd: Implement support for ZBC devices

Shaun Tancheff (2):
  block: Define zoned block device operations
  blk-zoned: implement ioctls

 Documentation/ABI/testing/sysfs-block |  29 ++
 block/Kconfig |   8 +
 block/Makefile|   2 +-
 block/blk-core.c  |   4 +
 block/blk-settings.c  |   5 +
 block/blk-sysfs.c |  29 ++
 block/blk-zoned.c | 350 +++
 block/ioctl.c |   4 +
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.c | 141 ++--
 drivers/scsi/sd.h |  67 
 drivers/scsi/sd_zbc.c | 627 ++
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h|  99 ++
 include/scsi/scsi_proto.h |  17 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned.h | 143 
 include/uapi/linux/fs.h   |   4 +
 18 files changed, 1499 insertions(+), 34 deletions(-)
 create mode 100644 block/blk-zoned.c
 create mode 100644 drivers/scsi/sd_zbc.c
 create mode 100644 include/uapi/linux/blkzoned.h

-- 
2.9.3



Re: BUG Re: mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk

2016-09-27 Thread Shaun Tancheff
Confirmed:
  - Removing DEBUG_VM_RB fixes the hang.

Also confirmed:
 - Above patch fixes the hang when DEBUG_VM_RB is re-enabled.

Thanks!



On Tue, Sep 27, 2016 at 11:05 AM, Andrea Arcangeli  wrote:
> Hello,
>
> On Tue, Sep 27, 2016 at 05:16:15AM -0500, Shaun Tancheff wrote:
>> git bisect points at commit  c9634dcf00c9c93b ("mm: vma_merge: fix
>> vm_page_prot SMP race condition against rmap_walk")
>
> I assume linux-next? But I can't find the commit, but I should know
> what this is.
>
>>
>> Last lines to console are [transcribed]:
>>
>> vma 8c3d989a7c78 start 7fe02ed4c000 end 7fe02ed52000
>> next 8c3d96de0c38 prev 8c3d989a6e40 mm 8c3d071cbac0
>> prot 8025 anon_vma 8c3d96fc9b28 vm_ops   (null)
>> pgoff 7fe02ed4c file   (null) private_data   (null)
>> flags: 0x8100073(read|write|mayread|maywrite|mayexec|account|softdirty)
>
> It's a false positive, you have DEBUG_VM_RB=y, you can disable it or
> cherry-pick the fix:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_cgit_linux_kernel_git_andrea_aa.git_commit_-3Fid-3D74d8b44224f31153e23ca8a7f7f0700091f5a9b2&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=mhyVFRknYnKxpypFw43nt0xMGGZX0r4k-qe6PIyp5ew&s=QjS2W4fUFnnJl4YxCk4WB30v5281AC4B7bAQeP8KWlQ&e=
>
> The assumption validate_mm_rb did isn't valid anymore on the new code
> during __vma_unlink, the validation code must be updated to skip the
> next vma instead of the current one after this change. It's a bug in
> DEBUG_VM_RB=y, if you keep DEBUG_VM_RB=n there's no bug.
>
>> Reproducer is an Ubuntu 16.04.1 LTS x86_64 running on a VM (VirtualBox).
>> Symptom is a solid hang after boot and switch to starting gnome session.
>>
>> Hang at about 35s.
>>
>> kdbg traceback is all null entries.
>>
>> Let me know what additional information I can provide.
>
> I already submitted the fix to Andrew last week:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dmm-26m-3D147449253801920-26w-3D2&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=mhyVFRknYnKxpypFw43nt0xMGGZX0r4k-qe6PIyp5ew&s=EIo2P9JsNNIZSPoTgxO2vC5DJE4p6-HeOznwL1qhowo&e=
>
> I assume it's pending for merging in -mm.
>
> If you can test this patch and confirm the problem goes away with
> DEBUG_VM_RB=y it'd be great.
>
> Thanks,
> Andrea



-- 
Shaun Tancheff


BUG Re: mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk

2016-09-27 Thread Shaun Tancheff
git bisect points at commit  c9634dcf00c9c93b ("mm: vma_merge: fix
vm_page_prot SMP race condition against rmap_walk")

Last lines to console are [transcribed]:

vma 8c3d989a7c78 start 7fe02ed4c000 end 7fe02ed52000
next 8c3d96de0c38 prev 8c3d989a6e40 mm 8c3d071cbac0
prot 8025 anon_vma 8c3d96fc9b28 vm_ops   (null)
pgoff 7fe02ed4c file   (null) private_data   (null)
flags: 0x8100073(read|write|mayread|maywrite|mayexec|account|softdirty)

Reproducer is an Ubuntu 16.04.1 LTS x86_64 running on a VM (VirtualBox).
Symptom is a solid hang after boot and switch to starting gnome session.

Hang at about 35s.

kdbg traceback is all null entries.

Let me know what additional information I can provide.

Regards,
Shaun Tancheff


Re: UFS API in the kernel

2016-09-26 Thread Shaun Tancheff
On Thu, Sep 22, 2016 at 10:21 AM, Joao Pinto  wrote:
> Hi!
>
> I am designing an application that has the goal to be an utility for Unipro 
> and
> UFS testing purposes. This application is going to run on top of a recent 
> Linux
> Kernel containing the new UFS stack (including the new DWC drivers).
>
> I am considering doing the following:
> a) Create a new config item called CONFIG_UFS_CHARDEV which is going to 
> create a
> char device responsible to make some IOCTL available for user-space 
> applications
> b) Create a linux/ufs.h header file that contains data structures declarations
> that will be needed in user-space applications

I am not very familiar with UFS devices, that said you should have an
sgX chardev being created already so you can handle SG_IO requests.
There also appear to be some sysfs entries being created.

So between sg and sysfs you should be able to handle any user-space
out of band requests without resorting to making a new chardev.

Adding more sysfs entries, if you need them, should be fine.

You may find it easier to expand on the existing interfaces than to
get consensus on a new driver and ioctls.

Hope this helps,
Shaun

> Could you please advise me about what the correct approach should be to make 
> it
> as standard as possible and usable in the future?
>
> Thank you very much for your help!
>
> regards,
> Joao
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQICaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=vJFB6pCywWtdvkgHz9Vc0jQz0xzeyZlr-7eCWYu88nM&s=yiQLPFpqmMrbqLZz1Jb3aNqOje2dRMLJHEzUDobwcXc&e=



-- 
Shaun Tancheff


Re: [PATCH v8 1/2 RESEND] Add bio/request flags to issue ZBC/ZAC commands

2016-08-25 Thread Shaun Tancheff
On Thu, Aug 25, 2016 at 9:31 PM, Damien Le Moal  wrote:
>
> Shaun,
>
> On 8/25/16 05:24, Shaun Tancheff wrote:
>>
>> (RESENDING to include f2fs, fs-devel and dm-devel)
>>
>> Add op flags to access to zone information as well as open, close
>> and reset zones:
>>   - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
>>   - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing
>>   - REQ_OP_ZONE_CLOSE - Explicitly close a zone
>>   - REQ_OP_ZONE_FINISH - Explicitly finish a zone
>>   - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone
>>
>> These op flags can be used to create bio's to control zoned devices
>> through the block layer.
>
>
> I still have a hard time seeing the need for the REQ_OP_ZONE_REPORT
> operation assuming that the device queue will hold a zone information cache,
> Hannes RB-tree or your array type, whichever.
>
> Let's try to think simply here: if the disk user (and FS, a device mapper or
> an application doing raw disk accesses) wants to access the disk zone
> information, why would it need to issue a BIO when calling
> blkdev_lookup_zone would exactly give that information straight out of
> memory (so much faster) ? I thought hard about this, but cannot think of any
> value for the BIO-to-disk option. It seems to me to be equivalent to
> systematically doing a page cache read even if the page cache tells us that
> the page is up-to-date...

Firstly the BIO abstraction here gives a common interface to
getting the zone information and works even for embedded
systems that are not willing / convinced to enable
SCSI_ZBC + BLK_ZONED.

Secondly when SCSI_ZBC + BLK_ZONED are enabled it just
returns from the zone cache [as you can hopefully find
in the second half of this series]. I did add a 'force' option
but it's not intended to be used lightly.

Thirdly it is my belief that BIO abstraction is more easily
adapted to working with [and through] the device mapper
layer (s).

Today we both have the issue where if a file system
supports working with a ZBC device there can be no
device mapper stacked between the file system and
the actual zoned device. This is also true of our respective
device mapper targets.

It is my current belief that teaching the device mapper
layer to include REQ_OP_ZONE* operations is relatively
straight forward and can be done w/o affecting existing
targets that don't specifically need to operate on zones.
Something similar to the way flush is handled currently.
If the target doesn't ask to see zone operations the default
mapping rules apply.

Examples of why I would like to add REQ_OP_ZONE*
support to the device mapper:

I think it would be really neat if I could just to a quick
dm-linear and put big chunk of SSD in front of dm-zoned
or dm-zdm as it would be a nice way to boost performance.

Similarly it enable using dm-linear to stitch together enough
conventional space with a ZBC drive to see if Dave Chinner's
XFS proposal from a couple of years ago could work.

> Moreover, issuing a report zone to the disk may return information that is
> in fact incorrect, as that would not take into account the eventual set of
> write requests that was dispatched but not yet processed by the disk (some
> zone write pointer may be reported with a value lower than what the zone
> cache maintains).

Yes but issuing a zone report to media is not the expected path
when the zone cache is available. It is there to 'force' a re-sync
and it is intended that the user of the call knows that the force
is being applied and wants it to happen. Perhaps I should make
it two flags? One to force a reply form the device and second
flag to re-sync the zone cache with the result? There is one
piece of information that can only be retrieved by going to the
device and that is the 'non-seq resources' flag and it is only
used by Host Aware devices ... as far as I understand.

> Dealing (and fixing) these inconsistencies would force an update of the
> report zone result using the information of the zone cache, which in itself
> sounds like a good justification of not doing a report zones in the first
> place.

When report zones is just pulling from the zone cache it should
not be a problem. So the normal activity [when SCSI_ZBC +
BLK_ZONED are enabled] should not be introducing any
inconsistencies.

> I am fine with the other operations, and in fact having a BIO interface for
> them to send down to the SCSI layer is better than any other method. It will
> causes them to be seen in sd_init_command, which is the path taken for read
> and write commands too. So all zone cache information checking and updating
> can be done in that single place and serialized with a spinlock. Maintenance
> of the zone cache information becomes very easy.
>
> 

Re: [PATCH v6 3/4 RESEND] SCT Write Same / DSM Trim

2016-08-25 Thread Shaun Tancheff
On Thu, Aug 25, 2016 at 2:01 AM, Tom Yan  wrote:
> Really please just drop this patch. There is no rational reason for
> you to associate the maximum payload size to the logical sector size.

Been over this many, many times now. It has to do with the size of
the buffer setup through WRITE SAME in drivers/scsi/sd.c

> And please stop using the ATA SCSI Response Buffer (ata_scsi_rbuf)
> that is used for response to the SCSI layer for SCSI commands that
> won't really interact with the ATA device (i.e. triggers an ATA
> command), while ata_format_sct_write_same() and
> ata_scsi_write_same_xlat() are used for constructing payload that is
> going to be send to the ATA device. Can't you even see that these are
> of different direction to different layer?

Adding a new global buffer where there is one there already is
kind of silly. The buffer already has a perfectly acceptable
spinlock and the time spent copying data around is trivially
small in comparison to the I/O operation so there is not
likely to be any contention over the buffer.

It is memory. Why do you think ata_scsi_rbuf is so special?


Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-25 Thread Shaun Tancheff
On Thu, Aug 25, 2016 at 1:23 AM, Tom Yan  wrote:
> You only fill the bytes that you want to to set explicitly:
>
> +   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
> +   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
> +   put_unaligned_le64(lba, &sctpg[2]);
> +   put_unaligned_le64(num, &sctpg[6]);
> +   put_unaligned_le32(0u,  &sctpg[10]);
>
> What I doubted is, if you don't memset (zero-fill) the buffer first,
> will other bytes have indeterministic value that causes random
> unexpected behavior?

No.
If there is random or unexpected behaviour the device is broken
and some other remedy, such as blacklisting, is required.

---
Shaun


Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-25 Thread Shaun Tancheff
On Thu, Aug 25, 2016 at 1:31 AM, Tom Yan  wrote:
> On 25 August 2016 at 05:28, Shaun Tancheff  wrote:
>> On Wed, Aug 24, 2016 at 12:31 AM, Tom Yan  wrote:
>>> On 24 August 2016 at 11:33, Martin K. Petersen
>>>  wrote:
>>>>>>>>> "Tom" == Tom Yan  writes:
>>>>
>>>> Tom> Nope, SCSI Write Same commands does not have payload (or in SCSI
>>>> Tom> terms, parameter list / data-out buffer).
>>>>
>>>> WRITE SAME has a a payload of 1 logical block (unless NDOB is set but we
>>>> have had no good reason to support that yet).
>>>
>>> Interesting, I wasn't aware of the bit. I just didn't see any
>>> parameter list defined for any of the Write Same commands. Ah wait, it
>>> carries the pattern (the "same") and so.
>>>
>>> Hmm, it doesn't seem like the translation implemented in this patch
>>> series cares about the payload though?
>>
>> As repeated here and elsewhere the payload is:
>> scsi_sglist(cmd)
>> and it was put there by scsi_init_io() when it called scsi_init_sgtable()
>
> What I mean is:
>
> put_unaligned_le32(0u,  &sctpg[10]);
>
> So even if the payload of the SCSI Write Same commands instruct a
> non-zero pattern writing, SCT Write Same will conveniently ignore that
> do zero-filling anyway. That's what I mean by "doesn't care about the
> payload".

If you would like to add support for that it would be nice. I am not
planning to do so here.

> Though that would only be case with SG_IO though. SCSI Write Same
> issued from block layer (BLKZEROOUT) will be instructing zero-filling
> anyway.

>>>> UNMAP has a payload that varies based on the number of range
>>>> descriptors. The SCSI disk driver only ever issues a single descriptor
>>>> but since libata doesn't support UNMAP this doesn't really come into
>>>> play.
>>>>
>>>> Ideally there would be a way to distinguish between device limits for
>>>> WRITE SAME with the UNMAP bit and for "regular" WRITE SAME. One way to
>>>> do that would be to transition the libata discard implementation over to
>>>> single-range UNMAP, fill out the relevant VPD page B0 fields and leave
>>>> the WRITE SAME bits for writing zeroes.
>>>>
>>>> One reason that has not been particularly compelling is that the WRITE
>>>> SAME payload buffer does double duty to hold the ATA DSM TRIM range
>>>
>>> Huh? Why would the SATL care about its payload buffer for TRIM (i.e.
>>> when the UNMAP bit is set)? Doesn't it just read the LBA and NUMBER OF
>>> BLOCKS field and pack TRIM ranges/payload according to that?
>>>
>>>> descriptors and matches the required ATA payload size. Whereas the UNMAP
>>>
>>> Why would it need to "matches the required ATA payload size"?
>>>
>>>> command would only provide 24 bytes of TRIM range space.
>>>
>>> I don't really follow. The UNMAP descriptor has LBA (8 bytes / 64-bit)
>>> and NUMBER OF BLOCKS (4 bytes / 32-bit) field of the same length as
>>> Write Same (16). Even if the SCSI disk driver will only send one
>>> descriptor, it should work as good as Write Same (16).
>>
>> The "payload" is the data block transferred with the command.
>> The "descriptor" is, in this context, the contents of the payload as
>> it "describes" what the action the command is supposed to perform.
>>
>
> I know right.
>
>> The "payload" contains the "descriptor" that "describes" how
>> DSM TRIM should determine which logical blocks it should UNMAP.
>
> This should only be the case of UNMAP command, but not SCSI Write Same
> with UNMAP bit set. And either way it should not affect how large the
> ATA TRIM payload can be.

The current SATL does not report support for UNMAP.
If you think it should be added please submit a patch.

If you would like to extend the current translate to support sending
multiple blocks of trim data please submit a patch.

>>>> Also, please be careful with transfer lengths, __data_len, etc. As
>>>> mentioned, the transfer length WRITE SAME is typically 512 bytes and
>>>> that's the number of bytes that need to be DMA'ed and transferred over
>>>> the wire by the controller. But from a command completion perspective we
>>>> need to complete however many bytes the command acted upon. Unlike reads
>>>> and writes there is not a 1:1 mapping between the transfer length and
>>>> the affected area. So we do a bit of magic after the buffer has been
>>>> mapped to ensure that the completion byte count matches the number of
>>>> blocks that were affected by the command rather than the size of the
>>>> data buffer in bytes.
>>>>
>>>> --
>>>> Martin K. Petersen  Oracle Linux Engineering
>> --
>> Shaun Tancheff
-- 
Shaun Tancheff


Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-24 Thread Shaun Tancheff
On Wed, Aug 24, 2016 at 1:10 AM, Tom Yan  wrote:
> Btw, I wonder if you need to memset your buffer with 0 first, like
> what is done in ata_scsi_rbuf_get.

It is not necessary as the defined buffer is completely filled out here.

Are you thinking as a sort of future proofing?
Ex: In the unlikely event that the SCT Write Same command
descriptor is expanded in a future ACS?

It is more likely to see the command deprecated and replaced with a
new SCT feature.

Regardless of how unlikely I would consider a memset here to clear
the remainder of the payload.


Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-24 Thread Shaun Tancheff
On Wed, Aug 24, 2016 at 12:31 AM, Tom Yan  wrote:
> On 24 August 2016 at 11:33, Martin K. Petersen
>  wrote:
>>>>>>> "Tom" == Tom Yan  writes:
>>
>> Tom> Nope, SCSI Write Same commands does not have payload (or in SCSI
>> Tom> terms, parameter list / data-out buffer).
>>
>> WRITE SAME has a a payload of 1 logical block (unless NDOB is set but we
>> have had no good reason to support that yet).
>
> Interesting, I wasn't aware of the bit. I just didn't see any
> parameter list defined for any of the Write Same commands. Ah wait, it
> carries the pattern (the "same") and so.
>
> Hmm, it doesn't seem like the translation implemented in this patch
> series cares about the payload though?

As repeated here and elsewhere the payload is:
scsi_sglist(cmd)
and it was put there by scsi_init_io() when it called scsi_init_sgtable()

>> UNMAP has a payload that varies based on the number of range
>> descriptors. The SCSI disk driver only ever issues a single descriptor
>> but since libata doesn't support UNMAP this doesn't really come into
>> play.
>>
>> Ideally there would be a way to distinguish between device limits for
>> WRITE SAME with the UNMAP bit and for "regular" WRITE SAME. One way to
>> do that would be to transition the libata discard implementation over to
>> single-range UNMAP, fill out the relevant VPD page B0 fields and leave
>> the WRITE SAME bits for writing zeroes.
>>
>> One reason that has not been particularly compelling is that the WRITE
>> SAME payload buffer does double duty to hold the ATA DSM TRIM range
>
> Huh? Why would the SATL care about its payload buffer for TRIM (i.e.
> when the UNMAP bit is set)? Doesn't it just read the LBA and NUMBER OF
> BLOCKS field and pack TRIM ranges/payload according to that?
>
>> descriptors and matches the required ATA payload size. Whereas the UNMAP
>
> Why would it need to "matches the required ATA payload size"?
>
>> command would only provide 24 bytes of TRIM range space.
>
> I don't really follow. The UNMAP descriptor has LBA (8 bytes / 64-bit)
> and NUMBER OF BLOCKS (4 bytes / 32-bit) field of the same length as
> Write Same (16). Even if the SCSI disk driver will only send one
> descriptor, it should work as good as Write Same (16).

The "payload" is the data block transferred with the command.
The "descriptor" is, in this context, the contents of the payload as
it "describes" what the action the command is supposed to perform.

The "payload" contains the "descriptor" that "describes" how
DSM TRIM should determine which logical blocks it should UNMAP.

>> Also, please be careful with transfer lengths, __data_len, etc. As
>> mentioned, the transfer length WRITE SAME is typically 512 bytes and
>> that's the number of bytes that need to be DMA'ed and transferred over
>> the wire by the controller. But from a command completion perspective we
>> need to complete however many bytes the command acted upon. Unlike reads
>> and writes there is not a 1:1 mapping between the transfer length and
>> the affected area. So we do a bit of magic after the buffer has been
>> mapped to ensure that the completion byte count matches the number of
>> blocks that were affected by the command rather than the size of the
>> data buffer in bytes.
>>
>> --
>> Martin K. Petersen  Oracle Linux Engineering
-- 
Shaun Tancheff


Re: [PATCH 1/2] f2fs: detect host-managed SMR by feature flag

2016-08-24 Thread Shaun Tancheff
orrectly for HASMR.

> IIRC, we discussed about HASMR with Marc Lehmann last year, and he gave the
> below link which sums up.
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__comments.gmane.org_gmane.linux.file-2Dsystems.f2fs_2854&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=3Uw29rZTMJsKnUE9bBduFOdp3OiEwJhpgbJ9GmyO_SU&s=YNxrXKMJpd-BPHnYG8ctqg6ct58p2yqO3BRKlbjBO_s&e=
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.schmorp.de_2015-2D10-2D08-2Dsmr-2Darchive-2Ddrives-2Dfast-2Dnow.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=3Uw29rZTMJsKnUE9bBduFOdp3OiEwJhpgbJ9GmyO_SU&s=SC_RkdkfDQZTeIccq0o1zE4siI_RgDL4UhiVc1VzENY&e=

I think above applies to 'drive managed' SMR where the upper level
has no insight into the underlying zone layout.

>> So how about also introducing a F2FS_FEATURE_HASMR feature flag to
>> handle these different cases ?
>
> Yes, so once I can retrieve the zone information from the device, surely I'll
> add that.

I have posted some patches that will allow you to retrieve the zone
information for both HA and HM drives.

Can you see if this would meet your needs?

The data is returned to the BIO is in the same (drive) format as SG_IO.

>> Also, I think that the DISCARD option must be enabled by default for
>> HMSMR disks. Otherwise, zones write pointer will never get reset.
>> The same applies to HASMR devices mounted with the LFS mode.
>> (In any case, the discard handling does not look like it will always
>> align to the device zone size, which will fail on a zoned disk (discard
>> granularity is the zone size). I may be missing something though. Still
>> checking.)
>
> Yup, will do. BTW, I remember I fixed zone-aligned discard issue before.
> Could you please check the latest dev-test branch in f2fs.git?
> I'm used to rebase that branch occasionally when I changed the previous 
> patches.

I think you will need to have the have access to the zone information.

Regards,
Shaun Tancheff


[PATCH v8 2/2 RESEND] Add ioctl to issue ZBC/ZAC commands via block layer

2016-08-24 Thread Shaun Tancheff
(RESENDING to include f2fs, fs-devel and dm-devel)

Add support for ZBC ioctl's
BLKREPORT - Issue Report Zones to device.
BLKZONEACTION - Issue a Zone Action (Close, Finish, Open, or Reset)

Signed-off-by: Shaun Tancheff 
---
v8:
 - Changed ioctl for zone actions to a single ioctl that takes 
   a structure including the zone, zone action, all flag, and force option
 - Mapped REQ_META flag to 'force unit access' for zone operations
v6:
 - Added GFP_DMA to gfp mask.
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's

 block/ioctl.c | 149 ++
 include/uapi/linux/blkzoned_api.h |  30 +++-
 include/uapi/linux/fs.h   |   1 +
 3 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..d760523 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -194,6 +194,151 @@ int blkdev_reread_part(struct block_device *bdev)
 }
 EXPORT_SYMBOL(blkdev_reread_part);
 
+static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode,
+ void __user *parg)
+{
+   int error = -EFAULT;
+   gfp_t gfp = GFP_KERNEL | GFP_DMA;
+   void *iopg = NULL;
+   struct bdev_zone_report_io *bzrpt = NULL;
+   int order = 0;
+   struct page *pgs = NULL;
+   u32 alloc_size = PAGE_SIZE;
+   unsigned int op_flags = 0;
+   u8 opt = 0;
+
+   if (!(mode & FMODE_READ))
+   return -EBADF;
+
+   iopg = (void *)get_zeroed_page(gfp);
+   if (!iopg) {
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   bzrpt = iopg;
+   if (copy_from_user(bzrpt, parg, sizeof(*bzrpt))) {
+   error = -EFAULT;
+   goto report_zones_out;
+   }
+   if (bzrpt->data.in.return_page_count > alloc_size) {
+   int npages;
+
+   alloc_size = bzrpt->data.in.return_page_count;
+   npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   pgs = alloc_pages(gfp, ilog2(npages));
+   if (pgs) {
+   void *mem = page_address(pgs);
+
+   if (!mem) {
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   order = ilog2(npages);
+   memset(mem, 0, alloc_size);
+   memcpy(mem, bzrpt, sizeof(*bzrpt));
+   bzrpt = mem;
+   } else {
+   /* Result requires DMA capable memory */
+   pr_err("Not enough memory available for request.\n");
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   } else {
+   alloc_size = bzrpt->data.in.return_page_count;
+   }
+   if (bzrpt->data.in.force_unit_access)
+   op_flags |= REQ_META;
+   opt = bzrpt->data.in.report_option;
+   error = blkdev_issue_zone_report(bdev, op_flags,
+   bzrpt->data.in.zone_locator_lba, opt,
+   pgs ? pgs : virt_to_page(iopg),
+   alloc_size, GFP_KERNEL);
+   if (error)
+   goto report_zones_out;
+
+   if (pgs) {
+   void *src = bzrpt;
+   u32 off = 0;
+
+   /*
+* When moving a multi-order page with GFP_DMA
+* the copy to user can trap ""
+* so instead we copy out 1 page at a time.
+*/
+   while (off < alloc_size && !error) {
+   u32 len = min_t(u32, PAGE_SIZE, alloc_size - off);
+
+   memcpy(iopg, src + off, len);
+   if (copy_to_user(parg + off, iopg, len))
+   error = -EFAULT;
+   off += len;
+   }
+   } else {
+   if (copy_to_user(parg, iopg, alloc_size))
+   error = -EFAULT;
+   }
+
+report_zones_out:
+   if (pgs)
+   __free_pages(pgs, order);
+   if (iopg)
+   free_page((unsigned long)iopg);
+   return error;
+}
+
+static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode,
+ void __user *parg)
+{
+   unsigned int op = 0;
+   unsigned int op_flags = 0;
+   sector_t lba;
+   struct bdev_zone_action za;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   /* When acting on zones we explicitly disallow using a partition. */
+   if (bdev != bdev->bd_contains) {
+   pr_err("%s: All zone operations disallowed on this device\n",
+   __func__);
+   return -EFAULT;
+   }
+
+   if (copy_from_user(

[PATCH v8 1/2 RESEND] Add bio/request flags to issue ZBC/ZAC commands

2016-08-24 Thread Shaun Tancheff
(RESENDING to include f2fs, fs-devel and dm-devel)

Add op flags to access to zone information as well as open, close
and reset zones:
  - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
  - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing
  - REQ_OP_ZONE_CLOSE - Explicitly close a zone
  - REQ_OP_ZONE_FINISH - Explicitly finish a zone
  - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone

These op flags can be used to create bio's to control zoned devices
through the block layer.

This is useful for file systems and device mappers that need explicit
control of zoned devices such as Host Managed and Host Aware SMR drives,

Report zones is a device read that requires a buffer.

Open, Close, Finish and Reset are device commands that have no
associated data transfer.
  Open -   Open is a zone for writing.
  Close -  Disallow writing to a zone.
  Finish - Disallow writing a zone and set the WP to the end
   of the zone.
  Reset -  Discard data in a zone and reset the WP to the start
   of the zone.

Sending an LBA of ~0 will attempt to operate on all zones.
This is typically used with Reset to wipe a drive as a Reset
behaves similar to TRIM in that all data in the zone(s) is deleted.

Report zones currently defaults to reporting on all zones. It expected
that support for the zone option flag will piggy back on streamid
support. The report option flag is useful as it can reduce the number
of zones in each report, but not critical.

Signed-off-by: Shaun Tancheff 
---
v8:
 - Added Finish Zone op
 - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled
v6:
 - Added GFP_DMA to gfp mask.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Removed include/uapi/linux/fs.h from this patch.

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  94 
 drivers/scsi/sd.c | 121 +
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   8 +-
 include/linux/blk_types.h |   7 +-
 include/linux/blkdev.h|   1 +
 include/linux/blkzoned_api.h  |  25 ++
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 182 ++
 10 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a306795..aedf311 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12984,6 +12984,15 @@ F: Documentation/networking/z8530drv.txt
 F: drivers/net/hamradio/*scc.c
 F: drivers/net/hamradio/z8530.h
 
+ZBC AND ZBC BLOCK DEVICES
+M: Shaun Tancheff 
+W: http://seagate.com
+W: https://github.com/Seagate/ZDM-Device-Mapper
+L: linux-bl...@vger.kernel.org
+S: Maintained
+F: include/linux/blkzoned_api.h
+F: include/uapi/linux/blkzoned_api.h
+
 ZBUD COMPRESSED PAGE ALLOCATOR
 M: Seth Jennings 
 L: linux...@kvack.org
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 083e56f..e92bd56 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -266,3 +266,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_zone_report - queue a report zones operation
+ * @bdev:  target blockdev
+ * @op_flags:  extra bio rw flags. If unsure, use 0.
+ * @sector:starting sector (report will include this sector).
+ * @opt:   See: zone_report_option, default is 0 (all zones).
+ * @page:  one or more contiguous pages.
+ * @pgsz:  up to size of page in bytes, size of report.
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
+sector_t sector, u8 opt, struct page *page,
+size_t pgsz, gfp_t gfp_mask)
+{
+   struct bdev_zone_report *conv = page_address(page);
+   struct bio *bio;
+   unsigned int nr_iovecs = 1;
+   int ret = 0;
+
+   if (pgsz < (sizeof(struct bdev_zone_report) +
+   sizeof(struct bdev_zone_descriptor)))
+   return -EINVAL;
+
+   bio = bio_alloc(gfp_mask, nr_iovecs);
+   if (!bio)
+   return -ENOMEM;
+
+   conv->

[PATCH v8 0/2 RESEND] Block layer support ZAC/ZBC commands

2016-08-24 Thread Shaun Tancheff
(RESENDING to include f2fs, fs-devel and dm-devel)

Hi Jens,

This series is based on linus' v4.8-rc2 branch.

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be
suitable for use by Host Managed drives.

ZBC [and ZAC] drives add new commands for discovering and working
with Zones.

Part one of this series expands the bio/request reserved op size from
3 to 4 bits and then adds op codes for each of the ZBC commands:
   Report zones, close zone, finish zone, open zone and reset zone.

Part two of this series deals with integrating these new bio/request
op's with Hannes' zone cache.

This extends the ZBC support up to the block layer allowing direct
control by file systems or device mapper targets. Also by deferring
the zone handling to the authoritative subsystem there is an overall
lower memory usage for holding the active zone information as well
as clarifying responsible party for maintaining the write pointer
for each active zone.

By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and
where the write pointer is based on writes completed w/o errors.

Knowing the drive zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or
discards) that are zone aligned.

A perhaps non-obvious approach is that a conventional drive will
returns a zone report descriptor with a single large conventional zone.
This is intended to allow a collection of zoned and non-zoned media to
be stitched together to provide a file system with a zoned device with
conventional space mapped to where it is useful.

Patches for util-linux can be found here:
g...@github.com:stancheff/util-linux.git v2.28.1+biof

https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof

This patch is available here:
https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v8

g...@github.com:stancheff/linux.git v4.8-rc2+biof.v8

v8:
 - Changed zone report to default to reading from zone cache.
 - Changed ioctl for zone commands to support forcing a query or command
   to be sent to media.
 - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled
v7:
 - Initial support for Hannes' zone cache.
v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.


Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  94 +
 block/ioctl.c | 149 +++
 drivers/scsi/sd.c | 121 ++
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   8 +-
 include/linux/blk_types.h |   7 +-
 include/linux/blkdev.h|   1 +
 include/linux/blkzoned_api.h  |  25 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 210 ++
 include/uapi/linux/fs.h   |   1 +
 12 files changed, 625 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.9.3



[PATCH v6 3/4 RESEND] SCT Write Same / DSM Trim

2016-08-24 Thread Shaun Tancheff
Correct handling of devices with sector_size other that 512 bytes.

In the case of a 4Kn device sector_size it is possible to describe a much
larger DSM Trim than the current fixed default of 512 bytes.

This patch assumes the minimum descriptor is sector_size and fills out
the descriptor accordingly.

The ACS-2 specification is quite clear that the DSM command payload is
sized as number of 512 byte transfers so a 4Kn device will operate
correctly without this patch.

Signed-off-by: Shaun Tancheff 
---
v5:
 - Reshuffled descripton.
 - Added support for a sector_size descriptor other than 512 bytes.

 drivers/ata/libata-scsi.c | 85 +++
 1 file changed, 57 insertions(+), 28 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index ebf1a04..37f456e 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3283,7 +3283,7 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
 /**
  * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
  * @cmd: SCSI command being translated
- * @num: Maximum number of entries (nominally 64).
+ * @trmax: Maximum number of entries that will fit in sector_size bytes.
  * @sector: Starting sector
  * @count: Total Range of request in logical sectors
  *
@@ -3298,63 +3298,80 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
  *  LBA's should be sorted order and not overlap.
  *
  * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
+ *
+ * Return: Number of bytes copied into sglist.
  */
-static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num,
- u64 sector, u32 count)
+static size_t ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 trmax,
+   u64 sector, u32 count)
 {
-   __le64 *buffer;
-   u32 i = 0, used_bytes;
+   struct scsi_device *sdp = cmd->device;
+   size_t len = sdp->sector_size;
+   size_t r;
+   __le64 *buf;
+   u32 i = 0;
unsigned long flags;
 
-   BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE);
+   WARN_ON(len > ATA_SCSI_RBUF_SIZE);
+
+   if (len > ATA_SCSI_RBUF_SIZE)
+   len = ATA_SCSI_RBUF_SIZE;
 
spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
-   buffer = ((void *)ata_scsi_rbuf);
-   while (i < num) {
+   buf = ((void *)ata_scsi_rbuf);
+   memset(buf, 0, len);
+   while (i < trmax) {
u64 entry = sector |
((u64)(count > 0x ? 0x : count) << 48);
-   buffer[i++] = __cpu_to_le64(entry);
+   buf[i++] = __cpu_to_le64(entry);
if (count <= 0x)
break;
count -= 0x;
sector += 0x;
}
-
-   used_bytes = ALIGN(i * 8, 512);
-   memset(buffer + i, 0, used_bytes - i * 8);
-   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 512);
+   r = sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buf, len);
spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
 
-   return used_bytes;
+   return r;
 }
 
 /**
  * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
  * @cmd: SCSI command being translated
  * @lba: Starting sector
- * @num: Number of logical sectors to be zero'd.
+ * @num: Number of sectors to be zero'd.
  *
- * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted
+ * Rewrite the WRITE SAME payload to be an SCT Write Same formatted
  * descriptor.
  * NOTE: Writes a pattern (0's) in the foreground.
- *   Large write-same requents can timeout.
+ *
+ * Return: Number of bytes copied into sglist.
  */
-static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num)
+static size_t ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 
num)
 {
-   u16 *sctpg;
+   struct scsi_device *sdp = cmd->device;
+   size_t len = sdp->sector_size;
+   size_t r;
+   u16 *buf;
unsigned long flags;
 
spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
-   sctpg = ((void *)ata_scsi_rbuf);
+   buf = ((void *)ata_scsi_rbuf);
+
+   put_unaligned_le16(0x0002,  &buf[0]); /* SCT_ACT_WRITE_SAME */
+   put_unaligned_le16(0x0101,  &buf[1]); /* WRITE PTRN FG */
+   put_unaligned_le64(lba, &buf[2]);
+   put_unaligned_le64(num, &buf[6]);
+   put_unaligned_le32(0u,  &buf[10]); /* pattern */
+
+   WARN_ON(len > ATA_SCSI_RBUF_SIZE);
 
-   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
-   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
-   put_unaligned_le64(lba, &sctpg[2]);
-   put_unaligned_le64(num, &sctpg[6]);
-   put_unaligned_le32(0u,  &sctpg[10]);
+   if 

Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same

2016-08-23 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 8:25 PM, Damien Le Moal  wrote:
>
> Shaun,
>
> On 8/23/16 09:22, Shaun Tancheff wrote:
>> On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal  
>> wrote:

>> Also you may note that in my patch to get Host Aware working
>> with the zone cache I do not include the runt zone in the cache.
>
> Why not ? The RB-tree will handle it just fine (the insert and lookup
> code as Hannes had them was not relying on a constant zone size).

A good point. I didn't pay too much attention while brining this
forward. I think a few of my hacks may be pointless now. I'll
try to rework it and get rid of the runt check.

>> So as it sits I need this fallback otherwise doing blkdiscard over
>> the whole device ends in a error, as well as mkfs.f2fs et. al.
>
> Got it, but I do not see a problem with including it. I have not checked
> the code, but the split of a big discard call into "chunks" should be
> already handling the last chunk and make sure that the operation does
> not exceed the device capacity (in any case, that's easy to fix in the
> sd_zbc_setup_discard code).

Yes I agree the split of big discards does handle the last chunk correctly.

>>> Some 10TB host managed disks out there have 1% conventional zone space,
>>> that is 100GB of capacity. When issuing a "reset all", doing a write
>>> same in these zones will take forever... If the user really wants zeroes
>>> in those zones, let it issue a zeroout.
>>>
>>> I think that it would a better choice to simply not report
>>> discard_zeroes_data as true and do nothing for conventional zones reset.
>>
>> I think that would be unfortunate for Host Managed but I think it's
>> the right choice for Host Aware at this time. So either we base
>> it on disk type or we have some other config flag added to sysfs.
>
> I do not see any difference between host managed and host aware. Both
> define the same behavior for reset, and both end up in a NOP for
> conventional zone reset (no data "erasure" required by the standard).
> For write pointer zones, reading unwritten LBAs returns the
> initialization pattern, with the exception of host-managed disks with
> the URSWRZ bit set to 0. But that case is covered in sd.c, so the
> behavior is consistent across all models. So why forcing data zeroing
> when the standards do not mandate it ?

Well you do have point.
It appears to be only mkfs and similar tools that are really utilizing
discard zeros data at the moment.

I did a quick test:

mkfs -t ext4 -b 4096 -g 32768 -G 32  \
 -E 
lazy_itable_init=0,lazy_journal_init=0,offset=0,num_backup_sb=0,packed_meta_blocks=1,discard
  \
 -O flex_bg,extent,sparse_super2

   - discard zeroes data true - 3 minutess
   - discard zeroes data false - 6 minutes
So for the smaller conventional space on the current HA drive
there is some advantage to enabling discard zeroes data.

However for a larger conventional space you are correct the overall
impact is worse performance.

For some reason I had been assuming that some file systems
used or relied on discard zeroes data during normal operation.
Now that I am looking for that I don't seem to be finding any
evidence of it, so aside from mkfs I don't have as good an
argument discard zeroes data as I though I did.

Regards,
Shaun


Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-23 Thread Shaun Tancheff
On Tue, Aug 23, 2016 at 5:37 AM, Tom Yan  wrote:
> On 22 August 2016 at 04:23, Shaun Tancheff  wrote:
>> +/**
>> + * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
>> + * @cmd: SCSI command being translated
>> + * @lba: Starting sector
>> + * @num: Number of logical sectors to be zero'd.
>> + *
>> + * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted
>> + * descriptor.
>> + * NOTE: Writes a pattern (0's) in the foreground.
>> + *   Large write-same requents can timeout.
>> + */
>> +static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 
>> num)
>> +{
>> +   u16 *sctpg;
>> +   unsigned long flags;
>> +
>> +   spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
>> +   sctpg = ((void *)ata_scsi_rbuf);
>
> Because ata_scsi_rbuf is of a fixed size of ATA_SCSI_RBUF_SIZE.
>
> #define ATA_SCSI_RBUF_SIZE  4096
> ...
> static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE];
>
>> +
>> +   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
>> +   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
>> +   put_unaligned_le64(lba, &sctpg[2]);
>> +   put_unaligned_le64(num, &sctpg[6]);
>> +   put_unaligned_le32(0u,  &sctpg[10]);
>> +
>> +   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), sctpg, 
>> 512);
>
> You have no reason to use 512 here instead of ATA_SCSI_RBUF_SIZE this time.

Ah .. because SCT Write Same is a fixed 512 byte transfer?
Ah .. because I only have 512 bytes to copy?

>> +   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
>> +}


Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-23 Thread Shaun Tancheff
On Tue, Aug 23, 2016 at 2:53 AM, Tom Yan  wrote:
> On 23 August 2016 at 06:20, Shaun Tancheff  wrote:
>> On Tue, Aug 23, 2016 at 12:26 AM, Tom Yan  wrote:
>>
>>> It is always 1 merely because we prefer sticking to that. Say we want
>>> to enable multi-block payload now, it won't be 1 anymore.
>>
>> Sorry, I though that DSM TRIM is using 512 bytes here because
>> WRITE_SAME_16 has a payload of a single logical sector.
>
> Nope, SCSI Write Same commands does not have payload (or in SCSI
> terms, parameter list / data-out buffer).

Hmm. Not sure about T10, but that's not how I read
sd_setup_write_same_cmnd(), it always setups up a transfer of
sector_size for scsi_init_io().

As I understand things, this is how the cmd's sglist is populated and why
this should be using sg_copy_from_buffer() rather than open coding
to the page.
-- 
Shaun Tancheff


Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-22 Thread Shaun Tancheff
On Tue, Aug 23, 2016 at 12:26 AM, Tom Yan  wrote:

> It is always 1 merely because we prefer sticking to that. Say we want
> to enable multi-block payload now, it won't be 1 anymore.

Sorry, I though that DSM TRIM is using 512 bytes here because
WRITE_SAME_16 has a payload of a single logical sector.

> Also note that the payload is NOT always fully-filled.

Oh, I was considering filling the remainder of the buffer
with 0's using memset() as you described to be "filly-filled"
in this case. Sorry for the confusion.

-- 
Shaun Tancheff


Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-22 Thread Shaun Tancheff
On Tue, Aug 23, 2016 at 12:55 AM, Tom Yan  wrote:
> I am really not going to tell the whole story here again. We have a
> really long discussion on whether we should advertise Maximum Write
> Same Length for SCT Write Same, and the value we should advertise.
> (Didn't we come to an conclusion on that as well?)

I had thought the conclusion was leave b0 as is and let the WS10 limit
take effect per Martin.
-- 
Shaun Tancheff


Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 6:49 PM, Tom Yan  wrote:
> The only 512 I can see in the old code is the one in:
>
>> -   used_bytes = ALIGN(i * 8, 512);
>
> where the alignment is necessary because 'used_bytes' will be returned
> as 'size', which will be used for setting the number of 512-byte block
> payload when we construct the ATA taskfile / command. It may NOT be a
> good idea to replace it with ATA_SECT_SIZE. Some comment could be
> nice.

Not sure I agree with that analysis. Could just as well assign it directly,
as ALIGN() is just superfluous here.
Later the count is always 512/512 -> 1. Always.

"i" is used here to limit the number of bytes that need to be memset()
after filling to payload. Personally I think memset is fast enough that
it is better to do before rather than later but I figure if the code works
let it be.

> So I don't think it makes any sense to check ATA_SCSI_RBUF_SIZE
> against 512 here.

Again, not sure I agree, but I don't really care on that point. Just many
years of defensive coding.

> On 22 August 2016 at 22:00, Shaun Tancheff  wrote:
>> Because this is re-using the response buffer in a new way.
>> Such reuse could be a surprise to someone refactoring that
>> code, although it's pretty unlikely. The build bug on
>> gives some level of confidence that it won't go unnoticed.
>> It also codifies the assumption that we can write 512 bytes
>> to the global ata_scsi_rbuf buffer.
>>
>> As to using a literal 512, I was just following what was here
>> before.
>>
>> It looks like there is a ATA_SECT_SIZE that can replace most
>> of the literal 512's in here though. That might be a nice cleanup
>> overall. Not sure it belongs here though.
>>
>
>>>> +   used_bytes = ALIGN(i * 8, 512);
>>>> +   memset(buffer + i, 0, used_bytes - i * 8);
>>>> +   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 
>>>> 512);
>
> And I don't think you have a reason to use 512 here either. It appears
> to me that you should use ATA_SCSI_RBUF_SIZE instead (see
> ata_scsi_rbuf_put in libata-scsi). If not, it should probably be a
> _derived_ value (e.g. from `i`) that tells the _actual_ size of
> `buffer`.

Nope. We *must* copy the whole 512 byte payload into the sglist.
Otherwise what was the point of the memset to clear out an cruft?
Failing to move the whole payload into place could leave whatever
garbage is in the buffer to be interpreted as an actual trim and do
real damage. I certainly can't use ATA_SCSI_RBUF_SIZE because
the payload attached to the cmd need only be 512 bytes. Trying
to copy in 4k is going to cause bad things when you check
the return from sg_copy_from_buffer() and notice you failed to
copy in you payload.

> Again, note that even when we prefer to stick with _one_ 512-byte
> block TRIM payload, we should probably NEVER _hard code_ such limit
> (for it's really ugly and unnecessary) in functions like this. All we

The interface requires well formed 512 byte chunks so we have to
at least have to do enough to ensure that we work in multiples of
512. Since 512 is all over the spec for this type of thing I think
it would be reasonable to have a define or enum if you don't think
reusing ATA_SECT_SIZE is good maybe something
like ATA_CMD_XFER_SIZE ?

> need to do is to advertise the limit (or lower) as Maximum Write
> Length, and make sure our SATL works as per SBC that it will reject
> SCSI Write Same commands that is "larger" than that.
>>>> +   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
>>>> +
>>>> +   return used_bytes;
>>>> +}


-- 
Shaun Tancheff


Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 6:09 PM, Tom Yan  wrote:
> I am not sure about what you mean here. Rejecting SCSI Write Same
> commands that has its "number of blocks" field set to a value higher
> than the device's reported Maximum Write Same Length is only natural
> and mandated by SBC. We have no reason (even if it is practically not
> a must) not to do it while we are implementing a SCSI-ATA Translation
> Layer here as long as we advertise Maximum Write Same Length. It does
> not matter here whether the command ends up being translated to SCT
> Write Same or TRIM.
>
> How high or how lower the limit should be advertised has nothing to do
> with the checking.
>
> FWIW, letting the SCSI/block layer fall back with SD_MAX_WS10_BLOCKS
> does NOT count as advertising Maximum Write Same Length, that's why we
> may or may not (in terms of SBC) check n_block against it if we are
> really gonna leave ata_scsiop_inq_b0 in libata-scsi untouched.

Sorry I'm still a bit confused.

SCT Write Same does not have a limit ... it's a u64 of logical sectors.
Any limit specified is smaller based on other parts of the stack.
The SATL code being used to emulating SCSI Write Same which does
have a limited number of sectors .. so falling back to the SCSI limit
seems reasonable.

So the limit that is being applied is either the current TRIM limit,
or the SCSI Write Same limit.
-- 
Shaun Tancheff


Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal  wrote:
>
> Shaun,
>
> On 8/22/16 13:31, Shaun Tancheff wrote:
> [...]
>> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>> -  sector_t sector, unsigned int num_sectors)
>> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
>>  {
>> - struct blk_zone *zone;
>> + struct request *rq = cmd->request;
>> + struct scsi_device *sdp = cmd->device;
>> + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
>> + sector_t sector = blk_rq_pos(rq);
>> + unsigned int nr_sectors = blk_rq_sectors(rq);
>>   int ret = BLKPREP_OK;
>> + struct blk_zone *zone;
>>   unsigned long flags;
>> + u32 wp_offset;
>> + bool use_write_same = false;
>>
>>   zone = blk_lookup_zone(rq->q, sector);
>> - if (!zone)
>> + if (!zone) {
>> + /* Test for a runt zone before giving up */
>> + if (sdp->type != TYPE_ZBC) {
>> + struct request_queue *q = rq->q;
>> + struct rb_node *node;
>> +
>> + node = rb_last(&q->zones);
>> + if (node)
>> + zone = rb_entry(node, struct blk_zone, node);
>> + if (zone) {
>> + spin_lock_irqsave(&zone->lock, flags);
>> + if ((zone->start + zone->len) <= sector)
>> + goto out;
>> + spin_unlock_irqrestore(&zone->lock, flags);
>> + zone = NULL;
>> + }
>> + }
>>   return BLKPREP_KILL;
>> + }
>
> I do not understand the point of this code here to test for the runt
> zone. As long as sector is within the device maximum capacity (in 512B
> unit), blk_lookup_zone will return the pointer to the zone structure
> containing that sector (the RB-tree does not have any constraint
> regarding zone size). The only case where NULL would be returned is if
> discard is issued super early right after the disk is probed and before
> the zone refresh work has completed. We can certainly protect against
> that by delaying the discard.

As you can see I am not including Host Managed in the
runt check.

Also you may note that in my patch to get Host Aware working
with the zone cache I do not include the runt zone in the cache.
So as it sits I need this fallback otherwise doing blkdiscard over
the whole device ends in a error, as well as mkfs.f2fs et. al.

>>   spin_lock_irqsave(&zone->lock, flags);
>> -
>>   if (zone->state == BLK_ZONE_UNKNOWN ||
>>   zone->state == BLK_ZONE_BUSY) {
>>   sd_zbc_debug_ratelimit(sdkp,
>> -"Discarding zone %zu state %x, 
>> deferring\n",
>> +"Discarding zone %zx state %x, 
>> deferring\n",
>
> Sector values are usually displayed in decimal. Why use Hex here ? At
> least "0x" would be needed to avoid confusion I think.

Yeah, my brain is lazy about converting very large
numbers to powers of 2. So it's much easier to spot
zone alignment here.



>>  zone->start, zone->state);
>>   ret = BLKPREP_DEFER;
>>   goto out;
>> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, 
>> struct request *rq,
>>   if (zone->state == BLK_ZONE_OFFLINE) {
>>   /* let the drive fail the command */
>>   sd_zbc_debug_ratelimit(sdkp,
>> -"Discarding offline zone %zu\n",
>> +"Discarding offline zone %zx\n",
>>  zone->start);
>>   goto out;
>>   }
>> -
>> - if (!blk_zone_is_smr(zone)) {
>> + if (blk_zone_is_cmr(zone)) {
>> + use_write_same = true;
>>   sd_zbc_debug_ratelimit(sdkp,
>> -"Discarding %s zone %zu\n",
>> -blk_zone_is_cmr(zone) ? "CMR" : 
>> "unknown",
>> +"Discarding CMR zone %zx\n",
>>  zone->start);
>> - ret = BLKPREP_DONE;
>>   goto out;
>>   }
>
> Some 10TB host managed

Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 3:14 PM, Tom Yan  wrote:
> On 23 August 2016 at 03:43, Shaun Tancheff  wrote:
>>
>> Why would we enforce upper level limits on something that doesn't
>> have any?
>
> If we advertise a limit in our SATL, it makes sense that we should
> make sure the behaviour is consistent when we issue a write same
> through the block layer / ioctl and when we issue a SCSI Write Same
> command directly (e.g. with sg_write_same). IMHO that's pretty much
> why SBC would mandate such behaviour as well.

Breaking would be advertising a limit that is too high and failing.
Advertising a lower limit and succeeding may not be ideal for all
possible use cases, but it's not breaking behaviour.

>>
>> If the upper level, or SG_IO, chooses to set a timeout of 10 hours and
>> wipe a whole disk it should be free to do so.
>>
>
> That's why I said, "if you are going to advertise an Maximum Write Same 
> Length".

-- 
Shaun Tancheff


Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 4:20 PM, Tom Yan  wrote:
> On 22 August 2016 at 04:23, Shaun Tancheff  wrote:
>> Safely overwriting the attached page to ATA format from the SCSI formatted
>> variant.
>>
>> Signed-off-by: Shaun Tancheff 
>> ---
>> v6:
>>  - Fix bisect bug reported by Tom Yan 
>>  - Change to use sg_copy_from_buffer as per Christoph Hellwig 
>> v5:
>>  - Added prep patch to work with non-page aligned scatterlist pages
>>and use kmap_atomic() to lock page during modification.
>>
>>  drivers/ata/libata-scsi.c | 56 
>> ++-
>>  include/linux/ata.h   | 26 --
>>  2 files changed, 51 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index e207b33..7990cb2 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -3282,6 +3282,54 @@ static unsigned int ata_scsi_pass_thru(struct 
>> ata_queued_cmd *qc)
>> return 1;
>>  }
>>
>> +/**
>> + * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
>> + * @cmd: SCSI command being translated
>> + * @num: Maximum number of entries (nominally 64).
>> + * @sector: Starting sector
>> + * @count: Total Range of request
>> + *
>> + * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian 
>> formatted
>> + * descriptor.
>> + *
>> + * Upto 64 entries of the format:
>> + *   63:48 Range Length
>> + *   47:0  LBA
>> + *
>> + *  Range Length of 0 is ignored.
>> + *  LBA's should be sorted order and not overlap.
>> + *
>> + * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
>> + */
>> +static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 
>> num,
>> + u64 sector, u32 count)
>> +{
>> +   __le64 *buffer;
>> +   u32 i = 0, used_bytes;
>> +   unsigned long flags;
>> +
>> +   BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE);
>
> Why do we want to check "512" (a literal number) against
> ATA_SCSI_RBUF_SIZE here?

Because this is re-using the response buffer in a new way.
Such reuse could be a surprise to someone refactoring that
code, although it's pretty unlikely. The build bug on
gives some level of confidence that it won't go unnoticed.
It also codifies the assumption that we can write 512 bytes
to the global ata_scsi_rbuf buffer.

As to using a literal 512, I was just following what was here
before.

It looks like there is a ATA_SECT_SIZE that can replace most
of the literal 512's in here though. That might be a nice cleanup
overall. Not sure it belongs here though.

>> +
>> +   spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
>> +   buffer = ((void *)ata_scsi_rbuf);
>> +   while (i < num) {
>> +   u64 entry = sector |
>> +   ((u64)(count > 0x ? 0x : count) << 48);
>> +   buffer[i++] = __cpu_to_le64(entry);
>> +   if (count <= 0x)
>> +   break;
>> +   count -= 0x;
>> +   sector += 0x;
>> +   }
>> +
>> +   used_bytes = ALIGN(i * 8, 512);
>> +   memset(buffer + i, 0, used_bytes - i * 8);
>> +   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 
>> 512);
>> +   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
>> +
>> +   return used_bytes;
>> +}
>> +
>>  static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
>>  {
>> struct ata_taskfile *tf = &qc->tf;
>> @@ -3290,8 +3338,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
>> ata_queued_cmd *qc)
>> const u8 *cdb = scmd->cmnd;
>> u64 block;
>> u32 n_block;
>> +   const u32 trmax = ATA_MAX_TRIM_RNUM;
>> u32 size;
>> -   void *buf;
>> u16 fp;
>> u8 bp = 0xff;
>>
>> @@ -3319,10 +3367,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
>> ata_queued_cmd *qc)
>> if (!scsi_sg_count(scmd))
>> goto invalid_param_len;
>>
>> -   buf = page_address(sg_page(scsi_sglist(scmd)));
>> -
>> -   if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) {
>> -   size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, 
>> block, n_block);
>> +   if (n_block <= 0x * trmax) {
>> +   size = ata_f

Re: [PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 2:27 PM, Tom Yan  wrote:
> On 22 August 2016 at 12:23, Shaun Tancheff  wrote:
>> Safely overwriting the attached page to ATA format from the SCSI formatted
>> variant.
>>
>> Signed-off-by: Shaun Tancheff 
>> ---
>> v6:
>>  - Fix bisect bug reported by Tom Yan 
>>  - Change to use sg_copy_from_buffer as per Christoph Hellwig 
>> v5:
>>  - Added prep patch to work with non-page aligned scatterlist pages
>>and use kmap_atomic() to lock page during modification.
>>
>>  drivers/ata/libata-scsi.c | 56 
>> ++-
>>  include/linux/ata.h   | 26 --
>>  2 files changed, 51 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index e207b33..7990cb2 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -3282,6 +3282,54 @@ static unsigned int ata_scsi_pass_thru(struct 
>> ata_queued_cmd *qc)
>> return 1;
>>  }
>>
>> +/**
>> + * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
>> + * @cmd: SCSI command being translated
>> + * @num: Maximum number of entries (nominally 64).
>> + * @sector: Starting sector
>> + * @count: Total Range of request
>> + *
>> + * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian 
>> formatted
>> + * descriptor.
>> + *
>> + * Upto 64 entries of the format:
>> + *   63:48 Range Length
>> + *   47:0  LBA
>> + *
>> + *  Range Length of 0 is ignored.
>> + *  LBA's should be sorted order and not overlap.
>> + *
>> + * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
>> + */
>> +static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 
>> num,
>> + u64 sector, u32 count)
>> +{
>> +   __le64 *buffer;
>> +   u32 i = 0, used_bytes;
>> +   unsigned long flags;
>> +
>> +   BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE);
>> +
>> +   spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
>> +   buffer = ((void *)ata_scsi_rbuf);
>> +   while (i < num) {
>> +   u64 entry = sector |
>> +   ((u64)(count > 0x ? 0x : count) << 48);
>> +   buffer[i++] = __cpu_to_le64(entry);
>> +   if (count <= 0x)
>> +   break;
>> +   count -= 0x;
>> +   sector += 0x;
>> +   }
>> +
>> +   used_bytes = ALIGN(i * 8, 512);
>> +   memset(buffer + i, 0, used_bytes - i * 8);
>> +   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 
>> 512);
>> +   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
>> +
>> +   return used_bytes;
>> +}
>> +
>>  static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
>>  {
>> struct ata_taskfile *tf = &qc->tf;
>> @@ -3290,8 +3338,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
>> ata_queued_cmd *qc)
>> const u8 *cdb = scmd->cmnd;
>> u64 block;
>> u32 n_block;
>> +   const u32 trmax = ATA_MAX_TRIM_RNUM;
>
> This does not seem like a good idea.
>
>> u32 size;
>> -   void *buf;
>> u16 fp;
>> u8 bp = 0xff;
>>
>> @@ -3319,10 +3367,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
>> ata_queued_cmd *qc)
>> if (!scsi_sg_count(scmd))
>> goto invalid_param_len;
>>
>> -   buf = page_address(sg_page(scsi_sglist(scmd)));
>> -
>> -   if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) {
>> -   size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, 
>> block, n_block);
>> +   if (n_block <= 0x * trmax) {
>
> Note that the limit here would always be the advertised Maximum Write
> Same Length (ata_scsiop_inq_b0). It would be best for them to look the
> same. Besides, it doesn't seem necessary to create this trmax anyway.

It is entirely a style thing. I tend to prefer hex when describing an interface
that specifies number of bits. The trmax is just to keep the following line
under 80 chars.
I am not tied to either. I can change it if people really find it confusing.

>> +   size = ata_format_dsm_trim_descr(scmd, trmax, block, 
>> n_block);
>> } else {
>> fp = 2;
>> 

Re: [PATCH v6 2/4] Add support for SCT Write Same

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 2:20 PM, Tom Yan  wrote:
> On 22 August 2016 at 12:23, Shaun Tancheff  wrote:
>> SATA drives may support write same via SCT. This is useful
>> for setting the drive contents to a specific pattern (0's).
>>
>> Translate a SCSI WRITE SAME 16 command to be either a DSM TRIM
>> command or an SCT Write Same command.
>>
>> Based on the UNMAP flag:
>>   - When set translate to DSM TRIM
>>   - When not set translate to SCT Write Same
>>
>> Signed-off-by: Shaun Tancheff 
>> ---
>> v6:
>>  - Change to use sg_copy_from_buffer as per Christoph Hellwig 
>> v5:
>>  - Addressed review comments
>>  - Report support for ZBC only for zoned devices.
>>  - kmap page during rewrite
>>  - Fix unmap set to require trim or error, if not unmap then sct write
>>same or error.
>> v4:
>>  - Added partial MAINTENANCE_IN opcode simulation
>>  - Dropped all changes in drivers/scsi/*
>>  - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT.
>> v3:
>>  - Demux UNMAP/TRIM from WRITE SAME
>> v2:
>>  - Remove fugly ata hacking from sd.c
>>
>>  drivers/ata/libata-scsi.c | 199 
>> +++---
>>  include/linux/ata.h   |  43 ++
>>  2 files changed, 213 insertions(+), 29 deletions(-)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index 7990cb2..ebf1a04 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device 
>> *sdev)
>>  {
>> sdev->use_10_for_rw = 1;
>> sdev->use_10_for_ms = 1;
>> -   sdev->no_report_opcodes = 1;
>> -   sdev->no_write_same = 1;
>>
>> /* Schedule policy is determined by ->qc_defer() callback and
>>  * it needs to see every deferred qc.  Set dev_blocked to 1 to
>> @@ -3287,7 +3285,7 @@ static unsigned int ata_scsi_pass_thru(struct 
>> ata_queued_cmd *qc)
>>   * @cmd: SCSI command being translated
>>   * @num: Maximum number of entries (nominally 64).
>>   * @sector: Starting sector
>> - * @count: Total Range of request
>> + * @count: Total Range of request in logical sectors
>>   *
>>   * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian 
>> formatted
>>   * descriptor.
>> @@ -3330,6 +3328,45 @@ static unsigned int ata_format_dsm_trim_descr(struct 
>> scsi_cmnd *cmd, u32 num,
>> return used_bytes;
>>  }
>>
>> +/**
>> + * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
>> + * @cmd: SCSI command being translated
>> + * @lba: Starting sector
>> + * @num: Number of logical sectors to be zero'd.
>> + *
>> + * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted
>> + * descriptor.
>> + * NOTE: Writes a pattern (0's) in the foreground.
>> + *   Large write-same requents can timeout.
>> + */
>> +static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 
>> num)
>> +{
>> +   u16 *sctpg;
>> +   unsigned long flags;
>> +
>> +   spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
>> +   sctpg = ((void *)ata_scsi_rbuf);
>> +
>> +   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
>> +   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
>> +   put_unaligned_le64(lba, &sctpg[2]);
>> +   put_unaligned_le64(num, &sctpg[6]);
>> +   put_unaligned_le32(0u,  &sctpg[10]);
>> +
>> +   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), sctpg, 
>> 512);
>> +   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
>> +}
>> +
>> +/**
>> + * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same
>> + * @qc: Command to be translated
>> + *
>> + * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or
>> + * an SCT Write Same command.
>> + * Based on WRITE SAME has the UNMAP flag
>> + *   When set translate to DSM TRIM
>> + *   When clear translate to SCT Write Same
>> + */
>>  static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
>>  {
>> struct ata_taskfile *tf = &qc->tf;
>> @@ -3342,6 +3379,7 @@ static unsigned int ata_scsi_write_same_xlat(struct 
>> ata_queued_cmd *qc)
>> u32 size;
>> u16 fp;
>> u8 bp = 0xff;
>> +  

Re: [PATCH v6 3/4] SCT Write Same / DSM Trim

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 12:02 PM, Tom Yan  wrote:
> On 22 August 2016 at 15:04, Shaun Tancheff  wrote:
>> On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan  wrote:
>>> On 22 August 2016 at 08:31, Tom Yan  wrote:

> Since I have no experience with SCT Write Same at all, and neither do
> I own any spinning HDD at all, I cannot firmly suggest what to do. All
> I can suggest is: should we decrease it per sector size? Or would 2G
> per command still be too large to avoid timeout?

Timeout for WS is 120 seconds so we should be fine there.

The number to look for is the:
   Max. Sustained Transfer Rate OD (MB/s): 190 8TB (180 5TB)

Which means the above drives should complete a 2G write in
about 10 to 11 seconds.

If these were 4Kn drives and we allowed a 16G max then it
would be 80-90 seconds, assuming the write speed didn't
get any better.

So holding the maximum to around 2G is probably the best
overall, in my opinion.

-- 
Shaun Tancheff

On Mon, Aug 22, 2016 at 12:02 PM, Tom Yan  wrote:
> On 22 August 2016 at 15:04, Shaun Tancheff  wrote:
>> On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan  wrote:
>>> On 22 August 2016 at 08:31, Tom Yan  wrote:
>>>> As mentioned before, as of the latest draft of ACS-4, nothing about a
>>>> larger payload size is mentioned. Conservatively speaking, it sort of
>>>
>>> *payload block size
>>>
>>>> means that we are allowing four 512-byte block payload on 4Kn device
>>>
>>> *eight 512-byte-block payload
>>>
>>>> regardless of the reported limit in the IDENTIFY DEVICE data. I am
>>>> really not sure if it's a good thing to do. Doesn't seem necessary
>>>> anyway, especially when our block layer does not support such a large
>>>> bio size (well, yet), so each request will end up using a payload of
>>>> two 512-byte blocks at max anyway.
>>>>
>>>> Also, it's IMHO better to do it in a seperate patch (series) after the
>>>> SCT Write Same support has entered libata's repo too, because this has
>>>> nothing to with it but TRIM translation. In case the future ACS
>>>> standards has clearer/better instruction on this, it will be easier
>>>> for us to revert/follow up too.
>>
>> I am certainly fine with dropping this patch as it is not critical to
>> the reset of the series.
>>
>> Nothing will break if we stick with the 512 byte fixed limit. This
>> is at most a prep patch for handling increased limits should
>> they be reported.
>>
>> All it really is doing is acknowledging that any write same
>> must have a payload of sector_size which can be something
>> larger than 512 bytes.
>
> Actually I am not sure if we should hard code the limit
> ata_format_dsm_trim_descr() / ata_set_lba_range_entries() at all. The
> current implementation (with or without your patch) seems redundant
> and unnecessary to me.
>
> All we need to do should be: making sure that the block limits VPD
> advertises a safe Maximum Write Same Length, and reject Write Same
> (16) commands that have "number of blocks" that exceeds the limit
> (which is what I introduced in commit 5c79097a28c2, "libata-scsi:
> reject WRITE SAME (16) with n_block that exceeds limit").
>
> In that case, we don't need to hard code the limit in the
> while-condition again; instead we should just make it end with the
> request size, since the accepted request could never be larger than
> the limit we advertise.
>
>>
>>>> And you'll need to fix the Block Limits VPD simulation
>>>> (ata_scsiop_inq_b0) too, so that it will advertise the Maximum Write
>>>> Same Length dynamically as per the logical sector size, otherwise your
>>>> effort will be completely in vain, even if our block layer is
>>>> overhauled in the future.
>>
>> Martin had earlier suggested that I leave the write same defaults
>> as is due to concerns with misbehaving hardware.
>
> It doesn't really apply in libata's anyway. SD_MAX_WS10_BLOCKS means
> nothing to ATA drives, except from coincidentally being the same value
> as ATA_MAX_SECTORS_LBA48 (which technically should have been 65536
> instead).
>
>>
>> I think your patch adjusting the reported limits is reasonable
>> enough. It seems to me we should have the hardware
>> report it's actual limits, for example, report what the spec
>> allows.
>
> As you mentioned yourself before, technically SCT Write Same does not
> have a limit. The only practical limit is the timeout in the SCSI
> layer, so the actual bytes being (over)written is probably our only
&g

Re: [PATCH 2/2] Migrate zone cache from RB-Tree to arrays of descriptors

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 2:11 AM, Hannes Reinecke  wrote:
> On 08/22/2016 06:34 AM, Shaun Tancheff wrote:
>> Currently the RB-Tree zone cache is fast and flexible. It does
>> use a rather largish amount of ram. This model reduces the ram
>> required from 120 bytes per zone to 16 bytes per zone with a
>> moderate transformation of the blk_zone_lookup() api.
>>
>> This model is predicated on the belief that most variations
>> on zoned media will follow a pattern of using collections of same
>> sized zones on a single device. Similar to the pattern of erase
>> blocks on flash devices being progressivly larger 16K, 64K, ...
>>
>> The goal is to be able to build a descriptor which is both memory
>> efficient, performant, and flexible.
>>
>> Signed-off-by: Shaun Tancheff 
>> ---
>>  block/blk-core.c   |2 +-
>>  block/blk-sysfs.c  |   31 +-
>>  block/blk-zoned.c  |  103 +++--
>>  drivers/scsi/sd.c  |5 +-
>>  drivers/scsi/sd.h  |4 +-
>>  drivers/scsi/sd_zbc.c  | 1025 
>> +++-
>>  include/linux/blkdev.h |   82 +++-
>>  7 files changed, 716 insertions(+), 536 deletions(-)

> Have you measure the performance impact here?

As far as actual hardware (HostAware) I am seeing the same
I/O performance. I suspect its just that below 100k iops the
zone cache just isn't a bottleneck.

> The main idea behind using an RB-tree is that each single element will
> fit in the CPU cache; using an array will prevent that.
> So we will increase the number of cache flushes, and most likely a
> performance penalty, too.
> Hence I'd rather like to see a performance measurement here before going
> down that road.

I think it will have to be a simulated benchmark, if that's okay.

Of course I'm open to suggestions if there is something you have in mind.
-- 
Regards,
Shaun Tancheff


Re: [PATCH v6 3/4] SCT Write Same / DSM Trim

2016-08-22 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 3:33 AM, Tom Yan  wrote:
> On 22 August 2016 at 08:31, Tom Yan  wrote:
>> As mentioned before, as of the latest draft of ACS-4, nothing about a
>> larger payload size is mentioned. Conservatively speaking, it sort of
>
> *payload block size
>
>> means that we are allowing four 512-byte block payload on 4Kn device
>
> *eight 512-byte-block payload
>
>> regardless of the reported limit in the IDENTIFY DEVICE data. I am
>> really not sure if it's a good thing to do. Doesn't seem necessary
>> anyway, especially when our block layer does not support such a large
>> bio size (well, yet), so each request will end up using a payload of
>> two 512-byte blocks at max anyway.
>>
>> Also, it's IMHO better to do it in a seperate patch (series) after the
>> SCT Write Same support has entered libata's repo too, because this has
>> nothing to with it but TRIM translation. In case the future ACS
>> standards has clearer/better instruction on this, it will be easier
>> for us to revert/follow up too.

I am certainly fine with dropping this patch as it is not critical to
the reset of the series.

Nothing will break if we stick with the 512 byte fixed limit. This
is at most a prep patch for handling increased limits should
they be reported.

All it really is doing is acknowledging that any write same
must have a payload of sector_size which can be something
larger than 512 bytes.

>> And you'll need to fix the Block Limits VPD simulation
>> (ata_scsiop_inq_b0) too, so that it will advertise the Maximum Write
>> Same Length dynamically as per the logical sector size, otherwise your
>> effort will be completely in vain, even if our block layer is
>> overhauled in the future.

Martin had earlier suggested that I leave the write same defaults
as is due to concerns with misbehaving hardware.

I think your patch adjusting the reported limits is reasonable
enough. It seems to me we should have the hardware
report it's actual limits, for example, report what the spec
allows.

Of course there are lots of reasons to limit the absolute
maximums.

So in this case we are just enabling the limit to be
increased but not changing the current black-listing
that distrusts DSM Trim. Once we have 4Kn devices
to test then we can start white-listing and see if there
is an overall increase in performance.

>> Please be noted that, since your haven't touched ata_scsiop_inq_b0 at
>> all, the reported Maximum Write Same Length will be:
>>
>> On device with TRIM support:
>> - 4194240 LOGICAL sector per request split / command
>> -- ~=2G on non-4Kn drives
>> -- ~=16G on non-4Kn drives
>>
>> On device without TRIM support:
>> - 0 --> SD_MAX_WS10_BLOCKS (65535) per request split / command
>> -- ~= 32M on non-4Kn drives
>> -- ~=256M on non-4Kn drives
>>
>> Even if we ignore the upper limit(s) of the block layer, do we want
>> such inconsistencies?

Hmm. Overall I think it is still okay if a bit confusing.
It is possible that for devices which support SCT Write Same
and DSM TRIM will still Trim faster than they can Write Same,
However the internal implementation is opaque so I can't
say if Write Same is often implemented in terms of TRIM
or not. I mean that's how _I_ do it [Write 1 block and map
N blocks to it], But not every FTL will have come to the
same conclusion.

I also suspect that given the choice for most use casess that
TRIM is preferred over WS when TRIM supports returning
zeroed blocks.

-- 
Shaun Tancheff


Re: [PATCH 2/2] Migrate zone cache from RB-Tree to arrays of descriptors

2016-08-21 Thread Shaun Tancheff
On Sun, Aug 21, 2016 at 11:34 PM, Shaun Tancheff  wrote:
> Currently the RB-Tree zone cache is fast and flexible. It does
> use a rather largish amount of ram. This model reduces the ram
> required from 120 bytes per zone to 16 bytes per zone with a
> moderate transformation of the blk_zone_lookup() api.
>
> This model is predicated on the belief that most variations
> on zoned media will follow a pattern of using collections of same
> sized zones on a single device. Similar to the pattern of erase
> blocks on flash devices being progressivly larger 16K, 64K, ...
>
> The goal is to be able to build a descriptor which is both memory
> efficient, performant, and flexible.
>
> Signed-off-by: Shaun Tancheff 
> ---
>  block/blk-core.c   |2 +-
>  block/blk-sysfs.c  |   31 +-
>  block/blk-zoned.c  |  103 +++--
>  drivers/scsi/sd.c  |5 +-
>  drivers/scsi/sd.h  |4 +-
>  drivers/scsi/sd_zbc.c  | 1025 
> +++-
>  include/linux/blkdev.h |   82 +++-
>  7 files changed, 716 insertions(+), 536 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3a9caf7..3b084a8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -727,7 +727,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
> INIT_LIST_HEAD(&q->blkg_list);
>  #endif
>  #ifdef CONFIG_BLK_DEV_ZONED
> -   q->zones = RB_ROOT;
> +   q->zones = NULL;
>  #endif
> INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 43f441f..ecbd434 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -232,36 +232,7 @@ static ssize_t queue_max_hw_sectors_show(struct 
> request_queue *q, char *page)
>  #ifdef CONFIG_BLK_DEV_ZONED
>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
>  {
> -   struct rb_node *node;
> -   struct blk_zone *zone;
> -   ssize_t offset = 0, end = 0;
> -   size_t size = 0, num = 0;
> -   enum blk_zone_type type = BLK_ZONE_TYPE_UNKNOWN;
> -
> -   for (node = rb_first(&q->zones); node; node = rb_next(node)) {
> -   zone = rb_entry(node, struct blk_zone, node);
> -   if (zone->type != type ||
> -   zone->len != size ||
> -   end != zone->start) {
> -   if (size != 0)
> -   offset += sprintf(page + offset, "%zu\n", 
> num);
> -   /* We can only store one page ... */
> -   if (offset + 42 > PAGE_SIZE) {
> -   offset += sprintf(page + offset, "...\n");
> -   return offset;
> -   }
> -   size = zone->len;
> -   type = zone->type;
> -   offset += sprintf(page + offset, "%zu %zu %d ",
> - zone->start, size, type);
> -   num = 0;
> -   end = zone->start + size;
> -   } else
> -   end += zone->len;
> -   num++;
> -   }
> -   offset += sprintf(page + offset, "%zu\n", num);
> -   return offset;
> +   return sprintf(page, "%u\n", q->zones ? 1 : 0);
>  }
>  #endif
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 975e863..338a1af 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -8,63 +8,84 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>
> -struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t lba)
> +/**
> + * blk_lookup_zone() - Lookup zones
> + * @q: Request Queue
> + * @sector: Location to lookup
> + * @start: Pointer to starting location zone (OUT)
> + * @len: Pointer to length of zone (OUT)
> + * @lock: Pointer to spinlock of zones in owning descriptor (OUT)
> + */
> +struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
> +sector_t *start, sector_t *len,
> +spinlock_t **lock)
>  {
> -   struct rb_root *root = &q->zones;
> -   struct rb_node *node = root->rb_node;
> +   int iter;
> +   struct blk_zone *bzone = NULL;
> +   struct zone_wps *zi = q->zones;
> +
> +   *start = 0;
> +   *len = 0;
> +   *lock = NULL;
> +
> +   if (!q->zones)
> +   goto out;
>
> -   while (node) {
> -   struct blk_zone *zone = contain

[PATCH 2/2] Migrate zone cache from RB-Tree to arrays of descriptors

2016-08-21 Thread Shaun Tancheff
Currently the RB-Tree zone cache is fast and flexible. It does
use a rather largish amount of ram. This model reduces the ram
required from 120 bytes per zone to 16 bytes per zone with a
moderate transformation of the blk_zone_lookup() api.

This model is predicated on the belief that most variations
on zoned media will follow a pattern of using collections of same
sized zones on a single device. Similar to the pattern of erase
blocks on flash devices being progressivly larger 16K, 64K, ...

The goal is to be able to build a descriptor which is both memory
efficient, performant, and flexible.

Signed-off-by: Shaun Tancheff 
---
 block/blk-core.c   |2 +-
 block/blk-sysfs.c  |   31 +-
 block/blk-zoned.c  |  103 +++--
 drivers/scsi/sd.c  |5 +-
 drivers/scsi/sd.h  |4 +-
 drivers/scsi/sd_zbc.c  | 1025 +++-
 include/linux/blkdev.h |   82 +++-
 7 files changed, 716 insertions(+), 536 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3a9caf7..3b084a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -727,7 +727,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
INIT_LIST_HEAD(&q->blkg_list);
 #endif
 #ifdef CONFIG_BLK_DEV_ZONED
-   q->zones = RB_ROOT;
+   q->zones = NULL;
 #endif
INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 43f441f..ecbd434 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -232,36 +232,7 @@ static ssize_t queue_max_hw_sectors_show(struct 
request_queue *q, char *page)
 #ifdef CONFIG_BLK_DEV_ZONED
 static ssize_t queue_zoned_show(struct request_queue *q, char *page)
 {
-   struct rb_node *node;
-   struct blk_zone *zone;
-   ssize_t offset = 0, end = 0;
-   size_t size = 0, num = 0;
-   enum blk_zone_type type = BLK_ZONE_TYPE_UNKNOWN;
-
-   for (node = rb_first(&q->zones); node; node = rb_next(node)) {
-   zone = rb_entry(node, struct blk_zone, node);
-   if (zone->type != type ||
-   zone->len != size ||
-   end != zone->start) {
-   if (size != 0)
-   offset += sprintf(page + offset, "%zu\n", num);
-   /* We can only store one page ... */
-   if (offset + 42 > PAGE_SIZE) {
-   offset += sprintf(page + offset, "...\n");
-   return offset;
-   }
-   size = zone->len;
-   type = zone->type;
-   offset += sprintf(page + offset, "%zu %zu %d ",
- zone->start, size, type);
-   num = 0;
-   end = zone->start + size;
-   } else
-   end += zone->len;
-   num++;
-   }
-   offset += sprintf(page + offset, "%zu\n", num);
-   return offset;
+   return sprintf(page, "%u\n", q->zones ? 1 : 0);
 }
 #endif
 
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 975e863..338a1af 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -8,63 +8,84 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
-struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t lba)
+/**
+ * blk_lookup_zone() - Lookup zones
+ * @q: Request Queue
+ * @sector: Location to lookup
+ * @start: Pointer to starting location zone (OUT)
+ * @len: Pointer to length of zone (OUT)
+ * @lock: Pointer to spinlock of zones in owning descriptor (OUT)
+ */
+struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
+sector_t *start, sector_t *len,
+spinlock_t **lock)
 {
-   struct rb_root *root = &q->zones;
-   struct rb_node *node = root->rb_node;
+   int iter;
+   struct blk_zone *bzone = NULL;
+   struct zone_wps *zi = q->zones;
+
+   *start = 0;
+   *len = 0;
+   *lock = NULL;
+
+   if (!q->zones)
+   goto out;
 
-   while (node) {
-   struct blk_zone *zone = container_of(node, struct blk_zone,
-node);
+   for (iter = 0; iter < zi->wps_count; iter++) {
+   if (sector >= zi->wps[iter]->start_lba &&
+   sector <  zi->wps[iter]->last_lba) {
+   struct contiguous_wps *wp = zi->wps[iter];
+   u64 index = (sector - wp->start_lba) / wp->zone_size;
 
-   if (lba < zone->start)
-   node = node->rb_left;
-   else if (lba >= zone->start + zone->len)
-   node = node->rb_rig

[PATCH 1/2] Move ZBC core setup to sd_zbc

2016-08-21 Thread Shaun Tancheff
Move the remaining ZBC specific code to sd_zbc.c

Signed-off-by: Shaun Tancheff 
---
 drivers/scsi/sd.c |  65 +--
 drivers/scsi/sd.h |  20 ++
 drivers/scsi/sd_zbc.c | 170 +++---
 3 files changed, 126 insertions(+), 129 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9a649fa..f144df4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2244,68 +2244,6 @@ static int sd_read_protection_type(struct scsi_disk 
*sdkp, unsigned char *buffer
return ret;
 }
 
-static void sd_read_zones(struct scsi_disk *sdkp, unsigned char *buffer)
-{
-   int retval;
-   unsigned char *desc;
-   u32 rep_len;
-   u8 same;
-   u64 zone_len, lba;
-
-   if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC)
-   /*
-* Device managed or normal SCSI disk,
-* no special handling required
-*/
-   return;
-
-   retval = sd_zbc_report_zones(sdkp, buffer, SD_BUF_SIZE,
-0, ZBC_ZONE_REPORTING_OPTION_ALL, false);
-   if (retval < 0)
-   return;
-
-   rep_len = get_unaligned_be32(&buffer[0]);
-   if (rep_len < 64) {
-   sd_printk(KERN_WARNING, sdkp,
- "REPORT ZONES report invalid length %u\n",
- rep_len);
-   return;
-   }
-
-   if (sdkp->rc_basis == 0) {
-   /* The max_lba field is the capacity of a zoned device */
-   lba = get_unaligned_be64(&buffer[8]);
-   if (lba + 1 > sdkp->capacity) {
-   if (sdkp->first_scan)
-   sd_printk(KERN_WARNING, sdkp,
- "Changing capacity from %zu to Max 
LBA+1 %zu\n",
- sdkp->capacity, (sector_t) lba + 1);
-   sdkp->capacity = lba + 1;
-   }
-   }
-
-   /*
-* Adjust 'chunk_sectors' to the zone length if the device
-* supports equal zone sizes.
-*/
-   same = buffer[4] & 0xf;
-   if (same > 3) {
-   sd_printk(KERN_WARNING, sdkp,
- "REPORT ZONES SAME type %d not supported\n", same);
-   return;
-   }
-   /* Read the zone length from the first zone descriptor */
-   desc = &buffer[64];
-   zone_len = get_unaligned_be64(&desc[8]);
-   sdkp->unmap_alignment = zone_len;
-   sdkp->unmap_granularity = zone_len;
-   blk_queue_chunk_sectors(sdkp->disk->queue,
-   logical_to_sectors(sdkp->device, zone_len));
-
-   sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE);
-   sd_config_discard(sdkp, SD_ZBC_RESET_WP);
-}
-
 static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device 
*sdp,
struct scsi_sense_hdr *sshdr, int sense_valid,
int the_result)
@@ -2611,7 +2549,8 @@ got_data:
  sdkp->physical_block_size);
sdkp->device->sector_size = sector_size;
 
-   sd_read_zones(sdkp, buffer);
+   if (sd_zbc_config(sdkp, buffer, SD_BUF_SIZE))
+   sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 
{
char cap_str_2[10], cap_str_10[10];
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index adbf3e0..fc766db 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -289,10 +289,6 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, 
unsigned int a)
 #define SD_ZBC_WRITE_ERR   2
 
 #ifdef CONFIG_SCSI_ZBC
-
-extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int,
-  sector_t, enum zbc_zone_reporting_options, bool);
-extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len);
 extern int sd_zbc_setup_zone_report_cmnd(struct scsi_cmnd *cmd, u8 rpt_opt);
 extern int sd_zbc_setup_zone_action(struct scsi_cmnd *cmd);
 extern int sd_zbc_setup_discard(struct scsi_cmnd *cmd);
@@ -303,23 +299,15 @@ extern void sd_zbc_uninit_command(struct scsi_cmnd *cmd);
 extern void sd_zbc_remove(struct scsi_disk *);
 extern void sd_zbc_reset_zones(struct scsi_disk *);
 extern void sd_zbc_update_zones(struct scsi_disk *, sector_t, int, int reason);
+extern bool sd_zbc_config(struct scsi_disk *, void *, size_t);
+
 extern unsigned int sd_zbc_discard_granularity(struct scsi_disk *sdkp);
 
 #else /* CONFIG_SCSI_ZBC */
 
-static inline int sd_zbc_report_zones(struct scsi_disk *sdkp,
- unsigned char *buf, int buf_len,
- sector_t start_sector,
- enum zbc_zone_reporting_options option,
-   

[PATCH 0/2] Change zone cache format to use less memory

2016-08-21 Thread Shaun Tancheff
Currently the RB-Tree zone cache is fast and flexible. It does
use a rather largish amount of ram. This model reduces the ram
required from 120 bytes per zone to 16 bytes per zone with a
moderate transformation of the blk_zone_lookup() api.

This model is predicated on the belief that most variations
on zoned media will follow a pattern of using collections of same
sized zones on a single device. Similar to the pattern of erase
blocks on flash devices being progressivly larger 16K, 64K, ...

The goal is to be able to build a descriptor which is both memory
efficient, performant, and flexible.

Shaun Tancheff (2):
  Move ZBC core setup to sd_zbc
  Migrate zone cache from RB-Tree to arrays of descriptors

 block/blk-core.c   |2 +-
 block/blk-sysfs.c  |   31 +-
 block/blk-zoned.c  |  103 +++--
 drivers/scsi/sd.c  |   66 +--
 drivers/scsi/sd.h  |   20 +-
 drivers/scsi/sd_zbc.c  | 1037 +---
 include/linux/blkdev.h |   82 +++-
 7 files changed, 759 insertions(+), 582 deletions(-)

-- 
2.9.3



[PATCH v2 4/4] Integrate ZBC command requests with zone cache.

2016-08-21 Thread Shaun Tancheff
Block layer (bio/request) commands can use or update the
sd_zbc zone cache as appropriate for each command.

Report Zones [REQ_OP_ZONE_REPORT] by default uses the current
zone cache data to generate a device (ZBC spec) formatted response.
REQ_META can also be specified to force the command to the device
and the result will be used to refresh the zone cache.

Reset WP [REQ_OP_ZONE_RESET] by default will attempt to translate
the request into a discard following the SD_ZBC_RESET_WP provisioning
mode. REQ_META can also be specified to force the command to be sent
to the device.

Open, Close and Finish zones having no other analog are sent directly
to the device.

On successful completion each zone action will update the zone cache
as appropriate.

Signed-off-by: Shaun Tancheff 
---
 block/blk-lib.c   |  16 --
 drivers/scsi/sd.c |  42 +++-
 drivers/scsi/sd.h |  22 +-
 drivers/scsi/sd_zbc.c | 672 +++---
 4 files changed, 698 insertions(+), 54 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 67b9258..8cc5893 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -307,22 +307,6 @@ int blkdev_issue_zone_report(struct block_device *bdev, 
unsigned int op_flags,
bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags);
ret = submit_bio_wait(bio);
 
-   /*
-* When our request it nak'd the underlying device maybe conventional
-* so ... report a single conventional zone the size of the device.
-*/
-   if (ret == -EIO && conv->descriptor_count) {
-   /* Adjust the conventional to the size of the partition ... */
-   __be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects);
-
-   conv->maximum_lba = blksz;
-   conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL;
-   conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4;
-   conv->descriptors[0].length = blksz;
-   conv->descriptors[0].lba_start = 0;
-   conv->descriptors[0].lba_wptr = blksz;
-   ret = 0;
-   }
bio_put(bio);
return ret;
 }
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b76ffbb..9a649fa 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1181,9 +1181,10 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd 
*cmd)
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
struct bio *bio = rq->bio;
sector_t sector = blk_rq_pos(rq);
-   struct gendisk *disk = rq->rq_disk;
unsigned int nr_bytes = blk_rq_bytes(rq);
int ret = BLKPREP_KILL;
+   bool is_fua = (rq->cmd_flags & REQ_META) ? true : false;
+   u8 rpt_opt = ZBC_ZONE_REPORTING_OPTION_ALL;
 
WARN_ON(nr_bytes == 0);
 
@@ -1194,18 +1195,35 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd 
*cmd)
if (sdkp->zoned != 1 && sdkp->device->type != TYPE_ZBC) {
void *src;
struct bdev_zone_report *conv;
+   __be64 blksz = cpu_to_be64(sdkp->capacity);
 
-   if (nr_bytes < sizeof(struct bdev_zone_report))
+   if (nr_bytes < 512)
goto out;
 
src = kmap_atomic(bio->bi_io_vec->bv_page);
conv = src + bio->bi_io_vec->bv_offset;
conv->descriptor_count = cpu_to_be32(1);
conv->same_field = BLK_ZONE_SAME_ALL;
-   conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects);
+   conv->maximum_lba = blksz;
+   conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL;
+   conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4;
+   conv->descriptors[0].length = blksz;
+   conv->descriptors[0].lba_start = 0;
+   conv->descriptors[0].lba_wptr = blksz;
kunmap_atomic(src);
+   ret = BLKPREP_DONE;
goto out;
}
+   /* FUTURE ... when streamid is available */
+   /* rpt_opt = bio_get_streamid(bio); */
+
+   if (!is_fua) {
+   ret = sd_zbc_setup_zone_report_cmnd(cmd, rpt_opt);
+   if (ret == BLKPREP_DONE || ret == BLKPREP_DEFER)
+   goto out;
+   if (ret == BLKPREP_KILL)
+   pr_err("No Zone Cache, query media.\n");
+   }
 
ret = scsi_init_io(cmd);
if (ret != BLKPREP_OK)
@@ -1224,8 +1242,7 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd 
*cmd)
cmd->cmnd[1] = ZI_REPORT_ZONES;
put_unaligned_be64(sector, &cmd->cmnd[2]);
put_unaligned_be32(nr_bytes, &cmd->cmnd[10]);
-   /* FUTURE ... when streamid is available */
-   /* cmd->cmnd[14] = bio_get_streamid(bio); */
+   cmd->cmnd[14] = rpt_opt;
 
cmd->sc_data_direction = DMA_FROM_DE

[PATCH v2 2/4] On Discard either do Reset WP or Write Same

2016-08-21 Thread Shaun Tancheff
Based on the type of zone either perform a Reset WP
for Sequential zones or a Write Same for Conventional zones.

Also detect and handle the runt zone, if there is one.

One additional check is added to error on discard requests
that do not include all the active data in zone.
By way of example when the WP indicates that 2000 blocks
in the zone are in use and the discard indicated 1000 blocks
can be unmapped the discard should fail as a Reset WP will
unmap all the 2000 blocks in the zone.

Signed-off-by: Shaun Tancheff 
---
 drivers/scsi/sd.c |  45 ++---
 drivers/scsi/sd.h |   9 ++--
 drivers/scsi/sd_zbc.c | 135 +++---
 3 files changed, 114 insertions(+), 75 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7903e21..d5ef6d8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -729,21 +729,19 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
sector_t sector = blk_rq_pos(rq);
unsigned int nr_sectors = blk_rq_sectors(rq);
unsigned int nr_bytes = blk_rq_bytes(rq);
-   unsigned int len;
-   int ret = 0;
+   int ret;
char *buf;
-   struct page *page = NULL;
+   struct page *page;
 
sector >>= ilog2(sdp->sector_size) - 9;
nr_sectors >>= ilog2(sdp->sector_size) - 9;
 
-   if (sdkp->provisioning_mode != SD_ZBC_RESET_WP) {
-   page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
-   if (!page)
-   return BLKPREP_DEFER;
-   }
+   page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+   if (!page)
+   return BLKPREP_DEFER;
 
rq->completion_data = page;
+   rq->timeout = SD_TIMEOUT;
 
switch (sdkp->provisioning_mode) {
case SD_LBP_UNMAP:
@@ -758,7 +756,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
put_unaligned_be64(sector, &buf[8]);
put_unaligned_be32(nr_sectors, &buf[16]);
 
-   len = 24;
+   cmd->transfersize = 24;
break;
 
case SD_LBP_WS16:
@@ -768,7 +766,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
put_unaligned_be64(sector, &cmd->cmnd[2]);
put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
-   len = sdkp->device->sector_size;
+   cmd->transfersize = sdp->sector_size;
break;
 
case SD_LBP_WS10:
@@ -777,35 +775,24 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
cmd->cmnd[0] = WRITE_SAME;
if (sdkp->provisioning_mode == SD_LBP_WS10)
cmd->cmnd[1] = 0x8; /* UNMAP */
+   else
+   rq->timeout = SD_WRITE_SAME_TIMEOUT;
put_unaligned_be32(sector, &cmd->cmnd[2]);
put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
 
-   len = sdkp->device->sector_size;
+   cmd->transfersize = sdp->sector_size;
break;
 
case SD_ZBC_RESET_WP:
-   /* sd_zbc_setup_discard uses block layer sector units */
-   ret = sd_zbc_setup_discard(sdkp, rq, blk_rq_pos(rq),
-  blk_rq_sectors(rq));
+   ret = sd_zbc_setup_discard(cmd);
if (ret != BLKPREP_OK)
goto out;
-   cmd->cmd_len = 16;
-   cmd->cmnd[0] = ZBC_OUT;
-   cmd->cmnd[1] = ZO_RESET_WRITE_POINTER;
-   put_unaligned_be64(sector, &cmd->cmnd[2]);
-   /* Reset Write Pointer doesn't have a payload */
-   len = 0;
-   cmd->sc_data_direction = DMA_NONE;
break;
-
default:
ret = BLKPREP_INVALID;
goto out;
}
 
-   rq->timeout = SD_TIMEOUT;
-
-   cmd->transfersize = len;
cmd->allowed = SD_MAX_RETRIES;
 
/*
@@ -816,17 +803,15 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
 * discarded on disk. This allows us to report completion on the full
 * amount of blocks described by the request.
 */
-   if (len) {
-   blk_add_request_payload(rq, page, 0, len);
+   if (cmd->transfersize) {
+   blk_add_request_payload(rq, page, 0, cmd->transfersize);
ret = scsi_init_io(cmd);
}
rq->__data_len = nr_bytes;
 
 out:
-   if (page && ret != BLKPREP_OK) {
-   rq->completion_data = NULL;
+   if (ret != BLKPREP_OK)
__free_page(page);
-   }
return ret;
 }
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index ef6c132..2792c10 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -295,8 +295,7 @@ extern int sd_zbc_report_zones(struct s

[PATCH v2 3/4] Merge ZBC constants

2016-08-21 Thread Shaun Tancheff
Dedupe ZBC/ZAC constants used for reporting options, same code,
zone condition and zone type.

These are all useful to programs consuming zone information from
user space as well so include them in a uapi header.

Signed-off-by: Shaun Tancheff 
---
 block/blk-lib.c   |   4 +-
 drivers/scsi/sd.c |   2 +-
 include/linux/blkdev.h|  20 -
 include/scsi/scsi_proto.h |  17 
 include/uapi/linux/blkzoned_api.h | 167 +++---
 5 files changed, 103 insertions(+), 107 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e92bd56..67b9258 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -316,8 +316,8 @@ int blkdev_issue_zone_report(struct block_device *bdev, 
unsigned int op_flags,
__be64 blksz = cpu_to_be64(bdev->bd_part->nr_sects);
 
conv->maximum_lba = blksz;
-   conv->descriptors[0].type = ZTYP_CONVENTIONAL;
-   conv->descriptors[0].flags = ZCOND_CONVENTIONAL << 4;
+   conv->descriptors[0].type = BLK_ZONE_TYPE_CONVENTIONAL;
+   conv->descriptors[0].flags = BLK_ZONE_NO_WP << 4;
conv->descriptors[0].length = blksz;
conv->descriptors[0].lba_start = 0;
conv->descriptors[0].lba_wptr = blksz;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d5ef6d8..b76ffbb 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1201,7 +1201,7 @@ static int sd_setup_zone_report_cmnd(struct scsi_cmnd 
*cmd)
src = kmap_atomic(bio->bi_io_vec->bv_page);
conv = src + bio->bi_io_vec->bv_offset;
conv->descriptor_count = cpu_to_be32(1);
-   conv->same_field = ZS_ALL_SAME;
+   conv->same_field = BLK_ZONE_SAME_ALL;
conv->maximum_lba = cpu_to_be64(disk->part0.nr_sects);
kunmap_atomic(src);
goto out;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 68198eb..d5cdb5d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -263,26 +263,6 @@ struct blk_queue_tag {
 #define BLK_SCSI_CMD_PER_LONG  (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8))
 
 #ifdef CONFIG_BLK_DEV_ZONED
-enum blk_zone_type {
-   BLK_ZONE_TYPE_UNKNOWN,
-   BLK_ZONE_TYPE_CONVENTIONAL,
-   BLK_ZONE_TYPE_SEQWRITE_REQ,
-   BLK_ZONE_TYPE_SEQWRITE_PREF,
-   BLK_ZONE_TYPE_RESERVED,
-};
-
-enum blk_zone_state {
-   BLK_ZONE_NO_WP,
-   BLK_ZONE_EMPTY,
-   BLK_ZONE_OPEN,
-   BLK_ZONE_OPEN_EXPLICIT,
-   BLK_ZONE_CLOSED,
-   BLK_ZONE_UNKNOWN = 5,
-   BLK_ZONE_READONLY = 0xd,
-   BLK_ZONE_FULL,
-   BLK_ZONE_OFFLINE,
-   BLK_ZONE_BUSY = 0x20,
-};
 
 struct blk_zone {
struct rb_node node;
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 6ba66e0..d1defd1 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -299,21 +299,4 @@ struct scsi_lun {
 #define SCSI_ACCESS_STATE_MASK0x0f
 #define SCSI_ACCESS_STATE_PREFERRED   0x80
 
-/* Reporting options for REPORT ZONES */
-enum zbc_zone_reporting_options {
-   ZBC_ZONE_REPORTING_OPTION_ALL = 0,
-   ZBC_ZONE_REPORTING_OPTION_EMPTY,
-   ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN,
-   ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN,
-   ZBC_ZONE_REPORTING_OPTION_CLOSED,
-   ZBC_ZONE_REPORTING_OPTION_FULL,
-   ZBC_ZONE_REPORTING_OPTION_READONLY,
-   ZBC_ZONE_REPORTING_OPTION_OFFLINE,
-   ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
-   ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE,
-   ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f,
-};
-
-#define ZBC_REPORT_ZONE_PARTIAL 0x80
-
 #endif /* _SCSI_PROTO_H_ */
diff --git a/include/uapi/linux/blkzoned_api.h 
b/include/uapi/linux/blkzoned_api.h
index cd81a9f..fa12976 100644
--- a/include/uapi/linux/blkzoned_api.h
+++ b/include/uapi/linux/blkzoned_api.h
@@ -16,97 +16,123 @@
 
 #include 
 
+#define ZBC_REPORT_OPTION_MASK  0x3f
+#define ZBC_REPORT_ZONE_PARTIAL 0x80
+
 /**
  * enum zone_report_option - Report Zones types to be included.
  *
- * @ZOPT_NON_SEQ_AND_RESET: Default (all zones).
- * @ZOPT_ZC1_EMPTY: Zones which are empty.
- * @ZOPT_ZC2_OPEN_IMPLICIT: Zones open but not explicitly opened
- * @ZOPT_ZC3_OPEN_EXPLICIT: Zones opened explicitly
- * @ZOPT_ZC4_CLOSED: Zones closed for writing.
- * @ZOPT_ZC5_FULL: Zones that are full.
- * @ZOPT_ZC6_READ_ONLY: Zones that are read-only
- * @ZOPT_ZC7_OFFLINE: Zones that are offline
- * @ZOPT_RESET: Zones that are empty
- * @ZOPT_NON_SEQ: Zones that have HA media-cache writes pending
- * @ZOPT_NON_WP_ZONES: Zones that do not have Write Pointers (conventional)
- * @ZOPT_PARTIAL_FLAG: Modifies the definition of the Zone List Length field.
+ * @ZBC_ZONE_REPORTING_OPTION_ALL: Default (all zones).
+ * @ZBC_ZONE_REPORTING_OPTION_EMPTY: Zones which are empty.
+ * @ZBC_ZONE_REPORTI

[PATCH v2 0/4] Integrate bio/request ZBC ops with zone cache

2016-08-21 Thread Shaun Tancheff
Hi,

As per Christoph's request this patch incorporates Hannes' cache of zone
information.

This approach is to have REQ_OP_ZONE_REPORT return data in the same
format regardless of the availability of the zone cache. So if the
is kernel being built with or without BLK_DEV_ZONED [and SCSI_ZBC]
users of blkdev_issue_zone_report() and/or REQ_OP_ZONE_REPORT bio's
will have a consistent data format to digest.

Additionally it seems reasonable to allow the REQ_OP_ZONE_* to
be able to indicate if the command *must* be delivered to the
device [and update the zone cache] accordingly. Here REQ_META is
being used as REQ_FUA can be dropped causing sd_done to be skipped.
Rather than special case the current code I chose to pick an otherwise
non-applicable flag.

This series is based off of Linus's v4.8-rc2 and builds on top of the
previous series of block layer support:
Add ioctl to issue ZBC/ZAC commands via block layer
Add bio/request flags to issue ZBC/ZAC commands
as well as the series posted by Hannes
sd_zbc: Fix handling of ZBC read after write pointer
sd: Limit messages for ZBC disks capacity change
sd: Implement support for ZBC devices
sd: Implement new RESET_WP provisioning mode
sd: configure ZBC devices
...

Patches for util-linux can be found here:
g...@github.com:stancheff/util-linux.git v2.28.1+biof

https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof

This patch is available here:
https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v9

g...@github.com:stancheff/linux.git v4.8-rc2+biof.v9

v2:
 - Fully integrated bio <-> zone cache [<-> device]
 - Added discard -> write same for conventional zones.
 - Merged disparate constants into a canonical set.

Shaun Tancheff (4):
  Enable support for Seagate HostAware drives (testing).
  On Discard either do Reset WP or Write Same
  Merge ZBC constants
  Integrate ZBC command requests with zone cache.

 block/blk-lib.c   |  16 -
 drivers/scsi/sd.c | 111 +++--
 drivers/scsi/sd.h |  49 ++-
 drivers/scsi/sd_zbc.c | 904 ++
 include/linux/blkdev.h|  22 +-
 include/scsi/scsi_proto.h |  17 -
 include/uapi/linux/blkzoned_api.h | 167 ---
 7 files changed, 1032 insertions(+), 254 deletions(-)

-- 
2.9.3



[PATCH v2 1/4] Enable support for Seagate HostAware drives

2016-08-21 Thread Shaun Tancheff
Seagate drives report a SAME code of 0 due to having:
  - Zones of different types (CMR zones at the low LBA space).
  - Zones of different size (A terminating 'runt' zone in the high
lba space).

Support loading the zone topology into the zone cache.

Signed-off-by: Shaun Tancheff 
---
 drivers/scsi/sd.c  |  22 +++---
 drivers/scsi/sd.h  |  20 --
 drivers/scsi/sd_zbc.c  | 183 +++--
 include/linux/blkdev.h |  16 +++--
 4 files changed, 170 insertions(+), 71 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 059a57f..7903e21 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -693,8 +693,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
break;
 
case SD_ZBC_RESET_WP:
-   max_blocks = sdkp->unmap_granularity;
q->limits.discard_zeroes_data = 1;
+   q->limits.discard_granularity =
+   sd_zbc_discard_granularity(sdkp);
+
+   max_blocks = min_not_zero(sdkp->unmap_granularity,
+ q->limits.discard_granularity >>
+   ilog2(logical_block_size));
break;
 
case SD_LBP_ZERO:
@@ -1955,13 +1960,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
good_bytes = blk_rq_bytes(req);
scsi_set_resid(SCpnt, 0);
} else {
-#ifdef CONFIG_SCSI_ZBC
if (op == ZBC_OUT)
/* RESET WRITE POINTER failed */
sd_zbc_update_zones(sdkp,
blk_rq_pos(req),
-   512, true);
-#endif
+   512, SD_ZBC_RESET_WP_ERR);
+
good_bytes = 0;
scsi_set_resid(SCpnt, blk_rq_bytes(req));
}
@@ -2034,7 +2038,6 @@ static int sd_done(struct scsi_cmnd *SCpnt)
good_bytes = blk_rq_bytes(req);
scsi_set_resid(SCpnt, 0);
}
-#ifdef CONFIG_SCSI_ZBC
/*
 * ZBC: Unaligned write command.
 * Write did not start a write pointer position.
@@ -2042,8 +2045,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
if (sshdr.ascq == 0x04)
sd_zbc_update_zones(sdkp,
blk_rq_pos(req),
-   512, true);
-#endif
+   512, SD_ZBC_WRITE_ERR);
}
break;
default:
@@ -2270,7 +2272,7 @@ static void sd_read_zones(struct scsi_disk *sdkp, 
unsigned char *buffer)
 * supports equal zone sizes.
 */
same = buffer[4] & 0xf;
-   if (same == 0 || same > 3) {
+   if (same > 3) {
sd_printk(KERN_WARNING, sdkp,
  "REPORT ZONES SAME type %d not supported\n", same);
return;
@@ -2282,9 +2284,9 @@ static void sd_read_zones(struct scsi_disk *sdkp, 
unsigned char *buffer)
sdkp->unmap_granularity = zone_len;
blk_queue_chunk_sectors(sdkp->disk->queue,
logical_to_sectors(sdkp->device, zone_len));
-   sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 
-   sd_zbc_setup(sdkp, buffer, SD_BUF_SIZE);
+   sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE);
+   sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 }
 
 static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device 
*sdp,
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 6ae4505..ef6c132 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -283,19 +283,24 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, 
unsigned int a)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+
+#define SD_ZBC_INIT0
+#define SD_ZBC_RESET_WP_ERR1
+#define SD_ZBC_WRITE_ERR   2
+
 #ifdef CONFIG_SCSI_ZBC
 
 extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int,
   sector_t, enum zbc_zone_reporting_options, bool);
-extern int sd_zbc_setup(struct scsi_disk *, char *, int);
+extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len);
 extern void sd_zbc_remove(struct scsi_disk *);
 extern void sd_zbc_reset_zones(struct scsi_disk *);
 extern int sd_zbc_setup_discard(struct scsi_disk *, struct request *,
sector_t, unsigned int);
 extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *,
   sector_t, unsigned int *);
-extern void sd_zbc_upda

[PATCH v6 4/4] SCT Write Same handle ATA_DFLAG_PIO

2016-08-21 Thread Shaun Tancheff
Use non DMA write log when ATA_DFLAG_PIO is set.

Signed-off-by: Shaun Tancheff 
---
v6: Added check for ATA_DFLAG_PIO and fallback to non DMA write log for
SCT Write Same

 drivers/ata/libata-scsi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 37f456e..e50b7a7 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3485,6 +3485,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
tf->device = ATA_CMD_STANDBYNOW1;
tf->protocol = ATA_PROT_DMA;
tf->command = ATA_CMD_WRITE_LOG_DMA_EXT;
+   if (unlikely(dev->flags & ATA_DFLAG_PIO))
+   tf->command = ATA_CMD_WRITE_LOG_EXT;
}
 
tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE | ATA_TFLAG_LBA48 |
-- 
2.9.3



[PATCH v6 3/4] SCT Write Same / DSM Trim

2016-08-21 Thread Shaun Tancheff
Correct handling of devices with sector_size other that 512 bytes.

Signed-off-by: Shaun Tancheff 
---
In the case of a 4Kn device sector_size it is possible to describe a much
larger DSM Trim than the current fixed default of 512 bytes.

This patch assumes the minimum descriptor is sector_size and fills out
the descriptor accordingly.

The ACS-2 specification is quite clear that the DSM command payload is
sized as number of 512 byte transfers so a 4Kn device will operate
correctly without this patch.

v5:
 - Added support for a sector_size descriptor other than 512 bytes.

 drivers/ata/libata-scsi.c | 85 +++
 1 file changed, 57 insertions(+), 28 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index ebf1a04..37f456e 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3283,7 +3283,7 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
 /**
  * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
  * @cmd: SCSI command being translated
- * @num: Maximum number of entries (nominally 64).
+ * @trmax: Maximum number of entries that will fit in sector_size bytes.
  * @sector: Starting sector
  * @count: Total Range of request in logical sectors
  *
@@ -3298,63 +3298,80 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
  *  LBA's should be sorted order and not overlap.
  *
  * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
+ *
+ * Return: Number of bytes copied into sglist.
  */
-static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num,
- u64 sector, u32 count)
+static size_t ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 trmax,
+   u64 sector, u32 count)
 {
-   __le64 *buffer;
-   u32 i = 0, used_bytes;
+   struct scsi_device *sdp = cmd->device;
+   size_t len = sdp->sector_size;
+   size_t r;
+   __le64 *buf;
+   u32 i = 0;
unsigned long flags;
 
-   BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE);
+   WARN_ON(len > ATA_SCSI_RBUF_SIZE);
+
+   if (len > ATA_SCSI_RBUF_SIZE)
+   len = ATA_SCSI_RBUF_SIZE;
 
spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
-   buffer = ((void *)ata_scsi_rbuf);
-   while (i < num) {
+   buf = ((void *)ata_scsi_rbuf);
+   memset(buf, 0, len);
+   while (i < trmax) {
u64 entry = sector |
((u64)(count > 0x ? 0x : count) << 48);
-   buffer[i++] = __cpu_to_le64(entry);
+   buf[i++] = __cpu_to_le64(entry);
if (count <= 0x)
break;
count -= 0x;
sector += 0x;
}
-
-   used_bytes = ALIGN(i * 8, 512);
-   memset(buffer + i, 0, used_bytes - i * 8);
-   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 512);
+   r = sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buf, len);
spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
 
-   return used_bytes;
+   return r;
 }
 
 /**
  * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
  * @cmd: SCSI command being translated
  * @lba: Starting sector
- * @num: Number of logical sectors to be zero'd.
+ * @num: Number of sectors to be zero'd.
  *
- * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted
+ * Rewrite the WRITE SAME payload to be an SCT Write Same formatted
  * descriptor.
  * NOTE: Writes a pattern (0's) in the foreground.
- *   Large write-same requents can timeout.
+ *
+ * Return: Number of bytes copied into sglist.
  */
-static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num)
+static size_t ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 
num)
 {
-   u16 *sctpg;
+   struct scsi_device *sdp = cmd->device;
+   size_t len = sdp->sector_size;
+   size_t r;
+   u16 *buf;
unsigned long flags;
 
spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
-   sctpg = ((void *)ata_scsi_rbuf);
+   buf = ((void *)ata_scsi_rbuf);
+
+   put_unaligned_le16(0x0002,  &buf[0]); /* SCT_ACT_WRITE_SAME */
+   put_unaligned_le16(0x0101,  &buf[1]); /* WRITE PTRN FG */
+   put_unaligned_le64(lba, &buf[2]);
+   put_unaligned_le64(num, &buf[6]);
+   put_unaligned_le32(0u,  &buf[10]); /* pattern */
+
+   WARN_ON(len > ATA_SCSI_RBUF_SIZE);
 
-   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
-   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
-   put_unaligned_le64(lba, &sctpg[2]);
-   put_unaligned_le64(num, &sctpg[6]);
-   put_unaligned_le32(0u,  &sctpg[10]);
+   if (len > ATA_SCSI_RBUF_SIZE

[PATCH v6 2/4] Add support for SCT Write Same

2016-08-21 Thread Shaun Tancheff
SATA drives may support write same via SCT. This is useful
for setting the drive contents to a specific pattern (0's).

Translate a SCSI WRITE SAME 16 command to be either a DSM TRIM
command or an SCT Write Same command.

Based on the UNMAP flag:
  - When set translate to DSM TRIM
  - When not set translate to SCT Write Same

Signed-off-by: Shaun Tancheff 
---
v6:
 - Change to use sg_copy_from_buffer as per Christoph Hellwig 
v5:
 - Addressed review comments
 - Report support for ZBC only for zoned devices.
 - kmap page during rewrite
 - Fix unmap set to require trim or error, if not unmap then sct write
   same or error.
v4:
 - Added partial MAINTENANCE_IN opcode simulation
 - Dropped all changes in drivers/scsi/*
 - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT.
v3:
 - Demux UNMAP/TRIM from WRITE SAME
v2:
 - Remove fugly ata hacking from sd.c

 drivers/ata/libata-scsi.c | 199 +++---
 include/linux/ata.h   |  43 ++
 2 files changed, 213 insertions(+), 29 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 7990cb2..ebf1a04 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
sdev->use_10_for_rw = 1;
sdev->use_10_for_ms = 1;
-   sdev->no_report_opcodes = 1;
-   sdev->no_write_same = 1;
 
/* Schedule policy is determined by ->qc_defer() callback and
 * it needs to see every deferred qc.  Set dev_blocked to 1 to
@@ -3287,7 +3285,7 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
  * @cmd: SCSI command being translated
  * @num: Maximum number of entries (nominally 64).
  * @sector: Starting sector
- * @count: Total Range of request
+ * @count: Total Range of request in logical sectors
  *
  * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted
  * descriptor.
@@ -3330,6 +3328,45 @@ static unsigned int ata_format_dsm_trim_descr(struct 
scsi_cmnd *cmd, u32 num,
return used_bytes;
 }
 
+/**
+ * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
+ * @cmd: SCSI command being translated
+ * @lba: Starting sector
+ * @num: Number of logical sectors to be zero'd.
+ *
+ * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted
+ * descriptor.
+ * NOTE: Writes a pattern (0's) in the foreground.
+ *   Large write-same requents can timeout.
+ */
+static void ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num)
+{
+   u16 *sctpg;
+   unsigned long flags;
+
+   spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
+   sctpg = ((void *)ata_scsi_rbuf);
+
+   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
+   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
+   put_unaligned_le64(lba, &sctpg[2]);
+   put_unaligned_le64(num, &sctpg[6]);
+   put_unaligned_le32(0u,  &sctpg[10]);
+
+   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), sctpg, 512);
+   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
+}
+
+/**
+ * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same
+ * @qc: Command to be translated
+ *
+ * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or
+ * an SCT Write Same command.
+ * Based on WRITE SAME has the UNMAP flag
+ *   When set translate to DSM TRIM
+ *   When clear translate to SCT Write Same
+ */
 static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 {
struct ata_taskfile *tf = &qc->tf;
@@ -3342,6 +3379,7 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
u32 size;
u16 fp;
u8 bp = 0xff;
+   u8 unmap = cdb[1] & 0x8;
 
/* we may not issue DMA commands if no DMA mode is set */
if (unlikely(!dev->dma_mode))
@@ -3353,11 +3391,26 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
}
scsi_16_lba_len(cdb, &block, &n_block);
 
-   /* for now we only support WRITE SAME with the unmap bit set */
-   if (unlikely(!(cdb[1] & 0x8))) {
-   fp = 1;
-   bp = 3;
-   goto invalid_fld;
+   if (unmap) {
+   /* If trim is not enabled the cmd is invalid. */
+   if ((dev->horkage & ATA_HORKAGE_NOTRIM) ||
+   !ata_id_has_trim(dev->id)) {
+   fp = 1;
+   bp = 3;
+   goto invalid_fld;
+   }
+   /* If the request is too large the cmd is invalid */
+   if (n_block > 0x * trmax) {
+   fp = 2;
+   goto invalid_fld;
+   }
+   } else {
+   /* If write same is not available the cmd is invalid */
+ 

[PATCH v6 0/4] SCT Write Same

2016-08-21 Thread Shaun Tancheff
At some point the method of issuing Write Same for ATA drives changed.
Currently write same is commonly available via SCT so expose the SCT
capabilities and use SCT Write Same when it is available.

This is useful for zoned based media that prefers to support discard
with lbprz set, aka discard zeroes data by mapping discard operations to
reset write pointer operations. Conventional zones that do not support
reset write pointer can still honor the discard zeroes data by issuing
a write same over the zone.

It may also be nice to know if various controllers that currently
disable WRITE SAME will work with the SCT Write Same code path:
  aacraid, arcmsr, megaraid, 3w-9xxx, 3w-sas, 3w-, gdth, hpsa, ips,
  megaraid, pmcraid, storvsc_drv

This patch against v4.8-rc2 is also at
https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v9

g...@github.com:stancheff/linux.git v4.8-rc2+biof.v9

v6:
 - Fix bisect bug reported by Tom Yan 
 - Change to use sg_copy_from_buffer as per Christoph Hellwig 
 - Added support for a sector_size descriptor other than 512 bytes.
v5:
 - Addressed review comments
 - Report support for ZBC only for zoned devices.
 - kmap page during rewrite
 - Fix unmap set to require trim or error, if not unmap then sct write
   same or error.
v4:
 - Added partial MAINTENANCE_IN opcode simulation
 - Dropped all changes in drivers/scsi/*
 - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT.
v3:
 - Demux UNMAP/TRIM from WRITE SAME
v2:
 - Remove fugly ata hacking from sd.c

Shaun Tancheff (4):
  libata: Safely overwrite attached page in WRITE SAME xlat
  Add support for SCT Write Same
  SCT Write Same / DSM Trim
  SCT Write Same handle ATA_DFLAG_PIO

 drivers/ata/libata-scsi.c | 280 +-
 include/linux/ata.h   |  69 +++-
 2 files changed, 292 insertions(+), 57 deletions(-)

-- 
2.9.3



[PATCH v6 1/4] libata: Safely overwrite attached page in WRITE SAME xlat

2016-08-21 Thread Shaun Tancheff
Safely overwriting the attached page to ATA format from the SCSI formatted
variant.

Signed-off-by: Shaun Tancheff 
---
v6:
 - Fix bisect bug reported by Tom Yan 
 - Change to use sg_copy_from_buffer as per Christoph Hellwig 
v5:
 - Added prep patch to work with non-page aligned scatterlist pages
   and use kmap_atomic() to lock page during modification.

 drivers/ata/libata-scsi.c | 56 ++-
 include/linux/ata.h   | 26 --
 2 files changed, 51 insertions(+), 31 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e207b33..7990cb2 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3282,6 +3282,54 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
return 1;
 }
 
+/**
+ * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
+ * @cmd: SCSI command being translated
+ * @num: Maximum number of entries (nominally 64).
+ * @sector: Starting sector
+ * @count: Total Range of request
+ *
+ * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted
+ * descriptor.
+ *
+ * Upto 64 entries of the format:
+ *   63:48 Range Length
+ *   47:0  LBA
+ *
+ *  Range Length of 0 is ignored.
+ *  LBA's should be sorted order and not overlap.
+ *
+ * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
+ */
+static unsigned int ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 num,
+ u64 sector, u32 count)
+{
+   __le64 *buffer;
+   u32 i = 0, used_bytes;
+   unsigned long flags;
+
+   BUILD_BUG_ON(512 > ATA_SCSI_RBUF_SIZE);
+
+   spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
+   buffer = ((void *)ata_scsi_rbuf);
+   while (i < num) {
+   u64 entry = sector |
+   ((u64)(count > 0x ? 0x : count) << 48);
+   buffer[i++] = __cpu_to_le64(entry);
+   if (count <= 0x)
+   break;
+   count -= 0x;
+   sector += 0x;
+   }
+
+   used_bytes = ALIGN(i * 8, 512);
+   memset(buffer + i, 0, used_bytes - i * 8);
+   sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buffer, 512);
+   spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
+
+   return used_bytes;
+}
+
 static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 {
struct ata_taskfile *tf = &qc->tf;
@@ -3290,8 +3338,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
const u8 *cdb = scmd->cmnd;
u64 block;
u32 n_block;
+   const u32 trmax = ATA_MAX_TRIM_RNUM;
u32 size;
-   void *buf;
u16 fp;
u8 bp = 0xff;
 
@@ -3319,10 +3367,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
if (!scsi_sg_count(scmd))
goto invalid_param_len;
 
-   buf = page_address(sg_page(scsi_sglist(scmd)));
-
-   if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) {
-   size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, block, 
n_block);
+   if (n_block <= 0x * trmax) {
+   size = ata_format_dsm_trim_descr(scmd, trmax, block, n_block);
} else {
fp = 2;
goto invalid_fld;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index adbc812..45a1d71 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -1071,32 +1071,6 @@ static inline void ata_id_to_hd_driveid(u16 *id)
 #endif
 }
 
-/*
- * Write LBA Range Entries to the buffer that will cover the extent from
- * sector to sector + count.  This is used for TRIM and for ADD LBA(S)
- * TO NV CACHE PINNED SET.
- */
-static inline unsigned ata_set_lba_range_entries(void *_buffer,
-   unsigned num, u64 sector, unsigned long count)
-{
-   __le64 *buffer = _buffer;
-   unsigned i = 0, used_bytes;
-
-   while (i < num) {
-   u64 entry = sector |
-   ((u64)(count > 0x ? 0x : count) << 48);
-   buffer[i++] = __cpu_to_le64(entry);
-   if (count <= 0x)
-   break;
-   count -= 0x;
-   sector += 0x;
-   }
-
-   used_bytes = ALIGN(i * 8, 512);
-   memset(buffer + i, 0, used_bytes - i * 8);
-   return used_bytes;
-}
-
 static inline bool ata_ok(u8 status)
 {
return ((status & (ATA_BUSY | ATA_DRDY | ATA_DF | ATA_DRQ | ATA_ERR))
-- 
2.9.3



[PATCH v8 2/2] Add ioctl to issue ZBC/ZAC commands via block layer

2016-08-21 Thread Shaun Tancheff
Add support for ZBC ioctl's
BLKREPORT - Issue Report Zones to device.
BLKZONEACTION - Issue a Zone Action (Close, Finish, Open, or Reset)

Signed-off-by: Shaun Tancheff 
---
v8:
 - Changed ioctl for zone actions to a single ioctl that takes 
   a structure including the zone, zone action, all flag, and force option
 - Mapped REQ_META flag to 'force unit access' for zone operations
v6:
 - Added GFP_DMA to gfp mask.
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's

 block/ioctl.c | 149 ++
 include/uapi/linux/blkzoned_api.h |  30 +++-
 include/uapi/linux/fs.h   |   1 +
 3 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..d760523 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -194,6 +194,151 @@ int blkdev_reread_part(struct block_device *bdev)
 }
 EXPORT_SYMBOL(blkdev_reread_part);
 
+static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode,
+ void __user *parg)
+{
+   int error = -EFAULT;
+   gfp_t gfp = GFP_KERNEL | GFP_DMA;
+   void *iopg = NULL;
+   struct bdev_zone_report_io *bzrpt = NULL;
+   int order = 0;
+   struct page *pgs = NULL;
+   u32 alloc_size = PAGE_SIZE;
+   unsigned int op_flags = 0;
+   u8 opt = 0;
+
+   if (!(mode & FMODE_READ))
+   return -EBADF;
+
+   iopg = (void *)get_zeroed_page(gfp);
+   if (!iopg) {
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   bzrpt = iopg;
+   if (copy_from_user(bzrpt, parg, sizeof(*bzrpt))) {
+   error = -EFAULT;
+   goto report_zones_out;
+   }
+   if (bzrpt->data.in.return_page_count > alloc_size) {
+   int npages;
+
+   alloc_size = bzrpt->data.in.return_page_count;
+   npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   pgs = alloc_pages(gfp, ilog2(npages));
+   if (pgs) {
+   void *mem = page_address(pgs);
+
+   if (!mem) {
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   order = ilog2(npages);
+   memset(mem, 0, alloc_size);
+   memcpy(mem, bzrpt, sizeof(*bzrpt));
+   bzrpt = mem;
+   } else {
+   /* Result requires DMA capable memory */
+   pr_err("Not enough memory available for request.\n");
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   } else {
+   alloc_size = bzrpt->data.in.return_page_count;
+   }
+   if (bzrpt->data.in.force_unit_access)
+   op_flags |= REQ_META;
+   opt = bzrpt->data.in.report_option;
+   error = blkdev_issue_zone_report(bdev, op_flags,
+   bzrpt->data.in.zone_locator_lba, opt,
+   pgs ? pgs : virt_to_page(iopg),
+   alloc_size, GFP_KERNEL);
+   if (error)
+   goto report_zones_out;
+
+   if (pgs) {
+   void *src = bzrpt;
+   u32 off = 0;
+
+   /*
+* When moving a multi-order page with GFP_DMA
+* the copy to user can trap ""
+* so instead we copy out 1 page at a time.
+*/
+   while (off < alloc_size && !error) {
+   u32 len = min_t(u32, PAGE_SIZE, alloc_size - off);
+
+   memcpy(iopg, src + off, len);
+   if (copy_to_user(parg + off, iopg, len))
+   error = -EFAULT;
+   off += len;
+   }
+   } else {
+   if (copy_to_user(parg, iopg, alloc_size))
+   error = -EFAULT;
+   }
+
+report_zones_out:
+   if (pgs)
+   __free_pages(pgs, order);
+   if (iopg)
+   free_page((unsigned long)iopg);
+   return error;
+}
+
+static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode,
+ void __user *parg)
+{
+   unsigned int op = 0;
+   unsigned int op_flags = 0;
+   sector_t lba;
+   struct bdev_zone_action za;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   /* When acting on zones we explicitly disallow using a partition. */
+   if (bdev != bdev->bd_contains) {
+   pr_err("%s: All zone operations disallowed on this device\n",
+   __func__);
+   return -EFAULT;
+   }
+
+   if (copy_from_user(&za, parg, sizeof(za)))
+   return -EFAULT

[PATCH v8 1/2] Add bio/request flags to issue ZBC/ZAC commands

2016-08-21 Thread Shaun Tancheff
Add op flags to access to zone information as well as open, close
and reset zones:
  - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
  - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing
  - REQ_OP_ZONE_CLOSE - Explicitly close a zone
  - REQ_OP_ZONE_FINISH - Explicitly finish a zone
  - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone

These op flags can be used to create bio's to control zoned devices
through the block layer.

This is useful for file systems and device mappers that need explicit
control of zoned devices such as Host Managed and Host Aware SMR drives,

Report zones is a device read that requires a buffer.

Open, Close, Finish and Reset are device commands that have no
associated data transfer.
  Open -   Open is a zone for writing.
  Close -  Disallow writing to a zone.
  Finish - Disallow writing a zone and set the WP to the end
   of the zone.
  Reset -  Discard data in a zone and reset the WP to the start
   of the zone.

Sending an LBA of ~0 will attempt to operate on all zones.
This is typically used with Reset to wipe a drive as a Reset
behaves similar to TRIM in that all data in the zone(s) is deleted.

Report zones currently defaults to reporting on all zones. It expected
that support for the zone option flag will piggy back on streamid
support. The report option flag is useful as it can reduce the number
of zones in each report, but not critical.

Signed-off-by: Shaun Tancheff 
---
v8:
 - Added Finish Zone op
 - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled
v6:
 - Added GFP_DMA to gfp mask.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Removed include/uapi/linux/fs.h from this patch.

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  94 
 drivers/scsi/sd.c | 121 +
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   8 +-
 include/linux/blk_types.h |   7 +-
 include/linux/blkdev.h|   1 +
 include/linux/blkzoned_api.h  |  25 ++
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 182 ++
 10 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a306795..aedf311 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12984,6 +12984,15 @@ F: Documentation/networking/z8530drv.txt
 F: drivers/net/hamradio/*scc.c
 F: drivers/net/hamradio/z8530.h
 
+ZBC AND ZBC BLOCK DEVICES
+M: Shaun Tancheff 
+W: http://seagate.com
+W: https://github.com/Seagate/ZDM-Device-Mapper
+L: linux-bl...@vger.kernel.org
+S: Maintained
+F: include/linux/blkzoned_api.h
+F: include/uapi/linux/blkzoned_api.h
+
 ZBUD COMPRESSED PAGE ALLOCATOR
 M: Seth Jennings 
 L: linux...@kvack.org
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 083e56f..e92bd56 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -266,3 +266,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_zone_report - queue a report zones operation
+ * @bdev:  target blockdev
+ * @op_flags:  extra bio rw flags. If unsure, use 0.
+ * @sector:starting sector (report will include this sector).
+ * @opt:   See: zone_report_option, default is 0 (all zones).
+ * @page:  one or more contiguous pages.
+ * @pgsz:  up to size of page in bytes, size of report.
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
+sector_t sector, u8 opt, struct page *page,
+size_t pgsz, gfp_t gfp_mask)
+{
+   struct bdev_zone_report *conv = page_address(page);
+   struct bio *bio;
+   unsigned int nr_iovecs = 1;
+   int ret = 0;
+
+   if (pgsz < (sizeof(struct bdev_zone_report) +
+   sizeof(struct bdev_zone_descriptor)))
+   return -EINVAL;
+
+   bio = bio_alloc(gfp_mask, nr_iovecs);
+   if (!bio)
+   return -ENOMEM;
+
+   conv->descriptor_count = 0;
+   bio->bi_iter.bi_sect

[PATCH v8 0/2] Block layer support ZAC/ZBC commands

2016-08-21 Thread Shaun Tancheff
Hi Jens,

This series is based on linus' v4.8-rc2 branch.

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be
suitable for use by Host Managed drives.

ZBC [and ZAC] drives add new commands for discovering and working
with Zones.

Part one of this series expands the bio/request reserved op size from
3 to 4 bits and then adds op codes for each of the ZBC commands:
   Report zones, close zone, finish zone, open zone and reset zone.

Part two of this series deals with integrating these new bio/request
op's with Hannes' zone cache.

This extends the ZBC support up to the block layer allowing direct
control by file systems or device mapper targets. Also by deferring
the zone handling to the authoritative subsystem there is an overall
lower memory usage for holding the active zone information as well
as clarifying responsible party for maintaining the write pointer
for each active zone.

By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and
where the write pointer is based on writes completed w/o errors.

Knowing the drive zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or
discards) that are zone aligned.

A perhaps non-obvious approach is that a conventional drive will
returns a zone report descriptor with a single large conventional zone.
This is intended to allow a collection of zoned and non-zoned media to
be stitched together to provide a file system with a zoned device with
conventional space mapped to where it is useful.

Patches for util-linux can be found here:
g...@github.com:stancheff/util-linux.git v2.28.1+biof

https://github.com/stancheff/util-linux/tree/v2.28.1%2Bbiof

This patch is available here:
https://github.com/stancheff/linux/tree/v4.8-rc2%2Bbiof.v8

g...@github.com:stancheff/linux.git v4.8-rc2+biof.v8

v8:
 - Changed zone report to default to reading from zone cache.
 - Changed ioctl for zone commands to support forcing a query or command
   to be sent to media.
 - Fixed report zones copy to user to work when HARDENED_USERCOPY is enabled
v7:
 - Initial support for Hannes' zone cache.
v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.


Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  94 +
 block/ioctl.c | 149 +++
 drivers/scsi/sd.c | 121 ++
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   8 +-
 include/linux/blk_types.h |   7 +-
 include/linux/blkdev.h|   1 +
 include/linux/blkzoned_api.h  |  25 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 210 ++
 include/uapi/linux/fs.h   |   1 +
 12 files changed, 625 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.9.3



Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal  wrote:
> Shaun,
>
> On 8/10/16 12:58, Shaun Tancheff wrote:
>>
>> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal 
>> wrote:
>>>>
>>>> On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:
>>
>>
>> [trim]
>>
>>>>> Since disk type == 0 for everything that isn't HM so I would prefer the
>>>>> sysfs 'zoned' file just report if the drive is HA or HM.
>>>>>
>>>> Okay. So let's put in the 'zoned' attribute the device type:
>>>> 'host-managed', 'host-aware', or 'device managed'.
>>>
>>>
>>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned
>>> file.
>>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>>> else. This means that drive managed models are not exposed as zoned block
>>> devices. For HM vs HA differentiation, an application can look at the
>>> device type file since it is already present.
>>>
>>> We could indeed set the "zoned" file to the device type, but HM drives
>>> and
>>> regular drives will both have "0" in it, so no differentiation possible.
>>> The other choice could be the "zoned" bits defined by ZBC, but these
>>> do not define a value for host managed drives, and the drive managed
>>> value
>>> being not "0" could be confusing too. So I settled for a simple 0/1
>>> boolean.
>>
>>
>> This seems good to me.
>
>
> Another option I forgot is for the "zoned" file to indicate the total number
> of zones of the device, and 0 for a non zoned regular block device. That
> would work as well.

Clearly either is sufficient.

> [...]
>>>
>>> Done: I hacked Shaun ioctl code and added finish zone too. The
>>> difference with Shaun initial code is that the ioctl are propagated down
>>> to
>>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need
>>> for
>>> BIO request definition for the zone operations. So a lot less code added.
>>
>>
>> The purpose of the BIO flags is not to enable the ioctls so much as
>> the other way round. Creating BIO op's is to enable issuing ZBC
>> commands from device mapper targets and file systems without some
>> heinous ioctl hacks.
>> Making the resulting block layer interfaces available via ioctls is just a
>> reasonable way to exercise the code ... or that was my intent.
>
>
> Yes, I understood your code. However, since (or if) we keep the zone
> information in the RB-tree cache, there is no need for the report zone
> operation BIO interface. Same for reset write pointer by keeping the mapping
> to discard. blk_lookup_zone can be used in kernel as a report zone BIO
> replacement and works as well for the report zone ioctl implementation. For
> reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl
> becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and
> finish zone remains. For these, adding the BIO interface seemed an overkill.
> Hence my choice of propagating the ioctl to the driver.
> This is debatable of course, and adding an in-kernel interface is not hard:
> we can implement blk_open_zone, blk_close_zone and blk_finish_zone using
> __blkdev_driver_ioctl. That looks clean to me.

Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API
-> Kernel
is not really a good designed IMO.

> Overall, my concern with the BIO based interface for the ZBC commands is
> that it adds one flag for each command, which is not really the philosophy
> of the interface and potentially opens the door for more such
> implementations in the future with new standards and new commands coming up.
> Clearly that is not a sustainable path. So I think that a more specific
> interface for these zone operations is a better choice. That is consistent
> with what happens with the tons of ATA and SCSI commands not actually doing
> data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and
> are processed as request REQ_TYPE_BLOCK_PC.

Part of the reason for following on Mike Christie's bio op/flags cleanup was to
make these op's. The advantage of being added as ops is that there is only
1 extra bit need (not 4 or 5 bits for flags). The other reason for being
promoted into the block layer as commands is because it seems to me
to make sense that these abstractions could be allowed to be passed through
a DM layer and be handled by a files sy

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Shaun Tancheff
On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal  wrote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
> […]
>>>>
>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, but
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=1 in your
current patches. I believe that setting discard_zeroes_data=1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the “discard_zeroes_data” thing, I also think that is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and speed
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) which
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less

Re: [PATCH] block: Fix secure erase

2016-08-15 Thread Shaun Tancheff
 if (bio_op(bio) == REQ_OP_SECURE_ERASE)
> +   return 1;
> +
> if (bio_op(bio) == REQ_OP_WRITE_SAME)
> return 1;
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2c210b6a7bcf..e79055c8b577 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -882,7 +882,7 @@ static inline unsigned int blk_rq_cur_sectors(const 
> struct request *rq)
>  static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
>  int op)
>  {
> -   if (unlikely(op == REQ_OP_DISCARD))
> +   if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
> return min(q->limits.max_discard_sectors, UINT_MAX >> 9);
>
> if (unlikely(op == REQ_OP_WRITE_SAME))
> @@ -913,7 +913,9 @@ static inline unsigned int blk_rq_get_max_sectors(struct 
> request *rq,
> if (unlikely(rq->cmd_type != REQ_TYPE_FS))
> return q->limits.max_hw_sectors;
>
> -   if (!q->limits.chunk_sectors || (req_op(rq) == REQ_OP_DISCARD))
> +   if (!q->limits.chunk_sectors ||
> +   req_op(rq) == REQ_OP_DISCARD ||
> +   req_op(rq) == REQ_OP_SECURE_ERASE)
> return blk_queue_get_max_sectors(q, req_op(rq));
>
> return min(blk_max_size_offset(q, offset),
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 7598e6ca817a..dbafc5df03f3 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -223,7 +223,7 @@ static void __blk_add_trace(struct blk_trace *bt, 
> sector_t sector, int bytes,
> what |= MASK_TC_BIT(op_flags, META);
> what |= MASK_TC_BIT(op_flags, PREFLUSH);
> what |= MASK_TC_BIT(op_flags, FUA);
> -   if (op == REQ_OP_DISCARD)
> +   if (op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE)
> what |= BLK_TC_ACT(BLK_TC_DISCARD);
> if (op == REQ_OP_FLUSH)
> what |= BLK_TC_ACT(BLK_TC_FLUSH);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DQIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=Y15JQ82w_sRbTpLbwWZih05XQoGHNqYfhx0M_42AepQ&s=LMrzTEwBuBaQO9U9V-bZnGTGnpBfUXCdpjig4yyXvDM&e=



-- 
Shaun Tancheff


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-14 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke  wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  
>> wrote:
>>> Hannes, Shaun,
>>>
>>> Let me add some more comments.
>>>
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>>>
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>>>>>
>>>>>> Can you please integrate this with Hannes series so that it uses
>>>>>> his cache of the zone information?
>>>>>
>>>>> Adding Hannes and Damien to Cc.
>>>>>
>>>>> Christoph,
>>>>>
>>>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, 
>>>>> that is
>>>>> quite simple. I can even have the open/close/reset zone commands update 
>>>>> the
>>>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around 
>>>>> the
>>>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups 
>>>>> that do
>>>>> not need the RB-Tree to function with zoned media.
>>
>> I have posted patches to integrate with the zone cache, hopefully they
>> make sense.
>>
> [ .. ]
>>>> I have thought about condensing the RB tree information, but then I
>>>> figured that for 'real' SMR handling we cannot assume all zones are of
>>>> fixed size, and hence we need all the information there.
>>>> Any condensing method would assume a given structure of the zones, which
>>>> the standard just doesn't provide.
>>>> Or am I missing something here?

Of course you can condense the zone cache without loosing any
information. Here is the layout I used ... I haven't update the patch
to the latest posted patches but this is the basic idea.

[It was originally done as a follow on of making your zone cache work
 with Seagate's HA drive. I did not include the wp-in-arrays patch
 along with the HA drive support that I sent you in May as you were
 quite terse about RB trees when I tried to discuss this approach with
 you at Vault]

struct blk_zone {
unsigned type:4;
unsigned state:5;
unsigned extra:7;
unsigned wp:40;
void *private_data;
};

struct contiguous_wps {
u64 start_lba;
u64 last_lba; /* or # of blocks */
u64 zone_size; /* size in blocks */
unsigned is_zoned:1;
u32 zone_count;
spinlock_t lock;
struct blk_zone zones[0];
};

struct zone_wps {
u32 wps_count;
struct contiguous_wps **wps;
};

Then in struct request_queue
-struct rb_root zones;
+   struct struct zone_wps *zones;

For each contiguous chunk of zones you need a descriptor. In the current
drives you need 1 or 2 descriptors.

Here a conventional drive is encapsulated as zoned media with one
drive sized conventional zone.

I have not spent time building an ad-hoc LVM comprised of zoned and
conventional media so it's not all ironed out yet.
I think you can see the advantage of being able to put conventional space
anywhere you would like to work around zoned media not being laid out
the the best manner for your setup.

Yes things start to break down if every other zone is a different size ..

The point being that even with supporting zones that order 48 bytes.
in size this saves a lot of space with no loss of information.
I still kind of prefer pushing blk_zone down to a u32 by reducing
the max zone size and dropping the private_data ... but that may
be going a bit too far.

blk_lookup_zone then has an [unfortunate] signature change:


/**
 * blk_lookup_zone() - Lookup zones
 * @q: Request Queue
 * @sector: Location to lookup
 * @start: Starting location zone (OUT: Required)
 * @len: Length of zone (OUT: Required)
 * @lock: Spinlock of zones (OUT: Required)
 */
struct blk_zone *blk_lookup_zone(struct request_queue *q, sector_t sector,
 sector_t *start, sector_t *len,
 spinlock_t **lock)
{
int iter;
struct blk_zone *bzone = NULL;
struct zone_wps *zi = q->zones;

*start = 0;
*len = 0;
*lock = NULL;

if (!q->zones)
goto out;

for (iter = 0; iter < zi->wps_count; iter++) {
if (sector >= zi->wps[iter]->start_lba &&
sector <  zi->wps[iter]->last_lba) {
struct contiguous_wps *wp = zi->wps[iter];
u64 index = (sector - wp->start_lba) / wp->zone_size;


[PATCH] Update WRITE_SAME timeout in sd_setup_discard_cmnd

2016-08-11 Thread Shaun Tancheff
In sd_setup_discard_cmnd() there are a some discard
methods that fall back to using WRITE_SAME. It
appears that those paths using WRITE_SAME should
also use the SD_WRITE_SAME_TIMEOUT instead of the
default SD_TIMEOUT.

Signed-off-by: Shaun Tancheff 
---
I don't have a use case that breaks the current code.
It just seems to me that setups for discard and
write same should be consistent.
---
 drivers/scsi/sd.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d3e852a..3c15f3a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -722,6 +722,8 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
if (!page)
return BLKPREP_DEFER;
 
+   rq->timeout = SD_TIMEOUT;
+
switch (sdkp->provisioning_mode) {
case SD_LBP_UNMAP:
buf = page_address(page);
@@ -746,6 +748,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
 
len = sdkp->device->sector_size;
+   rq->timeout = SD_WRITE_SAME_TIMEOUT;
break;
 
case SD_LBP_WS10:
@@ -758,6 +761,7 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
 
len = sdkp->device->sector_size;
+   rq->timeout = SD_WRITE_SAME_TIMEOUT;
break;
 
default:
@@ -766,8 +770,6 @@ static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
}
 
rq->completion_data = page;
-   rq->timeout = SD_TIMEOUT;
-
cmd->transfersize = len;
cmd->allowed = SD_MAX_RETRIES;
 
-- 
2.8.1



Re: [PATCH v5 2/2] Add support for SCT Write Same

2016-08-10 Thread Shaun Tancheff
On Wed, Aug 10, 2016 at 6:30 AM, Tom Yan  wrote:
> On 10 August 2016 at 14:31, Tom Yan  wrote:
>> I don't really know about SCT Write Same but there is one concern I
>> could I think of.
>>
>> libata's SATL would report a maximum write same length base on the
>> number of sectors a one-block TRIM payload can describe at most, which
>> is 65535 * 64 = 4194240 (see ata_scsiop_inq_b0 in libata-scsi.c). If
>> the drive does not support TRIM, it will not report such length. That
>> is technically fine, as per SBC standard, and I suppose the SCSI disk
>> driver would use SD_MAX_WS16_BLOCKS = 0x7f (8388607).
>
> Actually it will use SD_MAX_WS10_BLOCKS = 0x (65535) in such case.
> See sd_config_write_same() in sd.c. So if the device support TRIM,
> each SCT Write Same will cover 4194240 logical sectors; if it does
> not, each will cover 65535 logical sectors. In that case, perhaps we
> should report the same value in the Maximum Write Same Length field
> when (only) SCT Write Same is supported? (Not the Optimal Unmap
> Granularity field though).

You are correct in that we can advertise the larger limit in
ata_scsi_dev_config() when only SCT write same is supported
rather than fall back to WS10.

TRIM is bound by an interface maximum. You can only stuff 64 entries
of a 16 bit length followed by 48 bit lba into a 512 byte block.

SCT is not restricted (you can wipe an entire drive) however there
is a practical limit in that I have coded the SCT to operate
in the foreground so the command could timeout depending
on how fast the media can write.

On my machine the default timeout is 30s so to clear 4194240 (16G):
  30s -> 547 MB/s
  60s -> 274 MB/s
  90s -> 183 MB/s
120s -> 137 MB/s

So for my drives 8G and 30s or 16G and 60s is fine.
For older or slow drives 4G and 30s should be fine.

I really am not sure what would be considered the correct
solution though. I believe that the WRITE SAME defaults
are currently being chosen around physical limits.

We could reduce the trim to 16 entries when SCT is available and
bump SCT to the same 16 * 63335 maximum?

I think we can also bump the command timeout for WRITE SAME?

Suggestions are welcome.
-- 
Shaun Tancheff


Re: [PATCH v5 2/2] Add support for SCT Write Same

2016-08-10 Thread Shaun Tancheff
On Wed, Aug 10, 2016 at 11:50 AM, Tom Yan  wrote:
> On 10 August 2016 at 14:34, Shaun Tancheff  wrote:
>>
>> You are correct in that we can advertise the larger limit in
>> ata_scsi_dev_config() when only SCT write same is supported
>> rather than fall back to WS10.
>
> ata_scsi_dev_config()? Not sure if I follow. We should only need to
> report Maximum Write Same Length in the Block Limit VPD
> (ata_scsiop_inq_b0).
>
>>
>> TRIM is bound by an interface maximum. You can only stuff 64 entries
>> of a 16 bit length followed by 48 bit lba into a 512 byte block.
>
> Well that is actually the minimum. Modern SSDs often support more than
> one-block payload (e.g. 8, 16...).  It's just our SCSI disk driver
> statically limit it to the minimum. Though it allows only 0x /
> 512 = 8388607 (SD_MAX_WS16_BLOCKS) blocks per WRITE SAME (16) command
> anyway, so we can at most allow only a 2-block (well, or 3-block)
> payload.

Ah. Thanks for the clarification.

>> SCT is not restricted (you can wipe an entire drive) however there
>> is a practical limit in that I have coded the SCT to operate
>> in the foreground so the command could timeout depending
>> on how fast the media can write.
>>
>> On my machine the default timeout is 30s so to clear 4194240 (16G):
>
> You are talking about an AF 4Kn drive I suppose? For a 512e drive it
> should be only ~2G.

I stand corrected. Since all the kernel math is 512 byte sectors you are
absolutely correct and this isn't an issue at all.

We should report SD_MAX_WS16_BLOCKS when only SCT
is available and 4194240 when TRIM is available.

You can safely ignore the remainder of my pointless rambling.
Thanks for you patience none the less.

>>   30s -> 547 MB/s
>>   60s -> 274 MB/s
>>   90s -> 183 MB/s
>> 120s -> 137 MB/s
>>
>> So for my drives 8G and 30s or 16G and 60s is fine.
>> For older or slow drives 4G and 30s should be fine.
>>
>> I really am not sure what would be considered the correct
>> solution though. I believe that the WRITE SAME defaults
>> are currently being chosen around physical limits.
>
> Not sure about what WRITE SAME defaults and physical limits you are
> referring to.

Just that the WRITE SAME limit SD_MAX_WS16_BLOCKS is derived
from the request interface as opposed to some other arbitrary limit.

>>
>> We could reduce the trim to 16 entries when SCT is available and
>> bump SCT to the same 16 * 63335 maximum?
>
> I am not sure if that's a good idea. Small TRIM payloads (hence more
> TRIM commands) could lead to noticeable overhead in my experience. But
> if 4194240 blocks is really too many for SCT Write Same in any case, I
> guess we will have to compromise, since the Maximum Write Same Length
> field is shared. (Now it feels unfortunate that we decided to switch
> from UNMAP -> TRIM to WRITE SAME (16) -> TRIM long ago.) The question
> is, do we want the value to stay at 4194240 when SCT Write Same is not
> available?
>
> I have no idea what the value should be. But, given the fact sector
> size seems to matter much in the SCT case, perhaps at the very least,
> we would want to derive the multiplier from that?
>
>>
>> I think we can also bump the command timeout for WRITE SAME?
>
> I have no idea where the timeout comes from. Is it even a thing in the
> kernel (instead of one in the firmware of the drive or the ACS
> standard)?

Oh ... the timeout I was thinking of is this (or defaulted from):
/sys/block/sdX/device/timeout
It's the struct request_queue's 'rq_timeout'
At least that's where my line of thought was going.

I will update the patch to use SD_MAX_WS16_BLOCKS and 4194240 as appropriate.

Regards,
Shaun


Re: [PATCH v5 1/2] Use kmap_atomic when rewriting attached page

2016-08-10 Thread Shaun Tancheff
On Wed, Aug 10, 2016 at 5:56 AM, Tom Yan  wrote:
> On 10 August 2016 at 09:00, Shaun Tancheff  wrote:
>>  static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
>>  {
>> struct ata_taskfile *tf = &qc->tf;
>> struct scsi_cmnd *scmd = qc->scsicmd;
>> struct ata_device *dev = qc->dev;
>> const u8 *cdb = scmd->cmnd;
>> +   struct scatterlist *sg;
>> u64 block;
>> u32 n_block;
>> +   const u32 trmax = ATA_MAX_TRIM_RNUM;
>> u32 size;
>> -   void *buf;
>> u16 fp;
>> u8 bp = 0xff;
>>
>> @@ -3319,10 +3363,9 @@ static unsigned int ata_scsi_write_same_xlat(struct 
>> ata_queued_cmd *qc)
>> if (!scsi_sg_count(scmd))
>> goto invalid_param_len;
>>
>> -   buf = page_address(sg_page(scsi_sglist(scmd)));
>> -
>> -   if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) {
>> -   size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, 
>> block, n_block);
>> +   sg = scsi_sglist(scmd);
>> +   if (n_block <= 0x * cmax) {
>
> Although this got moved and corrected in the next patch, but perhaps
> you should still correct the `cmax` here, which should be `trmax`.

You are correct. I will fix this up. Thanks!

>> +   size = ata_format_dsm_trim_descr(sg, trmax, block, n_block);
>> } else {
>> fp = 2;
>> goto invalid_fld;



-- 
Shaun Tancheff


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Shaun Tancheff
happily merge BIOs across zone 
> boundaries
> and the discard code will not split and align calls on the zones. But upper 
> layers
> (an FS or a device mapper) can still do all this by themselves if they 
> want/can
> support non-constant zone sizes.
>
> The only exception is drives like the Seagate one with only the last zone of a
> different size. This case is handled exactly as if all zones are the same size
> simply because any operation on the last smaller zone will naturally align as
> the checks of operation size against the drive capacity will do the right 
> things.
>
> The ioctls work for all cases (drive with constant zone size or not). This is 
> again
> to allow supporting eventual weird drives at application level. I integrated 
> all
> these ioctl into libzbc block device backend driver and everything is fine. 
> Can't
> tell the difference with direct-to-drive SG_IO accesses. But unlike these, 
> the zone
> ioctls keep the zone information RB-tree cache up to date.
>
>>
>> I will be updating my patchset accordingly.
>
> I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I 
> will send
> everything for review. If you have any comment on the above, please let me 
> know and
> I will be happy to incorporate changes.
>
> Best regards.
>
>
> 
> Damien Le Moal, Ph.D.
> Sr. Manager, System Software Group, HGST Research,
> HGST, a Western Digital brand
> damien.lem...@hgst.com
> (+81) 0466-98-3593 (ext. 513593)
> 1 kirihara-cho, Fujisawa,
> Kanagawa, 252-0888 Japan
> www.hgst.com
>
> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
> Notice & Disclaimer:
>
> This e-mail and any files transmitted with it may contain confidential or 
> legally privileged information of WDC and/or its affiliates, and are intended 
> solely for the use of the individual or entity to which they are addressed. 
> If you are not the intended recipient, any disclosure, copying, distribution 
> or any action taken or omitted to be taken in reliance on it, is prohibited. 
> If you have received this e-mail in error, please notify the sender 
> immediately and delete the e-mail in its entirety from your system.
>



-- 
Shaun Tancheff


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 1:47 AM, Hannes Reinecke  wrote:
> On 08/05/2016 10:35 PM, Shaun Tancheff wrote:
>> On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  
>> wrote:
>>>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:

[trim]
>> Also the zone report is 'slow' in that there is an overhead for the
>> report itself but
>> the number of zones per query can be quite large so 4 or 5 I/Os that
>> run into the
>> hundreds if milliseconds to cache the entire drive isn't really unworkable 
>> for
>> something that is used infrequently.
>>
> No, surely not.
> But one of the _big_ advantages for the RB tree is blkdev_discard().
> Without the RB tree any mkfs program will issue a 'discard' for every
> sector. We will be able to coalesce those into one discard per zone, but
> we still need to issue one for _every_ zone.
> Which is (as indicated) really slow, and easily takes several minutes.
> With the RB tree we can short-circuit discards to empty zones, and speed
> up processing time dramatically.
> Sure we could be moving the logic into mkfs and friends, but that would
> require us to change the programs and agree on a library (libzbc?) which
> should be handling that.

Adding an additional library dependency seems overkill for a program
that is already doing ioctls and raw block I/O ... but I would leave that
up to each file system. As it sits issuing the ioctl and walking the array
of data returned [see blkreport.c] is already quite trivial.

I believe the goal here is for F2FS, and perhaps NILFS? to "just
work" with the DISCARD to Reset WP and zone cache in place.

Still quite skeptical about other common file systems
"just working" without their respective mkfs et. al. being
zone aware and handling the topology of the media at mkfs time.
Perhaps there is something I am unaware of?

[trim]

>> I can add finish zone ... but I really can't think of a use for it, myself.
>>
> Which is not the point. The standard defines this, so clearly someone
> found it a reasonable addendum. So let's add this for completeness.

Agreed and queued for the next version.

Regards,
Shaun


[PATCH v5 2/2] Add support for SCT Write Same

2016-08-09 Thread Shaun Tancheff
SATA drives may support write same via SCT. This is useful
for setting the drive contents to a specific pattern (0's).

Translate a SCSI WRITE SAME command to be either a DSM TRIM command or
an SCT Write Same command.

Based on the UNMAP flag:
  - When set translate to DSM TRIM
  - When not set translate to SCT Write Same

Signed-off-by: Shaun Tancheff 
---
v5:
 - Addressed review comments
 - Report support for ZBC only for zoned devices.
 - kmap page during rewrite
 - Fix unmap set to require trim or error, if not unmap then sct write
   same or error.
v4:
 - Added partial MAINTENANCE_IN opcode simulation
 - Dropped all changes in drivers/scsi/*
 - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT.
v3:
 - Demux UNMAP/TRIM from WRITE SAME
v2:
 - Remove fugly ata hacking from sd.c
---
 drivers/ata/libata-scsi.c | 189 +++---
 include/linux/ata.h   |  43 +++
 2 files changed, 205 insertions(+), 27 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index a71067a..99b0e6c 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
sdev->use_10_for_rw = 1;
sdev->use_10_for_ms = 1;
-   sdev->no_report_opcodes = 1;
-   sdev->no_write_same = 1;
 
/* Schedule policy is determined by ->qc_defer() callback and
 * it needs to see every deferred qc.  Set dev_blocked to 1 to
@@ -3325,6 +3323,41 @@ static unsigned int ata_format_dsm_trim_descr(struct 
scatterlist *sg, u32 num,
return used_bytes;
 }
 
+/**
+ * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
+ * @sg: Scatter / Gather list attached to command.
+ * @lba: Starting sector
+ * @num: Number of bytes to be zero'd.
+ *
+ * Rewrite the WRITE SAME descriptor to be an SCT Write Same formatted
+ * descriptor.
+ * NOTE: Writes a pattern (0's) in the foreground.
+ *   Large write-same requents can timeout.
+ */
+static void ata_format_sct_write_same(struct scatterlist *sg, u64 lba, u64 num)
+{
+   void *ptr = kmap_atomic(sg_page(sg));
+   u16 *sctpg = ptr + sg->offset;
+
+   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
+   put_unaligned_le16(0x0101,  &sctpg[1]); /* WRITE PTRN FG */
+   put_unaligned_le64(lba, &sctpg[2]);
+   put_unaligned_le64(num, &sctpg[6]);
+   put_unaligned_le32(0u,  &sctpg[10]);
+
+   kunmap_atomic(ptr);
+}
+
+/**
+ * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same
+ * @qc: Command to be translated
+ *
+ * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or
+ * an SCT Write Same command.
+ * Based on WRITE SAME has the UNMAP flag
+ *   When set translate to DSM TRIM
+ *   When clear translate to SCT Write Same
+ */
 static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 {
struct ata_taskfile *tf = &qc->tf;
@@ -3338,6 +3371,7 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
u32 size;
u16 fp;
u8 bp = 0xff;
+   u8 unmap = cdb[1] & 0x8;
 
/* we may not issue DMA commands if no DMA mode is set */
if (unlikely(!dev->dma_mode))
@@ -3350,10 +3384,23 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
scsi_16_lba_len(cdb, &block, &n_block);
 
/* for now we only support WRITE SAME with the unmap bit set */
-   if (unlikely(!(cdb[1] & 0x8))) {
-   fp = 1;
-   bp = 3;
-   goto invalid_fld;
+   if (unmap) {
+   if ((dev->horkage & ATA_HORKAGE_NOTRIM) ||
+   !ata_id_has_trim(dev->id)) {
+   fp = 1;
+   bp = 3;
+   goto invalid_fld;
+   }
+   if (n_block > 0x * trmax) {
+   fp = 2;
+   goto invalid_fld;
+   }
+   } else {
+   if (!ata_id_sct_write_same(dev->id)) {
+   fp = 1;
+   bp = 3;
+   goto invalid_fld;
+   }
}
 
/*
@@ -3364,30 +3411,42 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
goto invalid_param_len;
 
sg = scsi_sglist(scmd);
-   if (n_block <= 0x * cmax) {
+   if (unmap) {
size = ata_format_dsm_trim_descr(sg, trmax, block, n_block);
+   if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
+   /* Newer devices support queued TRIM commands */
+   tf->protocol = ATA_PROT_NCQ;
+   tf->command = ATA_CMD_FPDMA_SEND;
+   tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_

[PATCH v5 1/2] Use kmap_atomic when rewriting attached page

2016-08-09 Thread Shaun Tancheff
The current SATL for WRITE_SAME does not protect against misaligned
pages. Additionally the associated page should also kmap'd when
being modified.

Signed-off-by: Shaun Tancheff 
---
 v5: Added prep patch to work with non-page aligned scatterlist pages
 and use kmap_atomic() to lock page during modification.

 drivers/ata/libata-scsi.c | 53 ++-
 include/linux/ata.h   | 26 ---
 2 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e207b33..a71067a 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3282,16 +3282,60 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
return 1;
 }
 
+/**
+ * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
+ * @sg: Scatter / Gather list attached to command.
+ * @num: Maximum number of entries (nominally 64).
+ * @sector: Starting sector
+ * @count: Total Range of request
+ *
+ * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted
+ * descriptor.
+ *
+ * Upto 64 entries of the format:
+ *   63:48 Range Length
+ *   47:0  LBA
+ *
+ *  Range Length of 0 is ignored.
+ *  LBA's should be sorted order and not overlap.
+ *
+ * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
+ */
+static unsigned int ata_format_dsm_trim_descr(struct scatterlist *sg, u32 num,
+ u64 sector, u32 count)
+{
+   void *ptr = kmap_atomic(sg_page(sg));
+   __le64 *buffer = ptr + sg->offset;
+   u32 i = 0, used_bytes;
+
+   while (i < num) {
+   u64 entry = sector |
+   ((u64)(count > 0x ? 0x : count) << 48);
+   buffer[i++] = __cpu_to_le64(entry);
+   if (count <= 0x)
+   break;
+   count -= 0x;
+   sector += 0x;
+   }
+
+   used_bytes = ALIGN(i * 8, 512);
+   memset(buffer + i, 0, used_bytes - i * 8);
+
+   kunmap_atomic(ptr);
+   return used_bytes;
+}
+
 static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 {
struct ata_taskfile *tf = &qc->tf;
struct scsi_cmnd *scmd = qc->scsicmd;
struct ata_device *dev = qc->dev;
const u8 *cdb = scmd->cmnd;
+   struct scatterlist *sg;
u64 block;
u32 n_block;
+   const u32 trmax = ATA_MAX_TRIM_RNUM;
u32 size;
-   void *buf;
u16 fp;
u8 bp = 0xff;
 
@@ -3319,10 +3363,9 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
if (!scsi_sg_count(scmd))
goto invalid_param_len;
 
-   buf = page_address(sg_page(scsi_sglist(scmd)));
-
-   if (n_block <= 65535 * ATA_MAX_TRIM_RNUM) {
-   size = ata_set_lba_range_entries(buf, ATA_MAX_TRIM_RNUM, block, 
n_block);
+   sg = scsi_sglist(scmd);
+   if (n_block <= 0x * cmax) {
+   size = ata_format_dsm_trim_descr(sg, trmax, block, n_block);
} else {
fp = 2;
goto invalid_fld;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index adbc812..45a1d71 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -1071,32 +1071,6 @@ static inline void ata_id_to_hd_driveid(u16 *id)
 #endif
 }
 
-/*
- * Write LBA Range Entries to the buffer that will cover the extent from
- * sector to sector + count.  This is used for TRIM and for ADD LBA(S)
- * TO NV CACHE PINNED SET.
- */
-static inline unsigned ata_set_lba_range_entries(void *_buffer,
-   unsigned num, u64 sector, unsigned long count)
-{
-   __le64 *buffer = _buffer;
-   unsigned i = 0, used_bytes;
-
-   while (i < num) {
-   u64 entry = sector |
-   ((u64)(count > 0x ? 0x : count) << 48);
-   buffer[i++] = __cpu_to_le64(entry);
-   if (count <= 0x)
-   break;
-   count -= 0x;
-   sector += 0x;
-   }
-
-   used_bytes = ALIGN(i * 8, 512);
-   memset(buffer + i, 0, used_bytes - i * 8);
-   return used_bytes;
-}
-
 static inline bool ata_ok(u8 status)
 {
return ((status & (ATA_BUSY | ATA_DRDY | ATA_DF | ATA_DRQ | ATA_ERR))
-- 
2.8.1



[PATCH v5 0/2] Add support for SCT Write Same

2016-08-09 Thread Shaun Tancheff
At some point the method of issuing Write Same for ATA drives changed.
Currently write same is commonly available via SCT so expose the SCT
capabilities and use SCT Write Same when it is available.

This is useful for zoned based media that prefers to support discard
with lbprz set, aka discard zeroes data by mapping discard operations to
reset write pointer operations. Conventional zones that do not support
reset write pointer can still honor the discard zeroes data by issuing
a write same over the zone.

It may also be nice to know if various controllers that currently
disable WRITE SAME will work with the SCT Write Same code path:
  aacraid, arcmsr, megaraid, 3w-9xxx, 3w-sas, 3w-, gdth, hpsa, ips,
  megaraid, pmcraid, storvsc_drv

This patch against v4.8-rc1 is also at 

https://github.com/stancheff/linux/tree/v4.8-rc1+ws.v5

g...@github.com:stancheff/linux.git v4.8-rc1+ws.v5

Shaun Tancheff (2):
  Use kmap_atomic when rewriting attached page
  Add support for SCT Write Same

 drivers/ata/libata-scsi.c | 240 --
 include/linux/ata.h   |  69 -
 2 files changed, 252 insertions(+), 57 deletions(-)

-- 
2.8.1



Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-05 Thread Shaun Tancheff
On Tue, Aug 2, 2016 at 8:29 PM, Damien Le Moal  wrote:
> Hannes, Shaun,
>
> Let me add some more comments.
>
>> On Aug 2, 2016, at 23:35, Hannes Reinecke  wrote:
>>
>> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>>>>
>>>> Can you please integrate this with Hannes series so that it uses
>>>> his cache of the zone information?
>>>
>>> Adding Hannes and Damien to Cc.
>>>
>>> Christoph,
>>>
>>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that 
>>> is
>>> quite simple. I can even have the open/close/reset zone commands update the
>>> RB-Tree .. the non-private parts anyway. I would prefer to do this around 
>>> the
>>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that 
>>> do
>>> not need the RB-Tree to function with zoned media.

I have posted patches to integrate with the zone cache, hopefully they
make sense.

>>>
>>> I do still have concerns with the approach which I have shared in smaller
>>> forums but perhaps I have to bring them to this group.
>>>
>>> First is the memory consumption. This isn't really much of a concern for 
>>> large
>>> servers with few drives but I think the embedded NAS market will grumble as
>>> well as the large data pods trying to stuff 300+ drives in a chassis.
>>>
>>> As of now the RB-Tree needs to hold ~3 zones.
>>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>>> around 3.5 MB per zoned drive attached.
>>> Which is fine if it is really needed, but most of it is fixed information
>>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>>> in an array as more than adequate). Worse is that the crucial piece of
>>> information, the current wp needed for scheduling the next write, is mostly
>>> out of date because it is updated only after the write completes and zones
>>> being actively written to must work off of the last location / size that was
>>> submitted, not completed. The work around is for that tracking to be handled
>>> in the private_data member. I am not saying that updating the wp on
>>> completing a write isn’t important, I am saying that the bi_end_io hook is
>>> the existing hook that works just fine.
>>>
>> Which _actually_ is not true; with my patches I'll update the write
>> pointer prior to submit the I/O (on the reasoning that most of the time
>> I/O will succeed) and re-read the zone information if an I/O failed.
>> (Which I'll have to do anyway as after an I/O failure the write pointer
>> status is not clearly defined.)

Apologies for my mis-characterization.

>> I have thought about condensing the RB tree information, but then I
>> figured that for 'real' SMR handling we cannot assume all zones are of
>> fixed size, and hence we need all the information there.
>> Any condensing method would assume a given structure of the zones, which
>> the standard just doesn't provide.
>> Or am I missing something here?
>
> Indeed, the standards do not mandate any particular zone configuration,
> constant zone size, etc. So writing code so that can be handled is certainly
> the right way of doing things. However, if we decide to go forward with
> mapping RESET WRITE POINTER command to DISCARD, then at least a constant
> zone size (minus the last zone as you said) must be assumed, and that
> information can be removed from the entries in the RB tree (as it will be
> saved for the sysfs "zone_size" file anyway. Adding a little code to handle
> that eventual last runt zone with a different size is not a big problem.

>> As for write pointer handling: yes, the write pointer on the zones is
>> not really useful for upper-level usage.
>> Where we do need it is to detect I/O which is crossing the write pointer
>> (eg when doing reads over the entire zone).
>> As per spec you will be getting an I/O error here, so we need to split
>> the I/O on the write pointer to get valid results back.
>
> To be precise here, the I/O splitting will be handled by the block layer
> thanks to the "chunk_sectors" setting. But that relies on a constant zone
> size assumption too.
>
> The RB-tree here is most useful for reads over or after the write pointer as
> this can have different behavior on different drives (URSWRZ bit). The RB-tree
> allows us to hide these differences to upper layers and simplify support at
> those levels

Re: [PATCH 37/45] drivers: use req op accessor

2016-08-04 Thread Shaun Tancheff
On Thu, Aug 4, 2016 at 10:46 AM, Christoph Hellwig  wrote:
> On Wed, Aug 03, 2016 at 07:30:29PM -0500, Shaun Tancheff wrote:
>> I think the translation in loop.c is suspicious here:
>>
>> "if use DIO && not (a flush_flag or discard_flag)"
>> should translate to:
>> "if use DIO && not ((a flush_flag) || op == discard)"
>>
>> But in the patch I read:
>> "if use DIO && ((not a flush_flag) || op == discard)
>>
>> Which would have DIO && discards follow the AIO path?
>
> Indeed.  Sorry for missing out on your patch, I just sent a fix
> in reply to Dave's other report earlier which is pretty similar to
> yours.

No worries. I prefer your switch to a an if conditional here.

-- 
Shaun Tancheff


Re: [PATCH 37/45] drivers: use req op accessor

2016-08-03 Thread Shaun Tancheff
On Wed, Aug 3, 2016 at 6:47 PM, Mike Christie  wrote:
> On 08/03/2016 05:33 PM, Ross Zwisler wrote:
>> On Sun, Jun 5, 2016 at 1:32 PM,   wrote:
>>> From: Mike Christie 
>>>
>>> The req operation REQ_OP is separated from the rq_flag_bits
>>> definition. This converts the block layer drivers to
>>> use req_op to get the op from the request struct.
>>>
>>> Signed-off-by: Mike Christie 
>>> ---
>>>  drivers/block/loop.c  |  6 +++---
>>>  drivers/block/mtip32xx/mtip32xx.c |  2 +-
>>>  drivers/block/nbd.c   |  2 +-
>>>  drivers/block/rbd.c   |  4 ++--
>>>  drivers/block/xen-blkfront.c  |  8 +---
>>>  drivers/ide/ide-floppy.c  |  2 +-
>>>  drivers/md/dm.c   |  2 +-
>>>  drivers/mmc/card/block.c  |  7 +++
>>>  drivers/mmc/card/queue.c  |  6 ++
>>
>> Dave Chinner reported a deadlock with XFS + DAX, which I reproduced
>> and bisected to this commit:
>>
>> commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34
>> Author: Mike Christie 
>> Date:   Sun Jun 5 14:32:17 2016 -0500
>> drivers: use req op accessor
>>
>> Here are the steps to reproduce the deadlock with a BRD ramdisk:
>>
>> mkfs.xfs -f /dev/ram0
>> mount -o dax /dev/ram0 /mnt/scratch
>
> When using ramdisks, we need the attached patch like in your other bug
> report. I think it will fix some hangs people are seeing.
>
> I do not think that it should cause the failure to run issue you saw
> when doing generic/008 and ext2.
>

I think the translation in loop.c is suspicious here:

"if use DIO && not (a flush_flag or discard_flag)"
should translate to:
"if use DIO && not ((a flush_flag) || op == discard)"

But in the patch I read:
"if use DIO && ((not a flush_flag) || op == discard)

Which would have DIO && discards follow the AIO path?

So I would humbly suggest something like the following
(on top of commit c2df40dfb8c015211ec55f4b1dd0587f875c7b34):
[Please excuse the messed up patch format ... gmail eats tabs]

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b9b737c..0754d83 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1659,8 +1659,9 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
if (lo->lo_state != Lo_bound)
return -EIO;

-   if (lo->use_dio && (!(cmd->rq->cmd_flags & REQ_FLUSH) ||
-   req_op(cmd->rq) == REQ_OP_DISCARD))
+   if (lo->use_dio && !(
+   (cmd->rq->cmd_flags & REQ_FLUSH) ||
+req_op(cmd->rq) == REQ_OP_DISCARD))
cmd->use_aio = true;
else
cmd->use_aio = false;

-- 
Shaun Tancheff


[PATCH 2/2] Enable support for Seagate HostAware drives (testing).

2016-08-03 Thread Shaun Tancheff
Seagate drives report a SAME code of 0 due to having:
  - Zones of different types (CMR zones at the low LBA space).
  - Zones of different size (A terminating 'runt' zone in the high lba space).

Support loading the zone topology into the sd_zbc zone cache.

Signed-off-by: Shaun Tancheff 

Cc: Hannes Reinecke 
Cc: Damien Le Moal 
---
v1:
 - Updated kernel version / re-sync with Hannes' zac.v3 branch.
---
 drivers/scsi/sd.c |  22 
 drivers/scsi/sd.h |  20 +--
 drivers/scsi/sd_zbc.c | 150 ++
 3 files changed, 155 insertions(+), 37 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7c38975..5fbc599 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -694,8 +694,13 @@ static void sd_config_discard(struct scsi_disk *sdkp, 
unsigned int mode)
break;
 
case SD_ZBC_RESET_WP:
-   max_blocks = sdkp->unmap_granularity;
q->limits.discard_zeroes_data = 1;
+   q->limits.discard_granularity =
+   sd_zbc_discard_granularity(sdkp);
+
+   max_blocks = min_not_zero(sdkp->unmap_granularity,
+ q->limits.discard_granularity >>
+   ilog2(logical_block_size));
break;
 
case SD_LBP_ZERO:
@@ -1952,13 +1957,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
good_bytes = blk_rq_bytes(req);
scsi_set_resid(SCpnt, 0);
} else {
-#ifdef CONFIG_SCSI_ZBC
if (op == ZBC_OUT)
/* RESET WRITE POINTER failed */
sd_zbc_update_zones(sdkp,
blk_rq_pos(req),
-   512, true);
-#endif
+   512, SD_ZBC_RESET_WP_ERR);
+
good_bytes = 0;
scsi_set_resid(SCpnt, blk_rq_bytes(req));
}
@@ -2031,7 +2035,6 @@ static int sd_done(struct scsi_cmnd *SCpnt)
good_bytes = blk_rq_bytes(req);
scsi_set_resid(SCpnt, 0);
}
-#ifdef CONFIG_SCSI_ZBC
/*
 * ZBC: Unaligned write command.
 * Write did not start a write pointer position.
@@ -2039,8 +2042,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
if (sshdr.ascq == 0x04)
sd_zbc_update_zones(sdkp,
blk_rq_pos(req),
-   512, true);
-#endif
+   512, SD_ZBC_WRITE_ERR);
}
break;
default:
@@ -2267,7 +2269,7 @@ static void sd_read_zones(struct scsi_disk *sdkp, 
unsigned char *buffer)
 * supports equal zone sizes.
 */
same = buffer[4] & 0xf;
-   if (same == 0 || same > 3) {
+   if (same > 3) {
sd_printk(KERN_WARNING, sdkp,
  "REPORT ZONES SAME type %d not supported\n", same);
return;
@@ -2279,9 +2281,9 @@ static void sd_read_zones(struct scsi_disk *sdkp, 
unsigned char *buffer)
sdkp->unmap_granularity = zone_len;
blk_queue_chunk_sectors(sdkp->disk->queue,
logical_to_sectors(sdkp->device, zone_len));
-   sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 
-   sd_zbc_setup(sdkp, buffer, SD_BUF_SIZE);
+   sd_zbc_setup(sdkp, zone_len, buffer, SD_BUF_SIZE);
+   sd_config_discard(sdkp, SD_ZBC_RESET_WP);
 }
 
 static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device 
*sdp,
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 6ae4505..ef6c132 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -283,19 +283,24 @@ static inline void sd_dif_complete(struct scsi_cmnd *cmd, 
unsigned int a)
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+
+#define SD_ZBC_INIT0
+#define SD_ZBC_RESET_WP_ERR1
+#define SD_ZBC_WRITE_ERR   2
+
 #ifdef CONFIG_SCSI_ZBC
 
 extern int sd_zbc_report_zones(struct scsi_disk *, unsigned char *, int,
   sector_t, enum zbc_zone_reporting_options, bool);
-extern int sd_zbc_setup(struct scsi_disk *, char *, int);
+extern int sd_zbc_setup(struct scsi_disk *, u64 zlen, char *buf, int buf_len);
 extern void sd_zbc_remove(struct scsi_disk *);
 extern void sd_zbc_reset_zones(struct scsi_disk *);
 extern int sd_zbc_setup_discard(struct scsi_disk *, struct request *,
sector_t, unsigned int);
 extern int sd_zbc_setup_read_write(struct scsi_disk *, struct request *,
  

[PATCH 1/2] bio/zbc support for zone cache

2016-08-03 Thread Shaun Tancheff
Zone actions (Open/Close/Reset) update zone cache on success.

Add helpers for
- Zone actions to update zone cache
- Zone report to translate cache to ZBC format structs

Update blkreport to pull from zone cache instead of querying media.

Added open explicit and closed states for zone cache

Signed-off-by: Shaun Tancheff 

Cc: Hannes Reinecke 
Cc: Damien Le Moal 
Cc: Dan Williams 
Cc: Sagi Grimberg 
Cc: Mike Christie 
Cc: Toshi Kani 
Cc: Kent Overstreet 
Cc: Ming Lei 

---
 block/blk-lib.c|   3 +-
 block/blk-zoned.c  | 190 +
 block/ioctl.c  |  39 +++---
 include/linux/blkdev.h |  14 +++-
 4 files changed, 234 insertions(+), 12 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 6dcdcbf..92898ec 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "blk.h"
 
@@ -358,6 +357,8 @@ int blkdev_issue_zone_action(struct block_device *bdev, 
unsigned int op,
bio_set_op_attrs(bio, op, op_flags);
ret = submit_bio_wait(bio);
bio_put(bio);
+   if (ret == 0)
+   update_zone_state(bdev, sector, op);
return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zone_action);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 975e863..799676b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -68,3 +68,193 @@ void blk_drop_zones(struct request_queue *q)
q->zones = RB_ROOT;
 }
 EXPORT_SYMBOL_GPL(blk_drop_zones);
+
+static void __set_zone_state(struct blk_zone *zone, int op)
+{
+   if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
+   return;
+
+   switch (op) {
+   case REQ_OP_ZONE_OPEN:
+   zone->state = BLK_ZONE_OPEN_EXPLICIT;
+   break;
+   case REQ_OP_ZONE_CLOSE:
+   zone->state = BLK_ZONE_CLOSED;
+   break;
+   case REQ_OP_ZONE_RESET:
+   zone->wp = zone->start;
+   break;
+   default:
+   WARN_ONCE(1, "%s: invalid op code: %u\n", __func__, op);
+   }
+}
+
+void update_zone_state(struct block_device *bdev, sector_t lba, unsigned int 
op)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct blk_zone *zone = NULL;
+
+   if (lba == ~0ul) {
+   struct rb_node *node;
+
+   for (node = rb_first(&q->zones); node; node = rb_next(node)) {
+   zone = rb_entry(node, struct blk_zone, node);
+   __set_zone_state(zone, op);
+   }
+   return;
+   }
+   zone = blk_lookup_zone(q, lba);
+   if (zone)
+   __set_zone_state(zone, op);
+}
+EXPORT_SYMBOL_GPL(update_zone_state);
+
+void bzrpt_fill(struct block_device *bdev, struct bdev_zone_report *bzrpt,
+   size_t sz, sector_t lba, u8 opt)
+{
+   u64 clen = ~0ul;
+   struct blk_zone *zone = NULL;
+   struct rb_node *node = NULL;
+   struct request_queue *q = bdev_get_queue(bdev);
+   u32 max_entries = (sz - sizeof(struct bdev_zone_report))
+   /  sizeof(struct bdev_zone_descriptor);
+   u32 entry;
+   int len_diffs = 0;
+   int type_diffs = 0;
+   u8 ctype;
+   u8 same = 0;
+
+   zone = blk_lookup_zone(q, lba);
+   if (zone)
+   node = &zone->node;
+
+   for (entry = 0;
+entry < max_entries && node;
+entry++, node = rb_next(node)) {
+   u64 wp;
+   u8 cond = 0;
+   u8 flgs = 0;
+
+   zone = rb_entry(node, struct blk_zone, node);
+   if (blk_zone_is_cmr(zone))
+   wp = zone->start + zone->len;
+   else
+   wp = zone->wp;
+
+   bzrpt->descriptors[entry].lba_start = cpu_to_be64(zone->start);
+   bzrpt->descriptors[entry].length = cpu_to_be64(zone->len);
+   bzrpt->descriptors[entry].type = zone->type;
+   bzrpt->descriptors[entry].lba_wptr = cpu_to_be64(wp);
+
+   switch (zone->state) {
+   case BLK_ZONE_NO_WP:
+   cond = ZCOND_CONVENTIONAL;
+   break;
+   case BLK_ZONE_OPEN:
+   cond = ZCOND_ZC2_OPEN_IMPLICIT;
+   break;
+   case BLK_ZONE_OPEN_EXPLICIT:
+   cond = ZCOND_ZC3_OPEN_EXPLICIT;
+   break;
+   case BLK_ZONE_CLOSED:
+   cond = ZCOND_ZC4_CLOSED;
+   break;
+   case BLK_ZONE_READONLY:
+   cond = ZCOND_ZC6_READ_ONLY;
+   break;
+   case BLK_ZONE_OFFLINE:
+   cond = ZCOND_ZC7_OFFLINE;
+   break;
+   default:
+  

[PATCH 0/2] Bio flags use zone cache when available.

2016-08-03 Thread Shaun Tancheff
Hi,

As per Christoph's request this patch incorporates Hannes' cache of zone
information.

The approach is to provide the same blkreport format as would be done 
when BLK_DEV_ZONED is not enabled via an ioctl.
Reset WP, Open, and Close zone will update the zone cache.

Using blkdev_issue_zone_report() from within the kernel the report
will still come directly from the media. I am considering an option
for the ioctl to force pulling the zone report from the media when
BLK_DEV_ZONED is enabled and using the resulting information to
refresh the zone cache. It is not clear that this is useful so I
have not included that in this patch.

The follow-on patch just enables the zone cache to be populated
using a Seagate's HostAware drive.

This series is based off of Linus current tip post (f38d2e5313f0af..)
and builds on top of the previous series of block layer support:
Add ioctl to issue ZBC/ZAC commands via block layer
Add bio/request flags to issue ZBC/ZAC commands

as well as the series post by Hannes'
sd_zbc: Fix handling of ZBC read after write pointer
sd: Limit messages for ZBC disks capacity change
sd: Implement support for ZBC devices
sd: Implement new RESET_WP provisioning mode
sd: configure ZBC devices
...

Cc: Hannes Reinecke 
Cc: Damien Le Moal 

Shaun Tancheff (2):
  bio/zbc support for zone cache
  Enable support for Seagate HostAware drives (testing).

 block/blk-lib.c|   3 +-
 block/blk-zoned.c  | 190 +
 block/ioctl.c  |  39 +++---
 drivers/scsi/sd.c  |  22 +++---
 drivers/scsi/sd.h  |  20 --
 drivers/scsi/sd_zbc.c  | 150 --
 include/linux/blkdev.h |  14 +++-
 7 files changed, 389 insertions(+), 49 deletions(-)

-- 
2.8.1




sd_zbc: Fix handling of ZBC read after write pointer
sd: Limit messages for ZBC disks capacity change
sd: Implement support for ZBC devices
sd: Implement new RESET_WP provisioning mode
sd: configure ZBC devices
block: Add 'BLK_MQ_RQ_QUEUE_DONE' return value
block: Introduce BLKPREP_DONE
block: Add 'zoned' sysfs queue attribute
block: Implement support for zoned block devices
block: update chunk_sectors in blk_stack_limits()
blk-sysfs: Add 'chunk_sectors' to sysfs attributes
sd: configure ZBC devices
block: Add 'BLK_MQ_RQ_QUEUE_DONE' return value
block: Introduce BLKPREP_DONE
block: Add 'zoned' sysfs queue attribute
block: Implement support for zoned block devices
block: update chunk_sectors in blk_stack_limits()
blk-sysfs: Add 'chunk_sectors' to sysfs attributes


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-01 Thread Shaun Tancheff
On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig  wrote:
>
> Can you please integrate this with Hannes series so that it uses
> his cache of the zone information?

Adding Hannes and Damien to Cc.

Christoph,

I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
quite simple. I can even have the open/close/reset zone commands update the
RB-Tree .. the non-private parts anyway. I would prefer to do this around the
CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
not need the RB-Tree to function with zoned media.

I do still have concerns with the approach which I have shared in smaller
forums but perhaps I have to bring them to this group.

First is the memory consumption. This isn't really much of a concern for large
servers with few drives but I think the embedded NAS market will grumble as
well as the large data pods trying to stuff 300+ drives in a chassis.

As of now the RB-Tree needs to hold ~3 zones.
sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
around 3.5 MB per zoned drive attached.
Which is fine if it is really needed, but most of it is fixed information
and it can be significantly condensed (I have proposed 8 bytes per zone held
in an array as more than adequate). Worse is that the crucial piece of
information, the current wp needed for scheduling the next write, is mostly
out of date because it is updated only after the write completes and zones
being actively written to must work off of the last location / size that was
submitted, not completed. The work around is for that tracking to be handled
in the private_data member. I am not saying that updating the wp on
completing a write isn’t important, I am saying that the bi_end_io hook is
the existing hook that works just fine.

This all tails into domain responsability. With the RB-Tree doing half of the
work and the ‘responsible’ domain handling the active path via private_data
why have the split at all? It seems to be a double work to have second object
tracking the first so that I/O scheduling can function.

Finally is the error handling path when the RB-Tree encounters and error it
attempts to requery the drive topology virtually guaranteeing that the
private_data is now out-of-sync with the RB-Tree. Again this is something
that can be better encapsulated in the bi_end_io to be informed of the
failed I/O and schedule the appropriate recovery (including re-querying the
zone information of the affected zone(s)).

Anyway those are my concerns and why I am still reluctant to drop this line of
support. I have incorporated Hannes changes at various points. Hence the
SCT Write Same to attempt to work around some of the flaws in mapping
discard to reset write pointer.

Thanks and Regards,
Shaun

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=0ZPyN4vfYZXSmuCmIm3wpExF1K28PYO9KmgcqDsfQBg&s=aiguzw5_op7woZCZ5Qi7c36b16SxiWTJXshN0dG3Xyo&e=



-- 
Shaun Tancheff


Fixup direct bi_rw modifiers

2016-07-30 Thread Shaun Tancheff
bi_rw should be using bio_set_op_attrs to set bi_rw.

Signed-off-by: Shaun Tancheff 

Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: Mike Christie 
---
Patch is against linux-next tag next-20160729

NOTE: In 4.7 this was not including the 'WRITE' macro so may have
  it may not have been operating as intended.
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f67d6a1..720e6ef 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2050,7 +2050,7 @@ int repair_io_failure(struct inode *inode, u64 start, u64 
length, u64 logical,
return -EIO;
}
bio->bi_bdev = dev->bdev;
-   bio->bi_rw = WRITE_SYNC;
+   bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_SYNC);
bio_add_page(bio, page, length, pg_offset);
 
if (btrfsic_submit_bio_wait(bio)) {
-- 
2.8.1



[PATCH v6 2/2] Add ioctl to issue ZBC/ZAC commands via block layer

2016-07-29 Thread Shaun Tancheff
Add support for ZBC ioctl's
BLKREPORT- Issue Report Zones to device.
BLKOPENZONE  - Issue Zone Action: Open Zone command.
BLKCLOSEZONE - Issue Zone Action: Close Zone command.
BLKRESETZONE - Issue Zone Action: Reset Zone command.

Signed-off-by: Shaun Tancheff 
---
v6:
 - Added GFP_DMA to gfp mask.
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
---
 block/ioctl.c | 110 ++
 include/uapi/linux/blkzoned_api.h |   6 +++
 include/uapi/linux/fs.h   |   1 +
 3 files changed, 117 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..a2a6c2c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -194,6 +195,109 @@ int blkdev_reread_part(struct block_device *bdev)
 }
 EXPORT_SYMBOL(blkdev_reread_part);
 
+static int blk_zoned_report_ioctl(struct block_device *bdev, fmode_t mode,
+   void __user *parg)
+{
+   int error = -EFAULT;
+   gfp_t gfp = GFP_KERNEL | GFP_DMA;
+   struct bdev_zone_report_io *zone_iodata = NULL;
+   int order = 0;
+   struct page *pgs = NULL;
+   u32 alloc_size = PAGE_SIZE;
+   unsigned long op_flags = 0;
+   u8 opt = 0;
+
+   if (!(mode & FMODE_READ))
+   return -EBADF;
+
+   zone_iodata = (void *)get_zeroed_page(gfp);
+   if (!zone_iodata) {
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   if (copy_from_user(zone_iodata, parg, sizeof(*zone_iodata))) {
+   error = -EFAULT;
+   goto report_zones_out;
+   }
+   if (zone_iodata->data.in.return_page_count > alloc_size) {
+   int npages;
+
+   alloc_size = zone_iodata->data.in.return_page_count;
+   npages = (alloc_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   pgs = alloc_pages(gfp, ilog2(npages));
+   if (pgs) {
+   void *mem = page_address(pgs);
+
+   if (!mem) {
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   order = ilog2(npages);
+   memset(mem, 0, alloc_size);
+   memcpy(mem, zone_iodata, sizeof(*zone_iodata));
+   free_page((unsigned long)zone_iodata);
+   zone_iodata = mem;
+   } else {
+   /* Result requires DMA capable memory */
+   pr_err("Not enough memory available for request.\n");
+   error = -ENOMEM;
+   goto report_zones_out;
+   }
+   }
+   opt = zone_iodata->data.in.report_option;
+   error = blkdev_issue_zone_report(bdev, op_flags,
+   zone_iodata->data.in.zone_locator_lba, opt,
+   pgs ? pgs : virt_to_page(zone_iodata),
+   alloc_size, GFP_KERNEL);
+
+   if (error)
+   goto report_zones_out;
+
+   if (copy_to_user(parg, zone_iodata, alloc_size))
+   error = -EFAULT;
+
+report_zones_out:
+   if (pgs)
+   __free_pages(pgs, order);
+   else if (zone_iodata)
+   free_page((unsigned long)zone_iodata);
+   return error;
+}
+
+static int blk_zoned_action_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+   unsigned int op = 0;
+
+   if (!(mode & FMODE_WRITE))
+   return -EBADF;
+
+   /*
+* When acting on zones we explicitly disallow using a partition.
+*/
+   if (bdev != bdev->bd_contains) {
+   pr_err("%s: All zone operations disallowed on this device\n",
+   __func__);
+   return -EFAULT;
+   }
+
+   switch (cmd) {
+   case BLKOPENZONE:
+   op = REQ_OP_ZONE_OPEN;
+   break;
+   case BLKCLOSEZONE:
+   op = REQ_OP_ZONE_CLOSE;
+   break;
+   case BLKRESETZONE:
+   op = REQ_OP_ZONE_RESET;
+   break;
+   default:
+   pr_err("%s: Unknown action: %u\n", __func__, cmd);
+   return -EINVAL;
+   }
+   return blkdev_issue_zone_action(bdev, op, 0, arg, GFP_KERNEL);
+}
+
 static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode,
unsigned long arg, unsigned long flags)
 {
@@ -568,6 +672,12 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
case BLKTRACESETUP:
case BLKTRACETEARDOWN:
return blk_trace_ioctl(bdev, cmd, argp);
+   case BLKREPORT:
+   return blk_zoned_report_ioctl(bdev, mode, argp);
+   case BLKOPENZONE:
+   

[PATCH v6 1/2] Add bio/request flags to issue ZBC/ZAC commands

2016-07-29 Thread Shaun Tancheff
Add op flags to access to zone information as well as open, close
and reset zones:
  - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
  - REQ_OP_ZONE_OPEN - Explictly open a zone for writing
  - REQ_OP_ZONE_CLOSE - Explictly close a zone
  - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone

These op flags can be used to create bio's to control zoned devices
through the block layer.

This is useful for filesystems and device mappers that need explicit
control of zoned devices such as Host Mananged and Host Aware SMR drives,

Report zones is a device read that requires a buffer.

Open, Close and Reset are device commands that have no associated
data transfer. Sending an LBA of ~0 will attempt to operate on all
zones. This is typically used with Reset to wipe a drive as a Reset
behaves similar to TRIM in that all data in the zone(s) is deleted.

The Finish zone command is intentionally not implimented as there is no
current use case for that operation.

Report zones currently defaults to reporting on all zones. It expected
that support for the zone option flag will piggy back on streamid
support. The report option flag is useful as it can reduce the number
of zones in each report, but not critical.

Signed-off-by: Shaun Tancheff 
---
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Removed include/uapi/linux/fs.h from this patch.
---
 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  95 +
 drivers/scsi/sd.c | 118 +
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   7 +-
 include/linux/blk_types.h |   6 +-
 include/linux/blkzoned_api.h  |  25 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 214 ++
 9 files changed, 474 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 771c31c..32f5598 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12785,6 +12785,15 @@ F: Documentation/networking/z8530drv.txt
 F: drivers/net/hamradio/*scc.c
 F: drivers/net/hamradio/z8530.h
 
+ZBC AND ZBC BLOCK DEVICES
+M: Shaun Tancheff 
+W: http://seagate.com
+W: https://github.com/Seagate/ZDM-Device-Mapper
+L: linux-bl...@vger.kernel.org
+S: Maintained
+F: include/linux/blkzoned_api.h
+F: include/uapi/linux/blkzoned_api.h
+
 ZBUD COMPRESSED PAGE ALLOCATOR
 M: Seth Jennings 
 L: linux...@kvack.org
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 083e56f..6dcdcbf 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "blk.h"
 
@@ -266,3 +267,97 @@ int blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_zone_report - queue a report zones operation
+ * @bdev:  target blockdev
+ * @op_flags:  extra bio rw flags. If unsure, use 0.
+ * @sector:starting sector (report will include this sector).
+ * @opt:   See: zone_report_option, default is 0 (all zones).
+ * @page:  one or more contiguous pages.
+ * @pgsz:  up to size of page in bytes, size of report.
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *Issue a zone report request for the sectors in question.
+ */
+int blkdev_issue_zone_report(struct block_device *bdev, unsigned int op_flags,
+sector_t sector, u8 opt, struct page *page,
+size_t pgsz, gfp_t gfp_mask)
+{
+   struct bdev_zone_report *conv = page_address(page);
+   struct bio *bio;
+   unsigned int nr_iovecs = 1;
+   int ret = 0;
+
+   if (pgsz < (sizeof(struct bdev_zone_report) +
+   sizeof(struct bdev_zone_descriptor)))
+   return -EINVAL;
+
+   bio = bio_alloc(gfp_mask, nr_iovecs);
+   if (!bio)
+   return -ENOMEM;
+
+   conv->descriptor_count = 0;
+   bio->bi_iter.bi_sector = sector;
+   bio->bi_bdev = bdev;
+   bio->bi_vcnt = 0;
+   bio->bi_iter.bi_size = 0;
+
+   bio_add_page(bio, page, pgsz, 0);
+   bio_set_op_attrs(bio, REQ_OP_ZONE_REPORT, op_flags);
+   ret = submit_bio_wait(bio);
+
+   /*
+* When our request it nak

[PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-07-29 Thread Shaun Tancheff
Hi Jens,

This series is based on linus' current tip after the merge of 'for-4.8/core'

As Host Aware drives are becoming available we would like to be able
to make use of such drives. This series is also intended to be suitable
for use by Host Managed drives.

ZAC/ZBC drives add new commands for discovering and working with Zones.

This extends the ZAC/ZBC support up to the block layer allowing direct control
by file systems or device mapper targets. Also by deferring the zone handling
to the authoritative subsystem there is an overall lower memory usage for
holding the active zone information as well as clarifying responsible party
for maintaining the write pointer for each active zone.
By way of example a DM target may have several writes in progress. To sector
(or lba) for those writes will each depend on the previous write. While the
drive's write pointer will be updated as writes are completed the DM target
will be maintaining both where the next write should be scheduled from and 
where the write pointer is based on writes completed w/o errors.
Knowing the drive's zone topology enables DM targets and file systems to
extend their block allocation schemes and issue write pointer resets (or 
discards) that are zone aligned.
A perhaps non-obvious approach is that a conventional drive will 
returns a zone report descriptor with a single large conventional zone.

Patches for util-linux can be found here:
https://github.com/Seagate/ZDM-Device-Mapper/tree/master/patches/util-linux

This patch is available here:
https://github.com/stancheff/linux/tree/zbc.bio.v6

g...@github.com:stancheff/linux.git zbc.bio.v6

v6:
 - Fix page alloc to include DMA flag for ioctl.
v5:
 - In sd_setup_zone_action_cmnd, remove unused vars and fix switch indent
 - In blk-lib fix documentation
v4:
 - Rebase on linux-next tag next-20160617.
 - Change bio flags to bio op's
 - Dropped ata16 hackery
V3:
 - Rebase on Mike Cristie's separate bio operations
 - Update blkzoned_api.h to include report zones PARTIAL bit.
 - Use zoned report reserved bit for ata-passthrough flag.

V2:
 - Changed bi_rw to op_flags clarify sepeartion of bio op from flags.
 - Fixed memory leak in blkdev_issue_zone_report failing to put_bio().
 - Documented opt in blkdev_issue_zone_report.
 - Moved include/uapi/linux/fs.h changes to patch 3
 - Fixed commit message for first patch in series.

Shaun Tancheff (2):
  Add bio/request flags to issue ZBC/ZAC commands
  Add ioctl to issue ZBC/ZAC commands via block layer

 MAINTAINERS   |   9 ++
 block/blk-lib.c   |  95 
 block/ioctl.c | 110 +++
 drivers/scsi/sd.c | 118 
 drivers/scsi/sd.h |   1 +
 include/linux/bio.h   |   7 +-
 include/linux/blk_types.h |   6 +-
 include/linux/blkzoned_api.h  |  25 +
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/blkzoned_api.h | 220 ++
 include/uapi/linux/fs.h   |   1 +
 11 files changed, 591 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/blkzoned_api.h
 create mode 100644 include/uapi/linux/blkzoned_api.h

-- 
2.8.1



[PATCH v4] Add support for SCT Write Same

2016-07-29 Thread Shaun Tancheff
SATA drives may support write same via SCT. This is useful
for setting the drive contents to a specific pattern (0's).

If UNMAP is not set or TRIM is not available then
fall back to SCT WRITE SAME, if it is available.

In this way it would be possible to mimic lbprz for devices that
support TRIM but fail to zero blocks reliably. For example a
file-system or DM target could issue a write same w/o unmap followed
by an trim.

Signed-off-by: Shaun Tancheff 
---
At some point the method of issuing Write Same for ATA drives changed.
Currently write same is commonly available via SCT so expose the SCT
capabilities and use SCT Write Same when it is available.

This is useful for zoned based media that prefers to support discard
with lbprz set, aka discard zeroes data by mapping discard operations to
reset write pointer operations. Conventional zones that do not support
reset write pointer can still honor the discard zeroes data by issuing
a write same over the zone.

With this patch if UNMAP is set the TRIM support will follow the existing 
code path.
When UNMAP is not set and SCT WRITE SAME support is available the specified
blocks will be zero'd.

This patch is also at
https://github.com/stancheff/linux/tree/v4.7-sct.ws.v4

g...@github.com:stancheff/linux.git v4.7-sct.ws.v4


v4:
 - Added partial MAINTENANCE_IN opcode simulation
 - Dropped all changes in drivers/scsi/*
 - Changed to honor the UNMAP flag -> TRIM, no UNMAP -> SCT.
v3:
 - Demux UNMAP/TRIM from WRITE SAME
v2:
 - Remove fugly ata hacking from sd.c
---
 drivers/ata/libata-scsi.c | 166 --
 include/linux/ata.h   |  43 
 2 files changed, 187 insertions(+), 22 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index bfec66f..6c7a87e 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1159,8 +1159,6 @@ static void ata_scsi_sdev_config(struct scsi_device *sdev)
 {
sdev->use_10_for_rw = 1;
sdev->use_10_for_ms = 1;
-   sdev->no_report_opcodes = 1;
-   sdev->no_write_same = 1;
 
/* Schedule policy is determined by ->qc_defer() callback and
 * it needs to see every deferred qc.  Set dev_blocked to 1 to
@@ -3267,11 +3265,20 @@ static unsigned int ata_scsi_pass_thru(struct 
ata_queued_cmd *qc)
return 1;
 }
 
+/**
+ * ata_scsi_write_same_xlat - Handle WRITE_SAME commands
+ *
+ * @qc: command structure
+ *
+ * Translate WRITE_SAME w/UNMAP set to TRIM/discard request.
+ * Otherwise use SCT WRITE SAME when available.
+ */
 static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
 {
struct ata_taskfile *tf = &qc->tf;
struct scsi_cmnd *scmd = qc->scsicmd;
struct ata_device *dev = qc->dev;
+   struct scatterlist *sg;
const u8 *cdb = scmd->cmnd;
u64 block;
u32 n_block;
@@ -3279,6 +3286,8 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
void *buf;
u16 fp;
u8 bp = 0xff;
+   u8 unmap = cdb[1] & 0x8;
+   bool use_sct = unmap ? false : true;
 
/* we may not issue DMA commands if no DMA mode is set */
if (unlikely(!dev->dma_mode))
@@ -3290,8 +3299,10 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
}
scsi_16_lba_len(cdb, &block, &n_block);
 
-   /* for now we only support WRITE SAME with the unmap bit set */
-   if (unlikely(!(cdb[1] & 0x8))) {
+   /*
+* If UNMAP is not set and SCT write same is not available then fail.
+*/
+   if (use_sct && !ata_id_sct_write_same(dev->id)) {
fp = 1;
bp = 3;
goto invalid_fld;
@@ -3304,26 +3315,56 @@ static unsigned int ata_scsi_write_same_xlat(struct 
ata_queued_cmd *qc)
if (!scsi_sg_count(scmd))
goto invalid_param_len;
 
-   buf = page_address(sg_page(scsi_sglist(scmd)));
-   size = ata_set_lba_range_entries(buf, 512, block, n_block);
+   sg = scsi_sglist(scmd);
+   buf = page_address(sg_page(sg)) + sg->offset;
 
-   if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
-   /* Newer devices support queued TRIM commands */
-   tf->protocol = ATA_PROT_NCQ;
-   tf->command = ATA_CMD_FPDMA_SEND;
-   tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_DSM & 0x1f;
-   tf->nsect = qc->tag << 3;
-   tf->hob_feature = (size / 512) >> 8;
-   tf->feature = size / 512;
+   /*
+* if we only have SCT then ignore the state of unmap request
+* a zero the blocks.
+*/
+   if (use_sct) {
+   u16 *sctpg = buf;
+
+   put_unaligned_le16(0x0002,  &sctpg[0]); /* SCT_ACT_WRITE_SAME */
+   put_unaligned_le16(0x0101,  &

[PATCH] Change MPI alloc from GFP_KERNEL to GFP_ATOMIC

2016-07-29 Thread Shaun Tancheff
Running Ubuntu 14.04 for testing against 4.8 merge window I am getting.

One quick fix is to change GFP_KERNEL to GFP_ATOMIC.

BUG: sleeping function called from invalid context at mm/slab.h:393
in_atomic(): 1, irqs_disabled(): 0, pid: 594, name: modprobe
no locks held by modprobe/594.
CPU: 1 PID: 594 Comm: modprobe Not tainted 4.7.0-zdm #64
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
  8ef5d4e5b978 8b5eded3 8ef5d9021940
 8bee5c08 8ef5d4e5b9a0 8b0b5109 8bee5c08
 0189  8ef5d4e5b9c8 8b0b5209
Call Trace:
 [] dump_stack+0x85/0xc2
 [] ___might_sleep+0x179/0x230
 [] __might_sleep+0x49/0x80
 [] kmem_cache_alloc_trace+0x1ba/0x2f0
 [] ? mpi_alloc+0x20/0x80
 [] mpi_alloc+0x20/0x80
 [] mpi_read_raw_from_sgl+0xc7/0x1d0
 [] rsa_verify+0x57/0xd0
 [] ? pkcs1pad_sg_set_buf+0x40/0xb0
 [] pkcs1pad_verify+0xbb/0x100
 [] public_key_verify_signature+0x1c0/0x2a0
 [] ? debug_check_no_locks_freed+0xd0/0x160
 [] public_key_verify_signature_2+0x15/0x20
 [] verify_signature+0x3c/0x50
 [] pkcs7_validate_trust+0x1fa/0x260
 [] verify_pkcs7_signature+0x7d/0x100
 [] mod_verify_sig+0x7c/0xb0
 [] load_module+0x170/0x2ac0
 [] ? __vfs_read+0xbd/0x110
 [] ? ima_post_read_file+0x7d/0xa0
 [] ? kernel_read_file+0x191/0x1b0
 [] SYSC_finit_module+0xc3/0xf0
 [] SyS_finit_module+0xe/0x10
 [] entry_SYSCALL_64_fastpath+0x23/0xc1

Signed-off-by: Shaun Tancheff 
---
 lib/mpi/mpiutil.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/mpi/mpiutil.c b/lib/mpi/mpiutil.c
index 314f4df..420ace3 100644
--- a/lib/mpi/mpiutil.c
+++ b/lib/mpi/mpiutil.c
@@ -31,7 +31,7 @@ MPI mpi_alloc(unsigned nlimbs)
 {
MPI a;
 
-   a = kmalloc(sizeof *a, GFP_KERNEL);
+   a = kmalloc(sizeof *a, GFP_ATOMIC);
if (!a)
return a;
 
@@ -61,7 +61,7 @@ mpi_ptr_t mpi_alloc_limb_space(unsigned nlimbs)
if (!len)
return NULL;
 
-   return kmalloc(len, GFP_KERNEL);
+   return kmalloc(len, GFP_ATOMIC);
 }
 
 void mpi_free_limb_space(mpi_ptr_t a)
@@ -91,14 +91,14 @@ int mpi_resize(MPI a, unsigned nlimbs)
return 0;   /* no need to do it */
 
if (a->d) {
-   p = kmalloc(nlimbs * sizeof(mpi_limb_t), GFP_KERNEL);
+   p = kmalloc(nlimbs * sizeof(mpi_limb_t), GFP_ATOMIC);
if (!p)
return -ENOMEM;
memcpy(p, a->d, a->alloced * sizeof(mpi_limb_t));
kzfree(a->d);
a->d = p;
} else {
-   a->d = kzalloc(nlimbs * sizeof(mpi_limb_t), GFP_KERNEL);
+   a->d = kzalloc(nlimbs * sizeof(mpi_limb_t), GFP_ATOMIC);
if (!a->d)
return -ENOMEM;
}
-- 
2.8.1



[PATCH] Missing bio_put following submit_bio_wait

2016-06-07 Thread Shaun Tancheff
submit_bio_wait() gives the caller an opportunity to examine
struct bio and so expects the caller to issue the put_bio()

This fixes a memory leak reported by a few people in 4.7-rc2
kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch")

Signed-off-by: Shaun Tancheff 
Tested-by: Catalin Marinas 
Tested-by: Larry fin...@lwfinger.net
Tested-by: David Drysdale 
---
 block/blk-lib.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 23d7f30..9e29dc3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -113,6 +113,7 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
ret = submit_bio_wait(type, bio);
if (ret == -EOPNOTSUPP)
ret = 0;
+   bio_put(bio);
}
blk_finish_plug(&plug);
 
@@ -165,8 +166,10 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
}
}
 
-   if (bio)
+   if (bio) {
ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio);
+   bio_put(bio);
+   }
return ret != -EOPNOTSUPP ? ret : 0;
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
@@ -206,8 +209,11 @@ static int __blkdev_issue_zeroout(struct block_device 
*bdev, sector_t sector,
}
}
 
-   if (bio)
-   return submit_bio_wait(WRITE, bio);
+   if (bio) {
+   ret = submit_bio_wait(WRITE, bio);
+   bio_put(bio);
+   return ret;
+   }
return 0;
 }
 
-- 
2.8.1



[PATCH] Fix reported kmemleak

2016-06-07 Thread Shaun Tancheff
This fixes a memory leak reported by a few people in 4.7-rc1

kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch")

This patch just formalizes the one in this discussion here:
https://lkml.kernel.org/r/20160606112620.ga29...@e104818-lin.cambridge.arm.com

The same issue appears here:
https://lkml.kernel.org/r/20160607102651.ga6...@dhcp12-144.nay.redhat.com

Patch is also available at:

https://github.com/stancheff/linux.git
branch: v4.7-rc2+bio_put

Cc: Christoph Hellwig 
Cc: ax...@kernel.dk
cc: David Drysdale 
Cc: Xiong Zhou 
Cc: Stephen Rothwell 
Cc: linux-n...@vger.kernel.org
Cc: linux-nvd...@ml01.01.org
Cc: Christoph Hellwig 
Cc: Larry Finger 
Cc: Catalin Marinas 
Cc: LKML ,
Cc: Jens Axboe ,
Cc: bart.vanass...@sandisk.com

Shaun Tancheff (1):
  Missing bio_put following submit_bio_wait

 block/blk-lib.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

-- 
2.8.1



Re: kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch")

2016-06-06 Thread Shaun Tancheff
derstand this code, this chaining may not
> even be essential to struct bio freeing (and I'm investigating the wrong
> path).
>
> --
> Catalin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwIBAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=uHYaMtk0CBsO1MWg4alX8kHC6bvXsmWCR9wL8nrX28E&s=4EHwF1z6acQYr9R052hraaTE1KUf6IiXcEmd-CSLV48&e=


I'm pretty sure it is missing a bio_put() after submit_bio_wait().

Please excuse the hack-y patch but I think you need to do something
like this ...
(Note tabs eaten by gmail).

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 23d7f30..9e29dc3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -113,6 +113,7 @@ int blkdev_issue_discard(struct block_device
*bdev, sector_t sector,
ret = submit_bio_wait(type, bio);
if (ret == -EOPNOTSUPP)
ret = 0;
+   bio_put(bio);
}
blk_finish_plug(&plug);

@@ -165,8 +166,10 @@ int blkdev_issue_write_same(struct block_device
*bdev, sector_t sector,
}
}

-   if (bio)
+   if (bio) {
ret = submit_bio_wait(REQ_WRITE | REQ_WRITE_SAME, bio);
+   bio_put(bio);
+   }
return ret != -EOPNOTSUPP ? ret : 0;
 }
 EXPORT_SYMBOL(blkdev_issue_write_same);
@@ -206,8 +209,11 @@ static int __blkdev_issue_zeroout(struct
block_device *bdev, sector_t sector,
}
}

-   if (bio)
-   return submit_bio_wait(WRITE, bio);
+   if (bio) {
+   ret = submit_bio_wait(WRITE, bio);
+   bio_put(bio);
+   return ret;
+   }
return 0;
 }



-- 
Shaun Tancheff


[PATCH] RAID Cleanup for bio-split

2016-04-23 Thread Shaun Tancheff
It looks like some minor changes slipped through on the RAID.

A couple of checks for REQ_PREFLUSH flag should also check for
bi_op matching REQ_OP_FLUSH.

Wrappers for sync_page_io() are passed READ/WRITE but need to
be passed REQ_OP_READ and REQ_OP_WRITE.

Signed-off-by: Shaun Tancheff 
---
 drivers/md/raid0.c  |  3 ++-
 drivers/md/raid1.c  | 13 +++--
 drivers/md/raid10.c | 21 ++---
 drivers/md/raid5.c  |  3 ++-
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f95463d..46e9ba8 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -458,7 +458,8 @@ static void raid0_make_request(struct mddev *mddev, struct 
bio *bio)
struct md_rdev *tmp_dev;
struct bio *split;
 
-   if (unlikely(bio->bi_rw & REQ_PREFLUSH)) {
+   if (unlikely(bio->bi_rw & REQ_PREFLUSH) ||
+   unlikely(bio->bi_op == REQ_OP_FLUSH)) {
md_flush_request(mddev, bio);
return;
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2a2c177..f7c0577 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1771,12 +1771,12 @@ static void end_sync_write(struct bio *bio)
 }
 
 static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
-   int sectors, struct page *page, int rw)
+   int sectors, struct page *page, int op)
 {
-   if (sync_page_io(rdev, sector, sectors << 9, page, rw, 0, false))
+   if (sync_page_io(rdev, sector, sectors << 9, page, op, 0, false))
/* success */
return 1;
-   if (rw == WRITE) {
+   if (op == REQ_OP_WRITE) {
set_bit(WriteErrorSeen, &rdev->flags);
if (!test_and_set_bit(WantReplacement,
  &rdev->flags))
@@ -1883,7 +1883,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
bio->bi_io_vec[idx].bv_page,
-   WRITE) == 0) {
+   REQ_OP_WRITE) == 0) {
r1_bio->bios[d]->bi_end_io = NULL;
rdev_dec_pending(rdev, mddev);
}
@@ -2118,7 +2118,7 @@ static void fix_read_error(struct r1conf *conf, int 
read_disk,
if (rdev &&
!test_bit(Faulty, &rdev->flags))
r1_sync_page_io(rdev, sect, s,
-   conf->tmppage, WRITE);
+   conf->tmppage, REQ_OP_WRITE);
}
d = start;
while (d != read_disk) {
@@ -2130,7 +2130,7 @@ static void fix_read_error(struct r1conf *conf, int 
read_disk,
if (rdev &&
!test_bit(Faulty, &rdev->flags)) {
if (r1_sync_page_io(rdev, sect, s,
-   conf->tmppage, READ)) {
+   conf->tmppage, 
REQ_OP_READ)) {
atomic_add(s, &rdev->corrected_errors);
printk(KERN_INFO
   "md/raid1:%s: read error 
corrected "
@@ -2204,6 +2204,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
}
 
wbio->bi_op = REQ_OP_WRITE;
+   wbio->bi_rw = 0;
wbio->bi_iter.bi_sector = r1_bio->sector;
wbio->bi_iter.bi_size = r1_bio->sectors << 9;
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c5dc4e4..44e87c2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1364,8 +1364,7 @@ retry_write:
mbio->bi_bdev = rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
mbio->bi_op = op;
-   mbio->bi_rw =
-   do_sync | do_fua | do_sec;
+   mbio->bi_rw = do_sync | do_fua | do_sec;
mbio->bi_private = r10_bio;
 
atomic_inc(&r10_bio->remaining);
@@ -1408,8 +1407,7 @@ retry_write:
mbio->bi_bdev = rdev->bdev;
mbio->bi_end_io = raid10_end_write_request;
mbio->bi_op = op;
-   mbio->bi_rw =
-   do_sync | do_fua | do_sec;
+   mbio->bi_rw = do_sync | do_fua | do_sec;
mbio->bi_