Re: [PATCH v11 0/8] blk-mq: Implement runtime power management

2018-09-26 Thread Martin K. Petersen


Jens,

>> One of the pieces that is missing before blk-mq can be made the default
>> is implementing runtime power management support for blk-mq.  This patch
>> series not only implements runtime power management for blk-mq but also
>> fixes a starvation issue in the power management code for the legacy
>> block layer. Please consider this patch series for the upstream kernel.
>
> Thanks Bart, applied for 4.20.

I have made a note to back out the ufs change for 4.20.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v10 2/8] block, scsi: Change the preempt-only flag into a counter

2018-09-25 Thread Martin K. Petersen


Bart,

> The RQF_PREEMPT flag is used for three purposes:
> - In the SCSI core, for making sure that power management requests
>   are executed even if a device is in the "quiesced" state.
> - For domain validation by SCSI drivers that use the parallel port.
> - In the IDE driver, for IDE preempt requests.

> Rename "preempt-only" into "pm-only" because the primary purpose of
> this mode is power management. Since the power management core may but
> does not have to resume a runtime suspended device before performing
> system-wide suspend and since a later patch will set "pm-only" mode as
> long as a block device is runtime suspended, make it possible to set
> "pm-only" mode from more than one context. Since with this change
> scsi_device_quiesce() is no longer idempotent, make that function
> return early if it is called for a quiesced queue.

The SCSI pieces look OK to me...

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v4 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-25 Thread Martin K. Petersen


Max,

> Also moved the logic of the remapping to the nvme core driver instead
> of implementing it in the nvme pci driver. This way all the other nvme
> transport drivers will benefit from it (in case they'll implement metadata
> support).

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v4 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-25 Thread Martin K. Petersen


Max,

> Currently these functions are implemented in the scsi layer, but their
> actual place should be the block layer since T10-PI is a general data
> integrity feature that is used in the nvme protocol as well. Also, use
> the tuple size from the integrity profile since it may vary between
> integrity types.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v4 1/3] block: move ref_tag calculation func to the block layer

2018-07-25 Thread Martin K. Petersen


Max,

> Currently this function is implemented in the scsi layer, but it's
> actual place should be the block layer since T10-PI is a general
> data integrity feature that is used in the nvme protocol as well.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-23 Thread Martin K. Petersen


Christoph,

>> +void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
>> +   u32 ref_tag)
>> +{
>
> Maybe call this blk_t10_pi_prepare?

The rest of these functions have a blk_integrity_ prefix. So either
stick with that or put the functions in t10-pi.c and use a t10_pi_
prefix.

I'm a bit torn on placement since the integrity metadata could contain
other stuff than T10 PI. But the remapping is very specific to T10 PI.

>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 9421d9877730..4186bf027c59 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -1119,7 +1119,9 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
>> *SCpnt)
>>  SCpnt->cmnd[0] = WRITE_6;
>>  
>>  if (blk_integrity_rq(rq))
>> -sd_dif_prepare(SCpnt);
>> +blk_integrity_dif_prepare(SCpnt->request,
>> +  sdkp->protection_type,
>> +  scsi_prot_ref_tag(SCpnt));
>
> scsi_prot_ref_tag could be move to the block layer as it only uses
> the sector in the eequest and the sector size, which we can get
> from the gendisk as well.  We then don't need to pass it to the function.

For Type 2, the PI can be at intervals different from the logical block
size (although we don't support that yet). We should use the
blk_integrity profile interval instead of assuming sector size.

And wrt. Keith's comment: The tuple_size should be the one from the
integrity profile as well, not sizeof(struct t10_pi_tuple).

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] block: Fix transfer when chuck sectors exceeds max

2018-06-26 Thread Martin K. Petersen


Keith,

> A device may have boundary restrictions where the number of sectors
> between boundaries exceeds its max transfer size. In this case, we need
> to cap the max size to the smaller of the two limits.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 0/3] Fix blk_mq_end_request() and blk_end_request() for WRITE SAME

2018-06-26 Thread Martin K. Petersen


Hi Bart,

> This series of three patches fixes a crash in the block layer core
> that I encountered while retesting the SRP tests in blktests. Please
> consider these patches for kernel v4.19.

Patches 1 and 2 look fine. However, I'm not sure I'm so keen on patch
3. It looks like it's papering over a more fundamental issue.

Can you elaborate a bit on why the existing code fails with dm in the
mix?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-23 Thread Martin K. Petersen

Kees,

> obj-$(CONFIG_SCSI)  += scsi/
>
> So: this needs to live in block/ just like CONFIG_BLK_SCSI_REQUEST's
> scsi_ioctl.c. I will split it into CONFIG_BLK_SCSI_SENSE, but I'll
> still need to move the code from drivers/scsi/ to block/. Is this
> okay?

The reason this sucks is that scsi_normalize_sense() is an inherent core
feature in the SCSI layer. Dealing with sense data for ioctls is just a
fringe use case.

I don't want to get too hung up on what goes where. But architecturally
it really irks me to move a core piece of SCSI state machine
functionality out of the subsystem to accommodate ioctl handling.

I'm traveling today so I probably won't get a chance to look closely
until tomorrow morning.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 3/6] block: Create scsi_sense.h for SCSI and ATAPI

2018-05-22 Thread Martin K. Petersen

Christoph,

> On Tue, May 22, 2018 at 11:15:09AM -0700, Kees Cook wrote:
>> Both SCSI and ATAPI share the sense header. In preparation for using the
>> struct scsi_sense_hdr more widely, move this into a separate header and
>> move the helper function to scsi_ioctl.c which is linked with CONFIG_IDE
>> by way of CONFIG_BLK_SCSI_REQUEST.
>
> Please keep the code where it is and just depend on SCSI on the legacy
> ide driver.  No need to do gymnastics just for a legacy case.

Yup, I agree.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 2/2] xfs: add 'discard_sync' mount flag

2018-05-01 Thread Martin K. Petersen

Jens,

> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index a232c98fbf4d..aec5bc82d580 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -14,12 +14,17 @@ enum wbt_flags {
>   WBT_TRACKED = 1,/* write, tracked for throttling */
>   WBT_READ= 2,/* read */
>   WBT_KSWAPD  = 4,/* write, from kswapd */
> + WBT_TRIM= 8,

The term TRIM does not apply to NVMe, nor SCSI. Please call it
WBT_DISCARD.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2] target: Fix Fortify_panic kernel exception

2018-04-20 Thread Martin K. Petersen

Bryant,

> The bug exists in the memcmp in which the length passed in must be
> guaranteed to be 1. This bug currently exists because the second
> pointer passed in, can be smaller than the cmd->data_length, which
> causes a fortify_panic.
>
> The fix is to use memchr_inv instead to find whether or not a 0 exists
> instead of using memcmp. This way you dont have to worry about buffer
> overflow which is the reason for the fortify_panic.

Clarified the commit description a bit and applied the patch
4.17/scsi-fixes. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: Change device block count from userspace?

2018-04-19 Thread Martin K. Petersen

Manuel,

> It's not a kernel fault, the drive is broken.

Interesting, OK. Just wanted to make sure we didn't have a regression. I
tweaked the capacity reading code recently.

> Still, is there a userspace option (sysfs attribute perhaps) to limit
> disk capacity to a certain block count?

I'm afraid not.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Martin K. Petersen

Chris,

>> I'd like to propose that we compact the fs sessions so that we get a
>> 3-slot session reserved for "Individual filesystem discussions" one
>> afternoon. That way we've got time in the schedule for the all the
>> ext4/btrfs/XFS/NFS/CIFS devs to get together with each other and
>> talk about things of interest only to their own fileystems.
>>
>> That means we all don't have to find time outside the schedule to do
>> this, and think this wold be time very well spent for most fs people
>> at the conf
>
> I'd love this as well.

Based on feedback last year we explicitly added a third day to LSF/MM to
facilitate hack time and project meetings.

As usual the schedule is fluid and will be adjusted on the fly.
Depending on track, I am hoping we'll be done with the scheduled topics
either at the end of Tuesday or Wednesday morning.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] target: fix crash with iscsi target and dvd

2018-04-18 Thread Martin K. Petersen

Ming,

> When the current page can't be added to bio, one new bio should be
> created for adding this page again, instead of ignoring this page.
>
> This patch fixes kernel crash with iscsi target and dvd, as reported
> by Wakko.

I queued this up in 4.17/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 02/12] storsvc: don't set a bounce limit

2018-04-18 Thread Martin K. Petersen

Christoph,

> The default already is to never bounce, so the call is a no-op.

Applied to 4.18/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 01/12] iscsi_tcp: don't set a bounce limit

2018-04-18 Thread Martin K. Petersen

Christoph,

> The default already is to never bounce, so the call is a no-op.

Applied to 4.18/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] target: fix crash with iscsi target and dvd

2018-04-18 Thread Martin K. Petersen

Christoph,

> Btw, seems like someone needs to volunteer for putting together a pull
> request with target fixes for Linus - I haven't heard from Nic for a
> while, and we've got quite a few fixes out on the list.

Still happy to take things through SCSI if Nic doesn't materialize.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: Change device block count from userspace?

2018-04-17 Thread Martin K. Petersen

Manuel,

> I have a SATA SSD which suddenly reports its size as 2.2TB, 0x
> block count:

"Suddenly" as in out of the blue? Or after a drive firmware update? Or a
kernel ditto?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Martin K. Petersen

Jinpu,

[CC:ed the mpt3sas maintainers]

The ratelimit patch is just an attempt to treat the symptom, not the
cause.

> Thanks for asking, we updated mpt3sas driver which enables DIX support
> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
> After reboot, kernel reports the IO errors from all the drives behind
> HBA, seems for almost every read IO, which turns the system unusable:
> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)

That sounds like a bug in the mpt3sas driver or firmware. I guess the
HBA could conceivably be operating a SATA device as DIX Type 0 and strip
the PI on the drive side. But that doesn't seem to be a particularly
useful mode of operation.

Jinpu: Which firmware are you running? Also, please send us the output
of:

sg_readcap -l /dev/sda
sg_inq -x /dev/sda
sg_vpd /dev/sda

Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
controller?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: sr: get/drop reference to device in revalidate and check_events

2018-04-12 Thread Martin K. Petersen

Jens,

> We can't just use scsi_cd() to get the scsi_cd structure, we have
> to grab a live reference to the device. For both callbacks, we're
> not inside an open where we already hold a reference to the device.

Applied to 4.17/scsi-fixes, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-12 Thread Martin K. Petersen

Jack,

> + pr_err_ratelimited("%s: ref tag error at 
> location %llu (rcvd %u)\n",

I'm a bit concerned about dropping records of potential data loss.

Also, what are you doing that compels all these to be logged? This
should be a very rare occurrence.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk

2018-04-05 Thread Martin K. Petersen

Weiping,

> For virtio block device, actually there is no a hard limit for max
> request size, and virtio_blk driver set -1 to
> blk_queue_max_hw_sectors(q, -1U);.  But it doesn't work, because there
> is a default upper limitation BLK_DEF_MAX_SECTORS (1280 sectors).

That's intentional (although it's an ongoing debate what the actual
value should be).

> So this series want to add a new helper
> blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.

BLK_DEF_MAX_SECTORS is a kernel default empirically chosen to strike a
decent balance between I/O latency and bandwidth. It sets an upper bound
for filesystem requests only. Regardless of the capabilities of the
block device driver and underlying hardware.

You can override the limit on a per-device basis via max_sectors_kb in
sysfs. People generally do it via a udev rule.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue

2018-03-19 Thread Martin K. Petersen

Artem,

> we have this patch-set and it fixes megaraid regression in v4.16. Do
> you plan to mege it to v4.16-rcX? I am worried - there seem to be no
> sight that this is going to me merged. They are not in the linux-next.

I merged them into scsi-fixes last week.

It happens push a combined fixes+queue to linux-next to get more zeroday
coverage. However, most of the time linux-next is one 4.x+1 material
only.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-14 Thread Martin K. Petersen

Stephen,

>> It would be useful if those configurations were not left behind so
>> that Linux could feasibly deploy offload code to a controller in the
>> PCI domain.
>
> Agreed. I think this would be great. Kind of like the XCOPY framework
> that was proposed a while back for SCSI devices [1] but updated to also
> include NVMe devices. That is definitely a use case we would like this
> framework to support.

I'm on my umpteenth rewrite of the block/SCSI offload code. It is not as
protocol-agnostic as I would like in the block layer facing downwards.
It has proven quite hard to reconcile token-based and EXTENDED COPY
semantics along with the desire to support stacking. But from an
application/filesystem perspective everything looks the same regardless
of the intricacies of the device. Nothing is preventing us from
supporting other protocols...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V5 0/5] SCSI: fix selection of reply(hw) queue

2018-03-14 Thread Martin K. Petersen

Ming,

> The patches fixes reply queue(virt-queue on virtio-scsi) selection on
> hpsa, megaraid_sa and virtio-scsi, and IO hang can be caused easily by
> this issue.

I clarified all the commit descriptions. There were also a bunch of
duplicate review tags and other warnings. Please run checkpatch next
time!

Applied to 4.16/scsi-fixes. Thank you.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] dm mpath: fix passing integrity data

2018-03-14 Thread Martin K. Petersen

Steffen,

> After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity
> data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
> block integrity any more. So add it to the whitelist.

Ugh.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] device_handler: remove VLAs

2018-03-12 Thread Martin K. Petersen

Stephen,

> In preparation to enabling -Wvla, remove VLAs and replace them with
> fixed-length arrays instead.
>
> scsi_dh_{alua,emc,rdac} use variable-length array declarations to
> store command blocks, with the appropriate size as determined by
> COMMAND_SIZE. This patch replaces these with fixed-sized arrays using
> MAX_COMMAND_SIZE, so that the array size can be determined at compile
> time.

Applied to 4.17/scsi-queue. Thank you!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-06 Thread Martin K. Petersen

Ming,

> Please consider 2/8 too since it is still a fix.

I still need the driver maintainer to ack the change.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-06 Thread Martin K. Petersen

Ming,

> Given both Don and Laurence have verified that patch 1 and patch 2
> does fix IO hang, could you consider to merge the two first?

Oh, and I would still need a formal Acked-by: from Don and Tested-by:
from Laurence.

Also, for 4.16/scsi-fixes I would prefer verification to be done with
just patch 1/8 and none of the subsequent changes in place. Just to make
sure we're testing the right thing.

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-06 Thread Martin K. Petersen

Hi Ming,

> Given both Don and Laurence have verified that patch 1 and patch 2
> does fix IO hang, could you consider to merge the two first?

I'm not going to merge the MR patch until Kashyap acks it.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 11/11] block: Move the queue_flag_*() functions from a public into a private header file

2018-03-02 Thread Martin K. Petersen

Bart,

> This patch helps to avoid that new code gets introduced in block drivers
> that manipulates queue flags without holding the queue lock when that
> lock should be held.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 10/11] block: Complain if queue_flag_(set|clear)_unlocked() is abused

2018-03-02 Thread Martin K. Petersen

Bart,

> Since it is not safe to use queue_flag_(set|clear)_unlocked() without
> holding the queue lock after the sysfs entries for a queue have been
> created, complain if this happens.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 08/11] target/tcm_loop: Use blk_queue_flag_set()

2018-03-02 Thread Martin K. Petersen

Bart,

> Use blk_queue_flag_set() instead of open-coding this function.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 07/11] iscsi: Use blk_queue_flag_set()

2018-03-02 Thread Martin K. Petersen

Bart,

> Use blk_queue_flag_set() instead of open-coding this function.

Acked-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 06/11] bcache: Use the blk_queue_flag_{set,clear}() functions

2018-03-02 Thread Martin K. Petersen

Bart,

> Use the blk_queue_flag_{set,clear}() functions instead of open-coding
> these.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 04/11] block: Protect queue flag changes with the queue lock

2018-03-02 Thread Martin K. Petersen

Bart,

> Since the queue flags may be changed concurrently from multiple
> contexts after a queue becomes visible in sysfs, make these changes
> safe by protecting these with the queue lock.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 05/11] mtip32xx: Use the blk_queue_flag_*() functions

2018-03-02 Thread Martin K. Petersen

Bart,

> Use the blk_queue_flag_*() functions instead of open-coding these.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 03/11] block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()

2018-03-02 Thread Martin K. Petersen

Bart,

> Introduce functions that modify the queue flags and that protect
> these modifications with the request queue lock. Except for moving
> one wake_up_all() call from inside to outside a critical section,
> this patch does not change any functionality.

Looks OK.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 02/11] block: Use the queue_flag_*() functions instead of open-coding these

2018-03-02 Thread Martin K. Petersen

Bart,

> Except for changing the atomic queue flag manipulations that are
> protected by the queue lock into non-atomic manipulations, this
> patch does not change any functionality.

Looks fine.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 01/11] block: Reorder the queue flag manipulaton function definitions

2018-03-02 Thread Martin K. Petersen

Bart,

s/manipulaton/manipulation/ in Subject. Otherwise OK.

> Move the definition of queue_flag_clear_unlocked() up and move the
> definition of queue_in_flight() down such that all queue flag
> manipulation function definitions become contiguous.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-21 Thread Martin K. Petersen

Bart,

> It happens often while I'm preparing a patch for a block driver that
> I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
> available for this driver? Do I have to introduce definitions of these
> constants before I can use these constants? To avoid this confusion,
> move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
>  header file such that these become available for all
> block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
> header file conditional to avoid that including that header file after
>  causes the compiler to complain about a SECTOR_SIZE
> redefinition.

Looks good.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V2] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-23 Thread Martin K. Petersen

Ming,

> + * Block layer and block driver specific status, which is ususally returnd
  ^^^
> + * from driver to block layer in IO path.

Given that the comment blurb is long and the flag not defined until
later, it is not entirely obvious that you are documenting
BLK_STS_DEV_RESOURCE. So please make that clear at the beginning of the
comment.

> + * If driver isn't sure if the queue can be run again for dealing with the
> + * current request after this kind of resource is available, please return
> + * BLK_STS_SOURCE, for example, when memory allocation, DMA Mapping or other
  ^^^^^^

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: how to enlarge value of max_sectors_kb

2018-01-16 Thread Martin K. Petersen

Tang,

> There is a machine with very little max_sectors_kb size:
> [root@ceph151 queue]# pwd
> /sys/block/sdd/queue
> [root@ceph151 queue]# cat max_hw_sectors_kb 
> 256
> [root@ceph151 queue]# cat max_sectors_kb 
> 256
>
> The performance is very low when I run big I/Os.
> I can not modify it directly, I need a little bigger,
> so how can I enlarge it?

max_hw_sectors_kb is set based on the DMA transfer capabilities of the
SCSI controller.

max_sectors_kb is a soft limit for regular filesystem I/O. It can not
exceed max_hw_sectors_kb since that would violate the hardware
constraints.

You need to figure out why your controller is limiting the transfer
size.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] block: Fix __bio_integrity_endio() documentation

2018-01-16 Thread Martin K. Petersen

Bart,

> Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>

Looks fine.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V9 0/7] blk-mq support for ZBC disks

2018-01-08 Thread Martin K. Petersen

Jens,

> Completely up to you - I already have 1-5, I can add 6/7 as well, or
> just can do it in your tree. Let me know what you prefer.

Started my 4.16/scsi-fixes branch early based on your tree.

I queued these two up.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 3/3] scsi-mq-debugfs: Show more information

2018-01-08 Thread Martin K. Petersen

Bart,

> Show the request result, request timeout and SCSI command flags.
> This information is very helpful when trying to figure out why a
> queue got stuck. An example of the information that is exported
> through debugfs:

Applied to 4.16/scsi-fixes, thanks.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V9 0/7] blk-mq support for ZBC disks

2018-01-08 Thread Martin K. Petersen

Jens,

> This looks OK for me for 4.16. I can grab all of them, or I can leave
> the last two for Martin to apply if he prefers that, though that will
> add a block tree dependency for SCSI.

I already have a block dependency for 4.16. But it doesn't matter much.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 1/3] scsi: Fix a scsi_show_rq() NULL pointer dereference

2017-12-11 Thread Martin K. Petersen

Hi Ming,

> This patch allocates one array for T10_PI_TYPE2_PROTECTION command,
> size of each element is SD_EXT_CDB_SIZE, and the length is
> host->can_queue, then we can retrieve one command buffer runtime
> via rq->tag.
>
> So we can avoid to allocate the command buffer runtime, also the
> recent use-after-free report[1] in scsi_show_rq() can be fixed too.

I'm still mulling over the pros and cons of this one for 4.16+...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 1/3] scsi: Fix a scsi_show_rq() NULL pointer dereference

2017-12-11 Thread Martin K. Petersen

Bart,

> Avoid that scsi_show_rq() triggers a NULL pointer dereference if
> called after sd_uninit_command(). Swap the NULL pointer assignment
> and the mempool_free() call in sd_uninit_command() to make it less
> likely that scsi_show_rq() triggers a use-after-free. Note: even
> with these changes scsi_show_rq() can trigger a use-after-free but
> that's a lesser evil than e.g. suppressing debug information for
> T10-PI commands completely. This patch fixes the following oops:

Applied to 4.15/scsi-fixes. Thanks, Bart.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 1/3] scsi: Fix a scsi_show_rq() NULL pointer dereference

2017-12-07 Thread Martin K. Petersen

Ming,

> As I explained in [1], the use-after-free is inevitable no matter if
> clearing 'SCpnt->cmnd' before mempool_free() in sd_uninit_command() or
> not, so we need to comment the fact that cdb may point to garbage
> data, and this function(especially __scsi_format_command() has to
> survive that, so that people won't be surprised when kasan complains
> use-after-free, and guys will be careful when they try to change the
> code in future.

Longer term we really need to get rid of the separate CDB allocation. It
was a necessary evil when I did it. And not much of a concern since I
did not expect anybody sane to use Type 2 (it's designed for use inside
disk arrays).

However, I keep hearing about people using Type 2 drives. Some vendors
source drives formatted that way and use the same SKU for arrays and
standalone servers.

So we should really look into making it possible for a queue to have a
bigger than 16-byte built-in CDB. For Type 2 devices, 32-byte reads and
writes are a prerequisite. So it would be nice to be able to switch a
queue to a larger allocation post creation (we won't know the type until
after READ CAPACITY(16) has been sent).

Last I looked at this it was not entirely trivial given how we tag
things on to the end. But that really is my preferred fix.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-07 Thread Martin K. Petersen

Ming,

> Jens, Martin, would any of you mind making this patch in V4.15? Since
> it fixes real use cases and this way is exact what we do before
> 0df21c86bdbf("scsi: implement .get_budget and .put_budget for blk-mq").

Applied to 4.15/scsi-fixes, thank you!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-05 Thread Martin K. Petersen

Hi Ming,

> Please cook a patch for fixing the crash issue only, since we need
> to backport the fix to stable kernel.

I thought you were going to submit a V5 that addressed James' concerns?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-30 Thread Martin K. Petersen

Mariusz,

>>> One potential solution for that is to change type of some queue limits
>>> (at least discard_granularity and discard_alignment, maybe more, like
>>> max_discard_sectors?) from 32 bits to 64 bits.

You're still restricted by the max bio size which is 32 bits on 32-bit
platforms.

> Let me give you an example with io_min (it works also for other params):
> We've got RAID 0 with 2 disks and 32KiB chunk size. Each member disk
> reports io_min=64KiB. blk_stack_limits takes max value from set of chunk
> size and members io_min
>   t->io_min = max(t->io_min, b->io_min);
> so new array's io_min will be equal to 64KiB. Such request will be split
> for both array members, each device will get 32KiB request and it is not
> enough for them to meet minimum io size requirement.

When this was written, there was a clear benefit giving priority to the
MD/DM topology over any values reported by the hardware.

I.e. the performance was much better when honoring the MD stripe chunk
and stripe size over any values reported by the underlying devices. This
was true for both disk drives and RAID devices that reported the
relevant issues.

That's why the stacking works the way it does.

For your particular example, I'd say that if your device reports an
io_min of 64KB, then it's a user error to create an MD device with a
stripe chunk of 32KB. mdadm should discourage creating such a
configuration.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] nbd: set discard granularity properly

2017-11-30 Thread Martin K. Petersen

Josef,

> For some reason we had discard granularity set to 512 always even when
> discards were disabled.  Fix this by having the default be 0, and then
> if we turn it on set the discard granularity to the blocksize.

Originally, discard_max_bytes was the queue limit used to indicate
whether discards were supported or not. Callers were supposed to check
that for a value bigger than 0 before interpreting granularity and
alignment. However, many callers triggered on discard_granularity > 0
instead, and as a result things have been muddled somewhat.

So I don't have a problem with only reporting granularity when the
feature is actually enabled...

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V8 6/7] sd_zbc: Initialize device request queue zoned data

2017-11-20 Thread Martin K. Petersen

Damien,

> Initialize the seq_zones_bitmap, seq_zones_wlock and nr_zones fields
> of the disk request queue on disk revalidate. As the seq_zones_bitmap
> and seq_zones_wlock allocations are identical, introduce the helper
> sd_zbc_alloc_zone_bitmap(). Using this helper, reallocate the bitmaps
> whenever the disk capacity (number of zones) changes.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 6/7] block: create 'slaves' and 'holders' entries for hidden gendisks

2017-11-09 Thread Martin K. Petersen

Christoph,

> From: Hannes Reinecke <h...@suse.de>
>
> When creating nvme multipath devices we should populate the 'slaves'
> and 'holders' directorys properly to aid userspace topology detection.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 7/7] nvme: create 'slaves' and 'holders' entries for hidden controllers

2017-11-09 Thread Martin K. Petersen

Christoph,

> From: Hannes Reinecke <h...@suse.de>
>
> When creating nvme multipath devices we should populate the 'slaves'
> and 'holders' directorys properly to aid userspace topology detection.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 5/7] nvme: also expose the namespace identification sysfs files for mpath nodes

2017-11-09 Thread Martin K. Petersen

Christoph,

> We do this by adding a helper that returns the ns_head for a device
> that can belong to either the per-controller or per-subsystem block
> device nodes, and otherwise reuse all the existing code.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Martin K. Petersen

Christoph,

> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to
> it.  The gendisk for each controllers path to the name space still
> exists inside the kernel, but is hidden from userspace.  The character
> device nodes are still available on a per-controller basis.  A new
> link from the sysfs directory for the subsystem allows to find all
> controllers for a given subsystem.
>
> Currently we will always send I/O to the first available path, this
> will be changed once the NVMe Asynchronous Namespace Access (ANA) TP
> is ratified and implemented, at which point we will look at the ANA
> state for each namespace.  Another possibility that was prototyped is
> to use the path that is closes to the submitting NUMA code, which will
> be mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with the
> performance rates provided by NVMe.
>
> The multipath device will go away once all paths to it disappear, any
> delay to keep it alive needs to be implemented at the controller
> level.

Beautiful!

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 3/7] nvme: track shared namespaces

2017-11-09 Thread Martin K. Petersen

Christoph,

> Introduce a new struct nvme_ns_head that holds information about an
> actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there is
> a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
>
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 2/7] nvme: introduce a nvme_ns_ids structure

2017-11-09 Thread Martin K. Petersen

Christoph,

> This allows us to manage the various uniqueue namespace identifiers
> together instead needing various variables and arguments.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 1/7] nvme: track subsystems

2017-11-09 Thread Martin K. Petersen

Christoph,

> This adds a new nvme_subsystem structure so that we can track multiple
> controllers that belong to a single subsystem.  For now we only use it
> to store the NQN, and to check that we don't have duplicate NQNs
> unless the involved subsystems support multiple controllers.
>
> Includes code originally from Hannes Reinecke to expose the subsystems
> in sysfs.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V7 7/8] scsi: sd: Remove zone write locking

2017-11-08 Thread Martin K. Petersen

Damien,

> The block layer now handles zone write locking.

Looks OK.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V7 6/8] scsi: sd_zbc: Initialize device request queue zoned data

2017-11-08 Thread Martin K. Petersen

Damien,

> wait for the disk capacity and number of zones to stabilize on the
> second revalidation pass to allocate and initialize the bitmaps.

Stabilize how?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V7 5/8] block: deadline-iosched: Introduce zone locking support

2017-11-08 Thread Martin K. Petersen

Damien,

> Introduce zone write locking to avoid write request reordering with
> zoned block devices. This is achieved using a finer selection of the
> next request to dispatch:
> 1) Any non-write request is always allowed to proceed.
> 2) Any write to a conventional zone is always allowed to proceed.
> 3) For a write to a sequential zone, the zone lock is first checked.
>a) If the zone is not locked, the write is allowed to proceed after
>   its target zone is locked.
>b) If the zone is locked, the write request is skipped and the next
>   request in the dispatch queue tested (back to step 1).
>
> For a write request that has locked its target zone, the zone is
> unlocked either when the request completes and the method
> deadline_request_completed() is called, or when the request is requeued
> using the method deadline_add_request().
>
> Requests targeting a locked zone are always left in the scheduler queue
> to preserve the initial write order. If no write request can be
> dispatched, allow reads to be dispatched even if the write batch is not
> done.
>
> If the device used is not a zoned block device, or if zoned block device
> support is disabled, this patch does not modify deadline behavior.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V7 4/8] block: deadline-iosched: Introduce dispatch helpers

2017-11-08 Thread Martin K. Petersen

Damien,

> Avoid directly referencing the next_rq and fifo_list arrays using the
> helper functions deadline_next_request() and deadline_fifo_request() to
> facilitate changes in the dispatch request selection in
> deadline_dispatch_requests() for zoned block devices.
>
> While at it, also remove the unnecessary forward declaration of the
> function deadline_move_request().

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH V7 2/8] blokc: mq-deadline: Introduce dispatch helpers

2017-11-08 Thread Martin K. Petersen

Damien,

> Avoid directly referencing the next_rq and fifo_list arrays using the
> helper functions deadline_next_request() and deadline_fifo_request()
> to facilitate changes in the dispatch request selection in
> __dd_dispatch_request() for zoned block devices.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-06 Thread Martin K. Petersen

Ming,

>> Since this is a SCSI patch the SCSI maintainer, Martin Petersen,
>> should have been Cc-ed for this patch. Additionally, I think that
>> this patch should not have been queued by Jens before Martin had
>> approved this patch.
>
> This patch has been CCed to SCSI list.

I don't need a personal CC, linux-scsi is sufficient. However, as a
general rule, changes to any file in drivers/scsi needs go through the
SCSI tree.

We sometimes make exceptions to facilitate the merge process. In those
cases an Acked-by: will be provided to indicate that it is OK that the
patch goes through a different tree.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v3 6/8] scsi/ipr: Use sgl_alloc_order() and sgl_free_order()

2017-11-06 Thread Martin K. Petersen

Bart,

> Use the sgl_alloc_order() and sgl_free_order() functions instead of
> open coding these functions.

I'll merge patches 6-8 once the plumbing goes in.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues

2017-10-23 Thread Martin K. Petersen

Benjamin,

>> Not sure it's worth it especially now that Martin has merged the patch.
>
> He did? I only saw a mail that he picked patches 2-5. So all the bsg
> changes are still open I think.

Yes, I expected the bsg bits to go through Jens' tree.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v3 2/2] block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

2017-10-18 Thread Martin K. Petersen

Ilya,

> sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to

I'm not so keen on that description.

sd is the entity that owns max_ws_blocks so it certainly doesn't ignore
it. The default is for max_ws_blocks to be set because we have no
generic, non-destructive way of finding out whether a WRITE SAME command
is going to work or not. If a WRITE SAME subsequently fails, we'll
disable the feature by clearing max_write_same_sectors in the queue
limits.

> permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
> is set.  Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
> SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

Other than that patch looks good.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v3 1/2] block: factor out __blkdev_issue_zero_pages()

2017-10-18 Thread Martin K. Petersen

Ilya,

> blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

Looks good to me.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v9 09/10] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-16 Thread Martin K. Petersen

Bart,

> The contexts from which a SCSI device can be quiesced or resumed are:
> * Writing into /sys/class/scsi_device/*/device/state.
> * SCSI parallel (SPI) domain validation.
> * The SCSI device power management methods. See also scsi_bus_pm_ops.

The SCSI bits look fine to me.

Acked-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v9 07/10] ide, scsi: Tell the block layer at request allocation time about preempt requests

2017-10-16 Thread Martin K. Petersen

Bart,

> Convert blk_get_request(q, op, __GFP_RECLAIM) into
> blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not
> change any functionality.

Acked-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC] bsg-lib interface cleanup

2017-10-16 Thread Martin K. Petersen

Christoph,

> this series cleans up various abuses of the bsg interfaces, and then
> splits bsg for SCSI passthrough from bsg for arbitrary transport
> passthrough.  This removes the scsi_request abuse in bsg-lib that is
> very confusing, and also makes sure we can sanity check the requests
> we get.  The current code will happily execute scsi commands against
> bsg-lib queues, and transport pass through against scsi nodes, without
> any indication to the driver that we are doing the wrong thing.

I applied patches 2-5 to 4.15/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()

2017-09-20 Thread Martin K. Petersen

Christoph,

> I'm really not sure we should check for -EREMOTEIO specifically, but
> Martin who is more familiar with the SCSI code might be able to
> correct me, I'd feel safer about checking for any error which is
> what the old code did.
>
> Except for that the patch looks fine to me.

We explicitly return -EREMOTEIO when the device reports ILLEGAL
REQUEST. But I agree that we should fall back to manually zeroing for
any error.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 2/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-14 Thread Martin K. Petersen

Christoph,

> bsg-lib now embeddeds the job structure into the request, and
> req->special can't be used anymore.

Applied to 4.14/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] Revert "scsi: default to scsi-mq"

2017-08-14 Thread Martin K. Petersen

Christoph,

> Defaulting to scsi-mq in 4.13-rc has shown various regressions
> on setups that we didn't previously consider.  Fixes for them are
> in progress, but too invasive to make it in this cycle.  So for
> now revert the commit that defaults to blk-mq for SCSI.  For 4.14
> we'll plan to try again with these fixes.
>
> This reverts commit 5c279bd9e40624f4ab6e688671026d6005b066fa.

Applied to 4.13/scsi-fixes. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: two small integrity cleanups

2017-08-09 Thread Martin K. Petersen

Christoph,

> Found these while coming up with the fixes just sent.

Also OK.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: integrity regression fixes for 4.13-rc

2017-08-09 Thread Martin K. Petersen

Christoph,

> this series fixes regressions in the integrity handling update in
> 4.13-rc.
>
> The first one was sent earlier by Milan and while both Martin and I
> aren't exactly happy about the way dm uses the integrity code to cause
> this regression this minimal fix gets us back to the status quo.
>
> The second one makes sure that we only verify the DIF checksums on the
> lowest layer where we attach the integrity information.

These look OK to me.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-07 Thread Martin K. Petersen

Mikulas,

> If you create the integrity tag at or above device mapper level, you
> will run into problems because the same device can be accessed using
> device mapper and using physical volume /dev/sd*. If you create
> integrity tags at device mapper level, they will contain device
> mapper's logical sector number and the sector number won't match if
> you access the device directly using /dev/sd*.

For writes, the bip seed value is adjusted every time a bio is cloned,
sliced and diced as it traverses partitioning/MD/DM. And then at the
bottom of the stack, the ref tag values in the protection information
buffer are verified against the adjusted seed value and remapped
according to the starting logical block number. The reverse is taking
place for reads.

This is much faster than doing a remapping of the actual protection
buffer values every time the I/O transitions from one address space to
the other. In addition, some HBA hardware allows us to program the PI
engine with the seed value. So the submitter value to LBA conversion can
be done on the fly in hardware.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-05 Thread Martin K. Petersen

Mikulas,

> The sector number in the integrity tag must match the physical sector 
> number. So, it must be verified at the bottom.

The ref tag seed matches the submitter block number (typically block
layer sector for the top device) and is remapped to and from the LBA by
the SCSI disk driver or HBA firmware.

So the verification is designed to be done by the top level entity that
attaches the protection information to the bio. In this case
bio_integrity_prep().

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] bio-integrity: revert "stop abusing bi_end_io"

2017-08-05 Thread Martin K. Petersen

Christoph,

> We can simply add another bio flag to get back to the previous
> behavior.  That being said thing to do in the end is to verify it
> at the top of the stack, and not the bottom eventuall.  I can cook
> up a patch for that.

Yeah, the original code was careful about only adding the verification
hook to the top bio.

A bio flag is probably the path of least resistance. It already exists,
actually: BIP_BLOCK_INTEGRITY. But we'll need to make sure it gets
masked out when a bio is cloned. And then we can key off of that and
REQ_OP_READ in the endio function.

I prefer that approach to reverting Christoph's commit.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC PATCH] bio-integrity: Fix regression if profile verify_fn is NULL

2017-08-02 Thread Martin K. Petersen

Milan,

> And the integrity profile is perfect interface for this, we register
> own profile through the proper interface.  (Any other solution for
> per-sector metadata would be worse, I tried...)

The DM use case seems a bit weird and I would like to take a closer look
later (a storm took out my power and internet so I'm a bit logistically
impaired right now). But the intent of the integrity infrastructure was
to be able to carry arbitrary metadata pinned to an I/O. The callback
hooks in the profile were really only intended for the case where the
block layer itself needed to generate and verify.

We already do sanity checking on the callback pointers in the prep
function. I guess don't have a problem doing the same in the completion
path.

Fundamentally, though, the verify function should only ever be called if
the profile has BLK_INTEGRITY_VERIFY set.

Previously that was done in the prep function by only adding the verify
endio hook when it was needed. Christoph's patch to kill the double
endio changed that subtly (FWIW, Jens originally objected to having an
integrity conditional in the regular bio completion path. That's why the
verification handling abused the endio function).

Anyway. So I think that the BLK_INTEGRITY_VERIFY logic needs to be
carried over to __bio_integrity_endio()...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] sd: add support for TCG OPAL self encrypting disks

2017-06-27 Thread Martin K. Petersen

Tejun,

>> Looks good to me. I'll queue it up for 4.13 as soon as Linus has pulled
>> in the ata bits.
>
> I can route it through libata tree w/ your ack if that's more convenient.

I don't think it will cause any headaches. I'm just a bit cautious since
I already have a ton of conflicts in linux-next for this merge window.

Acked-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 9/9] nvme: add support for streams and directives

2017-06-26 Thread Martin K. Petersen

Christoph,

>  - can we keep a module option to disable streams, or in fact for
>now maybe to explicitly enable it?  I expect this to be interesting
>at least for the first devices that implement it.  Also given that
>it needs to be explicitly enabled I would expect some overhead of
>just enabling it when never used

Yeah, based on my experiments we'll need to drive this as an opt-in
feature for now. Short term the module option is OK. Once more devices
start materializing we probably need a white/blacklist/quirk scheme.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] sd: add support for TCG OPAL self encrypting disks

2017-06-26 Thread Martin K. Petersen

Christoph,

> ping?

Looks good to me. I'll queue it up for 4.13 as soon as Linus has pulled
in the ata bits.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHSET v9] Add support for write life time hints

2017-06-20 Thread Martin K. Petersen

Jens,

> A new iteration of this patchset, previously known as write streams.
> As before, this patchset aims at enabling applications split up
> writes into separate streams, based on the perceived life time
> of the data written. This is useful for a variety of reasons:

The grep-wielding streams police noticed several occurrences of the
string "stream" in patches 1-3 and 5. Please fix.

Otherwise OK as a baseline.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 2/2] block: stop using bio_data() in blk_write_same_mergeable

2017-06-20 Thread Martin K. Petersen

Christoph,

> While the Write Same page currently always is in low-level it is just
> as easy and safer to just compare the page and offset directly.

Also fine.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 1/2] block: remove the unused bio_to_phys macro

2017-06-20 Thread Martin K. Petersen

Christoph,

> Signed-off-by: Christoph Hellwig <h...@lst.de>

That's fine.

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 06/12] block: add helpers for setting/checking write hint validity

2017-06-16 Thread Martin K. Petersen

Jens,

> +static const unsigned int rwf_write_to_opf_flag[] = {
> + 0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
> +};

Minor nit: When I see WRITE_SHORT I instinctively think data corruption.

Can we make these REQ_LIFETIME_SHORT or something instead? It loses the
WRITE moniker which I'm not so keen on. But I'm not sure how we'd define
read lifetime...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 05/12] fs: add fcntl() interface for setting/getting write life time hints

2017-06-16 Thread Martin K. Petersen

Jens,

> We have a pwritev2(2) interface based on passing in flags. Add an
> fcntl interface for querying these flags, and also for setting them
> as well:
>
> F_GET_WRITE_LIFE  Returns one of the valid type of write hints,
>   like WRITE_HINT_MEDIUM.
>
> F_SET_WRITE_LIFE  Pass in a WRITE_HINT_* type to set the
>   write life time hint for this file/inode.
>   Returns 0 on succes, -1 otherwise.

It seems like an overkill to have different fcntls for different
hints. And since we are expecting more, maybe these should be
F_{GET,SET}_HINT and then the individual flags can be
WRITE_LIFETIME_FOOBAR?

Otherwise OK with the fnctl approach.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 01/12] block: add support for carrying stream information in a bio

2017-06-16 Thread Martin K. Petersen

Jens,

> No functional changes in this patch, we just add four flags
> that will be used to denote a stream type, and ensure that we
> don't merge across different stream types.

More stream terminology...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 02/12] blk-mq: expose stream write stats through debugfs

2017-06-16 Thread Martin K. Petersen

Jens,

> Useful to verify that things are working the way they should.
> Reading the file will return number of kb written to each
> stream. Writing the file will reset the statistics. No care
> is taken to ensure that we don't race on updates.
>
> Drivers will write to q->stream_writes[] if they handle a stream.

s/stream/write_lifetime_bucket/ or something like that.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Martin K. Petersen

Christoph,

> I think what Martin wants (or at least what I'd want him to want) is
> to define a few REQ_* bits that mirror the RWF bits, use that to
> transfer the information down the stack, and then only translate it
> to stream ids in the driver.

Yup. If we have enough space in the existing flags that's perfect (I
lost count after your op/flag shuffle).

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Martin K. Petersen

Jens,

> So how about we just call it "write_hint"? It sounds mostly like a
> naming issue to me, as you would then map that to some specific stream
> in your driver. You're free to do that right now. They are all flags,
> it's just packed as a value to not waste too much space.

Sure, that's fine with me. But let's call them bi_hints or something. I
have a couple that I would like to add that are I/O direction agnostic.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHSET v2] Add support for write life time hints

2017-06-14 Thread Martin K. Petersen

Jens,

> A new iteration of this patchset, previously known as write streams.
> As before, this patchset aims at enabling applications split up
> writes into separate streams, based on the perceived life time
> of the data written. This is useful for a variety of reasons:
>
> - With NVMe 1.3 compliant devices, the device can expose multiple
>   streams. Separating data written into streams based on life time
>   can drastically reduce the write amplification. This helps device
>   endurance, and increases performance. Testing just performed
>   internally at Facebook with these patches showed up to a 25%
>   reduction in NAND writes in a RocksDB setup.
>
> - Software caching solutions can make more intelligent decisions
>   on how and where to place data.
>
> Contrary to previous patches, we're not exposing numeric stream values
> anymore.  I've previously advocated for just doing a set of hints that
> makes sense instead. See the coverage from the LSFMM summit this year:

I am all for having these write hints. But one request I would like to
make is that we just make them flags and abolish all notions of the term
"streams" from block for this particular use case (since it is more
hinty than streamy).

There are devices coming where a proper stream ID is prerequisite to
separate data streams for other reasons than bucketing based on data
lifetime. So my preference would be that we make the lifetime hints be
flags in block. That does not preclude using Streams Directives to
implement them in the NVMe NAND flash case. But it does not cause
conflicts with the use cases that need "proper" stream IDs for QoS or
colocation avoidance purposes in SCSI.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 6/6] sd: add support for TCG OPAL self encrypting disks

2017-06-13 Thread Martin K. Petersen

Christoph,

>> But as we already set no_report_opcodes for all usb-storage and
>> quirked uas devices I think the worst offenders are already covered
>> anyway.
>
> Martin, how do we want to move ahead on this patch?

I was suggesting the VPD because that may be easier for the SAS HBA
vendors to accommodate for SATA passthrough. But we can cross that
bridge when we get to it.

For libata I'm fine with keying off a supports_opal:1 flag in
scsi_device or something to that effect. I'd just like to reduce the
risk of introducing more RSOC regressions.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] block: Dedicated error code fixups

2017-06-13 Thread Martin K. Petersen

Bart,

> This patch fixes two sparse warnings introduced by the "dedicated
> error codes for the block layer V3" patch series. These changes
> have not been tested.

LGTM.

-- 
Martin K. Petersen  Oracle Linux Engineering


  1   2   3   >