Re: [PATCH v3] ibmvscsis: Initial commit of IBM VSCSI Tgt Driver

2016-05-27 Thread Nicholas A. Bellinger
Hi Bryant,

On Fri, 2016-05-27 at 09:32 -0500, Bryant G. Ly wrote
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi_tgt.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi_tgt.c
> new file mode 100644
> index 000..292d129
> --- /dev/null
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi_tgt.c



> +
> +static struct se_portal_group *ibmvscsis_make_nexus(struct ibmvscsis_tport
> + *tport,
> + const char *name)
> +{
> + struct se_node_acl *acl;
> +
> + if (tport->se_sess) {
> + pr_debug("tport->se_sess already exists\n");
> + return >se_tpg;
> + }
> +
> + /*
> +  *  Initialize the struct se_session pointer and setup tagpool
> +  *  for struct ibmvscsis_cmd descriptors
> +  */
> + tport->se_sess = transport_init_session(TARGET_PROT_NORMAL);
> + if (IS_ERR(tport->se_sess))
> + goto transport_init_fail;
> +
> + /*
> +  * Since we are running in 'demo mode' this call will generate a
> +  * struct se_node_acl for the ibmvscsis struct se_portal_group with
> +  * the SCSI Initiator port name of the passed configfs group 'name'.
> +  */
> +
> + acl = core_tpg_check_initiator_node_acl(>se_tpg,
> + (unsigned char *)name);
> + if (!acl) {
> + pr_debug("core_tpg_check_initiator_node_acl() failed for %s\n",
> +  name);
> + goto acl_failed;
> + }
> + tport->se_sess->se_node_acl = acl;
> +
> + /*
> +  * Now register the TCM ibmvscsis virtual I_T Nexus as active.
> +  */
> + transport_register_session(>se_tpg,
> +tport->se_sess->se_node_acl,
> +tport->se_sess, tport);
> +
> + tport->se_sess->se_tpg = >se_tpg;
> +

FYI, starting in v4.6 these three calls are handled directly by a new
target_alloc_session() helper.

No objection to leaving this as-is for now to make it easier to run atop
unmodified v4.4 target code, but note you'll want to be converting this
post merge.

> +static int ibmvscsis_shutdown_session(struct se_session *se_sess)
> +{
> + return 0;
> +}
> +
> +static void ibmvscsis_close_session(struct se_session *se_sess)
> +{
> +}
> +
>

These two target_core_fabric_ops callers have been made optional for v4.7+

> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi_tgt.h 
> b/drivers/scsi/ibmvscsi/ibmvscsi_tgt.h
> new file mode 100644
> index 000..23e9449
> --- /dev/null
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi_tgt.h



> +
> +struct client_info {
> +#define SRP_VERSION "16.a"
> + char srp_version[8];
> + /* root node property ibm,partition-name */
> + char partition_name[PARTITION_NAMELEN];
> + /* root node property ibm,partition-no */
> + u32 partition_number;
> + /* initially 1 */
> + u32 mad_version;
> + u32 os_type;
> +};
> +
> +struct ibmvscsis_cmd {
> + /* Used for libsrp processing callbacks */
> + struct scsi_cmnd sc;

AFAICT, struct scsi_cmnd is only being used for passing io memory
descriptors and struct iu_entry via sc->SCp.ptr between ibmvscsi_tgt +
libsrp.

Now with the other legacy libsrp.c Scsi_Host related bits removed, it
should be easy to convert the rest in order to drop the last vestiges of
SCSI host LLD structures from ibmvscsi_tgt code.

> + /* Used for TCM Core operations */
> + struct se_cmd se_cmd;
> + /* Sense buffer that will be mapped into outgoing status */
> + unsigned char sense_buf[TRANSPORT_SENSE_BUFFER];
> + u32 lun;
> +};
> +



> diff --git a/drivers/scsi/ibmvscsi/libsrp.h b/drivers/scsi/ibmvscsi/libsrp.h
> new file mode 100644
> index 000..bf9e30b
> --- /dev/null
> +++ b/drivers/scsi/ibmvscsi/libsrp.h
> @@ -0,0 +1,91 @@
> +#ifndef __LIBSRP_H__
> +#define __LIBSRP_H__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +enum srp_task_attributes {
> + SRP_SIMPLE_TASK = 0,
> + SRP_HEAD_TASK = 1,
> + SRP_ORDERED_TASK = 2,
> + SRP_ACA_TASK = 4
> +};
> +
> +enum iue_flags {
> + V_DIOVER,
> + V_WRITE,
> + V_LINKED,
> + V_FLYING,
> +};
> +
> +enum {
> + SRP_TASK_MANAGEMENT_FUNCTION_COMPLETE   = 0,
> + SRP_REQUEST_FIELDS_INVALID  = 2,
> + SRP_TASK_MANAGEMENT_FUNCTION_NOT_SUPPORTED  = 4,
> + SRP_TASK_MANAGEMENT_FUNCTION_FAILED = 5
> +};
> +
> +struct srp_buf {
> + dma_addr_t dma;
> + void *buf;
> +};
> +
> +struct srp_queue {
> + void *pool;
> + void *items;
> + struct kfifo queue;
> + spinlock_t lock;
> +};
> +
> +struct srp_target {
> + struct Scsi_Host *shost;

Unused.

> + struct se_device *tgt;
> + struct device *dev;
> +
> + spinlock_t lock;
> + struct list_head cmd_queue;
> +
> + size_t srp_iu_size;
> + struct srp_queue iu_queue;
> + size_t rx_ring_size;
> + struct srp_buf **rx_ring;
> +
> 

[PATCH] scsi: fix race between simultaneous decrements of ->host_failed

2016-05-27 Thread Wei Fang
async_sas_ata_eh(), which will call scsi_eh_finish_cmd() in some case,
would be performed simultaneously in sas_ata_strategy_handler(). In this
case, ->host_failed may be decreased simultaneously in
scsi_eh_finish_cmd() on different CPUs, and become abnormal.

It will lead to permanently inequal between ->host_failed and
 ->host_busy. Then SCSI error handler thread won't become running,
SCSI errors after that won't be handled forever.

Use atomic type for ->host_failed to fix this race.

Signed-off-by: Wei Fang 
---
 drivers/ata/libata-eh.c |2 +-
 drivers/scsi/libsas/sas_scsi_host.c |5 +++--
 drivers/scsi/scsi.c |2 +-
 drivers/scsi/scsi_error.c   |   15 +--
 drivers/scsi/scsi_lib.c |3 ++-
 include/scsi/scsi_host.h|2 +-
 6 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 961acc7..a0e7612 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -606,7 +606,7 @@ void ata_scsi_error(struct Scsi_Host *host)
ata_scsi_port_error_handler(host, ap);
 
/* finish or retry handled scmd's and clean up */
-   WARN_ON(host->host_failed || !list_empty(_work_q));
+   WARN_ON(atomic_read(>host_failed) || !list_empty(_work_q));
 
DPRINTK("EXIT\n");
 }
diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index 519dac4..8d74003 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -757,7 +757,8 @@ retry:
spin_unlock_irq(shost->host_lock);
 
SAS_DPRINTK("Enter %s busy: %d failed: %d\n",
-   __func__, atomic_read(>host_busy), 
shost->host_failed);
+   __func__, atomic_read(>host_busy),
+   atomic_read(>host_failed));
/*
 * Deal with commands that still have SAS tasks (i.e. they didn't
 * complete via the normal sas_task completion mechanism),
@@ -800,7 +801,7 @@ out:
 
SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n",
__func__, atomic_read(>host_busy),
-   shost->host_failed, tries);
+   atomic_read(>host_failed), tries);
 }
 
 enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd)
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 1deb6ad..7840915 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -527,7 +527,7 @@ void scsi_log_completion(struct scsi_cmnd *cmd, int 
disposition)
scmd_printk(KERN_INFO, cmd,
"scsi host busy %d failed %d\n",

atomic_read(>device->host->host_busy),
-   cmd->device->host->host_failed);
+   
atomic_read(>device->host->host_failed));
}
}
 }
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 984ddcb..f37666f 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -62,7 +62,8 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *,
 /* called with shost->host_lock held */
 void scsi_eh_wakeup(struct Scsi_Host *shost)
 {
-   if (atomic_read(>host_busy) == shost->host_failed) {
+   if (atomic_read(>host_busy) ==
+   atomic_read(>host_failed)) {
trace_scsi_eh_wakeup(shost);
wake_up_process(shost->ehandler);
SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
@@ -250,7 +251,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
eh_flag &= ~SCSI_EH_CANCEL_CMD;
scmd->eh_eflags |= eh_flag;
list_add_tail(>eh_entry, >eh_cmd_q);
-   shost->host_failed++;
+   atomic_inc(>host_failed);
scsi_eh_wakeup(shost);
  out_unlock:
spin_unlock_irqrestore(shost->host_lock, flags);
@@ -1127,7 +1128,7 @@ static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn)
  */
 void scsi_eh_finish_cmd(struct scsi_cmnd *scmd, struct list_head *done_q)
 {
-   scmd->device->host->host_failed--;
+   atomic_dec(>device->host->host_failed);
scmd->eh_eflags = 0;
list_move_tail(>eh_entry, done_q);
 }
@@ -2190,8 +2191,10 @@ int scsi_error_handler(void *data)
if (kthread_should_stop())
break;
 
-   if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) 
||
-   shost->host_failed != atomic_read(>host_busy)) {
+   if ((atomic_read(>host_failed) == 0 &&
+shost->host_eh_scheduled == 0) ||
+   (atomic_read(>host_failed) !=
+atomic_read(>host_busy))) {
SCSI_LOG_ERROR_RECOVERY(1,
shost_printk(KERN_INFO, shost,
  

[GIT PULL] target updates for v4.7-rc1

2016-05-27 Thread Nicholas A. Bellinger
Hi Linus,

Here are the outstanding target pending updates for v4.7-rc1.

Please go ahead and pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git for-next

The highlights this round include:

- Allow external PR/ALUA metadata path be defined at runtime via
  top level configfs attribute. (Lee)
- Fix target session shutdown bug for ib_srpt multi-channel. (hch)
- Make TFO close_session() and shutdown_session() optional. (hch)
- Drop se_sess->sess_kref + convert tcm_qla2xxx to internal kref. (hch)
- Add tcm_qla2xxx endpoint attribute for basic FC jammer. (Laurence)
- Refactor iscsi-target RX/TX PDU encode/decode into common code. (Varun)
- Extend iscsit_transport with xmit_pdu, release_cmd, get_rx_pdu,
  validate_parameters, and get_r2t_ttt for generic ISO offload. (Varun)
- Initial merge of cxgb iscsi-segment offload target driver. (Varun)

The bulk of the changes are Chelsio's new driver, along with a number of
iscsi-target common code improvements made by Varun + Co along the way.

Thank you,

--nab

Christoph Hellwig (7):
  target: consolidate and fix session shutdown
  target: remove acl_stop
  target: make ->shutdown_session optional
  target: make close_session optional
  tcm_qla2xxx: introduce a private sess_kref
  iscsi-target: remove usage of ->shutdown_session
  target: remove sess_kref and ->shutdown_session

Colin Ian King (1):
  target: need_to_release is always false, remove redundant check and
kfree

Imran Haider (1):
  iscsi-target: graceful disconnect on invalid mapping to iovec

Laurence Oberman (1):
  tcm_qla2xxx Add SCSI command jammer/discard capability

Lee Duncan (2):
  target: make target db location configurable
  target: use new "dbroot" target attribute

Nicholas Bellinger (4):
  iscsi-target: Make iscsi_tpg_np driver show/store use generic code
  iscsi-target: Convert transport drivers to signal rdma_shutdown
  cxgbit: Use type ISCSI_CXGBIT + cxgbit tpg_np attribute
  iscsi-target: Fix early sk_data_ready LOGIN_FLAGS_READY race

Varun Prakash (13):
  iscsi-target: add int (*iscsit_xmit_pdu)()
  iscsi-target: add void (*iscsit_release_cmd)()
  iscsi-target: add void (*iscsit_get_rx_pdu)()
  iscsi-target: split iscsi_target_rx_thread()
  iscsi-target: add int (*iscsit_validate_params)()
  iscsi-target: add void (*iscsit_get_r2t_ttt)()
  iscsi-target: move iscsit_thread_check_cpumask()
  iscsi-target: use conn_transport->transport_type in text rsp
  iscsi-target: add new offload transport type
  iscsi-target: clear tx_thread_active
  iscsi-target: call complete on conn_logout_comp
  iscsi-target: export symbols
  cxgbit: add files for cxgbit.ko

 Documentation/scsi/tcm_qla2xxx.txt|   22 +
 Documentation/target/tcm_mod_builder.py   |   16 -
 drivers/infiniband/ulp/isert/ib_isert.c   |   11 +
 drivers/infiniband/ulp/srpt/ib_srpt.c |9 -
 drivers/scsi/qla2xxx/Kconfig  |9 +
 drivers/scsi/qla2xxx/qla_target.c |   56 +-
 drivers/scsi/qla2xxx/qla_target.h |4 +-
 drivers/scsi/qla2xxx/tcm_qla2xxx.c|   59 +-
 drivers/scsi/qla2xxx/tcm_qla2xxx.h|1 +
 drivers/target/iscsi/Kconfig  |2 +
 drivers/target/iscsi/Makefile |1 +
 drivers/target/iscsi/cxgbit/Kconfig   |7 +
 drivers/target/iscsi/cxgbit/Makefile  |6 +
 drivers/target/iscsi/cxgbit/cxgbit.h  |  353 
 drivers/target/iscsi/cxgbit/cxgbit_cm.c   | 2086 +
 drivers/target/iscsi/cxgbit/cxgbit_ddp.c  |  325 
 drivers/target/iscsi/cxgbit/cxgbit_lro.h  |   72 +
 drivers/target/iscsi/cxgbit/cxgbit_main.c |  702 +++
 drivers/target/iscsi/cxgbit/cxgbit_target.c   | 1561 +++
 drivers/target/iscsi/iscsi_target.c   |  701 +++
 drivers/target/iscsi/iscsi_target_configfs.c  |  158 +-
 drivers/target/iscsi/iscsi_target_datain_values.c |1 +
 drivers/target/iscsi/iscsi_target_erl0.c  |2 +-
 drivers/target/iscsi/iscsi_target_login.c |   17 +-
 drivers/target/iscsi/iscsi_target_nego.c  |   19 +-
 drivers/target/iscsi/iscsi_target_parameters.c|1 +
 drivers/target/iscsi/iscsi_target_util.c  |5 +
 drivers/target/loopback/tcm_loop.c|   12 -
 drivers/target/sbp/sbp_target.c   |   12 -
 drivers/target/target_core_alua.c |6 +-
 drivers/target/target_core_configfs.c |   70 +-
 drivers/target/target_core_internal.h |6 +
 drivers/target/target_core_pr.c   |2 +-
 drivers/target/target_core_rd.c   |4 -
 drivers/target/target_core_tpg.c  |   83 +-
 drivers/target/target_core_transport.c|   26 +-
 drivers/target/tcm_fc/tcm_fc.h|1 -
 drivers/target/tcm_fc/tfc_conf.c  |1 -
 

Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]

2016-05-27 Thread Mike Snitzer
On Fri, May 27 2016 at 11:42am -0400,
Hannes Reinecke  wrote:

> On 05/27/2016 04:44 PM, Mike Snitzer wrote:
> >On Fri, May 27 2016 at  4:39am -0400,
> >Hannes Reinecke  wrote:
> >
> [ .. ]
> >>No, the real issue is load-balancing.
> >>If you have several paths you have to schedule I/O across all paths,
> >>_and_ you should be feeding these paths efficiently.
> >
> > >detailed in the multipath paper I refernced>
> >
> >Right, as my patch header details, this is the only limitation that
> >remains with the reinstated bio-based DM multipath.
> >
> 
> :-)
> And the very reason why we went into request-based multipathing in
> the first place...
> 
> >>I was sort-of hoping that with the large bio work from Shaohua we
> >
> >I think you mean Ming Lei and his multipage biovec work?
> >
> Errm. Yeah, of course. Apologies.
> 
> >>could build bio which would not require any merging, ie building
> >>bios which would be assembled into a single request per bio.
> >>Then the above problem wouldn't exist anymore and we _could_ do
> >>scheduling on bio level.
> >>But from what I've gathered this is not always possible (eg for
> >>btrfs with delayed allocation).
> >
> >I doubt many people are running btrfs over multipath in production
> >but...
> >
> Hey. There is a company who does ...
> 
> >Taking a step back: reinstating bio-based DM multipath is _not_ at the
> >expense of request-based DM multipath.  As you can see I've made it so
> >that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
> >supported by a single DM multipath target.  When the trnasition to
> >request-based happened it would've been wise to preserve bio-based but I
> >digress...
> >
> >So, the point is: there isn't any one-size-fits-all DM multipath queue
> >mode here.  If a storage config benefits from the request_fn IO
> >schedulers (but isn't hurt by .request_fn's queue lock, so slower
> >rotational storage?) then use queue_mode=2.  If the storage is connected
> >to a large NUMA system and there is some reason to want to use blk-mq
> >request_queue at the DM level: use queue_mode=3.  If the storage is
> >_really_ fast and doesn't care about extra IO grooming (e.g. sorting and
> >merging) then select bio-based using queue_mode=1.
> >
> >I collected some quick performance numbers against a null_blk device, on
> >a single NUMA node system, with various DM layers ontop -- the multipath
> >runs are only with a single path... fio workload is just 10 sec randread:
> >
> Which is precisely the point.
> Everything's nice and shiny with a single path, as then the above
> issue simply doesn't apply.

Heh, as you can see from the request_fn results, that wasn't the case
until very recently with all the DM multipath blk-mq advances..

But my broader thesis is that for really fast storage it is looking
increasingly likely that we don't _need_ or want to have the
multipathing layer dealing with requests.  Not unless there is some
inherent big win.  request cloning is definitely heavier than bio
cloning.

And as you can probably infer, my work to reinstate bio-based multipath
is focused precisely at the fast storage case in the hopes of avoiding
hch's threat to pull multipathing down into the NVMe over fabrics
driver.

> Things only start getting interesting if you have _several_ paths.
> So the benchmarks only prove that device-mapper doesn't add too much
> of an overhead; they don't prove that the above point has been
> addressed...

Right, but I don't really care if it is addressed by bio-based because
we have the request_fn mode that offers the legacy IO schedulers.  The
fact that request_fn multipath has been adequate for the enterprise
rotational storage arrays is somehwat surprising... the queue_lock is a
massive bottleneck.

But if bio merging (via multipage biovecs) does prove itself to be a win
for bio-based multipath for all storage (slow and fast) then yes that'll
be really good news.  Nice to have options... we can dial in the option
that is best for a specific usecase/deployment and fix what isn't doing
well.

> [ .. ]
> >>Have you found another way of addressing this problem?
> >
> >No, bio sorting/merging really isn't a problem for DM multipath to
> >solve.
> >
> >Though Jens did say (in the context of one of these dm-crypt bulk mode
> >threads) that the block core _could_ grow some additional _minimalist_
> >capability for bio merging:
> >https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html
> >
> >I'd like to understand a bit more about what Jens is thinking in that
> >area because it could benefit DM thinp as well (though that is using bio
> >sorting rather than merging, introduced via commit 67324ea188).
> >
> >I'm not opposed to any line of future development -- but development
> >needs to be driven by observed limitations while testing on _real_
> >hardware.
> >
> In the end, with Ming Leis multipage bvec work we essentially
> already moved some merging ability into the bios; during

Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]

2016-05-27 Thread Hannes Reinecke

On 05/27/2016 04:44 PM, Mike Snitzer wrote:

On Fri, May 27 2016 at  4:39am -0400,
Hannes Reinecke  wrote:


[ .. ]

No, the real issue is load-balancing.
If you have several paths you have to schedule I/O across all paths,
_and_ you should be feeding these paths efficiently.




Right, as my patch header details, this is the only limitation that
remains with the reinstated bio-based DM multipath.



:-)
And the very reason why we went into request-based multipathing in the 
first place...



I was sort-of hoping that with the large bio work from Shaohua we


I think you mean Ming Lei and his multipage biovec work?


Errm. Yeah, of course. Apologies.


could build bio which would not require any merging, ie building
bios which would be assembled into a single request per bio.
Then the above problem wouldn't exist anymore and we _could_ do
scheduling on bio level.
But from what I've gathered this is not always possible (eg for
btrfs with delayed allocation).


I doubt many people are running btrfs over multipath in production
but...


Hey. There is a company who does ...


Taking a step back: reinstating bio-based DM multipath is _not_ at the
expense of request-based DM multipath.  As you can see I've made it so
that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
supported by a single DM multipath target.  When the trnasition to
request-based happened it would've been wise to preserve bio-based but I
digress...

So, the point is: there isn't any one-size-fits-all DM multipath queue
mode here.  If a storage config benefits from the request_fn IO
schedulers (but isn't hurt by .request_fn's queue lock, so slower
rotational storage?) then use queue_mode=2.  If the storage is connected
to a large NUMA system and there is some reason to want to use blk-mq
request_queue at the DM level: use queue_mode=3.  If the storage is
_really_ fast and doesn't care about extra IO grooming (e.g. sorting and
merging) then select bio-based using queue_mode=1.

I collected some quick performance numbers against a null_blk device, on
a single NUMA node system, with various DM layers ontop -- the multipath
runs are only with a single path... fio workload is just 10 sec randread:


Which is precisely the point.
Everything's nice and shiny with a single path, as then the above issue 
simply doesn't apply.

Things only start getting interesting if you have _several_ paths.
So the benchmarks only prove that device-mapper doesn't add too much of 
an overhead; they don't prove that the above point has been addressed...


[ .. ]

Have you found another way of addressing this problem?


No, bio sorting/merging really isn't a problem for DM multipath to
solve.

Though Jens did say (in the context of one of these dm-crypt bulk mode
threads) that the block core _could_ grow some additional _minimalist_
capability for bio merging:
https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html

I'd like to understand a bit more about what Jens is thinking in that
area because it could benefit DM thinp as well (though that is using bio
sorting rather than merging, introduced via commit 67324ea188).

I'm not opposed to any line of future development -- but development
needs to be driven by observed limitations while testing on _real_
hardware.

In the end, with Ming Leis multipage bvec work we essentially already 
moved some merging ability into the bios; during bio_add_page() the 
block layer will already merge bios together.


(I'll probably be yelled at by hch for ignorance for the following, but 
nevertheless)

From my POV there are several areas of 'merging' which currently happen:
a) bio merging: combine several consecutive bios into a larger one; 
should be largely address by Ming Leis multipage bvec
b) bio sorting: reshuffle bios so that any requests on the request queue 
are ordered 'best' for the underlying hardware (ie the actual I/O 
scheduler). Not implemented for mq, and actually of questionable value 
for fast storage. One of the points I'll be testing in the very near 
future; ideally we find that it's not _that_ important (compared to the 
previous point), then we could drop it altogether for mq.
c) clustering: coalescing several consecutive pages/bvecs into a single 
SG element. Obviously only can happen if you have large enough requests.

But the only gain is shortening the number of SG elements for a requests.
Again of questionable value as the request itself and the amount of data 
to transfer isn't changed. And another point of performance testing on 
my side.


So ideally we will find that b) and c) only contribute with a small 
amount to the overall performance, then we could easily drop it for MQ 
and concentrate on make bio merging work well.
Then it wouldn't really matter if we were doing bio-based or 
request-based multipathing as we had a 1:1 relationship, and this entire 
discussion could go away.


Well. Or that's the hope, at least.

Cheers,

Hannes
--
Dr. Hannes Reinecke  

Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]

2016-05-27 Thread Mike Snitzer
On Fri, May 27 2016 at  4:39am -0400,
Hannes Reinecke  wrote:

> On 05/26/2016 04:38 AM, Mike Snitzer wrote:
> >On Thu, Apr 28 2016 at 11:40am -0400,
> >James Bottomley  wrote:
> >
> >>On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:
> >>>Full disclosure: I'll be looking at reinstating bio-based DM multipath to
> >>>regain efficiencies that now really matter when issuing IO to extremely
> >>>fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
> >>>immutable biovecs), coupled with the emerging multipage biovec work that
> >>>will help construct larger bios, so I think it is worth pursuing to at
> >>>least keep our options open.
> >
> >Please see the 4 topmost commits I've published here:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8
> >
> >All request-based DM multipath support/advances have been completly
> >preserved.  I've just made it so that we can now have bio-based DM
> >multipath too.
> >
> >All of the various modes have been tested using mptest:
> >https://github.com/snitm/mptest
> >
> >>OK, but remember the reason we moved from bio to request was partly to
> >>be nearer to the device but also because at that time requests were
> >>accumulations of bios which had to be broken out, go back up the stack
> >>individually and be re-elevated, which adds to the inefficiency.  In
> >>theory the bio splitting work will mean that we only have one or two
> >>split bios per request (because they were constructed from a split up
> >>huge bio), but when we send them back to the top to be reconstructed as
> >>requests there's no guarantee that the split will be correct a second
> >>time around and we might end up resplitting the already split bios.  If
> >>you do reassembly into the huge bio again before resend down the next
> >>queue, that's starting to look like quite a lot of work as well.
> >
> >I've not even delved into the level you're laser-focused on here.
> >But I'm struggling to grasp why multipath is any different than any
> >other bio-based device...
> >
> Actually, _failover_ is not the primary concern. This is on a
> (relative) slow path so any performance degradation during failover
> is acceptable.
> 
> No, the real issue is load-balancing.
> If you have several paths you have to schedule I/O across all paths,
> _and_ you should be feeding these paths efficiently.



Right, as my patch header details, this is the only limitation that
remains with the reinstated bio-based DM multipath.

> I was sort-of hoping that with the large bio work from Shaohua we

I think you mean Ming Lei and his multipage biovec work?

> could build bio which would not require any merging, ie building
> bios which would be assembled into a single request per bio.
> Then the above problem wouldn't exist anymore and we _could_ do
> scheduling on bio level.
> But from what I've gathered this is not always possible (eg for
> btrfs with delayed allocation).

I doubt many people are running btrfs over multipath in production
but...

Taking a step back: reinstating bio-based DM multipath is _not_ at the
expense of request-based DM multipath.  As you can see I've made it so
that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
supported by a single DM multipath target.  When the trnasition to
request-based happened it would've been wise to preserve bio-based but I
digress...

So, the point is: there isn't any one-size-fits-all DM multipath queue
mode here.  If a storage config benefits from the request_fn IO
schedulers (but isn't hurt by .request_fn's queue lock, so slower
rotational storage?) then use queue_mode=2.  If the storage is connected
to a large NUMA system and there is some reason to want to use blk-mq
request_queue at the DM level: use queue_mode=3.  If the storage is
_really_ fast and doesn't care about extra IO grooming (e.g. sorting and
merging) then select bio-based using queue_mode=1.

I collected some quick performance numbers against a null_blk device, on
a single NUMA node system, with various DM layers ontop -- the multipath
runs are only with a single path... fio workload is just 10 sec randread:

FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12
{FIO} --numa_cpu_nodes=${NID} --numa_mem_policy=bind:${NID} 
--cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k 
--numjobs=${FIO_NUMJOBS} \
  --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} 
--time_based --loops=1 --ioengine=libaio \
  --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall 
--name task_${TASK_NAME} --filename=${DEVICE}"

I need real hardware (NVMe over Fabrics please!) to really test this
stuff properly; but I think the following results at least approximate
the relative performance of each multipath mode.

On null_blk blk-mq
--

baseline:
null_blk blk-mq   iops=1936.3K
dm-linear iops=1616.1K

multipath using round-robin 

[PATCH v3] ibmvscsis: Initial commit of IBM VSCSI Tgt Driver

2016-05-27 Thread Bryant G. Ly
Version 1:
This initial commit contains WIP of the IBM VSCSI Target Fabric
Module. It currently supports read/writes, and I have tested
the ability to create a file backstore with the driver, install
RHEL, and then boot up the partition via filio backstore through
the driver.

This driver is a pick up of the old IBM VIO scsi Target Driver
that was started by Nick and Fujita 2-4 years ago.
http://comments.gmane.org/gmane.linux.scsi/90119
and http://marc.info/?t=12973408564=1=2

The driver provides a virtual SCSI device on IBM Power Servers.

When reviving the old libsrp, I stripped out all that utilized
scsi to submit commands to the target. Hence there is no more
scsi_tgt_if_*, and scsi_transport_* files and fully utilizes
LIO instead. This driver does however use the SRP protocol
for communication between guests/and or hosts, but its all
synchronous data transfers due to the utilization of
H_COPY_RDMA, a VIO mechanism which means that others like
ib_srp, ib_srpt which are asynchronous can't use this driver.
This was also the reason for moving libsrp out of the
drivers/scsi/. and into the ibmvscsi folder.

Signed-off-by: Steven Royer 
Signed-off-by: Tyrel Datwyler 
Signed-off-by: Bryant G. Ly 
---
 MAINTAINERS  |   10 +
 drivers/scsi/Kconfig |   27 +-
 drivers/scsi/Makefile|2 +-
 drivers/scsi/ibmvscsi/Makefile   |4 +
 drivers/scsi/ibmvscsi/ibmvscsi_tgt.c | 1932 ++
 drivers/scsi/ibmvscsi/ibmvscsi_tgt.h |  169 +++
 drivers/scsi/ibmvscsi/libsrp.c   |  386 +++
 drivers/scsi/ibmvscsi/libsrp.h   |   91 ++
 drivers/scsi/libsrp.c|  447 
 include/scsi/libsrp.h|   78 --
 10 files changed, 2610 insertions(+), 536 deletions(-)
 create mode 100644 drivers/scsi/ibmvscsi/ibmvscsi_tgt.c
 create mode 100644 drivers/scsi/ibmvscsi/ibmvscsi_tgt.h
 create mode 100644 drivers/scsi/ibmvscsi/libsrp.c
 create mode 100644 drivers/scsi/ibmvscsi/libsrp.h
 delete mode 100644 drivers/scsi/libsrp.c
 delete mode 100644 include/scsi/libsrp.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 9c567a4..0f412d4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5451,6 +5451,16 @@ S:   Supported
 F: drivers/scsi/ibmvscsi/ibmvscsi*
 F: drivers/scsi/ibmvscsi/viosrp.h
 
+IBM Power Virtual SCSI Device Target Driver
+M: Bryant G. Ly 
+L: linux-scsi@vger.kernel.org
+L: target-de...@vger.kernel.org
+S: Supported
+F: drivers/scsi/ibmvscsi/ibmvscsis.c
+F: drivers/scsi/ibmvscsi/ibmvscsis.h
+F: drivers/scsi/ibmvscsi/libsrp.c
+F: drivers/scsi/ibmvscsi/libsrp.h
+
 IBM Power Virtual FC Device Drivers
 M: Tyrel Datwyler 
 L: linux-scsi@vger.kernel.org
diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index c5883a5..0f8a1de 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -848,6 +848,23 @@ config SCSI_IBMVSCSI
  To compile this driver as a module, choose M here: the
  module will be called ibmvscsi.
 
+config SCSI_IBMVSCSIS
+   tristate "IBM Virtual SCSI Server support"
+   depends on PPC_PSERIES && TARGET_CORE && SCSI && PCI
+   help
+ This is the IBM POWER Virtual SCSI Target Server
+ This driver uses the SRP protocol for communication betwen servers
+ guest and/or the host that run on the same server.
+ More information on VSCSI protocol can be found at www.power.org
+
+ The userspace configuration needed to initialize the driver can be
+ be found here:
+
+ https://github.com/powervm/ibmvscsis/wiki/Configuration
+
+ To compile this driver as a module, choose M here: the
+ module will be called ibmvstgt.
+
 config SCSI_IBMVFC
tristate "IBM Virtual FC support"
depends on PPC_PSERIES && SCSI
@@ -1729,16 +1746,6 @@ config SCSI_PM8001
  This driver supports PMC-Sierra PCIE SAS/SATA 8x6G SPC 8001 chip
  based host adapters.
 
-config SCSI_SRP
-   tristate "SCSI RDMA Protocol helper library"
-   depends on SCSI && PCI
-   select SCSI_TGT
-   help
- If you wish to use SRP target drivers, say Y.
-
- To compile this driver as a module, choose M here: the
- module will be called libsrp.
-
 config SCSI_BFA_FC
tristate "Brocade BFA Fibre Channel Support"
depends on PCI && SCSI
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 0335d28..ea2030c 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -127,8 +127,8 @@ obj-$(CONFIG_SCSI_LASI700)  += 53c700.o lasi700.o
 obj-$(CONFIG_SCSI_SNI_53C710)  += 53c700.o sni_53c710.o
 obj-$(CONFIG_SCSI_NSP32)   += nsp32.o
 obj-$(CONFIG_SCSI_IPR) += ipr.o
-obj-$(CONFIG_SCSI_SRP) += libsrp.o
 obj-$(CONFIG_SCSI_IBMVSCSI)+= ibmvscsi/

Re: iscsi fails to attach to targets with kernel 4.4.0

2016-05-27 Thread Oleg Borisenko
I'm facing the same problem, attaching additional log (maybe it helps)

May 27 07:47:05 oldcompute02 kernel: [ 1168.239065] [ cut here
] May 27 07:47:05 oldcompute02 kernel: [ 1168.239077] WARNING: CPU:
24 PID: 12116 at  /build/linux-lts-xenial-7RlTta/linux-lts-
xenial-4.4.0/fs/sysfs/dir.c:31  sysfs_warn_dup+0x64/0x80() May 27 07:47:05
oldcompute02 kernel: [ 1168.239080] sysfs: cannot create duplicate  filename
'/bus/scsi/devices/11:0:0:0' May 27 07:47:05 oldcompute02 kernel: [ 1168.239082]
Modules linked in: vhost_net vhost  macvtap macvlan veth xt_CHECKSUM
iptable_mangle ipt_MASQUERADE  nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4  xt_conntrack ipt_REJECT nf_reject_ipv4
xt_tcpudp bridge stp llc ebtable_filter ebtables  ip6table_filter ip6_tables
iptable_filter ip_tables x_tables scsi_dh_alua dm_round_robin  ib_iser rdma_cm
iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi dm_multipath vport_vxlan openvswitch nf_defrag_ipv6
nf_conntrack  bonding binfmt_misc x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm  ipmi_ssif ipmi_devintf irqbypass crct10dif_pclmul crc32_pclmul
aesni_intel dcdbas  input_leds joydev cdc_ether usbnet mii aes_x86_64 lrw
gf128mul glue_helper ablk_helper  cryptd acpi_power_meter acpi_pad ipmi_si
ipmi_msghandler 8250_fintek mei_me mei wmi  sb_edac shpchp edac_core lpc_ich
mac_hid lp parport hid_generic bnx2x vxlan  ip6_udp_tunnel ahci usbhid
udp_tunnel ptp hid libahci pps_core megaraid_sas mdio  libcrc32c fjes May 27
07:47:05 oldcompute02 kernel: [ 1168.239182] CPU: 24 PID: 12116 Comm:  iscsiadm
Tainted: GW   4.4.0-22-generic #40~14.04.1-Ubuntu May 27 07:47:05
oldcompute02 kernel: [ 1168.239185] Hardware name: Dell Inc.  PowerEdge
M620/06GTV1, BIOS 2.5.4 01/27/2016 May 27 07:47:05 oldcompute02 kernel: [
1168.239188]    881fe34eb9e8 813cde6c
881fe34eba30 May 27 07:47:05 oldcompute02 kernel: [ 1168.239192]
81ada130 881fe34eba20  8107d856 881fe5c2d000 May 27
07:47:05 oldcompute02 kernel: [ 1168.239196]  881fef80b350 883ff1802870
0001 ffef May 27 07:47:05 oldcompute02 kernel: [
1168.239200] Call Trace: May 27 07:47:05 oldcompute02 kernel: [ 1168.239209]
[]  dump_stack+0x63/0x87 May 27 07:47:05 oldcompute02 kernel:
[ 1168.239216]  []  warn_slowpath_common+0x86/0xc0 May 27
07:47:05 oldcompute02 kernel: [ 1168.239221]  []
warn_slowpath_fmt+0x4c/0x50 May 27 07:47:05 oldcompute02 kernel: [ 1168.239225]
[]  sysfs_warn_dup+0x64/0x80 May 27 07:47:05 oldcompute02
kernel: [ 1168.239230]  []
sysfs_do_create_link_sd.isra.2+0xaa/0xb0 May 27 07:47:05 oldcompute02 kernel: [
1168.239234]  []  sysfs_create_link+0x25/0x40 May 27 07:47:05
oldcompute02 kernel: [ 1168.239242]  []
bus_add_device+0x10b/0x1f0 May 27 07:47:05 oldcompute02 kernel: [ 1168.239247]
[]  device_add+0x3b5/0x610 May 27 07:47:05 oldcompute02
kernel: [ 1168.239252]  []  scsi_sysfs_add_sdev+0xa5/0x290 May
27 07:47:05 oldcompute02 kernel: [ 1168.239258]  []
scsi_probe_and_add_lun+0xb65/0xd80 May 27 07:47:05 oldcompute02 kernel: [
1168.239266]  [] ?  __pm_runtime_resume+0x5c/0x70 May 27
07:47:05 oldcompute02 kernel: [ 1168.239271]  []
__scsi_scan_target+0xd2/0x230 May 27 07:47:05 oldcompute02 kernel: [
1168.239276]  []  scsi_scan_target+0xd7/0xf0 May 27 07:47:05
oldcompute02 kernel: [ 1168.239293]  []
iscsi_user_scan_session.part.14+0x105/0x140 [scsi_transport_iscsi] May 27
07:47:05 oldcompute02 kernel: [ 1168.239303]  [] ?
iscsi_user_scan_session.part.14+0x140/0x140 [scsi_transport_iscsi] May 27
07:47:05 oldcompute02 kernel: [ 1168.239313]  []
iscsi_user_scan_session+0x1e/0x30 [scsi_transport_iscsi] May 27 07:47:05
oldcompute02 kernel: [ 1168.239318]  []
device_for_each_child+0x43/0x70 May 27 07:47:05 oldcompute02 kernel: [
1168.239328]  []  iscsi_user_scan+0x2e/0x30
[scsi_transport_iscsi] May 27 07:47:05 oldcompute02 kernel: [ 1168.239331]
[]  store_scan+0xa6/0x100 May 27 07:47:05 oldcompute02 kernel:
[ 1168.239338]  [] ?  calibrate_delay+0x5b9/0x607 May 27
07:47:05 oldcompute02 kernel: [ 1168.239346]  [] ?
__kmalloc+0x1d6/0x270 May 27 07:47:05 oldcompute02 kernel: [ 1168.239350]
[]  dev_attr_store+0x18/0x30 May 27 07:47:05 oldcompute02
kernel: [ 1168.239354]  []  sysfs_kf_write+0x3a/0x50 May 27
07:47:05 oldcompute02 kernel: [ 1168.239358]  []
kernfs_fop_write+0x120/0x170 May 27 07:47:05 oldcompute02 kernel: [ 1168.239364]
[]  __vfs_write+0x18/0x40 May 27 07:47:05 oldcompute02 kernel:
[ 1168.239367]  []  vfs_write+0xa2/0x1a0 May 27 07:47:05
oldcompute02 kernel: [ 1168.239371]  [] ?
do_sys_open+0x1b8/0x270 May 27 07:47:05 oldcompute02 kernel: [ 1168.239375]
[]  SyS_write+0x46/0xa0 May 27 07:47:05 oldcompute02 kernel: [
1168.239382]  []  entry_SYSCALL_64_fastpath+0x16/0x75 May 27
07:47:05 oldcompute02 kernel: [ 1168.239386] ---[ end trace a628d295136d7a5b ]-
-- May 27 07:47:05 oldcompute02 kernel: [ 1168.239432] scsi 11:0:0:0: 

Re: BLKZEROOUT not zeroing md dev on VMDK

2016-05-27 Thread Tom Yan
There seems to be some sort of race condition between
blkdev_issue_zeroout() and the scsi disk driver (disabling write same
after an illegal request). On my UAS drive, sometimes `blkdiscard -z
/dev/sdX` will return right away, even though if I then check
`write_same_max_bytes` it has turned 0. Sometimes it will just write
zero with SCSI WRITE even if `write_same_max_bytes` is 33553920 before
I issue `blkdiscard -z` (`write_same_max_bytes` also turned 0, as
expected).

Not sure if it is directly related to the case here though.

On 27 May 2016 at 12:45, Sitsofe Wheeler  wrote:
> On 27 May 2016 at 05:18, Darrick J. Wong  wrote:
>>
>> It's possible that the pvscsi device advertised WRITE SAME, but if the device
>> sends back ILLEGAL REQUEST then the SCSI disk driver will set
>> write_same_max_bytes=0.  Subsequent BLKZEROOUT attempts will then issue 
>> writes
>> of zeroes to the drive.
>
> Thanks for following up on this but that's not what happens on the md
> device - you can go on to issue as many BLKZEROOUT requests as you
> like but the md disk is never zeroed nor is an error returned.
>
> I filed a bug at https://bugzilla.kernel.org/show_bug.cgi?id=118581
> (see https://bugzilla.kernel.org/show_bug.cgi?id=118581#c6 for
> alternative reproduction steps that use scsi_debug and can be reworked
> to impact device mapper) and Shaohua Li noted that
> blkdev_issue_write_same could return 0 even when the disk didn't
> support write same (see
> https://bugzilla.kernel.org/show_bug.cgi?id=118581#c8 ).
>
> Shaohua went on to create a patch for this ("block: correctly fallback
> for zeroout" - https://patchwork.kernel.org/patch/9137311/ ) which has
> yet to be reviewed.
>
> --
> Sitsofe | http://sucs.org/~sits/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] scsi: add new async device reset support

2016-05-27 Thread Hannes Reinecke

On 05/27/2016 10:23 AM, Christoph Hellwig wrote:

Adding Hannes to the Cc list as he's been looking into EH improvements
in this area.

On Wed, May 25, 2016 at 02:55:02AM -0500, mchri...@redhat.com wrote:

From: Mike Christie 

Currently, if the SCSI eh runs then before we do a LUN_RESET
we stop the host. This patch and the block layer one before it
begin to add infrastructure to be able to do a LUN_RESET and
eventually do a transport level recovery without having to stop the
host.

For LUn-reset, this patch adds a new callout, eh_async_device_reset_handler,
which works similar to how LLDs handle SG_SCSI_RESET_DEVICE where the
LLD manages the commands that are affected.

eh_async_device_reset_handler:

The LLD should perform a LUN RESET that affects all commands
that have been accepted by its queuecommand callout for the
device passed in to the callout. While the reset handler is running,
queuecommand will not be running or called for the device.

Unlike eh_device_reset_handler, queuecommand may still be
called for other devices, and the LLD must call scsi_done for the
commands that have been affected by the reset.

If SUCCESS or FAST_IO_FAIL is returned, the scsi_cmnds cleaned up
must be failed with DID_ABORT.

Signed-off-by: Mike Christie 

In general I like the approach.
I'll be looking into it more closely next week.

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]

2016-05-27 Thread Hannes Reinecke

On 05/26/2016 04:38 AM, Mike Snitzer wrote:

On Thu, Apr 28 2016 at 11:40am -0400,
James Bottomley  wrote:


On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:

Full disclosure: I'll be looking at reinstating bio-based DM multipath to
regain efficiencies that now really matter when issuing IO to extremely
fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
immutable biovecs), coupled with the emerging multipage biovec work that
will help construct larger bios, so I think it is worth pursuing to at
least keep our options open.


Please see the 4 topmost commits I've published here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8

All request-based DM multipath support/advances have been completly
preserved.  I've just made it so that we can now have bio-based DM
multipath too.

All of the various modes have been tested using mptest:
https://github.com/snitm/mptest


OK, but remember the reason we moved from bio to request was partly to
be nearer to the device but also because at that time requests were
accumulations of bios which had to be broken out, go back up the stack
individually and be re-elevated, which adds to the inefficiency.  In
theory the bio splitting work will mean that we only have one or two
split bios per request (because they were constructed from a split up
huge bio), but when we send them back to the top to be reconstructed as
requests there's no guarantee that the split will be correct a second
time around and we might end up resplitting the already split bios.  If
you do reassembly into the huge bio again before resend down the next
queue, that's starting to look like quite a lot of work as well.


I've not even delved into the level you're laser-focused on here.
But I'm struggling to grasp why multipath is any different than any
other bio-based device...

Actually, _failover_ is not the primary concern. This is on a (relative) 
slow path so any performance degradation during failover is acceptable.


No, the real issue is load-balancing.
If you have several paths you have to schedule I/O across all paths, 
_and_ you should be feeding these paths efficiently.


With the original (bio-based) layout you had to schedule on the bio 
level, causing the requests to be inefficiently assembled.
Hence the 'rr_min_io' parameter, which were changing paths after 
rr_min_io _bios_. I did some experimenting a while back (I even had a 
presentation on LSF at one point ...), and figuring that you would get a 
performance degradation once the rr_min_io parameter went below 100.
But this means that paths will be switched after every 100 bios, 
irrespective of into how many requests they'll be assembled.
It also means that we have a rather 'choppy' load-balancing behaviour, 
and cannot achieve 'true' load balancing as the I/O scheduler on the bio 
level doesn't have any idea when a new request will be assembled.


I was sort-of hoping that with the large bio work from Shaohua we could 
build bio which would not require any merging, ie building bios which 
would be assembled into a single request per bio.
Then the above problem wouldn't exist anymore and we _could_ do 
scheduling on bio level.
But from what I've gathered this is not always possible (eg for btrfs 
with delayed allocation).


Have you found another way of addressing this problem?

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] scsi: add new async device reset support

2016-05-27 Thread Christoph Hellwig
Adding Hannes to the Cc list as he's been looking into EH improvements
in this area.

On Wed, May 25, 2016 at 02:55:02AM -0500, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> Currently, if the SCSI eh runs then before we do a LUN_RESET
> we stop the host. This patch and the block layer one before it
> begin to add infrastructure to be able to do a LUN_RESET and
> eventually do a transport level recovery without having to stop the
> host.
> 
> For LUn-reset, this patch adds a new callout, eh_async_device_reset_handler,
> which works similar to how LLDs handle SG_SCSI_RESET_DEVICE where the
> LLD manages the commands that are affected.
> 
> eh_async_device_reset_handler:
> 
> The LLD should perform a LUN RESET that affects all commands
> that have been accepted by its queuecommand callout for the
> device passed in to the callout. While the reset handler is running,
> queuecommand will not be running or called for the device.
> 
> Unlike eh_device_reset_handler, queuecommand may still be
> called for other devices, and the LLD must call scsi_done for the
> commands that have been affected by the reset.
> 
> If SUCCESS or FAST_IO_FAIL is returned, the scsi_cmnds cleaned up
> must be failed with DID_ABORT.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/scsi/scsi_error.c | 31 ---
>  drivers/scsi/scsi_lib.c   |  6 ++
>  drivers/scsi/scsi_priv.h  |  1 +
>  include/scsi/scsi_host.h  | 17 +
>  4 files changed, 52 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 984ddcb..cec2dfb 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -853,16 +853,41 @@ static int scsi_try_bus_device_reset(struct scsi_cmnd 
> *scmd)
>  {
>   int rtn;
>   struct scsi_host_template *hostt = scmd->device->host->hostt;
> + struct scsi_device *sdev = scmd->device;
>  
> - if (!hostt->eh_device_reset_handler)
> + if (!hostt->eh_device_reset_handler &&
> + !hostt->eh_async_device_reset_handler)
>   return FAILED;
>  
> - rtn = hostt->eh_device_reset_handler(scmd);
> + if (hostt->eh_device_reset_handler) {
> + rtn = hostt->eh_device_reset_handler(scmd);
> + } else {
> + if (!blk_reset_queue(sdev->request_queue))
> + rtn = SUCCESS;
> + else
> + rtn = FAILED;
> + }
>   if (rtn == SUCCESS)
> - __scsi_report_device_reset(scmd->device, NULL);
> + __scsi_report_device_reset(sdev, NULL);
>   return rtn;
>  }
>  
> +enum blk_eh_timer_return scsi_reset_queue(struct request_queue *q)
> +{
> + struct scsi_device *sdev = q->queuedata;
> + struct scsi_host_template *hostt = sdev->host->hostt;
> + int rtn;
> +
> + if (!hostt->eh_async_device_reset_handler)
> + return -EOPNOTSUPP;
> +
> + rtn = hostt->eh_async_device_reset_handler(sdev);
> + if (rtn == SUCCESS || rtn == FAST_IO_FAIL)
> + return BLK_EH_HANDLED;
> +
> + return BLK_EH_NOT_HANDLED;
> +}
> +
>  /**
>   * scsi_try_to_abort_cmd - Ask host to abort a SCSI command
>   * @hostt:   SCSI driver host template
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 8106515..11374dd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -779,6 +779,10 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd 
> *cmd, int result)
>   set_host_byte(cmd, DID_OK);
>   error = -ENODATA;
>   break;
> + case DID_ABORT:
> + set_host_byte(cmd, DID_OK);
> + error = -EINTR;
> + break;
>   default:
>   error = -EIO;
>   break;
> @@ -2159,6 +2163,7 @@ struct request_queue *scsi_alloc_queue(struct 
> scsi_device *sdev)
>   blk_queue_softirq_done(q, scsi_softirq_done);
>   blk_queue_rq_timed_out(q, scsi_times_out);
>   blk_queue_lld_busy(q, scsi_lld_busy);
> + blk_queue_reset(q, scsi_reset_queue);
>   return q;
>  }
>  
> @@ -2167,6 +2172,7 @@ static struct blk_mq_ops scsi_mq_ops = {
>   .queue_rq   = scsi_queue_rq,
>   .complete   = scsi_softirq_done,
>   .timeout= scsi_timeout,
> + .reset  = scsi_reset_queue,
>   .init_request   = scsi_init_request,
>   .exit_request   = scsi_exit_request,
>  };
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index 27b4d0a..2e03168 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -67,6 +67,7 @@ extern void scsi_exit_devinfo(void);
>  
>  /* scsi_error.c */
>  extern void scmd_eh_abort_handler(struct work_struct *work);
> +extern enum blk_eh_timer_return scsi_reset_queue(struct request_queue *q);
>  extern enum blk_eh_timer_return scsi_times_out(struct request *req);
>  extern int scsi_error_handler(void *host);
>  extern int 

Re: [PATCH 3/5] target: call queue reset if supported

2016-05-27 Thread Christoph Hellwig
On Wed, May 25, 2016 at 02:55:01AM -0500, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> Instead of waiting for commands, map the lun reset operation
> to a queue reset.
> 
> We do not check the result of blk_reset_queue because if
> it works then we need to wait for the bio/request completions
> and if it failed we might as wait and hope like we did before.

Please don't call into block device code from the file handler.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mpt3sas - set num_phys after allocating phy[] space

2016-05-27 Thread Sreekanth Reddy
Hi,

This patch looks good. Please consider this patch as Ack-by: Sreekanth Reddy


Thanks,
Sreekanth

On Thu, May 26, 2016 at 12:44 AM, Joe Lawrence  wrote:
> In _scsih_sas_host_add, the number of HBA phys are determined and then
> later used to allocate an array of struct  _sas_phy's.  If the routine
> sets ioc->sas_hba.num_phys, but then fails to allocate the
> ioc->sas_hba.phy array (by kcalloc error or other intermediate
> error/exit path), ioc->sas_hba is left in a dangerous state: all readers
> of ioc->sas_hba.phy[] do so by indexing it from 0..ioc->sas_hba.num_phys
> without checking that the space was ever allocated.
>
> Modify _scsih_sas_host_add to set ioc->sas_hba.num_phys only after
> successfully allocating ioc->sas_hba.phy[].
>
> Signed-off-by: Joe Lawrence 
> ---
>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
> b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> index f2139e5604a3..6e36d20c9e0b 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -4893,13 +4893,22 @@ _scsih_sas_host_add(struct MPT3SAS_ADAPTER *ioc)
> u16 ioc_status;
> u16 sz;
> u8 device_missing_delay;
> +   u8 num_phys;
>
> -   mpt3sas_config_get_number_hba_phys(ioc, >sas_hba.num_phys);
> -   if (!ioc->sas_hba.num_phys) {
> +   mpt3sas_config_get_number_hba_phys(ioc, _phys);
> +   if (!num_phys) {
> pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
> ioc->name, __FILE__, __LINE__, __func__);
> return;
> }
> +   ioc->sas_hba.phy = kcalloc(num_phys,
> +   sizeof(struct _sas_phy), GFP_KERNEL);
> +   if (!ioc->sas_hba.phy) {
> +   pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
> +   ioc->name, __FILE__, __LINE__, __func__);
> +   goto out;
> +   }
> +   ioc->sas_hba.num_phys = num_phys;
>
> /* sas_iounit page 0 */
> sz = offsetof(Mpi2SasIOUnitPage0_t, PhyData) + (ioc->sas_hba.num_phys 
> *
> @@ -4959,13 +4968,6 @@ _scsih_sas_host_add(struct MPT3SAS_ADAPTER *ioc)
> MPI2_SASIOUNIT1_REPORT_MISSING_TIMEOUT_MASK;
>
> ioc->sas_hba.parent_dev = >shost->shost_gendev;
> -   ioc->sas_hba.phy = kcalloc(ioc->sas_hba.num_phys,
> -   sizeof(struct _sas_phy), GFP_KERNEL);
> -   if (!ioc->sas_hba.phy) {
> -   pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
> -   ioc->name, __FILE__, __LINE__, __func__);
> -   goto out;
> -   }
> for (i = 0; i < ioc->sas_hba.num_phys ; i++) {
> if ((mpt3sas_config_get_phy_pg0(ioc, _reply, _pg0,
> i))) {
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mpt3sas - avoid mpt3sas_transport_port_add NULL parent_dev

2016-05-27 Thread Sreekanth Reddy
Hi,

This patch looks good. Please consider this patch as Ack-by: Sreekanth Reddy


Thanks,
Sreekanth


On Thu, May 26, 2016 at 12:44 AM, Joe Lawrence  wrote:
> If _scsih_sas_host_add's call to mpt3sas_config_get_sas_iounit_pg0
> fails, ioc->sas_hba.parent_dev may be left uninitialized.  A later
> device probe could invoke mpt3sas_transport_port_add which will call
> sas_port_alloc_num [scsi_transport_sas] with a NULL parent_dev pointer.
>
> Signed-off-by: Joe Lawrence 
> ---
>  drivers/scsi/mpt3sas/mpt3sas_transport.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_transport.c 
> b/drivers/scsi/mpt3sas/mpt3sas_transport.c
> index 6a84b82d71bb..ff93286bc32f 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_transport.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_transport.c
> @@ -705,6 +705,11 @@ mpt3sas_transport_port_add(struct MPT3SAS_ADAPTER *ioc, 
> u16 handle,
> goto out_fail;
> }
>
> +   if (!sas_node->parent_dev) {
> +   pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
> +   ioc->name, __FILE__, __LINE__, __func__);
> +   goto out_fail;
> +   }
> port = sas_port_alloc_num(sas_node->parent_dev);
> if ((sas_port_add(port))) {
> pr_err(MPT3SAS_FMT "failure at %s:%d/%s()!\n",
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html