date:20140909

Re: [PATCH] lpfc: Avoid to disable pci_dev twice

2014-09-09 Thread Mike Qiu


On 08/28/2014 02:34 AM, James Smart wrote:

Mike,

Can you confirm - the "nulls" this patch correct are because the 
probe_one and error_detect threads are running concurrently, thus 
battling ?


If so - this fix looks insufficient and we should rework it.


Yes, it is. My patch is just a workaround for this bug.



Q: why are they allowed to run concurrently ?  I could see this solved 
at the platform level to let probe_one finish before error_detect is 
called (and therefore stating error_detect only makes sense to call if 
probe_one was successful). It's also a much driver-friendly solution. 
I could see other drivers have much the same issue with concurrency 
and data structure teardown - and if locks aren't allowed in the 
error-detect path... it's not good.




I agree with you on this point, platform solution is much better. So 
maybe use a lock or a flag to show it is in such stat, this maybe  also 
happens when driver is in  remove stat.


Thanks,
Mike

-- james s



On 7/31/2014 10:16 PM, Mike Qiu wrote:

On 07/17/2014 02:32 PM, Mike Qiu wrote:


Hi, all

How about this patch ?

Any idea ?


In IBM Power servers, when hardware error occurs during probe
state, EEH subsystem will call driver's error_detected interface,
which will call pci_disable_device(). But driver's probe function also
call pci_disable_device() in this situation.

So pci_dev will be disabled twice:

Device lpfc disabling already-disabled device
[ cut here ]
WARNING: at drivers/pci/pci.c:1407
CPU: 0 PID: 8744 Comm: kworker/0:0 Tainted: GW 
3.10.42-2002.pkvm2_1_1.6.ppc64 #1

Workqueue: events .work_for_cpu_fn
task: c0274e3f5400 ti: c027d3958000 task.ti: c027d3958000
NIP: c0471b8c LR: c0471b88 CTR: c043ebe0
REGS: c027d395b650 TRAP: 0700   Tainted: GW 
(3.10.42-2002.pkvm2_1_1.6.ppc64)
MSR: 900100029032   CR: 28b52b44  XER: 
2000

CFAR: c0879ab8 SOFTE: 1
...
NIP .pci_disable_device+0xcc/0xe0
LR  .pci_disable_device+0xc8/0xe0
Call Trace:
.pci_disable_device+0xc8/0xe0 (unreliable)
.lpfc_disable_pci_dev+0x50/0x80 [lpfc]
.lpfc_pci_probe_one+0x870/0x21a0 [lpfc]
.local_pci_probe+0x68/0xb0
.work_for_cpu_fn+0x38/0x60
.process_one_work+0x1a4/0x4d0
.worker_thread+0x37c/0x490
.kthread+0xf0/0x100
.ret_from_kernel_thread+0x5c/0x80

Signed-off-by: Mike Qiu 
---
  drivers/scsi/lpfc/lpfc.h  |  1 +
  drivers/scsi/lpfc/lpfc_init.c | 59 
+++

  2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc.h b/drivers/scsi/lpfc/lpfc.h
index 434e903..0c7bad9 100644
--- a/drivers/scsi/lpfc/lpfc.h
+++ b/drivers/scsi/lpfc/lpfc.h
@@ -813,6 +813,7 @@ struct lpfc_hba {
  #define VPD_MASK0xf /* mask for any vpd data */

  uint8_t soft_wwn_enable;
+uint8_t probe_done;

  struct timer_list fcp_poll_timer;
  struct timer_list eratt_poll;
diff --git a/drivers/scsi/lpfc/lpfc_init.c 
b/drivers/scsi/lpfc/lpfc_init.c

index 06f9a5b..c2e67ae 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -9519,6 +9519,9 @@ lpfc_pci_probe_one_s3(struct pci_dev *pdev, 
const struct pci_device_id *pid)

  }
  }

+/* Set the probe flag */
+phba->probe_done = 1;
+
  /* Perform post initialization setup */
  lpfc_post_init_setup(phba);

@@ -9795,6 +9798,9 @@ lpfc_sli_prep_dev_for_recover(struct lpfc_hba 
*phba)

  static void
  lpfc_sli_prep_dev_for_reset(struct lpfc_hba *phba)
  {
+if (phba)
+return;
+
  lpfc_printf_log(phba, KERN_ERR, LOG_INIT,
  "2710 PCI channel disable preparing for reset\n");

@@ -9812,7 +9818,8 @@ lpfc_sli_prep_dev_for_reset(struct lpfc_hba 
*phba)


  /* Disable interrupt and pci device */
  lpfc_sli_disable_intr(phba);
-pci_disable_device(phba->pcidev);
+if (phba->probe_done && phba->pcidev)
+pci_disable_device(phba->pcidev);
  }

  /**
@@ -10282,6 +10289,9 @@ lpfc_pci_probe_one_s4(struct pci_dev *pdev, 
const struct pci_device_id *pid)

  goto out_disable_intr;
  }

+/* Set probe_done flag */
+phba->probe_done = 1;
+
  /* Log the current active interrupt mode */
  phba->intr_mode = intr_mode;
  lpfc_log_intr_mode(phba, intr_mode);
@@ -10544,6 +10554,9 @@ lpfc_sli4_prep_dev_for_recover(struct 
lpfc_hba *phba)

  static void
  lpfc_sli4_prep_dev_for_reset(struct lpfc_hba *phba)
  {
+if (!phba)
+return;
+
  lpfc_printf_log(phba, KERN_ERR, LOG_INIT,
  "2826 PCI channel disable preparing for reset\n");

@@ -10562,7 +10575,9 @@ lpfc_sli4_prep_dev_for_reset(struct lpfc_hba 
*phba)

  /* Disable interrupt and pci device */
  lpfc_sli4_disable_intr(phba);
  lpfc_sli4_queue_destroy(phba);
-pci_disable_device(phba->pcidev);
+
+if (phba->probe_done && phba->pcidev)
+pci_disable_device(phba->pcidev);
  }

  /**
@@ -10893,9 +10908,21 @@ static pci_ers_result_t
  lpfc_io_error_detec

Re: [systemd-devel] [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Tom Gundersen

On Tue, Sep 9, 2014 at 10:45 PM, Luis R. Rodriguez
 wrote:
> On Tue, Sep 9, 2014 at 12:35 PM, James Bottomley
>  wrote:
>> On Tue, 2014-09-09 at 12:16 -0700, Luis R. Rodriguez wrote:
>>> On Mon, Sep 8, 2014 at 10:38 PM, James Bottomley
>>>  wrote:
>>> > If we want to sort out some sync/async mechanism for probing devices, as
>>> > an agreement between the init systems and the kernel, that's fine, but
>>> > its a to-be negotiated enhancement.
>>>
>>> Unfortunately as Tejun notes the train has left which already made
>>> assumptions on this.
>>
>> Well, that's why it's a bug.  It's a material regression impacting
>> users.
>
> Indeed. I believe the issue with this regression however was that the
> original commit e64fae55 (January 2012) was only accepted by *kernel
> folks* to be a real regression until recently.

Just for the record, this only caused user-visible problems after
kernel commit 786235ee (November 2013), right?

> More than two years
> have gone by on growing design and assumptions on top of that original
> commit. I'm not sure if *systemd folks* yet believe its was a design
> regression?

I don't think so. udev should not allow its workers to run for an
unbounded length of time. Whether the upper bound should be 30, 60,
180 seconds or something else is up for debate (currently it is 60,
but if that is too short for some drivers we could certainly revisit
that). Moreover, it seems from this discussion that the aim is (still)
that insmod should be near-instantaneous (i.e., not wait for probe),
so it seems to me that the basic design is correct and all we need is
some temporary work-around and a way to better detect misbehaving
drivers?

>>>  I'm afraid distributions that want to avoid this
>>> sigkill at least on the kernel front will have to work around this
>>> issue either on systemd by increasing the default timeout which is now
>>> possible thanks to Hannes' changes or by some other means such as the
>>> combination of a modified non-chatty version of this patch + a check
>>> at the end of load_module() as mentioned earlier on these threads.
>>
>> Increasing the default timeout in systemd seems like the obvious bug fix
>> to me.  If the patch exists already, having distros that want it use it
>> looks to be correct ... not every bug is a kernel bug, after all.
>
> Its merged upstream on systemd now, along with a few fixes on top of
> it. I also see Kay merged a change to the default timeout to 60 second
> on August 30. Its unclear if these discussions had any impact on that
> decision or if that was just because udev firmware loading got now
> ripped out. I'll note that the new 60 second timeout wouldn't suffice
> for cxgb4 even if it didn't do firmware loading, its probe takes over
> one full minute.
>
>> Negotiating a probe vs init split for drivers is fine too, but it's a
>> longer term thing rather than a bug fix.
>
> Indeed. What I proposed with a multiplier for the timeout for the
> different types of built in commands was deemed complex but saw no
> alternatives proposed despite my interest to work on one and
> clarifications noted that this was a design regression. Not quite sure
> what else I could have done here. I'm interested in learning what the
> better approach is for the future as if we want to marry init + kernel
> we need a smooth way for us to discuss design without getting worked
> up about it, or taking it personal. I really want this to work as I
> personally like systemd so far.

How about this: keep the timeout global, but also introduce a
(relatively short, say 10 or 15 seconds) timeout after which a warning
is printed. Even if nothing is actually killed, having workers (be it
insmod or something else) take longer than a couple of seconds is
likely a sign that something is seriously off somewhere.

Cheers,

Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Tom Gundersen

On Tue, Sep 9, 2014 at 3:26 AM, Luis R. Rodriguez
 wrote:
> On Mon, Sep 8, 2014 at 6:22 PM, Tejun Heo  wrote:
>> On Tue, Sep 09, 2014 at 10:10:59AM +0900, Tejun Heo wrote:
>>> I'm not too convinced this is such a difficult problem to figure out.
>>> We already have most of logic in place and the only thing missing is
>>> how to switch it.  Wouldn't something like the following work?
>>>
>>> * Add a sysctl knob to enable asynchronous device probing on module
>>>   load and enable asynchronous probing globally if the knob is set.
>>
>> Alternatively, add a module-generic param "async_probe" or whatever
>> and use that to switch the behavior should work too.  I don't know
>> which way is better but either should work fine.
>
> I take it by this you meant a generic system-wide sysctl or kernel cmd
> line option to enable this for al drivers?

If the expectation is that this feature should be enabled
unconditionally for all systemd systems, wouldn't it make more sense
to make it a Kconfig option (possibly overridable from the kernel
commandline in case that makes testing simpler)?

Cheers,

Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

lk 3.17-rc4 blk_mq large write problems

2014-09-09 Thread Douglas Gilbert


A few days ago I was trying to create a large file
(say 16 GB) of zeros on an ext4 file system:
   dd if=/dev/zero bs=64k count=256k of=zero_16g.bin

After about 5 seconds there was a NULL de-reference that
crashed the machine (shown below). This was with a clean
version of lk 3.17-rc4 (from kernel.org) where the target
was a SATA SSD directly connected to a LSI 9300-4i SAS-3
HBA (mpt3sas). Significantly (IMO) the kernel boot line
contained:
   scsi_mod.use_blk_mq=Y

In all cases changing that to "N" fixed the problem. I tried
many things, including a SAS SSD but the problem persisted when
use_blk_mq=Y. It doesn't always oops as shown in the first
case below. There were also:
  - immediate reboots
  - lock-ups without any oops on the console
  - different oopses of a somewhat stranger nature
(hard to catch as logging everything on a real
 serial port is fiddly) like double bus errors

Rob Elliott has been unable to replicate this problem.

Today I switched to another machine running Debian 7 (the
first machine was Ubuntu 14.04 based); both x86_64.
Built the same kernel on the second machine, this time
with a LSI 9212-4i4e SAS-2 HBA (mpt2sas) and a SAS SSD
directly connected. Roughly speaking it was the same
test case:
  # 
  # mkfs.ext4 /dev/sdb1
  # mount /dev/sdb1 /mnt/spare
  # cd /mnt/spare
  # dd if=/dev/zero bs=64k count=256k of=zero_16g.bin
  # cd
  # umount /mnt/spare

Usually the dd or the umount would crash. Then after a
crash, following a power cycle, the mount would crash.
Changing to scsi_mod.use_blk_mq=N restored sanity.

Tried some other SAS controllers: couldn't get a MR-9240-4i
(MegaRaid) to work at all on my newer box (doesn't like
PCIe 3 ?). Got a ARC-1882I working and it did not have
problems with the big dd (perhaps the arcmsr driver still
uses the host_lock to serialize commands).

So it could be common, bad code in the mpt2sas and mpt3sas
drivers. Or it could be somewhere else. Perhaps there is
more than one problem.

Testers out there are encouraged to run the above test case.
The SATA and SAS SSDs that I used can consume writes in the
300 to 600 MB/sec range.

Part of the strangeness of this first attached oops is that
blk_mq_timeout_check() appears twice. The second one (typically
from the umount) is a blown stack.

Enjoy.
Doug Gilbert






BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [] scsi_times_out+0xe/0x2e0
PGD 2149ec067 PUD 214265067 PMD 0 
Oops:  [#1] SMP 
Modules linked in: x86_pkg_temp_thermal kvm_intel kvm nfsd ehci_pci ehci_hcd 
crct10dif_pclmul serio_raw parport_pc auth_rpcgss oid_registry exportfs nfs 
lockd sunrpc binfmt_misc fuse lp parport ext4 crc16 jbd2 usbhid ses xhci_hcd 
r8169 usbcore usb_common
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.17.0-rc3 #69
Hardware name: Gigabyte Technology Co., Ltd. Z97M-D3H/Z97M-D3H, BIOS F5 
05/30/2014
task: 88021513e090 ti: 88021518c000 task.ti: 88021518c000
RIP: 0010:[]  [] scsi_times_out+0xe/0x2e0
RSP: 0018:88021fb83e10  EFLAGS: 00010282
RAX: 8127cd20 RBX:  RCX: 8800d3dc8d40
RDX: 88020fe9c0c8 RSI: 2007 RDI: 8800d3dc8c00
RBP: 88020fe9c0c8 R08: 880037970088 R09: 88003797
R10: 88021e8024e8 R11: 0002 R12: 0449
R13: 88003797 R14: 88021fb83ea8 R15: 88021520c000
FS:  () GS:88021fb8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 000214321000 CR4: 001407e0
Stack:
 8800d3dc8c00 88020fe9c0c8 8118f1d7 26fb
 88020fe9d400 811905db 880214fb33c0 88003797
 81190570 88021fb83ea8 0020 81193430
Call Trace:
  
 [] ? blk_rq_timed_out+0x17/0x80
 [] ? blk_mq_timeout_check+0x6b/0x90
 [] ? blk_mq_attempt_merge+0xb0/0xb0
 [] ? blk_mq_tag_busy_iter+0x50/0x80
 [] ? blk_mq_rq_timer+0x84/0x120
 [] ? blk_mq_timeout_check+0x90/0x90
 [] ? call_timer_fn.isra.36+0x12/0x70
 [] ? run_timer_softirq+0x19a/0x230
 [] ? __do_softirq+0xd5/0x1f0
 [] ? irq_exit+0x45/0x50
 [] ? smp_apic_timer_interrupt+0x3b/0x50
 [] ? apic_timer_interrupt+0x6a/0x70
  
 [] ? cpuidle_enter_state+0x4b/0xc0
 [] ? cpuidle_enter_state+0x3d/0xc0
 [] ? cpu_startup_entry+0x237/0x270
Code: e8 d8 b3 ff ff 85 c0 75 cd e9 54 ff ff ff 66 66 66 66 66 66 2e 0f 1f 84 
00 00 00 00 00 55 be 07 20 00 00 53 48 8b 9f f8 00 00 00 <48> 8b 03 48 89 df 48 
8b 28 e8 24 ac ff ff 83 bd 54 01 00 00 ff 
RIP  [] scsi_times_out+0xe/0x2e0
 RSP 
CR2: 
---[ end trace 659752a390e3d62e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt
BUG: unable to handle kernel paging request at 00017f6b91a0
IP: [] cpuacct_charge+0x1f/0x40
PGD 3a77e067 PUD 0 
Thread overran stack, or stack corrupt

Re: [PATCH 8/8] blk-mq: support per-distpatch_queue flush machinery

2014-09-09 Thread Ming Lei

On Wed, Sep 10, 2014 at 2:48 AM, Christoph Hellwig  wrote:
>> + if (hctx) {
>> + int cmd_sz = q->tag_set->cmd_size;
>> + int node = hctx->numa_node;
>> +
>> + fq = kzalloc_node(sizeof(*fq), GFP_KERNEL, node);
>> + if (!fq)
>> + goto failed;
>> +
>> + rq_sz = round_up(rq_sz + cmd_sz, cache_line_size());
>> + fq->flush_rq = kzalloc_node(rq_sz, GFP_KERNEL, node);
>> + if (!fq->flush_rq)
>> + goto rq_failed;
>> +
>> + spin_lock_init(&fq->mq_flush_lock);
>> + } else {
>> + fq = kzalloc(sizeof(*fq), GFP_KERNEL);
>> + if (!fq)
>> + goto failed;
>> +
>> + fq->flush_rq = kzalloc(rq_sz, GFP_KERNEL);
>> + if (!fq->flush_rq)
>> + goto rq_failed;
>> + }
>
> Seems like this would be a lot cleaner by passing the cmd_size and
> node_id explicitly.  The added benefit would be that we could also
> pass the node for the blk_init_queue_node() case.

OK.

>
>> +static void __blk_mq_exit_flush(struct request_queue *q,
>> + unsigned free_end, unsigned int exit_end)
>> +{
>> + struct blk_mq_hw_ctx *hctx;
>> + unsigned int k;
>> + struct blk_flush_queue *fq;
>> + struct blk_mq_tag_set *set = q->tag_set;
>> + unsigned start_idx = set->queue_depth;
>> +
>> + queue_for_each_hw_ctx(q, hctx, k) {
>> + if (k >= free_end)
>> + break;
>> +
>> + fq = hctx->fq;
>> + if (k < exit_end && set->ops->exit_request)
>> + set->ops->exit_request(set->driver_data,
>> + fq->flush_rq, k,
>> + start_idx + k);
>> +
>> + blk_free_flush_queue(fq);
>> + }
>
> Can we merge the mq init/exit case into some existing for each hctx
> loop in blk-mq?

I am wondering we can do that because lifetime is totally different
between flush requests and tag_set requests which are initialized
before request queue is created.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] block: introduce blk_flush_queue to drive flush machinery

2014-09-09 Thread Ming Lei

On Wed, Sep 10, 2014 at 2:55 AM, Jens Axboe  wrote:
> On 09/09/2014 12:43 PM, Christoph Hellwig wrote:
>> On Tue, Sep 09, 2014 at 09:05:46PM +0800, Ming Lei wrote:
>>> This patch introduces 'struct blk_flush_queue' and puts all
>>> flush machinery related stuff into this strcuture, so that
>>
>> s/stuff/fields/
>> s/strcuture/structure/
>>
>> Looks good, but a few more nitpicks below.
>>
>> Reviewed-by: Christoph Hellwig 
>>
>>> +int blk_init_flush(struct request_queue *q)
>>> +{
>>> +int ret;
>>> +struct blk_flush_queue *fq = kzalloc(sizeof(*fq), GFP_KERNEL);
>>>
>>> +if (!fq)
>>>  return -ENOMEM;
>>>
>>> +q->fq = fq;
>>
>> I think it would be cleaner to return the flush data structure and
>> assign it in the caller.
>
> I was going to suggest renaming because of this as well. If we do this:
>
> q->fq = blk_init_flush(q);
>
> then it's immediately clear what it does, whereas blk_init_flush(q)
> means very little on its own. I'd change the naming to
> blk_alloc_flush_queue() and blk_free_flush_queue().

Exactly these two functions are introduced in patch 8/8, and I
will try to name them earlier in V1.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] block: introduce blk_flush_queue to drive flush machinery

2014-09-09 Thread Ming Lei

On Wed, Sep 10, 2014 at 2:43 AM, Christoph Hellwig  wrote:
> On Tue, Sep 09, 2014 at 09:05:46PM +0800, Ming Lei wrote:
>> This patch introduces 'struct blk_flush_queue' and puts all
>> flush machinery related stuff into this strcuture, so that
>
> s/stuff/fields/
> s/strcuture/structure/
>
> Looks good, but a few more nitpicks below.
>
> Reviewed-by: Christoph Hellwig 
>
>> +int blk_init_flush(struct request_queue *q)
>> +{
>> + int ret;
>> + struct blk_flush_queue *fq = kzalloc(sizeof(*fq), GFP_KERNEL);
>>
>> + if (!fq)
>>   return -ENOMEM;
>>
>> + q->fq = fq;
>
> I think it would be cleaner to return the flush data structure and
> assign it in the caller.

OK, and most of this part is transitional.

>
>> + INIT_LIST_HEAD(&fq->flush_queue[0]);
>> + INIT_LIST_HEAD(&fq->flush_queue[1]);
>> + INIT_LIST_HEAD(&fq->flush_data_in_flight);
>> +
>> + if (q->mq_ops) {
>> + ret = blk_mq_init_flush(q);
>
> I think we can just remove blk_mq_init_flush now that it's only
> called in blk-flush.c anyway.

blk_mq_init_flush() will become bigger in following patch.

>>  void blk_exit_flush(struct request_queue *q)
>>  {
>> + if (q->mq_ops)
>> + blk_mq_exit_flush(q);
>> + else {
>> + struct blk_flush_queue *fq = blk_get_flush_queue(q);
>> + kfree(fq->flush_rq);
>> + kfree(fq);
>> + }
>
> Similarly I would pass the flush structure here.

OK.

>
>> +struct blk_flush_queue {
>> + unsigned intflush_queue_delayed:1;
>> + unsigned intflush_pending_idx:1;
>> + unsigned intflush_running_idx:1;
>> + unsigned long   flush_pending_since;
>> + struct list_headflush_queue[2];
>> + struct list_headflush_data_in_flight;
>> + struct request  *flush_rq;
>> + spinlock_t  mq_flush_lock;
>> +};
>
> As this isn't really a queue I would call it blk_flush_data.

It is sort of a queue since there is a double buffer flush queue.

>
>> +static inline struct blk_flush_queue *blk_get_flush_queue(
>> + struct request_queue *q)
>> +{
>> + return q->fq;
>> +}
>
> I don't think there is a need for this helper.

No, the helper can simplify the following patch a lot since
the flush queue data is always obtained from this helper in both
legacy and mq case, which will take per-hw_queue flush queue.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Tejun Heo

On Tue, Sep 09, 2014 at 12:25:29PM +0900, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 08, 2014 at 08:19:12PM -0700, Luis R. Rodriguez wrote:
> > On the systemd side of things it should enable this sysctl and for
> > older kernels what should it do?
> 
> Supposing the change is backported via -stable, it can try to set the
> sysctl on all kernels.  If the knob doesn't exist, the fix is not
> there and nothing can be done about it.

The more I think about it, the more I think this should be a
per-insmod instance thing rather than a system-wide switch.  Currently
the kernel param code doesn't allow a generic param outside the ones
specified by the module itself but adding support for something like
driver.async_load=1 shouldn't be too difficult, applying that to
existing systems shouldn't be much more difficult than a system-wide
switch, and it'd be siginificantly cleaner than fiddling with driver
blacklist.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Dmitry Torokhov

On Tuesday, September 09, 2014 03:46:23 PM James Bottomley wrote:
> On Wed, 2014-09-10 at 07:41 +0900, Tejun Heo wrote:
> > 
> > The thing is that we have to have dynamic mechanism to listen for
> > device attachments no matter what and such mechanism has been in place
> > for a long time at this point.  The synchronous wait simply doesn't
> > serve any purpose anymore and kinda gets in the way in that it makes
> > it a possibly extremely slow process to tell whether loading of a
> > module succeeded or not because the wait for the initial round of
> > probe is piggybacked.
> 
> OK, so we just fire and forget in userland ... why bother inventing an
> elaborate new infrastructure in the kernel to do exactly what
> 
> modprobe  &
> 
> would do?

Just so we do not forget: we also want the no-modules case to also be able
to probe asynchronously so that a slow device does not stall kernel booting.

Thanks.

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Tejun Heo

Hello, James.

On Tue, Sep 09, 2014 at 03:46:23PM -0700, James Bottomley wrote:
> If you want the return of an individual device probe a log scraper gives
> it to you ... and nothing else does currently.  The advantage of the
> prink in dd.c is that it's standard for everything and can be scanned
> for ... if you take that out, you'll get complaints about the lack of
> standard messages (you'd be surprised at the number of enterprise
> monitoring systems that actually do log scraping).

Why would a log scaper care about which task is printing the messages?
The printk can stay there.  There's nothing wrong with it.  Log
scapers tend to be asynchronous in nature but if a log scraper wants
to operate synchronously for whatever reason, it can simply not turn
on async probing.

> OK, so we just fire and forget in userland ... why bother inventing an
> elaborate new infrastructure in the kernel to do exactly what
> 
> modprobe  &
> 
> would do?

I think the argument there is that the issuer wants to know whether
such operations succeeded or not and wants to report and record the
result and possibly take other actions in response.  We're currently
mixing wait and error reporting for one type of operation with wait
for another.  I'm not saying it's a fatal flaw or anything but it can
get in the way.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread James Bottomley

On Wed, 2014-09-10 at 07:41 +0900, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 09, 2014 at 03:26:02PM -0700, James Bottomley wrote:
> > > We no longer report back error on probe failure on module load.
> > 
> > Yes, we do; for every probe failure of a device on a driver we'll print
> > a warning (see drivers/base/dd.c).  Now if someone is proposing we
> > should report this in a better fashion, that's probably a good idea, but
> > I must have missed that patch.
> 
> We can do printks all the same from anywhere.  There's nothing special
> about printing from the module loading thread.  The only way to
> actually take advantage of the synchronisity would be propagating
> error return to the waiting issuer, which we used to do but no longer
> can.

If you want the return of an individual device probe a log scraper gives
it to you ... and nothing else does currently.  The advantage of the
prink in dd.c is that it's standard for everything and can be scanned
for ... if you take that out, you'll get complaints about the lack of
standard messages (you'd be surprised at the number of enterprise
monitoring systems that actually do log scraping).

> > >   It
> > > used to make sense to indicate error for module load on probe failure
> > > when the hardware was a lot simpler and drivers did their own device
> > > enumeration.  With the current bus / device setup, it doesn't make any
> > > sense and driver core silently suppresses all probe failures.  There's
> > > nothing the probing thread can monitor anymore.
> > 
> > Except the length of time taken to probe.  That seems to be what systemd
> > is interested in, hence this whole thread, right?
> 
> No, systemd in this case isn't interested in the time taken to probe
> at all.  It is expecting module load to just do that - load the
> module.  Modern userlands, systemd or not, no longer depend on or make
> use of the wait.

So what's the problem?  it can just fire and forget; that's what fork()
is for.

> > But that's nothing to do with sync or async.  Nowadays we register a
> > driver, the driver may bind to multiple devices.  If one of those
> > devices encounters an error during probe, we just report the fact in
> > dmesg and move on.  The module_init thread currently returns when all
> > the probe routines for all enumerated devices have been called, so
> > module_init has no indication of any failures (because they might be
> > mixed with successes); successes are indicated as the device appears but
> > we have nothing other than the kernel log to indicate the failures.  How
> > does moving to async probing alter this?  It doesn't as far as I can
> > see, except that module_init returns earlier but now we no longer have
> > an indication of when the probe completes, so we have to add yet another
> > mechanism to tell us if we're interested in that.  I really don't see
> > what this buys us.
> 
> The thing is that we have to have dynamic mechanism to listen for
> device attachments no matter what and such mechanism has been in place
> for a long time at this point.  The synchronous wait simply doesn't
> serve any purpose anymore and kinda gets in the way in that it makes
> it a possibly extremely slow process to tell whether loading of a
> module succeeded or not because the wait for the initial round of
> probe is piggybacked.

OK, so we just fire and forget in userland ... why bother inventing an
elaborate new infrastructure in the kernel to do exactly what

modprobe  &

would do?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Tejun Heo

Hello,

On Tue, Sep 09, 2014 at 03:26:02PM -0700, James Bottomley wrote:
> > We no longer report back error on probe failure on module load.
> 
> Yes, we do; for every probe failure of a device on a driver we'll print
> a warning (see drivers/base/dd.c).  Now if someone is proposing we
> should report this in a better fashion, that's probably a good idea, but
> I must have missed that patch.

We can do printks all the same from anywhere.  There's nothing special
about printing from the module loading thread.  The only way to
actually take advantage of the synchronisity would be propagating
error return to the waiting issuer, which we used to do but no longer
can.

> >   It
> > used to make sense to indicate error for module load on probe failure
> > when the hardware was a lot simpler and drivers did their own device
> > enumeration.  With the current bus / device setup, it doesn't make any
> > sense and driver core silently suppresses all probe failures.  There's
> > nothing the probing thread can monitor anymore.
> 
> Except the length of time taken to probe.  That seems to be what systemd
> is interested in, hence this whole thread, right?

No, systemd in this case isn't interested in the time taken to probe
at all.  It is expecting module load to just do that - load the
module.  Modern userlands, systemd or not, no longer depend on or make
use of the wait.

> But that's nothing to do with sync or async.  Nowadays we register a
> driver, the driver may bind to multiple devices.  If one of those
> devices encounters an error during probe, we just report the fact in
> dmesg and move on.  The module_init thread currently returns when all
> the probe routines for all enumerated devices have been called, so
> module_init has no indication of any failures (because they might be
> mixed with successes); successes are indicated as the device appears but
> we have nothing other than the kernel log to indicate the failures.  How
> does moving to async probing alter this?  It doesn't as far as I can
> see, except that module_init returns earlier but now we no longer have
> an indication of when the probe completes, so we have to add yet another
> mechanism to tell us if we're interested in that.  I really don't see
> what this buys us.

The thing is that we have to have dynamic mechanism to listen for
device attachments no matter what and such mechanism has been in place
for a long time at this point.  The synchronous wait simply doesn't
serve any purpose anymore and kinda gets in the way in that it makes
it a possibly extremely slow process to tell whether loading of a
module succeeded or not because the wait for the initial round of
probe is piggybacked.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread James Bottomley

On Wed, 2014-09-10 at 06:42 +0900, Tejun Heo wrote:
> Hey, James.
> 
> On Tue, Sep 09, 2014 at 12:35:46PM -0700, James Bottomley wrote:
> > I don't have very strong views on this one.  However, I've got to say
> > from a systems point of view that if the desire is to flag when the
> > module is having problems, probing and initializing synchronously in a
> > thread spawned by init which the init process can watchdog and thus can
> > flash up warning messages seems to be more straightforwards than an
> > elaborate asynchronous mechanism with completion signalling which
> > achieves the same thing in a more complicated (and thus bug prone)
> > fashion.
> 
> We no longer report back error on probe failure on module load.

Yes, we do; for every probe failure of a device on a driver we'll print
a warning (see drivers/base/dd.c).  Now if someone is proposing we
should report this in a better fashion, that's probably a good idea, but
I must have missed that patch.

>   It
> used to make sense to indicate error for module load on probe failure
> when the hardware was a lot simpler and drivers did their own device
> enumeration.  With the current bus / device setup, it doesn't make any
> sense and driver core silently suppresses all probe failures.  There's
> nothing the probing thread can monitor anymore.

Except the length of time taken to probe.  That seems to be what systemd
is interested in, hence this whole thread, right?

> In that sense, we already separated out device probing from module
> loading simply because the hardware reality mandated it and we have
> dynamic mechanisms to listen for device probes exactly for the same
> reason, so I think it makes sense to separate out the waiting too, at
> least in the long term.  In a modern dynamic setup, the waits are
> essentially arbitrary and doesn't buy us anything.

But that's nothing to do with sync or async.  Nowadays we register a
driver, the driver may bind to multiple devices.  If one of those
devices encounters an error during probe, we just report the fact in
dmesg and move on.  The module_init thread currently returns when all
the probe routines for all enumerated devices have been called, so
module_init has no indication of any failures (because they might be
mixed with successes); successes are indicated as the device appears but
we have nothing other than the kernel log to indicate the failures.  How
does moving to async probing alter this?  It doesn't as far as I can
see, except that module_init returns earlier but now we no longer have
an indication of when the probe completes, so we have to add yet another
mechanism to tell us if we're interested in that.  I really don't see
what this buys us.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Jiri Kosina

On Tue, 9 Sep 2014, Luis R. Rodriguez wrote:

> By design it seems systemd should not allow worker processes to block
> indefinitely and in fact it currently uses the same timeout for all
> types of worker processes. 

And I whole-heartedly believe this is something that fundamentally needs 
to be addressed in systemd, not in the kernel.

This aproach is actually introducing a user-visible regressions. Look, for 
example, exec() never times out. Therefore if your system is on its knees, 
heavily overloaded (or completely broken), you are likely to be able to 
`reboot' it, because exec("/sbin/reboot") ultimately succeeds.

But with all the timeouts, dbus, "Failed to issue method call: Did 
not receive a reply" messages, this is getting close to impossible.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Tejun Heo

Hey, James.

On Tue, Sep 09, 2014 at 12:35:46PM -0700, James Bottomley wrote:
> I don't have very strong views on this one.  However, I've got to say
> from a systems point of view that if the desire is to flag when the
> module is having problems, probing and initializing synchronously in a
> thread spawned by init which the init process can watchdog and thus can
> flash up warning messages seems to be more straightforwards than an
> elaborate asynchronous mechanism with completion signalling which
> achieves the same thing in a more complicated (and thus bug prone)
> fashion.

We no longer report back error on probe failure on module load.  It
used to make sense to indicate error for module load on probe failure
when the hardware was a lot simpler and drivers did their own device
enumeration.  With the current bus / device setup, it doesn't make any
sense and driver core silently suppresses all probe failures.  There's
nothing the probing thread can monitor anymore.

In that sense, we already separated out device probing from module
loading simply because the hardware reality mandated it and we have
dynamic mechanisms to listen for device probes exactly for the same
reason, so I think it makes sense to separate out the waiting too, at
least in the long term.  In a modern dynamic setup, the waits are
essentially arbitrary and doesn't buy us anything.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Luis R. Rodriguez

On Tue, Sep 9, 2014 at 12:35 PM, James Bottomley
 wrote:
> On Tue, 2014-09-09 at 12:16 -0700, Luis R. Rodriguez wrote:
>> On Mon, Sep 8, 2014 at 10:38 PM, James Bottomley
>>  wrote:
>> > If we want to sort out some sync/async mechanism for probing devices, as
>> > an agreement between the init systems and the kernel, that's fine, but
>> > its a to-be negotiated enhancement.
>>
>> Unfortunately as Tejun notes the train has left which already made
>> assumptions on this.
>
> Well, that's why it's a bug.  It's a material regression impacting
> users.

Indeed. I believe the issue with this regression however was that the
original commit e64fae55 (January 2012) was only accepted by *kernel
folks* to be a real regression until recently. More than two years
have gone by on growing design and assumptions on top of that original
commit. I'm not sure if *systemd folks* yet believe its was a design
regression?

>>  I'm afraid distributions that want to avoid this
>> sigkill at least on the kernel front will have to work around this
>> issue either on systemd by increasing the default timeout which is now
>> possible thanks to Hannes' changes or by some other means such as the
>> combination of a modified non-chatty version of this patch + a check
>> at the end of load_module() as mentioned earlier on these threads.
>
> Increasing the default timeout in systemd seems like the obvious bug fix
> to me.  If the patch exists already, having distros that want it use it
> looks to be correct ... not every bug is a kernel bug, after all.

Its merged upstream on systemd now, along with a few fixes on top of
it. I also see Kay merged a change to the default timeout to 60 second
on August 30. Its unclear if these discussions had any impact on that
decision or if that was just because udev firmware loading got now
ripped out. I'll note that the new 60 second timeout wouldn't suffice
for cxgb4 even if it didn't do firmware loading, its probe takes over
one full minute.

> Negotiating a probe vs init split for drivers is fine too, but it's a
> longer term thing rather than a bug fix.

Indeed. What I proposed with a multiplier for the timeout for the
different types of built in commands was deemed complex but saw no
alternatives proposed despite my interest to work on one and
clarifications noted that this was a design regression. Not quite sure
what else I could have done here. I'm interested in learning what the
better approach is for the future as if we want to marry init + kernel
we need a smooth way for us to discuss design without getting worked
up about it, or taking it personal. I really want this to work as I
personally like systemd so far.

>> > For the current bug fix, just fix  the component that broke ... which 
>> > would be systemd.
>>
>> For new systems it seems the proposed fix is to have systemd tell the
>> kernel what it thought it should be seeing and that is all pure async
>> probes through a sysctl, and then we'd do async probe on all modules
>> unless a driver is specifically flagged with a need to run synchronous
>> (we'll enable this for request_firmware() users for example to start
>> off with).
>
> I don't have very strong views on this one.  However, I've got to say
> from a systems point of view that if the desire is to flag when the
> module is having problems, probing and initializing synchronously in a
> thread spawned by init which the init process can watchdog and thus can
> flash up warning messages seems to be more straightforwards

Indeed however it was not understood that module loading did init +
probe synchrounously, and indeed what you recommend is also what I was
hoping systemd *should do* instead of a hard sigkill at the default
timeout.

> than an
> elaborate asynchronous mechanism with completion signalling which
> achieves the same thing in a more complicated (and thus bug prone)
> fashion.

I couldn't be in any more agreement with you. It takes two to tango though.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread James Bottomley

On Tue, 2014-09-09 at 12:16 -0700, Luis R. Rodriguez wrote:
> On Mon, Sep 8, 2014 at 10:38 PM, James Bottomley
>  wrote:
> > If we want to sort out some sync/async mechanism for probing devices, as
> > an agreement between the init systems and the kernel, that's fine, but
> > its a to-be negotiated enhancement.
> 
> Unfortunately as Tejun notes the train has left which already made
> assumptions on this.

Well, that's why it's a bug.  It's a material regression impacting
users.

>  I'm afraid distributions that want to avoid this
> sigkill at least on the kernel front will have to work around this
> issue either on systemd by increasing the default timeout which is now
> possible thanks to Hannes' changes or by some other means such as the
> combination of a modified non-chatty version of this patch + a check
> at the end of load_module() as mentioned earlier on these threads.

Increasing the default timeout in systemd seems like the obvious bug fix
to me.  If the patch exists already, having distros that want it use it
looks to be correct ... not every bug is a kernel bug, after all.

Negotiating a probe vs init split for drivers is fine too, but it's a
longer term thing rather than a bug fix.

> > For the current bug fix, just fix  the component that broke ... which would 
> > be systemd.
> 
> For new systems it seems the proposed fix is to have systemd tell the
> kernel what it thought it should be seeing and that is all pure async
> probes through a sysctl, and then we'd do async probe on all modules
> unless a driver is specifically flagged with a need to run synchronous
> (we'll enable this for request_firmware() users for example to start
> off with).

I don't have very strong views on this one.  However, I've got to say
from a systems point of view that if the desire is to flag when the
module is having problems, probing and initializing synchronously in a
thread spawned by init which the init process can watchdog and thus can
flash up warning messages seems to be more straightforwards than an
elaborate asynchronous mechanism with completion signalling which
achieves the same thing in a more complicated (and thus bug prone)
fashion.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] uas: Set no_report_opcodes

2014-09-09 Thread Hans de Goede


Hi,

On 09/09/2014 06:01 PM, Christoph Hellwig wrote:

On Tue, Sep 09, 2014 at 04:59:59PM +0200, Hans de Goede wrote:

asm1051e usb <-> sata bridges hang when receiving a report opcodes scsi cmnd.
Take a page out of the usb-storage book, and simple disable no_report_opcodes
outright.


Given that this device also seems broken in other ways can we wait a bit
before using the big hammer?


Actually the big hammer I would like to avoid is disabling uas all together
on these devices, as they work fine with 2 out of 3 of the 3 disks I've
tested with, as long as no report opcodes is not used.

Which is why my other patch also includes a log message to explain how
to enable uas despite the blacklist, but that will only work if we don't
send report opcodes.

> I'm still hoping UAS might give us some

better SCSI implementation so that we don't have to disable any kind of
advanced feature.


I understand, so an alternative would be to make this a quirk and only
set it for ASM1051/ASM1053 bridges.

Even better would be to combine this with figuring out why the bridge is
becoming unhappy when paired with a crucial m500 ssd, and fix that +
a report opcodes quirk, so that we don't have the blacklist it at all.

Any ideas how to fix the unhappiness ? I've already tried setting all
the sdev flags the usb-storage driver sets, but that does not help.

Regards,

Hans



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM

2014-09-09 Thread Luis R. Rodriguez

On Mon, Sep 8, 2014 at 10:38 PM, James Bottomley
 wrote:
> On Tue, 2014-09-09 at 10:10 +0900, Tejun Heo wrote:
>> Hello, Luis.
>>
>> On Mon, Sep 08, 2014 at 06:04:23PM -0700, Luis R. Rodriguez wrote:
>> > > I have no idea how the selection should be.  It could be per-insmod or
>> > > maybe just a system-wide flag with explicit exceptions marked on
>> > > drivers is good enough.  I don't know.
>> >
>> > Its perfectly understandable if we don't know what path to take yet
>> > and its also understandable for it to take time to figure out --
>> > meanwhile though systemd already has merged a policy of a 30 second
>> > timeout for *all drivers* though so we therefore need:
>>
>> I'm not too convinced this is such a difficult problem to figure out.
>> We already have most of logic in place and the only thing missing is
>> how to switch it.  Wouldn't something like the following work?
>>
>> * Add a sysctl knob to enable asynchronous device probing on module
>>   load and enable asynchronous probing globally if the knob is set.
>>
>> * Identify cases which can't be asynchronous and make them
>>   synchronous.  e.g. keep who's doing request_module() and avoid
>>   asynchronous probing if current is probing one of those.
>
> What's wrong with just fixing systemd?  Arbitrary timeouts in init
> scripts for system bring up are plain wrong ... I thought we had this
> sorted out ten years ago when we were first having the arguments about
> how long to wait for root; I'm surprised it's coming back again.

By design it seems systemd should not allow worker processes to block
indefinitely and in fact it currently uses the same timeout for all
types of worker processes. I last recommended a multiplier to at least
allow systemd to distinguish and allow us to modify the timeout based
on the type of process by using an enum used to classify these, kmod
for example would be one type of command:

http://lists.freedesktop.org/archives/systemd-devel/2014-August/021852.html

This was deemed to introduce unnecessary complexity, but I believe
this was before we realized that the timeout was penalizing kmod usage
unfairly given that the original assumption that it was just init that
should be penalized was incorrect given that we batch both init +
probe together. I have been relaying updates back on that thread as we
move along with this discussion on the issues found with the timeout,
but haven't gotten feedback yet as to which path folks on systemd
would like to take in light of recent discussions / clarifications.
Perhaps your arguments might help folks here reconsider things a bit
as well.

If we want *tight* integration between init system / kernel these
discussions are necessary not only when we find issues but also should
be part of the design phase for major changes.

> If we want to sort out some sync/async mechanism for probing devices, as
> an agreement between the init systems and the kernel, that's fine, but
> its a to-be negotiated enhancement.

Unfortunately as Tejun notes the train has left which already made
assumptions on this. I'm afraid distributions that want to avoid this
sigkill at least on the kernel front will have to work around this
issue either on systemd by increasing the default timeout which is now
possible thanks to Hannes' changes or by some other means such as the
combination of a modified non-chatty version of this patch + a check
at the end of load_module() as mentioned earlier on these threads.

> For the current bug fix, just fix  the component that broke ... which would 
> be systemd.

For new systems it seems the proposed fix is to have systemd tell the
kernel what it thought it should be seeing and that is all pure async
probes through a sysctl, and then we'd do async probe on all modules
unless a driver is specifically flagged with a need to run synchronous
(we'll enable this for request_firmware() users for example to start
off with).

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] uas: Disable uas on ASM1051 devices

2014-09-09 Thread Hans de Goede


Hi,

On 09/09/2014 05:23 PM, Alan Stern wrote:

On Tue, 9 Sep 2014, Hans de Goede wrote:


Even with REPORT SUPPORTED OPERATION CODES blacklisted the ASM1051 chipset
still does not work when combined with some disks, e.g. a Crucial M500 ssd.

When used with a troublesome disk, the chipset throws all kinds of USB errors,
and eventually hangs, where as in BOT mode it works fine.

To make matters worse the ASM1051 chipset and older versions of the ASM1053
chipset use the same usb-id.

When connected over USB-3 the 2 can be told apart by the number of streams
they support. So this patch adds some less then pretty code to disable uas for
the ASM1051. When connected over USB-2, simply disable uas alltogether for
devices with the shared usb-id.

This ends up disabling uas in many cases where it actually works fine, which
is unfortunate, but better then leaving some users with completely non working
external disks. This patch adds a log level explaining how perfomance conscious
users can re-enable uas.

Cc: sta...@vger.kernel.org # 3.16
Signed-off-by: Hans de Goede 
---
  drivers/usb/storage/uas-detect.h | 34 ++
  1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/storage/uas-detect.h b/drivers/usb/storage/uas-detect.h
index 503ac5c..3119f3f 100644
--- a/drivers/usb/storage/uas-detect.h
+++ b/drivers/usb/storage/uas-detect.h
@@ -59,10 +59,6 @@ static int uas_use_uas_driver(struct usb_interface *intf,
unsigned long flags = id->driver_info;
int r, alt;

-   usb_stor_adjust_quirks(udev, &flags);
-
-   if (flags & US_FL_IGNORE_UAS)
-   return 0;

alt = uas_find_uas_alt_setting(intf);
if (alt < 0)
@@ -72,6 +68,36 @@ static int uas_use_uas_driver(struct usb_interface *intf,
if (r < 0)
return 0;

+   /*
+* ASM1051 and older ASM1053 devices have the same usb-id, and UAS is
+* broken with some disks on the ASM1051, use the number of streams to
+* differentiate between the 2. Newer ASM1053 devices also support 32
+* streams, but have a different product-id.
+*/
+   if (udev->descriptor.idVendor == 0x174c &&
+   udev->descriptor.idProduct == 0x55aa) {
+   if (udev->speed < USB_SPEED_SUPER) {
+   /* No streams info, assume ASM1051E */
+   dev_warn(&udev->dev, "Assuming ASM1051E chipset\n");
+   flags |= US_FL_IGNORE_UAS;
+   } else if (usb_ss_max_streams(&eps[1]->ss_ep_comp) == 32) {
+   dev_warn(&udev->dev, "Detected ASM1051E chipset\n");
+   flags |= US_FL_IGNORE_UAS;
+   }
+   }
+
+   usb_stor_adjust_quirks(udev, &flags);


This won't work.  usb_stor_adjust_quirks masks out all the quirks that
it handles before applying the user-specified quirks.  Therefore it
will erase your US_FL_IGNORE_UAS flag.


Only if there is a matching quirk entry in the quirks parameter, and then
masking it out is fine, we want to allow this as we're setting the
US_FL_IGNORE_US flag on bridges where it is known to cause issues when
combined with *some* disks, while uas works fine with other disks, so
users may want to override it.

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] block: introduce blk_flush_queue to drive flush machinery

2014-09-09 Thread Jens Axboe

On 09/09/2014 12:43 PM, Christoph Hellwig wrote:
> On Tue, Sep 09, 2014 at 09:05:46PM +0800, Ming Lei wrote:
>> This patch introduces 'struct blk_flush_queue' and puts all
>> flush machinery related stuff into this strcuture, so that
> 
> s/stuff/fields/
> s/strcuture/structure/
> 
> Looks good, but a few more nitpicks below.
> 
> Reviewed-by: Christoph Hellwig 
> 
>> +int blk_init_flush(struct request_queue *q)
>> +{
>> +int ret;
>> +struct blk_flush_queue *fq = kzalloc(sizeof(*fq), GFP_KERNEL);
>>  
>> +if (!fq)
>>  return -ENOMEM;
>>  
>> +q->fq = fq;
> 
> I think it would be cleaner to return the flush data structure and
> assign it in the caller.

I was going to suggest renaming because of this as well. If we do this:

q->fq = blk_init_flush(q);

then it's immediately clear what it does, whereas blk_init_flush(q)
means very little on its own. I'd change the naming to
blk_alloc_flush_queue() and blk_free_flush_queue().

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] blk-mq: support per-distpatch_queue flush machinery

2014-09-09 Thread Christoph Hellwig

> + if (hctx) {
> + int cmd_sz = q->tag_set->cmd_size;
> + int node = hctx->numa_node;
> +
> + fq = kzalloc_node(sizeof(*fq), GFP_KERNEL, node);
> + if (!fq)
> + goto failed;
> +
> + rq_sz = round_up(rq_sz + cmd_sz, cache_line_size());
> + fq->flush_rq = kzalloc_node(rq_sz, GFP_KERNEL, node);
> + if (!fq->flush_rq)
> + goto rq_failed;
> +
> + spin_lock_init(&fq->mq_flush_lock);
> + } else {
> + fq = kzalloc(sizeof(*fq), GFP_KERNEL);
> + if (!fq)
> + goto failed;
> +
> + fq->flush_rq = kzalloc(rq_sz, GFP_KERNEL);
> + if (!fq->flush_rq)
> + goto rq_failed;
> + }

Seems like this would be a lot cleaner by passing the cmd_size and
node_id explicitly.  The added benefit would be that we could also
pass the node for the blk_init_queue_node() case.

> +static void __blk_mq_exit_flush(struct request_queue *q,
> + unsigned free_end, unsigned int exit_end)
> +{
> + struct blk_mq_hw_ctx *hctx;
> + unsigned int k;
> + struct blk_flush_queue *fq;
> + struct blk_mq_tag_set *set = q->tag_set;
> + unsigned start_idx = set->queue_depth;
> +
> + queue_for_each_hw_ctx(q, hctx, k) {
> + if (k >= free_end)
> + break;
> +
> + fq = hctx->fq;
> + if (k < exit_end && set->ops->exit_request)
> + set->ops->exit_request(set->driver_data,
> + fq->flush_rq, k,
> + start_idx + k);
> +
> + blk_free_flush_queue(fq);
> + }

Can we merge the mq init/exit case into some existing for each hctx
loop in blk-mq?

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] block: introduce blk_flush_queue to drive flush machinery

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 09:05:46PM +0800, Ming Lei wrote:
> This patch introduces 'struct blk_flush_queue' and puts all
> flush machinery related stuff into this strcuture, so that

s/stuff/fields/
s/strcuture/structure/

Looks good, but a few more nitpicks below.

Reviewed-by: Christoph Hellwig 

> +int blk_init_flush(struct request_queue *q)
> +{
> + int ret;
> + struct blk_flush_queue *fq = kzalloc(sizeof(*fq), GFP_KERNEL);
>  
> + if (!fq)
>   return -ENOMEM;
>  
> + q->fq = fq;

I think it would be cleaner to return the flush data structure and
assign it in the caller.

> + INIT_LIST_HEAD(&fq->flush_queue[0]);
> + INIT_LIST_HEAD(&fq->flush_queue[1]);
> + INIT_LIST_HEAD(&fq->flush_data_in_flight);
> +
> + if (q->mq_ops) {
> + ret = blk_mq_init_flush(q);

I think we can just remove blk_mq_init_flush now that it's only
called in blk-flush.c anyway.

>  void blk_exit_flush(struct request_queue *q)
>  {
> + if (q->mq_ops)
> + blk_mq_exit_flush(q);
> + else {
> + struct blk_flush_queue *fq = blk_get_flush_queue(q);
> + kfree(fq->flush_rq);
> + kfree(fq);
> + }

Similarly I would pass the flush structure here.

> +struct blk_flush_queue {
> + unsigned intflush_queue_delayed:1;
> + unsigned intflush_pending_idx:1;
> + unsigned intflush_running_idx:1;
> + unsigned long   flush_pending_since;
> + struct list_headflush_queue[2];
> + struct list_headflush_data_in_flight;
> + struct request  *flush_rq;
> + spinlock_t  mq_flush_lock;
> +};

As this isn't really a queue I would call it blk_flush_data.

> +static inline struct blk_flush_queue *blk_get_flush_queue(
> + struct request_queue *q)
> +{
> + return q->fq;
> +}

I don't think there is a need for this helper.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] block: flush: avoid to figure out flush queue unnecessarily

2014-09-09 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/8] block: avoid to use q->flush_rq directly

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 09:05:45PM +0800, Ming Lei wrote:
> This patch trys to use local variable to access flush request,
> so that we can convert to per-queue flush machinery a bit easier.
> 
> Signed-off-by: Ming Lei 

Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] block: move flush initialized stuff to blk_flush_init

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 09:05:44PM +0800, Ming Lei wrote:
> These stuff is always used with flush req together, so
> we can do that safely.

Little wording nitpick:

in the subject s/initialized stuff/initialization/

and in the body:

These fields are always used with the flush request, so initialize them
together.

Otherwise looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/8] block: introduce blk_init_flush and its pair

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 09:05:43PM +0800, Ming Lei wrote:
> These two functions are introduced to initialize and de-initialize
> flush stuff centrally.
> 
> Signed-off-by: Ming Lei 

Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/8] blk-mq: allocate flush_rq in blk_mq_init_flush()

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 09:05:42PM +0800, Ming Lei wrote:
> It is reasonable to allocate flush req in blk_mq_init_flush().
> 
> Signed-off-by: Ming Lei 

Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Block: fix unbalanced bypass-disable in blk_register_queue

2014-09-09 Thread Shirish Pargaonkar

Tested-by: Shirish Pargaonkar 

On Tue, Sep 9, 2014 at 10:50 AM, Alan Stern  wrote:
> When a queue is registered, the block layer turns off the bypass
> setting (because bypass is enabled when the queue is created).  This
> doesn't work well for queues that are unregistered and then registered
> again; we get a WARNING because of the unbalanced calls to
> blk_queue_bypass_end().
>
> This patch fixes the problem by making blk_register_queue() call
> blk_queue_bypass_end() only the first time the queue is registered.
>
> Signed-off-by: Alan Stern 
> CC: Tejun Heo 
> CC: James Bottomley 
> CC: Jens Axboe 
>
> ---
>
> [as1765]
>
>
>  block/blk-sysfs.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> Index: usb-3.17/block/blk-sysfs.c
> ===
> --- usb-3.17.orig/block/blk-sysfs.c
> +++ usb-3.17/block/blk-sysfs.c
> @@ -554,8 +554,10 @@ int blk_register_queue(struct gendisk *d
>  * Initialization must be complete by now.  Finish the initial
>  * bypass from queue allocation.
>  */
> -   queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
> -   blk_queue_bypass_end(q);
> +   if (!blk_queue_init_done(q)) {
> +   queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
> +   blk_queue_bypass_end(q);
> +   }
>
> ret = blk_trace_init_sysfs(dev);
> if (ret)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Block: fix unbalanced bypass-disable in blk_register_queue

2014-09-09 Thread Tejun Heo

On Tue, Sep 09, 2014 at 11:50:58AM -0400, Alan Stern wrote:
> When a queue is registered, the block layer turns off the bypass
> setting (because bypass is enabled when the queue is created).  This
> doesn't work well for queues that are unregistered and then registered
> again; we get a WARNING because of the unbalanced calls to
> blk_queue_bypass_end().
> 
> This patch fixes the problem by making blk_register_queue() call
> blk_queue_bypass_end() only the first time the queue is registered.
> 
> Signed-off-by: Alan Stern 
> CC: Tejun Heo 
> CC: James Bottomley 
> CC: Jens Axboe 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 16/17] arcmsr: support new adapter ARC12x4 series

2014-09-09 Thread Christoph Hellwig

Ching,

do you have a chance to address Thomas second concern below?  As
far as I can tell (Thomas, please correct me) that's the last
outstanding concern, and I'd really like to merge the arcmsr updates
for the Linux 3.18 merge window.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support

2014-09-09 Thread Elliott, Robert (Server Storage)



> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Tomas Henzl
> Sent: Tuesday, 09 September, 2014 10:54 AM
> Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature
> support
> 
> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> > This feature will provide similar interface as kernel crash dump
> feature.
> > When megaraid firmware encounter any crash, driver will collect the
> firmware raw image and
> > dump it into pre-configured location.
> >
...
> With several controllers in a system this may take a lot memory,
> could you also
> in case when a kdump kernel is running lower it, by not using this
> feature?

What is the correct way for a driver to determine that it is
running in a kdump kernel?  The reset_devices global variable?


---
Rob ElliottHP Server Storage



N�r��yb�X��ǧv�^�)޺{.n�+{���"�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Re: [PATCH 1/2] uas: Set no_report_opcodes

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 04:59:59PM +0200, Hans de Goede wrote:
> asm1051e usb <-> sata bridges hang when receiving a report opcodes scsi cmnd.
> Take a page out of the usb-storage book, and simple disable no_report_opcodes
> outright.

Given that this device also seems broken in other ways can we wait a bit
before using the big hammer?  I'm still hoping UAS might give us some
better SCSI implementation so that we don't have to disable any kind of
advanced feature.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support

2014-09-09 Thread Tomas Henzl

On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> This feature will provide similar interface as kernel crash dump feature.
> When megaraid firmware encounter any crash, driver will collect the firmware 
> raw image and 
> dump it into pre-configured location.
>
> Driver will allocate two different segment of memory. 
> #1 Non-DMA able large buffer (will be allocated on demand) to capture actual 
> FW crash dump.
> #2 DMA buffer (persistence allocation) just to do a arbitrator job. 
>
> Firmware will keep writing Crash dump data in chucks of DMA buffer size into 
> #2, 
> which will be copy back by driver to the host memory as described in #1.
>
> Driver-Firmware interface:
> ==
> A.) Host driver can allocate maximum 512MB Host memory to store crash dump 
> data. 
>
> This memory will be internal to the host and will not be exposed to the 
> Firmware.
> Driver may not be able to allocate 512 MB. In that case, driver will do 
> possible memory 
> (available at run time) allocation to store crash dump data. 
>
> Let’s call this buffer as Host Crash Buffer. 
>
> Host Crash buffer will not be contigious as a whole, but it will have 
> multiple chunk of contigious memory. 
> This will be internal to driver and firmware/application are unaware of it. 
> Partial allocation of Host Crash buffer may have valid information to debug 
> depending upon 
> what was collected in that buffer and depending on nature of failure. 
>
> Complete Crash dump is the best case, but we do want to capture partial 
> buffer just to grab something rather than nothing.
> Host Crash buffer will be allocated only when FW Crash dump data is 
> available, 
> and will be deallocated once application copy Host Crash buffer to the file. 
> Host Crash buffer size can be anything between 1MB to 512MB. (It will be 
> multiple of 1MBs)
>
>
> B.) Irrespective of underlying Firmware capability of crash dump support, 
> driver will allocate DMA buffer at start of the day for each MR controllers. 
> Let’s call this buffer as “DMA Crash Buffer”.
>
> For this feature, size of DMA crash buffer will be 1MB. 
> (We will not gain much even if DMA buffer size is increased.) 
>
> C.) Driver will now read Controller Info sending existing dcmd 
> “MR_DCMD_CTRL_GET_INFO”. 
> Driver should extract the information from ctrl info provided by firmware and 
> figure out if firmware support crash dump feature or not.
>
> Driver will enable crash dump feature only if
> “Firmware support Crash dump” +
> “Driver was able to create DMA Crash Buffer”.
>
> If either one from above is not set, Crash dump feature should be disable in 
> driver.
> Firmware will enable crash dump feature only if “Driver Send DCMD- 
> MR_DCMD_SET_CRASH_BUF_PARA with MR_CRASH_BUF_TURN_ON”
>
> Helper application/script should use sysfs parameter fw_crash_xxx to actually 
> copy data from
> host memory to the filesystem.

Is it possible to store the crash dump data on filesystem on the same 
controller after
the controller has crashed or do you expect that a use of another 
disk/controller?
With several controllers in a system this may take a lot memory, could you also
in case when a kdump kernel is running lower it, by not using this feature?

see other comments inside

>
> Signed-off-by: Sumit Saxena 
> Signed-off-by: Kashyap Desai 
> ---
>  drivers/scsi/megaraid/megaraid_sas.h|  58 +-
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 292 
> +++-
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 172 +++-
>  3 files changed, 517 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h 
> b/drivers/scsi/megaraid/megaraid_sas.h
> index bc7adcf..e0f03e2 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -105,6 +105,9 @@
>  #define MFI_STATE_READY  0xB000
>  #define MFI_STATE_OPERATIONAL0xC000
>  #define MFI_STATE_FAULT  0xF000
> +#define MFI_STATE_FORCE_OCR  0x0080
> +#define MFI_STATE_DMADONE0x0008
> +#define MFI_STATE_CRASH_DUMP_DONE0x0004
>  #define MFI_RESET_REQUIRED   0x0001
>  #define MFI_RESET_ADAPTER0x0002
>  #define MEGAMFI_FRAME_SIZE   64
> @@ -191,6 +194,9 @@
>  #define MR_DCMD_CLUSTER_RESET_LD 0x08010200
>  #define MR_DCMD_PD_LIST_QUERY   0x02010100
>  
> +#define MR_DCMD_CTRL_SET_CRASH_DUMP_PARAMS   0x01190100
> +#define MR_DRIVER_SET_APP_CRASHDUMP_MODE (0xF001 | 0x0600)
> +
>  /*
>   * Global functions
>   */
> @@ -264,6 +270,25 @@ enum MFI_STAT {
>  };
>  
>  /*
> + * Crash dump related defines
> + */
> +#define MAX_CRASH_DUMP_SIZE 512
> +#define CRASH_DMA_BUF_SIZE  (1024 * 1024)
> +
> +enum MR_FW_CRASH_DUMP_STATE {
> + UNAVAILABLE = 0,
> + AVAILABLE = 1,
> + COPYING = 2,
> + CO

[PATCH] Block: fix unbalanced bypass-disable in blk_register_queue

2014-09-09 Thread Alan Stern

When a queue is registered, the block layer turns off the bypass
setting (because bypass is enabled when the queue is created).  This
doesn't work well for queues that are unregistered and then registered
again; we get a WARNING because of the unbalanced calls to
blk_queue_bypass_end().

This patch fixes the problem by making blk_register_queue() call
blk_queue_bypass_end() only the first time the queue is registered.

Signed-off-by: Alan Stern 
CC: Tejun Heo 
CC: James Bottomley 
CC: Jens Axboe 

---

[as1765]


 block/blk-sysfs.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: usb-3.17/block/blk-sysfs.c
===
--- usb-3.17.orig/block/blk-sysfs.c
+++ usb-3.17/block/blk-sysfs.c
@@ -554,8 +554,10 @@ int blk_register_queue(struct gendisk *d
 * Initialization must be complete by now.  Finish the initial
 * bypass from queue allocation.
 */
-   queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
-   blk_queue_bypass_end(q);
+   if (!blk_queue_init_done(q)) {
+   queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
+   blk_queue_bypass_end(q);
+   }
 
ret = blk_trace_init_sysfs(dev);
if (ret)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/8] block: per-distpatch_queue flush machinery

2014-09-09 Thread Ming Lei

On Tue, Sep 9, 2014 at 11:20 PM, Christoph Hellwig  wrote:
> Hi Ming,
>
> thanks for doing this work!
>
> I've only taken a very cursory look at the series and I like most of it.
>
> But can you explain why you're not simply incrementing the number of
> allocated requests in the blk-mq request allocation code instead of
> allocating the flush request separately in the last patch?

Requests in tag_set may be shared by more than one queues, but flush
rq per hctx can't be shared by request queues at all.

Also it is simple to put it inside the introduce blk_flush_queue.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] scsi_debug: deadlock between completions and surprise module removal

2014-09-09 Thread Christoph Hellwig

On Mon, Sep 08, 2014 at 04:31:01PM -0400, Douglas Gilbert wrote:
> stop_all_queued() is doing hrtimer_cancel(), del_timer_sync()
> or tasklet_kill() on all the scsi_cmnd objects that are
> "in play". Unless another mechanism calls the .eh_abort_handler
> entry point reliably on each "in play" command then the module
> cannot be removed. That is because some timer expiry callbacks
> are pending.

scsi_remove_host disabled all queueing of new commands, so all these
timers and tasklets will eventually expire or run and allow the
removal to complete.  Of course this could be sped up by cancelling
them, but you don't need the sync version

> >Something like the (untested) patch below would do the trick.
> >We'd still need Dougs patch for the EH case, though.

> 
> The only other call to stop_all_queued() is from the
> .eh_host_reset_handler entry point.

True, but you also have stop_queued_cmnd for a abort case which
also needs that treatment.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [REGRESSION 3.17] scsi (uas) disks no longer using tagged command queuing

2014-09-09 Thread Christoph Hellwig

On Tue, Sep 09, 2014 at 11:15:24AM +0200, Hans de Goede wrote:
> Taking the uas.c file from 3.17, and building it for 3.16 restores
> the use of tcq (debugged by adding a printk blk_rq_tagged + request->tag).
> 
> So either uas is doing something wrong which happened to work in
> 3.16, or something has broken in 3.17.
> 
> I've already added debug printk-s of scsi_device->tagged_supported,
> queue_depth, ordered_tags and simple_tags and those all look good
> (1, 29, 1, 1).
> 
> I've also tried setting disable_blk_mq and that does not help.
> 
> Any hints to help debugging this further (other then a bisect) are
> appreciated. If no-one has any smart ideas I guess I'll end up doing
> a full bisect.

scsi-mq isn't enabled by default, so setting disable_blk_mq should
indeed not make a difference.

One interesting thing with uas is that it uses scsi_init_shared_tag_map,
which only few drivers do.

Can you apply a debug patch like the one below and see if that helps to
poinpoint down the issue?

diff --git a/block/blk-tag.c b/block/blk-tag.c
index a185b86..584db25 100644
--- a/block/blk-tag.c
+++ b/block/blk-tag.c
@@ -175,8 +175,10 @@ int blk_queue_init_tags(struct request_queue *q, int depth,
return rc;
queue_flag_set(QUEUE_FLAG_QUEUED, q);
return 0;
-   } else
+   } else {
+   printk("%s: grabbing reference to shared tag structure!\n", 
__func__);
atomic_inc(&tags->refcnt);
+   }
 
/*
 * assign it, all done
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index cdcc90b..7f3b5cb 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -164,8 +164,13 @@ static inline int scsi_init_shared_tag_map(struct 
Scsi_Host *shost, int depth)
 */
if (!shost->bqt) {
shost->bqt = blk_init_tags(depth);
-   if (!shost->bqt)
+   if (!shost->bqt) {
+   printk("failed to init host-wide tag map!\n");
return -ENOMEM;
+   }
+   printk("initialized host-wide tag map!\n");
+   } else {
+   printk("host-wide tag map already set!\n");
}
 
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] uas: Disable uas on ASM1051 devices

2014-09-09 Thread Alan Stern

On Tue, 9 Sep 2014, Hans de Goede wrote:

> Even with REPORT SUPPORTED OPERATION CODES blacklisted the ASM1051 chipset
> still does not work when combined with some disks, e.g. a Crucial M500 ssd.
> 
> When used with a troublesome disk, the chipset throws all kinds of USB errors,
> and eventually hangs, where as in BOT mode it works fine.
> 
> To make matters worse the ASM1051 chipset and older versions of the ASM1053
> chipset use the same usb-id.
> 
> When connected over USB-3 the 2 can be told apart by the number of streams
> they support. So this patch adds some less then pretty code to disable uas for
> the ASM1051. When connected over USB-2, simply disable uas alltogether for
> devices with the shared usb-id.
> 
> This ends up disabling uas in many cases where it actually works fine, which
> is unfortunate, but better then leaving some users with completely non working
> external disks. This patch adds a log level explaining how perfomance 
> conscious
> users can re-enable uas.
> 
> Cc: sta...@vger.kernel.org # 3.16
> Signed-off-by: Hans de Goede 
> ---
>  drivers/usb/storage/uas-detect.h | 34 ++
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/usb/storage/uas-detect.h 
> b/drivers/usb/storage/uas-detect.h
> index 503ac5c..3119f3f 100644
> --- a/drivers/usb/storage/uas-detect.h
> +++ b/drivers/usb/storage/uas-detect.h
> @@ -59,10 +59,6 @@ static int uas_use_uas_driver(struct usb_interface *intf,
>   unsigned long flags = id->driver_info;
>   int r, alt;
>  
> - usb_stor_adjust_quirks(udev, &flags);
> -
> - if (flags & US_FL_IGNORE_UAS)
> - return 0;
>  
>   alt = uas_find_uas_alt_setting(intf);
>   if (alt < 0)
> @@ -72,6 +68,36 @@ static int uas_use_uas_driver(struct usb_interface *intf,
>   if (r < 0)
>   return 0;
>  
> + /*
> +  * ASM1051 and older ASM1053 devices have the same usb-id, and UAS is
> +  * broken with some disks on the ASM1051, use the number of streams to
> +  * differentiate between the 2. Newer ASM1053 devices also support 32
> +  * streams, but have a different product-id.
> +  */
> + if (udev->descriptor.idVendor == 0x174c &&
> + udev->descriptor.idProduct == 0x55aa) {
> + if (udev->speed < USB_SPEED_SUPER) {
> + /* No streams info, assume ASM1051E */
> + dev_warn(&udev->dev, "Assuming ASM1051E chipset\n");
> + flags |= US_FL_IGNORE_UAS;
> + } else if (usb_ss_max_streams(&eps[1]->ss_ep_comp) == 32) {
> + dev_warn(&udev->dev, "Detected ASM1051E chipset\n");
> + flags |= US_FL_IGNORE_UAS;
> + }
> + }
> +
> + usb_stor_adjust_quirks(udev, &flags);

This won't work.  usb_stor_adjust_quirks masks out all the quirks that 
it handles before applying the user-specified quirks.  Therefore it 
will erase your US_FL_IGNORE_UAS flag.

You might as well remove this call and leave the earlier call to
usb_stor_adjust_quirks (in the first hunk of the patch) as is.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/8] block: per-distpatch_queue flush machinery

2014-09-09 Thread Christoph Hellwig

Hi Ming,

thanks for doing this work!

I've only taken a very cursory look at the series and I like most of it.

But can you explain why you're not simply incrementing the number of
allocated requests in the blk-mq request allocation code instead of
allocating the flush request separately in the last patch?

(A more throughout review will follow too)

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING in block layer triggered in 3.17-rc3

2014-09-09 Thread Tejun Heo

Hello, Alan.

On Tue, Sep 09, 2014 at 11:08:22AM -0400, Alan Stern wrote:
> Sorry, the meaning wasn't clear.  I meant that the existing code 
> doesn't work.  The patch seems to work okay (except that I can't use 
> queue_flag_test_and_set because the queue isn't locked at that point).  
> I'll submit the patch shortly.

Given it's fully synchronized, the following should work fine and
probably is less misleading than using atomic test_and_set.

if (!blk_queue_init_done) {
queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
blk_queue_bypass_end(q);
}

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING in block layer triggered in 3.17-rc3

2014-09-09 Thread Alan Stern

On Tue, 9 Sep 2014, Tejun Heo wrote:

> Hello,
> 
> On Mon, Sep 08, 2014 at 01:42:44PM -0400, Alan Stern wrote:
> > > This looks like a nasty hack.  In theory the QUEUE_FLAG_INIT_DONE should
> > > be unset on blk_unregister_queue() to match the teardown; it's only
> > > accident it isn't.  del_gendisk() in sd_remove() is supposed to tear a
> > > lot of queue stuff down.
> > 
> > It's not clear what the operative assumptions are.  The comment in
> > blk_register_queue() implies that bypass is active only because it was
> > set up that way when the queue was created.  The fact that
> > blk_unregister_queue() doesn't call blk_queue_bypass_start() seems to
> > support this view -- although it could also be a simple oversight.
> > 
> > Hopefully Tejun can clear this iup.
> 
> Maintaining the initial bypass till queue registration is an
> optimization because shutting down a fully functional queue is a
> costly operation and there are drivers which set and destroy queues
> repeatedly while probing, so, yeah, it's really a special case for
> when the queue is being registered for the first time.
> 
> > > Your hack seems to indicate that this doesn't work on the add->del->add
> > > transtion of a gendisk.
> > 
> > Indeed, it does not work.
> 
> As such, the change you suggested makes perfect sense to me.  Why
> wouldn't it work?

Sorry, the meaning wasn't clear.  I meant that the existing code 
doesn't work.  The patch seems to work okay (except that I can't use 
queue_flag_test_and_set because the queue isn't locked at that point).  
I'll submit the patch shortly.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] uas: Set no_report_opcodes

2014-09-09 Thread Hans de Goede

asm1051e usb <-> sata bridges hang when receiving a report opcodes scsi cmnd.
Take a page out of the usb-storage book, and simple disable no_report_opcodes
outright.

Cc: sta...@vger.kernel.org # 3.16
Signed-off-by: Hans de Goede 
---
 drivers/usb/storage/uas.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index 3f42785..4e2f576 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -944,6 +944,8 @@ static int uas_slave_alloc(struct scsi_device *sdev)
 */
blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));
 
+   sdev->no_report_opcodes = 1;
+
return 0;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] uas: Disable uas on ASM1051 devices

2014-09-09 Thread Hans de Goede

Even with REPORT SUPPORTED OPERATION CODES blacklisted the ASM1051 chipset
still does not work when combined with some disks, e.g. a Crucial M500 ssd.

When used with a troublesome disk, the chipset throws all kinds of USB errors,
and eventually hangs, where as in BOT mode it works fine.

To make matters worse the ASM1051 chipset and older versions of the ASM1053
chipset use the same usb-id.

When connected over USB-3 the 2 can be told apart by the number of streams
they support. So this patch adds some less then pretty code to disable uas for
the ASM1051. When connected over USB-2, simply disable uas alltogether for
devices with the shared usb-id.

This ends up disabling uas in many cases where it actually works fine, which
is unfortunate, but better then leaving some users with completely non working
external disks. This patch adds a log level explaining how perfomance conscious
users can re-enable uas.

Cc: sta...@vger.kernel.org # 3.16
Signed-off-by: Hans de Goede 
---
 drivers/usb/storage/uas-detect.h | 34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/storage/uas-detect.h b/drivers/usb/storage/uas-detect.h
index 503ac5c..3119f3f 100644
--- a/drivers/usb/storage/uas-detect.h
+++ b/drivers/usb/storage/uas-detect.h
@@ -59,10 +59,6 @@ static int uas_use_uas_driver(struct usb_interface *intf,
unsigned long flags = id->driver_info;
int r, alt;
 
-   usb_stor_adjust_quirks(udev, &flags);
-
-   if (flags & US_FL_IGNORE_UAS)
-   return 0;
 
alt = uas_find_uas_alt_setting(intf);
if (alt < 0)
@@ -72,6 +68,36 @@ static int uas_use_uas_driver(struct usb_interface *intf,
if (r < 0)
return 0;
 
+   /*
+* ASM1051 and older ASM1053 devices have the same usb-id, and UAS is
+* broken with some disks on the ASM1051, use the number of streams to
+* differentiate between the 2. Newer ASM1053 devices also support 32
+* streams, but have a different product-id.
+*/
+   if (udev->descriptor.idVendor == 0x174c &&
+   udev->descriptor.idProduct == 0x55aa) {
+   if (udev->speed < USB_SPEED_SUPER) {
+   /* No streams info, assume ASM1051E */
+   dev_warn(&udev->dev, "Assuming ASM1051E chipset\n");
+   flags |= US_FL_IGNORE_UAS;
+   } else if (usb_ss_max_streams(&eps[1]->ss_ep_comp) == 32) {
+   dev_warn(&udev->dev, "Detected ASM1051E chipset\n");
+   flags |= US_FL_IGNORE_UAS;
+   }
+   }
+
+   usb_stor_adjust_quirks(udev, &flags);
+
+   if (flags & US_FL_IGNORE_UAS) {
+   dev_warn(&udev->dev,
+   "UAS does not work with this chipset with some disks, 
using usb-storage instead\n");
+   dev_warn(&udev->dev,
+   "Add 'usb-storage.quirks=%04x:%04x:' to the kernel 
commandline to try UAS anyways\n",
+   le16_to_cpu(udev->descriptor.idVendor),
+   le16_to_cpu(udev->descriptor.idProduct));
+   return 0;
+   }
+
if (udev->bus->sg_tablesize == 0) {
dev_warn(&udev->dev,
"The driver for the USB controller %s does not support 
scatter-gather which is\n",
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH fixes for 3.17 0/2] uas: Disable uas on ASM1051 devices

2014-09-09 Thread Hans de Goede

Hi Greg,

I've received a number of bug-reports from users related to uas on ASM1051
chipset using devices. After some searching around I've managed to get myself
an ASM1051 device.

As a result I've spend the last 4 days trying to get the ASM1051 chipset to
work. After some initial success which makes it work with most disks I've
laying around (patch 1), I hit problems with one disk where it just would
not work.

To makes matters worse the ASM1051 chipset shares its usb-id with older
versions of the ASM1053, which works fine with all disks.

So I've come up with a patch which disables uas on all ASM1051 devices and on
some ASM1053 devices. Both the need to blacklist and the actual blacklisting
code are less then satisfactory, esp. after 4 days of work. But this is the
best I can come up with :|

It seems that older uas-bridge chips and xhci controllers streams support are
suffering from quite a few hw bugs, and that blacklisting is the best we can
do.

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/11] megaraid_sas : Update threshold based reply post host index register

2014-09-09 Thread Tomas Henzl

On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> Current driver updates reply post host index to let firmware know that 
> replies are processed,
> while returning from ISR function, only if there is no oustanding replies in 
> reply queue.
>
> Driver will free the request frame immediately from ISR but reply post host 
> index is not yet updated.
> It means freed request can be used by submission path and there may be a 
> tight loop in request/reply
> path. In such condition, firmware may crash when it tries to post reply and 
> there is no free 
> reply post descriptor.
>
> Eventually two things needs to be change to avoid this issue.
>
> Increase reply queue depth (double than request queue) to accommodate worst 
> case scenario.
> Update reply post host index to firmware once it reach to some pre-defined 
> threshold value.
>
> This change will make sure that firmware will always have some buffer of 
> reply descriptor and 
> will never find empty reply descriptor in completion path.
>
> Signed-off-by: Sumit Saxena 
> Signed-off-by: Kashyap Desai 
> ---
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 23 ++-
>  drivers/scsi/megaraid/megaraid_sas_fusion.h |  1 +
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c 
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index c69c1ac..f30297d 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -971,7 +971,7 @@ megasas_init_adapter_fusion(struct megasas_instance 
> *instance)
>  
>   max_cmd = instance->max_fw_cmds;
>  
> - fusion->reply_q_depth = ((max_cmd + 1 + 15)/16)*16;
> + fusion->reply_q_depth = (((max_cmd * 2) + 1 + 15)/16)*16;

This computation gives for certain values the same result as before,
for some the result is 1.5*higher and for others 2*higher. Is this intended?
(I'm not asking what all the magic numbers mean..)
what about a simple
fusion->reply_q_depth = 2*(((max_cmd + 1 + 15)/16)*16);

>  
>   fusion->request_alloc_sz =
>   sizeof(union MEGASAS_REQUEST_DESCRIPTOR_UNION) *max_cmd;
> @@ -1876,6 +1876,7 @@ complete_cmd_fusion(struct megasas_instance *instance, 
> u32 MSIxIndex)
>   u32 status, extStatus, device_id;
>   union desc_value d_val;
>   struct LD_LOAD_BALANCE_INFO *lbinfo;
> + int threshold_reply_count = 0;
>  
>   fusion = instance->ctrl_context;
>  
> @@ -1963,6 +1964,7 @@ complete_cmd_fusion(struct megasas_instance *instance, 
> u32 MSIxIndex)
>  
>   desc->Words = ULLONG_MAX;
>   num_completed++;
> + threshold_reply_count++;
>  
>   /* Get the next reply descriptor */
>   if (!fusion->last_reply_idx[MSIxIndex])
> @@ -1982,6 +1984,25 @@ complete_cmd_fusion(struct megasas_instance *instance, 
> u32 MSIxIndex)
>  
>   if (reply_descript_type == MPI2_RPY_DESCRIPT_FLAGS_UNUSED)
>   break;
> + /*
> +  * Write to reply post host index register after completing 
> threshold
> +  * number of reply counts and still there are more replies in 
> reply queue
> +  * pending to be completed
> +  */
> + if (threshold_reply_count >= THRESHOLD_REPLY_COUNT) {
> + if ((instance->pdev->device ==
> + PCI_DEVICE_ID_LSI_INVADER) ||
> + (instance->pdev->device ==
> + PCI_DEVICE_ID_LSI_FURY))
> + writel(((MSIxIndex & 0x7) << 24) |
> + fusion->last_reply_idx[MSIxIndex],
> + 
> instance->reply_post_host_index_addr[MSIxIndex/8]);
> + else
> + writel((MSIxIndex << 24) |
> + fusion->last_reply_idx[MSIxIndex],
> + 
> instance->reply_post_host_index_addr[0]);
> + threshold_reply_count = 0;
> + }
>   }
>  
>   if (!num_completed)
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.h 
> b/drivers/scsi/megaraid/megaraid_sas_fusion.h
> index e76af54..d660c4d 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.h
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.h
> @@ -86,6 +86,7 @@ enum MR_RAID_FLAGS_IO_SUB_TYPE {
>  
>  #define MEGASAS_FP_CMD_LEN   16
>  #define MEGASAS_FUSION_IN_RESET 0
> +#define THRESHOLD_REPLY_COUNT 50
>  
>  /*
>   * Raid Context structure which describes MegaRAID specific IO Parameters

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/11] megaraid_sas : Use writeq for 64bit pci write to avoid spinlock overhead

2014-09-09 Thread Tomas Henzl

On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> Use writeq() for 64bit PCI write instead of writel() to avoid additional lock 
> overhead.
>
> Signed-off-by: Sumit Saxena 
> Signed-off-by: Kashyap Desai 
> ---
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c 
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 57b47fe..c69c1ac 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -1065,6 +1065,13 @@ megasas_fire_cmd_fusion(struct megasas_instance 
> *instance,
>   u32 req_desc_hi,
>   struct megasas_register_set __iomem *regs)

Hi Sumit,
the fn params are a bit confusing req_desc_lo is of type dma_addr_t
and req_desc_hi is u32, is it possible to unite it in the future?

>  {
> +#if defined(writeq) && defined(CONFIG_64BIT)

On a similar place mpt2sas(_base_writeq) uses only "#ifndef writeq"
if it's incorrect fix it there too or remove the CONFIG_64 here

> + u64 req_data = 0;
> +
> + req_data = req_desc_hi;
> + req_data = ((req_data << 32) | (u32)req_desc_lo);

This seems to be critical path (you are removing an spinlock to avoid overhead),
so why do you have three consecutive assignments to the same variable?
(~(u64 req_data = r_hi << 32 | r_lo))

Cheers,
Tomas

> + writeq(le64_to_cpu(req_data), &(regs)->inbound_low_queue_port);
> +#else
>   unsigned long flags;
>  
>   spin_lock_irqsave(&instance->hba_lock, flags);
> @@ -1072,6 +1079,7 @@ megasas_fire_cmd_fusion(struct megasas_instance 
> *instance,
>   writel(le32_to_cpu(req_desc_lo), &(regs)->inbound_low_queue_port);
>   writel(le32_to_cpu(req_desc_hi), &(regs)->inbound_high_queue_port);
>   spin_unlock_irqrestore(&instance->hba_lock, flags);
> +#endif
>  }
>  
>  /**

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/8] block: move flush initialized stuff to blk_flush_init

2014-09-09 Thread Ming Lei

These stuff is always used with flush req together, so
we can do that safely.

Signed-off-by: Ming Lei 
---
 block/blk-core.c  |3 ---
 block/blk-flush.c |4 
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0a9d172..222fe84 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -600,9 +600,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 #ifdef CONFIG_BLK_CGROUP
INIT_LIST_HEAD(&q->blkg_list);
 #endif
-   INIT_LIST_HEAD(&q->flush_queue[0]);
-   INIT_LIST_HEAD(&q->flush_queue[1]);
-   INIT_LIST_HEAD(&q->flush_data_in_flight);
INIT_DELAYED_WORK(&q->delay_work, blk_delay_work);
 
kobject_init(&q->kobj, &blk_queue_ktype);
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 6932ee8..a5b2a00 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -490,6 +490,10 @@ static int blk_mq_init_flush(struct request_queue *q)
 
 int blk_init_flush(struct request_queue *q)
 {
+   INIT_LIST_HEAD(&q->flush_queue[0]);
+   INIT_LIST_HEAD(&q->flush_queue[1]);
+   INIT_LIST_HEAD(&q->flush_data_in_flight);
+
if (q->mq_ops)
return blk_mq_init_flush(q);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/8] block: introduce blk_flush_queue to drive flush machinery

2014-09-09 Thread Ming Lei

This patch introduces 'struct blk_flush_queue' and puts all
flush machinery related stuff into this strcuture, so that

- flush implementation details aren't exposed to driver
- it is easy to convert to per dispatch-queue flush machinery

This patch is basically a mechanical replacement.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   |3 +-
 block/blk-flush.c  |  106 
 block/blk-mq.c |   10 +++--
 block/blk.h|   22 +-
 include/linux/blkdev.h |   10 +
 5 files changed, 101 insertions(+), 50 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 222fe84..d278a30 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -390,11 +390,12 @@ static void __blk_drain_queue(struct request_queue *q, 
bool drain_all)
 * be drained.  Check all the queues and counters.
 */
if (drain_all) {
+   struct blk_flush_queue *fq = blk_get_flush_queue(q);
drain |= !list_empty(&q->queue_head);
for (i = 0; i < 2; i++) {
drain |= q->nr_rqs[i];
drain |= q->in_flight[i];
-   drain |= !list_empty(&q->flush_queue[i]);
+   drain |= !list_empty(&fq->flush_queue[i]);
}
}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index a59dd1a..894a149 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -28,7 +28,7 @@
  *
  * The actual execution of flush is double buffered.  Whenever a request
  * needs to execute PRE or POSTFLUSH, it queues at
- * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
+ * fq->flush_queue[fq->flush_pending_idx].  Once certain criteria are met, a
  * flush is issued and the pending_idx is toggled.  When the flush
  * completes, all the requests which were pending are proceeded to the next
  * step.  This allows arbitrary merging of different types of FLUSH/FUA
@@ -157,7 +157,7 @@ static bool blk_flush_queue_rq(struct request *rq, bool 
add_front)
  * completion and trigger the next step.
  *
  * CONTEXT:
- * spin_lock_irq(q->queue_lock or q->mq_flush_lock)
+ * spin_lock_irq(q->queue_lock or fq->mq_flush_lock)
  *
  * RETURNS:
  * %true if requests were added to the dispatch queue, %false otherwise.
@@ -166,7 +166,8 @@ static bool blk_flush_complete_seq(struct request *rq, 
unsigned int seq,
   int error)
 {
struct request_queue *q = rq->q;
-   struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
bool queued = false, kicked;
 
BUG_ON(rq->flush.seq & seq);
@@ -182,12 +183,12 @@ static bool blk_flush_complete_seq(struct request *rq, 
unsigned int seq,
case REQ_FSEQ_POSTFLUSH:
/* queue for flush */
if (list_empty(pending))
-   q->flush_pending_since = jiffies;
+   fq->flush_pending_since = jiffies;
list_move_tail(&rq->flush.list, pending);
break;
 
case REQ_FSEQ_DATA:
-   list_move_tail(&rq->flush.list, &q->flush_data_in_flight);
+   list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
queued = blk_flush_queue_rq(rq, true);
break;
 
@@ -222,17 +223,18 @@ static void flush_end_io(struct request *flush_rq, int 
error)
bool queued = false;
struct request *rq, *n;
unsigned long flags = 0;
+   struct blk_flush_queue *fq = blk_get_flush_queue(q);
 
if (q->mq_ops) {
-   spin_lock_irqsave(&q->mq_flush_lock, flags);
+   spin_lock_irqsave(&fq->mq_flush_lock, flags);
flush_rq->tag = -1;
}
 
-   running = &q->flush_queue[q->flush_running_idx];
-   BUG_ON(q->flush_pending_idx == q->flush_running_idx);
+   running = &fq->flush_queue[fq->flush_running_idx];
+   BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
 
/* account completion of the flush request */
-   q->flush_running_idx ^= 1;
+   fq->flush_running_idx ^= 1;
 
if (!q->mq_ops)
elv_completed_request(q, flush_rq);
@@ -256,13 +258,13 @@ static void flush_end_io(struct request *flush_rq, int 
error)
 * directly into request_fn may confuse the driver.  Always use
 * kblockd.
 */
-   if (queued || q->flush_queue_delayed) {
+   if (queued || fq->flush_queue_delayed) {
WARN_ON(q->mq_ops);
blk_run_queue_async(q);
}
-   q->flush_queue_delayed = 0;
+   fq->flush_queue_delayed = 0;
if (q->mq_ops)
-   spin_unlock_irqrestore(&q->mq_flush_lock,

[PATCH 6/8] block: flush: avoid to figure out flush queue unnecessarily

2014-09-09 Thread Ming Lei

Just figuring out flush queue at the entry of kicking off flush
machinery and request's completion handler, then pass it through.

Signed-off-by: Ming Lei 
---
 block/blk-flush.c |   30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 894a149..a9d6e01 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -91,7 +91,8 @@ enum {
FLUSH_PENDING_TIMEOUT   = 5 * HZ,
 };
 
-static bool blk_kick_flush(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q,
+  struct blk_flush_queue *fq);
 
 static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
 {
@@ -150,6 +151,7 @@ static bool blk_flush_queue_rq(struct request *rq, bool 
add_front)
 /**
  * blk_flush_complete_seq - complete flush sequence
  * @rq: FLUSH/FUA request being sequenced
+ * @fq: flush queue
  * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
  * @error: whether an error occurred
  *
@@ -162,11 +164,11 @@ static bool blk_flush_queue_rq(struct request *rq, bool 
add_front)
  * RETURNS:
  * %true if requests were added to the dispatch queue, %false otherwise.
  */
-static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
-  int error)
+static bool blk_flush_complete_seq(struct request *rq,
+  struct blk_flush_queue *fq,
+  unsigned int seq, int error)
 {
struct request_queue *q = rq->q;
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
bool queued = false, kicked;
 
@@ -212,7 +214,7 @@ static bool blk_flush_complete_seq(struct request *rq, 
unsigned int seq,
BUG();
}
 
-   kicked = blk_kick_flush(q);
+   kicked = blk_kick_flush(q, fq);
return kicked | queued;
 }
 
@@ -244,7 +246,7 @@ static void flush_end_io(struct request *flush_rq, int 
error)
unsigned int seq = blk_flush_cur_seq(rq);
 
BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
-   queued |= blk_flush_complete_seq(rq, seq, error);
+   queued |= blk_flush_complete_seq(rq, fq, seq, error);
}
 
/*
@@ -270,6 +272,7 @@ static void flush_end_io(struct request *flush_rq, int 
error)
 /**
  * blk_kick_flush - consider issuing flush request
  * @q: request_queue being kicked
+ * @fq: flush queue
  *
  * Flush related states of @q have changed, consider issuing flush request.
  * Please read the comment at the top of this file for more info.
@@ -280,9 +283,8 @@ static void flush_end_io(struct request *flush_rq, int 
error)
  * RETURNS:
  * %true if flush was issued, %false otherwise.
  */
-static bool blk_kick_flush(struct request_queue *q)
+static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 {
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
struct request *first_rq =
list_first_entry(pending, struct request, flush.list);
@@ -319,12 +321,13 @@ static bool blk_kick_flush(struct request_queue *q)
 static void flush_data_end_io(struct request *rq, int error)
 {
struct request_queue *q = rq->q;
+   struct blk_flush_queue *fq = blk_get_flush_queue(q);
 
/*
 * After populating an empty queue, kick it to avoid stall.  Read
 * the comment in flush_end_io().
 */
-   if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error))
+   if (blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error))
blk_run_queue_async(q);
 }
 
@@ -344,7 +347,7 @@ static void mq_flush_data_end_io(struct request *rq, int 
error)
 * the comment in flush_end_io().
 */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
-   if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error))
+   if (blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error))
blk_mq_run_hw_queue(hctx, true);
spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
 }
@@ -366,6 +369,7 @@ void blk_insert_flush(struct request *rq)
struct request_queue *q = rq->q;
unsigned int fflags = q->flush_flags;   /* may change, cache */
unsigned int policy = blk_flush_policy(fflags, rq);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q);
 
/*
 * @policy now records what operations need to be done.  Adjust
@@ -414,18 +418,16 @@ void blk_insert_flush(struct request *rq)
rq->cmd_flags |= REQ_FLUSH_SEQ;
rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
if (q->mq_ops) {
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
-
rq->end_io = mq_flush_data_end_io;
 
spin_lock_irq(&fq->mq_flush_lock);
-   blk_fl

[PATCH 2/8] block: introduce blk_init_flush and its pair

2014-09-09 Thread Ming Lei

These two functions are introduced to initialize and de-initialize
flush stuff centrally.

Signed-off-by: Ming Lei 
---
 block/blk-core.c  |5 ++---
 block/blk-flush.c |   19 ++-
 block/blk-mq.c|2 +-
 block/blk-mq.h|1 -
 block/blk-sysfs.c |4 ++--
 block/blk.h   |3 +++
 6 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6946a42..0a9d172 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -705,8 +705,7 @@ blk_init_allocated_queue(struct request_queue *q, 
request_fn_proc *rfn,
if (!q)
return NULL;
 
-   q->flush_rq = kzalloc(sizeof(struct request), GFP_KERNEL);
-   if (!q->flush_rq)
+   if (blk_init_flush(q))
return NULL;
 
if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
@@ -742,7 +741,7 @@ blk_init_allocated_queue(struct request_queue *q, 
request_fn_proc *rfn,
return q;
 
 fail:
-   kfree(q->flush_rq);
+   blk_exit_flush(q);
return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 75ca6cd0..6932ee8 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -474,7 +474,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
 }
 EXPORT_SYMBOL(blkdev_issue_flush);
 
-int blk_mq_init_flush(struct request_queue *q)
+static int blk_mq_init_flush(struct request_queue *q)
 {
struct blk_mq_tag_set *set = q->tag_set;
 
@@ -487,3 +487,20 @@ int blk_mq_init_flush(struct request_queue *q)
return -ENOMEM;
return 0;
 }
+
+int blk_init_flush(struct request_queue *q)
+{
+   if (q->mq_ops)
+   return blk_mq_init_flush(q);
+
+   q->flush_rq = kzalloc(sizeof(struct request), GFP_KERNEL);
+   if (!q->flush_rq)
+   return -ENOMEM;
+
+   return 0;
+}
+
+void blk_exit_flush(struct request_queue *q)
+{
+   kfree(q->flush_rq);
+}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d500793..706181a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1841,7 +1841,7 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
 
blk_mq_add_queue_tag_set(set, q);
 
-   if (blk_mq_init_flush(q))
+   if (blk_init_flush(q))
goto err_hw_queues;
 
blk_mq_map_swqueue(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b0bd9bc..a39cfa9 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,7 +27,6 @@ struct blk_mq_ctx {
 
 void __blk_mq_complete_request(struct request *rq);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-int blk_mq_init_flush(struct request_queue *q);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 void blk_mq_clone_flush_request(struct request *flush_rq,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4db5abf..28d3a11 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -517,11 +517,11 @@ static void blk_release_queue(struct kobject *kobj)
if (q->queue_tags)
__blk_queue_free_tags(q);
 
+   blk_exit_flush(q);
+
if (q->mq_ops)
blk_mq_free_queue(q);
 
-   kfree(q->flush_rq);
-
blk_trace_shutdown(q);
 
bdi_destroy(&q->backing_dev_info);
diff --git a/block/blk.h b/block/blk.h
index 6748c4f..261f734 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -22,6 +22,9 @@ static inline void __blk_get_queue(struct request_queue *q)
kobject_get(&q->kobj);
 }
 
+int blk_init_flush(struct request_queue *q);
+void blk_exit_flush(struct request_queue *q);
+
 int blk_init_rl(struct request_list *rl, struct request_queue *q,
gfp_t gfp_mask);
 void blk_exit_rl(struct request_list *rl);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/8] block: avoid to use q->flush_rq directly

2014-09-09 Thread Ming Lei

This patch trys to use local variable to access flush request,
so that we can convert to per-queue flush machinery a bit easier.

Signed-off-by: Ming Lei 
---
 block/blk-flush.c |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index a5b2a00..a59dd1a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -225,7 +225,7 @@ static void flush_end_io(struct request *flush_rq, int 
error)
 
if (q->mq_ops) {
spin_lock_irqsave(&q->mq_flush_lock, flags);
-   q->flush_rq->tag = -1;
+   flush_rq->tag = -1;
}
 
running = &q->flush_queue[q->flush_running_idx];
@@ -283,6 +283,7 @@ static bool blk_kick_flush(struct request_queue *q)
struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
struct request *first_rq =
list_first_entry(pending, struct request, flush.list);
+   struct request *flush_rq = q->flush_rq;
 
/* C1 described at the top of this file */
if (q->flush_pending_idx != q->flush_running_idx || list_empty(pending))
@@ -300,16 +301,16 @@ static bool blk_kick_flush(struct request_queue *q)
 */
q->flush_pending_idx ^= 1;
 
-   blk_rq_init(q, q->flush_rq);
+   blk_rq_init(q, flush_rq);
if (q->mq_ops)
-   blk_mq_clone_flush_request(q->flush_rq, first_rq);
+   blk_mq_clone_flush_request(flush_rq, first_rq);
 
-   q->flush_rq->cmd_type = REQ_TYPE_FS;
-   q->flush_rq->cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
-   q->flush_rq->rq_disk = first_rq->rq_disk;
-   q->flush_rq->end_io = flush_end_io;
+   flush_rq->cmd_type = REQ_TYPE_FS;
+   flush_rq->cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
+   flush_rq->rq_disk = first_rq->rq_disk;
+   flush_rq->end_io = flush_end_io;
 
-   return blk_flush_queue_rq(q->flush_rq, false);
+   return blk_flush_queue_rq(flush_rq, false);
 }
 
 static void flush_data_end_io(struct request *rq, int error)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/8] blk-mq: allocate flush_rq in blk_mq_init_flush()

2014-09-09 Thread Ming Lei

It is reasonable to allocate flush req in blk_mq_init_flush().

Signed-off-by: Ming Lei 
---
 block/blk-flush.c |   11 ++-
 block/blk-mq.c|   16 ++--
 block/blk-mq.h|2 +-
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 3cb5e9e..75ca6cd0 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -474,7 +474,16 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
 }
 EXPORT_SYMBOL(blkdev_issue_flush);
 
-void blk_mq_init_flush(struct request_queue *q)
+int blk_mq_init_flush(struct request_queue *q)
 {
+   struct blk_mq_tag_set *set = q->tag_set;
+
spin_lock_init(&q->mq_flush_lock);
+
+   q->flush_rq = kzalloc(round_up(sizeof(struct request) +
+   set->cmd_size, cache_line_size()),
+   GFP_KERNEL);
+   if (!q->flush_rq)
+   return -ENOMEM;
+   return 0;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b5af123..d500793 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1830,17 +1830,10 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
if (set->ops->complete)
blk_queue_softirq_done(q, set->ops->complete);
 
-   blk_mq_init_flush(q);
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
 
-   q->flush_rq = kzalloc(round_up(sizeof(struct request) +
-   set->cmd_size, cache_line_size()),
-   GFP_KERNEL);
-   if (!q->flush_rq)
-   goto err_hw;
-
if (blk_mq_init_hw_queues(q, set))
-   goto err_flush_rq;
+   goto err_hw;
 
mutex_lock(&all_q_mutex);
list_add_tail(&q->all_q_node, &all_q_list);
@@ -1848,12 +1841,15 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
 
blk_mq_add_queue_tag_set(set, q);
 
+   if (blk_mq_init_flush(q))
+   goto err_hw_queues;
+
blk_mq_map_swqueue(q);
 
return q;
 
-err_flush_rq:
-   kfree(q->flush_rq);
+err_hw_queues:
+   blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 err_hw:
blk_cleanup_queue(q);
 err_hctxs:
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ca4964a..b0bd9bc 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,7 +27,7 @@ struct blk_mq_ctx {
 
 void __blk_mq_complete_request(struct request *rq);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_init_flush(struct request_queue *q);
+int blk_mq_init_flush(struct request_queue *q);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 void blk_mq_clone_flush_request(struct request *flush_rq,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/8] block: introduce 'blk_mq_ctx' parameter to blk_get_flush_queue

2014-09-09 Thread Ming Lei

This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
so that this function can find the corresponding blk_flush_queue
bound with current mq context since the flush queue will become
per hw-queue.

For legacy queue, the parameter can be simply 'NULL'.

For multiqueue case, the parameter should be set the context
from which the related request is originated. With the context
info, the hw queue and related flush queue can be found.

Signed-off-by: Ming Lei 
---
 block/blk-core.c  |2 +-
 block/blk-flush.c |   17 -
 block/blk-mq.c|3 ++-
 block/blk.h   |4 ++--
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d278a30..40a5d37 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -390,7 +390,7 @@ static void __blk_drain_queue(struct request_queue *q, bool 
drain_all)
 * be drained.  Check all the queues and counters.
 */
if (drain_all) {
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, 
NULL);
drain |= !list_empty(&q->queue_head);
for (i = 0; i < 2; i++) {
drain |= q->nr_rqs[i];
diff --git a/block/blk-flush.c b/block/blk-flush.c
index a9d6e01..4a445a1 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -225,7 +225,7 @@ static void flush_end_io(struct request *flush_rq, int 
error)
bool queued = false;
struct request *rq, *n;
unsigned long flags = 0;
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, flush_rq->mq_ctx);
 
if (q->mq_ops) {
spin_lock_irqsave(&fq->mq_flush_lock, flags);
@@ -321,7 +321,7 @@ static bool blk_kick_flush(struct request_queue *q, struct 
blk_flush_queue *fq)
 static void flush_data_end_io(struct request *rq, int error)
 {
struct request_queue *q = rq->q;
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
 
/*
 * After populating an empty queue, kick it to avoid stall.  Read
@@ -335,11 +335,10 @@ static void mq_flush_data_end_io(struct request *rq, int 
error)
 {
struct request_queue *q = rq->q;
struct blk_mq_hw_ctx *hctx;
-   struct blk_mq_ctx *ctx;
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
unsigned long flags;
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-   ctx = rq->mq_ctx;
hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
/*
@@ -369,7 +368,7 @@ void blk_insert_flush(struct request *rq)
struct request_queue *q = rq->q;
unsigned int fflags = q->flush_flags;   /* may change, cache */
unsigned int policy = blk_flush_policy(fflags, rq);
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);
 
/*
 * @policy now records what operations need to be done.  Adjust
@@ -486,7 +485,7 @@ EXPORT_SYMBOL(blkdev_issue_flush);
 static int blk_mq_init_flush(struct request_queue *q)
 {
struct blk_mq_tag_set *set = q->tag_set;
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
 
spin_lock_init(&fq->mq_flush_lock);
 
@@ -500,7 +499,7 @@ static int blk_mq_init_flush(struct request_queue *q)
 
 static void blk_mq_exit_flush(struct request_queue *q)
 {
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
kfree(fq->flush_rq);
kfree(fq);
 }
@@ -542,7 +541,7 @@ void blk_exit_flush(struct request_queue *q)
if (q->mq_ops)
blk_mq_exit_flush(q);
else {
-   struct blk_flush_queue *fq = blk_get_flush_queue(q);
+   struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
kfree(fq->flush_rq);
kfree(fq);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7782d14..9ca4442 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -518,7 +518,8 @@ static inline bool is_flush_request(struct request *rq,
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
struct request *rq = tags->rqs[tag];
-   struct blk_flush_queue *fq = blk_get_flush_queue(rq->q);
+   /* mq_ctx of flush rq is always cloned from the corresponding req */
+   struct blk_flush_queue *fq = blk_get_flush_queue(rq->q, rq->mq_ctx);
 
if (!is_flush_request(rq, fq, tag))
return rq;
diff --git a/block/blk.h b/block/blk.h
index 2637349..30f8033 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -29,7 +29,7 @@ extern struct kobj_type b

[PATCH 8/8] blk-mq: support per-distpatch_queue flush machinery

2014-09-09 Thread Ming Lei

This patch supports to run one single lush machinery for
each blk-mq dispatch queue, so that:

- current init_request and exit_request callbacks can
cover flush request too, then the ugly and buggy way of
initializing flush request's pdu can be fixed

- flushing performance gets improved in case of multi hw-queue

In both fio write and randwrite test over virtio-blk(4 hw queues,
backed by nullblk) with sync=1, ioengine=sync, iodepth=64, numjobs=4,
it is observed that througput gets increased by 70% in the VM
over my laptop environment.

The multi virtqueue feature isn't merged to QEMU yet, and patches for
the feature can be found in below tree:

git://kernel.ubuntu.com/ming/qemu.git   v2.1.0-mq.3

And simply passing 'num_queues=4 vectors=5' should be enough to
enable multi queue feature for QEMU virtio-blk.

Suggested-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-flush.c  |  141 ++--
 block/blk.h|   12 -
 include/linux/blk-mq.h |2 +
 3 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4a445a1..2fc79bf 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -482,57 +482,143 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
 }
 EXPORT_SYMBOL(blkdev_issue_flush);
 
-static int blk_mq_init_flush(struct request_queue *q)
+static int blk_alloc_flush_queue(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx,
+   struct blk_flush_queue **pfq)
 {
-   struct blk_mq_tag_set *set = q->tag_set;
-   struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
+   struct blk_flush_queue *fq;
+   int rq_sz = sizeof(struct request);
 
-   spin_lock_init(&fq->mq_flush_lock);
+   if (hctx) {
+   int cmd_sz = q->tag_set->cmd_size;
+   int node = hctx->numa_node;
+
+   fq = kzalloc_node(sizeof(*fq), GFP_KERNEL, node);
+   if (!fq)
+   goto failed;
+
+   rq_sz = round_up(rq_sz + cmd_sz, cache_line_size());
+   fq->flush_rq = kzalloc_node(rq_sz, GFP_KERNEL, node);
+   if (!fq->flush_rq)
+   goto rq_failed;
+
+   spin_lock_init(&fq->mq_flush_lock);
+   } else {
+   fq = kzalloc(sizeof(*fq), GFP_KERNEL);
+   if (!fq)
+   goto failed;
+
+   fq->flush_rq = kzalloc(rq_sz, GFP_KERNEL);
+   if (!fq->flush_rq)
+   goto rq_failed;
+   }
+
+   INIT_LIST_HEAD(&fq->flush_queue[0]);
+   INIT_LIST_HEAD(&fq->flush_queue[1]);
+   INIT_LIST_HEAD(&fq->flush_data_in_flight);
 
-   fq->flush_rq = kzalloc(round_up(sizeof(struct request) +
-   set->cmd_size, cache_line_size()),
-   GFP_KERNEL);
-   if (!fq->flush_rq)
-   return -ENOMEM;
+   *pfq = fq;
return 0;
+
+ rq_failed:
+   kfree(fq);
+ failed:
+   return -ENOMEM;
 }
 
-static void blk_mq_exit_flush(struct request_queue *q)
+static void blk_free_flush_queue(struct blk_flush_queue *fq)
 {
-   struct blk_flush_queue *fq = blk_get_flush_queue(q, NULL);
+   if (!fq)
+   return;
kfree(fq->flush_rq);
kfree(fq);
 }
 
-int blk_init_flush(struct request_queue *q)
+static void __blk_mq_exit_flush(struct request_queue *q,
+   unsigned free_end, unsigned int exit_end)
+{
+   struct blk_mq_hw_ctx *hctx;
+   unsigned int k;
+   struct blk_flush_queue *fq;
+   struct blk_mq_tag_set *set = q->tag_set;
+   unsigned start_idx = set->queue_depth;
+
+   queue_for_each_hw_ctx(q, hctx, k) {
+   if (k >= free_end)
+   break;
+
+   fq = hctx->fq;
+   if (k < exit_end && set->ops->exit_request)
+   set->ops->exit_request(set->driver_data,
+   fq->flush_rq, k,
+   start_idx + k);
+
+   blk_free_flush_queue(fq);
+   }
+
+}
+
+static int blk_mq_init_flush(struct request_queue *q)
 {
+   struct blk_mq_hw_ctx *hctx;
+   unsigned int i, j = 0;
+   struct blk_flush_queue *fq;
int ret;
-   struct blk_flush_queue *fq = kzalloc(sizeof(*fq), GFP_KERNEL);
+   struct blk_mq_tag_set *set = q->tag_set;
+   unsigned start_idx = set->queue_depth;
 
-   if (!fq)
-   return -ENOMEM;
+   queue_for_each_hw_ctx(q, hctx, i) {
+   ret = blk_alloc_flush_queue(q, hctx, &fq);
+   if (ret)
+   goto fail;
+   hctx->fq = fq;
+   }
 
-   q->fq = fq;
-   INIT_LIST_HEAD(&fq->flush_queue[0]);
-   INIT_LIST_HEAD(&fq->flush_queue[1]);
-   INIT_LIST_HEAD(&fq->flush_data_in_flight);
+   queue_for_each_hw_ctx(q, hctx, j) {
+   fq =

[PATCH 0/8] block: per-distpatch_queue flush machinery

2014-09-09 Thread Ming Lei

Hi,

As recent discussion, especially suggested by Christoph, this patchset
implements per-distpatch_queue flush machinery, so that:

- current init_request and exit_request callbacks can
cover flush request too, then the ugly and buggy copying
way of initializing flush request's pdu can be fixed

- flushing performance gets improved in case of multi hw-queue

About 70% throughput improvement is observed in sync write/randwrite
over multi dispatch-queue virtio-blk, see details in commit log
of patch 8/8.

This patchset can be pulled from below tree too:

git://kernel.ubuntu.com/ming/linux.git v3.17-block-dev


 block/blk-core.c   |   11 +--
 block/blk-flush.c  |  239 +++-
 block/blk-mq.c |   27 +++---
 block/blk-mq.h |1 -
 block/blk-sysfs.c  |4 +-
 block/blk.h|   35 ++-
 include/linux/blk-mq.h |2 +
 include/linux/blkdev.h |   10 +-
 8 files changed, 251 insertions(+), 78 deletions(-)

Thanks,
--
Ming Lei

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[REGRESSION 3.17] scsi (uas) disks no longer using tagged command queuing

2014-09-09 Thread Hans de Goede

Hi All,

While working on making error handling in the uas driver more robust,
I noticed that all the commands being send to a sata ssd hooked up
over uas were untagged, where I would expect tcq to be used, as that
is the big advantage of uas over usb-storage / bot.

Taking the uas.c file from 3.17, and building it for 3.16 restores
the use of tcq (debugged by adding a printk blk_rq_tagged + request->tag).

So either uas is doing something wrong which happened to work in
3.16, or something has broken in 3.17.

I've already added debug printk-s of scsi_device->tagged_supported,
queue_depth, ordered_tags and simple_tags and those all look good
(1, 29, 1, 1).

I've also tried setting disable_blk_mq and that does not help.

Any hints to help debugging this further (other then a bisect) are
appreciated. If no-one has any smart ideas I guess I'll end up doing
a full bisect.

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] drivers: target: target_core_transport.c: build warning

2014-09-09 Thread Sudip Mukherjee

On Tue, Sep 09, 2014 at 12:53:50PM +0530, Sudip Mukherjee wrote:
> build is giving : 
> warning: passing argument 1 of 'strlen' makes pointer from integer 
> without a cast [enabled by default]
> 
> the snprintf after the strlen is trying to put the "Unsupported" string
> at the end of exising string. so len should give the string length here
> 
> Signed-off-by: Sudip Mukherjee 
> ---
>  drivers/target/target_core_transport.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/target/target_core_transport.c 
> b/drivers/target/target_core_transport.c
> index 1dd1181..3ce85ed 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -953,7 +953,7 @@ int transport_dump_vpd_ident_type(
>   strlcat(buf, "SCSI name string\n", sizeof(buf));
>   break;
>   default:
> - len = strlen(len);
> + len = strlen(buf);
>   snprintf(&buf[len], sizeof(buf) - len, "Unsupported: 0x%02x\n",
>   vpd->device_identifier_type);
>   ret = -EINVAL;
> -- 
> 1.8.1.2
> 

please discard this patch. This is based on next-20140908 , and the
issue has been corrected in next-20140909.

thanks
sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] drivers: target: target_core_transport.c: build warning

2014-09-09 Thread Sudip Mukherjee

build is giving : 
warning: passing argument 1 of 'strlen' makes pointer from integer 
without a cast [enabled by default]

the snprintf after the strlen is trying to put the "Unsupported" string
at the end of exising string. so len should give the string length here

Signed-off-by: Sudip Mukherjee 
---
 drivers/target/target_core_transport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
index 1dd1181..3ce85ed 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -953,7 +953,7 @@ int transport_dump_vpd_ident_type(
strlcat(buf, "SCSI name string\n", sizeof(buf));
break;
default:
-   len = strlen(len);
+   len = strlen(buf);
snprintf(&buf[len], sizeof(buf) - len, "Unsupported: 0x%02x\n",
vpd->device_identifier_type);
ret = -EINVAL;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

59 matches

Mail list logo