from:"Guilherme G. Piccoli"

Re: Should a raid-0 array immediately stop if a component disk is removed?

2018-04-27 Thread Guilherme G. Piccoli

Thanks for your quick reply Anthony!
Inline comments below:

On 27/04/2018 19:11, Wols Lists wrote:
> On 27/04/18 22:49, Guilherme G. Piccoli wrote:
> [...]
> Sounds like you're not using mdadm to remove the disk. So why do you
> expect mdadm to stop the array immediately? It doesn't know anything is
> wrong until it trips over the missing disk.

In fact, mdadm is aware something is wrong - it tries to stop the array,
running "mdadm -If ", but it fails because
the mount point prevents it to stop the array.
And the question lies exactly in this point: should it be (successfully)
stopped? I think it should, since we can continue writing on disks
causing data corruption.

> [...]
> Is your array linear or striped? If it's striped, I would expect it to
> fall over in a heap very quickly. If it's linear, it depends whether you

It's stripped. I was able to keep writing for some time (minutes).

> [...] 
> Note that raid-0 is NOT redundant. Standard advice is "if a drive fails,
> expect to lose your data". So the fact that your array limps on should
> be the pleasant surprise, not that it blows up in ways you didn't expect.

OK, I understand that. But imagine the following scenario: a regular
user gets for some reason a component disk removed, and they don't look
the logs before (or after) writes - the user can write stuff thinking
everything is fine, and that data is corrupted. I'd expect userspace
writes to fail as soon as possible in case one of raid-0 components is gone.

Thanks,

Guilherme

Should a raid-0 array immediately stop if a component disk is removed?

2018-04-27 Thread Guilherme G. Piccoli

Hello, we've noticed an interesting behavior when using a raid-0 md
array. Suppose we have a 2-disk raid-0 array that has a mount point
set - in our tests, we've used ext4 filesystem. If we remove one of
the component disks via sysfs[0], userspace is notified, but mdadm tool
fails to stop the array[1] (it cannot open the array device node with
O_EXCL flag, hence it fails to issue the STOP_ARRAY ioctl). Even if we
circumvent the mdadm O_EXCL open, md driver will fail to execute the
ioctl given the array is mounted.

As a result, the array keeps mounted and we can even read/write from
it, although it's possible to observe filesystem errors on dmesg[2].
Eventually, after some _minutes_, the filesystem gets remounted as
read-only.

During this weird window in which the array had a component disk removed
but is still mounted/active (and accepting read/writes), we tried to
perform reads and writes and sync command, which "succeed" (meaning the
commands themselves didn't fail, although the errors were observed in
dmesg). When "dd" was executed with "oflag=direct", the writes failed
immediately. This was observed with both nvme and scsi disks composing
the raid-0 array.

We've started to pursue a solution to this, which seems to be an odd
behavior. But worth to check in the CC'ed lists if perhaps this is "by
design" or if it was already discussed in the past (maybe an idea was
proposed). Tests were executed with v4.17-rc2 and upstream mdadm tool.

Thanks in advance,


Guilherme


[0] To remove the array component disk, we've used:

a) For nvme: "echo 1 > /sys/block/nvme0n1/device/device/remove"
b) For scsi: "echo 1 > /sys/block/sda/device/delete"


[1] mdadm tool tries to fail the array by executing: "mdadm -If
"


[2] Example:
[...]
[88.719] Buffer I/O error on device md0, logical block 157704
[88.722] Buffer I/O error on device md0, logical block 157705
[88.725] EXT4-fs warning (device md0): ext4_end_bio:323: I/O error 10
writing to inode 14 (offset 0 size 8388608 starting block 79744)
[88.725] EXT4-fs warning (device md0): ext4_end_bio:323: I/O error 10
writing to inode 14 (offset 0 size 8388608 starting block 8)
[...]

Re: [PATCH 00/28] aacraid: Refactor for sas transport and bug fixes

2017-12-27 Thread Guilherme G. Piccoli

On 12/26/2017 11:27 PM, Raghava Aditya Renukunta wrote:
> [...]
>> I was thinking...if possible, and just in case you plan to send a V2
>> after Bart's comments (or any other future reviews this patchset gets),
>> how about splitting in 2 patchsets, one for bug fixes and the other for
>> improvements/foundation of sas support?
>>
>> It could help to speed-up the merge of bug fixes.
> 
> Hi Guilherme,
> It did cross my mind, but I wanted to get the sas transport and bug fixes in 
> by 4.16. I thought
> that breaking the patches into 3 patch sets might not help with it.
> 

Yeah, makes sense in this case! Thanks for considering it :)
Cheers,


Guilherme

> Regards,
> Raghava Aditya
> 
>  
>> Thanks,
>>
>>
>> Guilherme
>>>
>>>  - Fixed a udev inquiry race condition
>>>  - Fixed a kdump hang issue which occurs in case of error recovery in kdump
>>>  - Made improvements to ioctl reset and reset_host sysfs reset paths
>>>  - Changed the code to retrieve lun information into stand alone functions.
>>>  - Merged container and hba hotplug event processing (device addition and
>>>and removal into single function)
>>>  - Removed scsi_scan_host for safw devices and now explicitly add devices
>>>retrieved from the fw.
>>>  - Reschedule scan in driver fails to retrieve lun information from fw.
>>>(usually works in a few attempts)
>>>  - Rescan worker waits for any pending EH recovery before rescanning
>>>  - Do not trigger rescan worker in kdump kernel
>>>
>>> Raghava Aditya Renukunta (29):
>>>   scsi: aacraid: Fix udev inquiry race condition
>>>   scsi: aacraid: Do not attempt abort when Fw panicked
>>>   scsi: aacraid: Fix hang in kdump
>>>   scsi: aacraid: Do not remove offlined devices
>>>   scsi: aacraid: Fix ioctl reset hang
>>>   scsi: aacraid: Allow reset_host sysfs var to recover Panicked Fw
>>>   scsi: aacraid: Refactor reset_host store function
>>>   scsi: aacraid: Move code to wait for IO completion to shutdown func
>>>   scsi: aacraid: Create bmic submission function from bmic identify
>>>   scsi: aacraid: Change phy luns function to use common bmic function
>>>   scsi: aacraid: Refactor and rename to make mirror existing changes
>>>   scsi: aacraid: Add target setup helper function
>>>   scsi: aacraid: Untangle targets setup from report phy luns
>>>   scsi: aacraid: Move function around to match existing code
>>>   scsi: aacraid: Create helper functions to get lun info
>>>   scsi: aacraid: Save bmic phy information for each phy
>>>   scsi: aacraid: Add helper function to set queue depth
>>>   scsi: aacraid: Merge func to get container information
>>>   scsi: aacraid: Process hba and container hot plug events in single
>>>   function
>>>   scsi: aacraid: Added macros to help loop through known buses and targets
>>>   scsi: aacraid: Refactor resolve luns code and scsi functions
>>>   scsi: aacraid: Merge adapter setup with resolve luns
>>>   scsi: aacraid: Block concurrent hotplug event handling
>>>   scsi: aacraid: Use hotplug handling function in place of scsi_scan_host
>>>   scsi: aacraid: Reschedule host scan in case of failure
>>>   scsi: aacraid: Fix hang while scanning in eh recovery
>>>   scsi: aacraid: Skip schedule rescan in case of kdump
>>>   scsi: aacraid: Remove unused rescan variable
>>>   scsi: aacraid: Remove AAC_HIDE_DISK check in queue command
>>>
>>>  drivers/scsi/aacraid/aachba.c   | 479 +++
>> -
>>>  drivers/scsi/aacraid/aacraid.h  |  52 -
>>>  drivers/scsi/aacraid/commctrl.c |   6 +-
>>>  drivers/scsi/aacraid/comminit.c |  49 +++-
>>>  drivers/scsi/aacraid/commsup.c  | 224 ++-
>>>  drivers/scsi/aacraid/linit.c|  23 +-
>>>  6 files changed, 561 insertions(+), 272 deletions(-)
>>>
>

Re: [PATCH 03/29] scsi: aacraid: Fix hang in kdump

2017-12-22 Thread Guilherme G. Piccoli

On 12/21/2017 03:33 PM, Raghava Aditya Renukunta wrote:
> Driver attempts to perform a device scan and device add after coming out
> of reset. At times when the kdump kernel loads and it tries to perform
> eh recovery, the device scan hangs since its commands are blocked because
> of the eh recovery. This should have shown up in normal eh recovery path
> (Should have been obvious)
> 
> Remove the code that performs scanning.I can live without the rescanning
> support in the stable kernels but a hanging kdump/eh recovery needs to be
> fixed.
> 
> Fixes: a2d0321dd532901e (scsi: aacraid: Reload offlined drives after 
> controller reset)
> Cc: <sta...@vger.kernel.org>
> Reported-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>

(Sorry in advance for flooding the thread heheh)
I guess it'd be more appropriate to:

Reported-by: Douglas Miller <dougm...@linux.vnet.ibm.com>

Although I've tested it, Doug isolated the race condition based on code
analysis...

Thanks,


Guilherme

> Tested-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
> Fixes: a2d0321dd532901e (scsi: aacraid: Reload offlined drives after 
> controller reset)
> Signed-off-by: Raghava Aditya Renukunta 
> <raghavaaditya.renuku...@microsemi.com>
> ---
>  drivers/scsi/aacraid/commsup.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
> index 525a652..ffbfd04 100644
> --- a/drivers/scsi/aacraid/commsup.c
> +++ b/drivers/scsi/aacraid/commsup.c
> @@ -1672,14 +1672,7 @@ static int _aac_reset_adapter(struct aac_dev *aac, int 
> forced, u8 reset_type)
>  out:
>   aac->in_reset = 0;
>   scsi_unblock_requests(host);
> - /*
> -  * Issue bus rescan to catch any configuration that might have
> -  * occurred
> -  */
> - if (!retval) {
> - dev_info(>pdev->dev, "Issuing bus rescan\n");
> - scsi_scan_host(host);
> - }
> +
>   if (jafo) {
>   spin_lock_irq(host->host_lock);
>   }
>

Re: [PATCH 00/28] aacraid: Refactor for sas transport and bug fixes

2017-12-22 Thread Guilherme G. Piccoli

On 12/21/2017 03:33 PM, Raghava Aditya Renukunta wrote:
> This patchset primarily lays the foundation for adding sas transport
> support to the aacraid driver. Being said these patches are mainly code
> refactors, bug fixes and improvements.

I was thinking...if possible, and just in case you plan to send a V2
after Bart's comments (or any other future reviews this patchset gets),
how about splitting in 2 patchsets, one for bug fixes and the other for
improvements/foundation of sas support?

It could help to speed-up the merge of bug fixes.

Thanks,


Guilherme
> 
>  - Fixed a udev inquiry race condition
>  - Fixed a kdump hang issue which occurs in case of error recovery in kdump
>  - Made improvements to ioctl reset and reset_host sysfs reset paths
>  - Changed the code to retrieve lun information into stand alone functions.
>  - Merged container and hba hotplug event processing (device addition and
>and removal into single function)
>  - Removed scsi_scan_host for safw devices and now explicitly add devices
>retrieved from the fw.
>  - Reschedule scan in driver fails to retrieve lun information from fw.
>(usually works in a few attempts)
>  - Rescan worker waits for any pending EH recovery before rescanning
>  - Do not trigger rescan worker in kdump kernel
> 
> Raghava Aditya Renukunta (29):
>   scsi: aacraid: Fix udev inquiry race condition
>   scsi: aacraid: Do not attempt abort when Fw panicked
>   scsi: aacraid: Fix hang in kdump
>   scsi: aacraid: Do not remove offlined devices
>   scsi: aacraid: Fix ioctl reset hang
>   scsi: aacraid: Allow reset_host sysfs var to recover Panicked Fw
>   scsi: aacraid: Refactor reset_host store function
>   scsi: aacraid: Move code to wait for IO completion to shutdown func
>   scsi: aacraid: Create bmic submission function from bmic identify
>   scsi: aacraid: Change phy luns function to use common bmic function
>   scsi: aacraid: Refactor and rename to make mirror existing changes
>   scsi: aacraid: Add target setup helper function
>   scsi: aacraid: Untangle targets setup from report phy luns
>   scsi: aacraid: Move function around to match existing code
>   scsi: aacraid: Create helper functions to get lun info
>   scsi: aacraid: Save bmic phy information for each phy
>   scsi: aacraid: Add helper function to set queue depth
>   scsi: aacraid: Merge func to get container information
>   scsi: aacraid: Process hba and container hot plug events in single
>   function
>   scsi: aacraid: Added macros to help loop through known buses and targets
>   scsi: aacraid: Refactor resolve luns code and scsi functions
>   scsi: aacraid: Merge adapter setup with resolve luns
>   scsi: aacraid: Block concurrent hotplug event handling
>   scsi: aacraid: Use hotplug handling function in place of scsi_scan_host
>   scsi: aacraid: Reschedule host scan in case of failure
>   scsi: aacraid: Fix hang while scanning in eh recovery
>   scsi: aacraid: Skip schedule rescan in case of kdump
>   scsi: aacraid: Remove unused rescan variable
>   scsi: aacraid: Remove AAC_HIDE_DISK check in queue command
> 
>  drivers/scsi/aacraid/aachba.c   | 479 
> +++-
>  drivers/scsi/aacraid/aacraid.h  |  52 -
>  drivers/scsi/aacraid/commctrl.c |   6 +-
>  drivers/scsi/aacraid/comminit.c |  49 +++-
>  drivers/scsi/aacraid/commsup.c  | 224 ++-
>  drivers/scsi/aacraid/linit.c|  23 +-
>  6 files changed, 561 insertions(+), 272 deletions(-)
>

Re: [PATCH 03/29] scsi: aacraid: Fix hang in kdump

2017-12-21 Thread Guilherme G. Piccoli

On 12/21/2017 03:33 PM, Raghava Aditya Renukunta wrote:
> Driver attempts to perform a device scan and device add after coming out
> of reset. At times when the kdump kernel loads and it tries to perform
> eh recovery, the device scan hangs since its commands are blocked because
> of the eh recovery. This should have shown up in normal eh recovery path
> (Should have been obvious)
> 
> Remove the code that performs scanning.I can live without the rescanning
> support in the stable kernels but a hanging kdump/eh recovery needs to be
> fixed.
> 
> Fixes: a2d0321dd532901e (scsi: aacraid: Reload offlined drives after 
> controller reset)
> Cc: <sta...@vger.kernel.org>
> Reported-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
> Tested-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
> Fixes: a2d0321dd532901e (scsi: aacraid: Reload offlined drives after 
> controller reset)
> Signed-off-by: Raghava Aditya Renukunta 
> <raghavaaditya.renuku...@microsemi.com>

Thanks a lot Raghava =)

> ---
>  drivers/scsi/aacraid/commsup.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
> index 525a652..ffbfd04 100644
> --- a/drivers/scsi/aacraid/commsup.c
> +++ b/drivers/scsi/aacraid/commsup.c
> @@ -1672,14 +1672,7 @@ static int _aac_reset_adapter(struct aac_dev *aac, int 
> forced, u8 reset_type)
>  out:
>   aac->in_reset = 0;
>   scsi_unblock_requests(host);
> - /*
> -  * Issue bus rescan to catch any configuration that might have
> -  * occurred
> -  */
> - if (!retval) {
> - dev_info(>pdev->dev, "Issuing bus rescan\n");
> - scsi_scan_host(host);
> - }
> +
>   if (jafo) {
>   spin_lock_irq(host->host_lock);
>   }
>

[PATCH 0/3] Some fixes to aacraid

2017-11-17 Thread Guilherme G. Piccoli

This series presents 3 small fixes for aacraid driver.
The most important is the crash prevention, IMHO.

Tested them against v4.14.

Guilherme G. Piccoli (3):
  scsi: aacraid: Check for PCI state of device in a generic way
  scsi: aacraid: Perform initialization reset only once
  scsi: aacraid: Prevent crash in case of free interrupt during scsi EH path

 drivers/scsi/aacraid/aacraid.h |  1 +
 drivers/scsi/aacraid/commsup.c | 35 +++
 drivers/scsi/aacraid/linit.c   |  3 +++
 drivers/scsi/aacraid/rx.c  | 15 ++-
 drivers/scsi/aacraid/src.c | 20 ++--
 5 files changed, 31 insertions(+), 43 deletions(-)

-- 
2.15.0

[PATCH 2/3] scsi: aacraid: Perform initialization reset only once

2017-11-17 Thread Guilherme G. Piccoli

Currently the driver accepts two ways of requesting an initialization
reset on the adapter: by passing aac_reset_devices module parameter,
or the generic kernel parameter reset_devices.

It's working as intended...but if we end up reaching a scsi hang and
the scsi EH mechanism takes place, aacraid performs resets as part of
the scsi error recovery procedure. These EH routines might reinitialize
the device, and if we have provided some of the reset parameters in the
kernel command-line, we again perform an "initialization" reset.

So, to avoid this duplication of resets in case of scsi EH path, this
patch adds a field to aac_dev struct to keep per-adapter track of the
init reset request - once it's done, we set it to false and don't
proactively reset anymore in case of reinitializations.

Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
 drivers/scsi/aacraid/aacraid.h |  1 +
 drivers/scsi/aacraid/linit.c   |  3 +++
 drivers/scsi/aacraid/rx.c  | 15 ++-
 drivers/scsi/aacraid/src.c | 20 ++--
 4 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h
index 403a639574e5..6e3d81969a77 100644
--- a/drivers/scsi/aacraid/aacraid.h
+++ b/drivers/scsi/aacraid/aacraid.h
@@ -1673,6 +1673,7 @@ struct aac_dev
struct aac_hba_map_info hba_map[AAC_MAX_BUSES][AAC_MAX_TARGETS];
u8  adapter_shutdown;
u32 handle_pci_error;
+   boolinit_reset;
 };
 
 #define aac_adapter_interrupt(dev) \
diff --git a/drivers/scsi/aacraid/linit.c b/drivers/scsi/aacraid/linit.c
index c9252b138c1f..bdf127aaab41 100644
--- a/drivers/scsi/aacraid/linit.c
+++ b/drivers/scsi/aacraid/linit.c
@@ -1680,6 +1680,9 @@ static int aac_probe_one(struct pci_dev *pdev, const 
struct pci_device_id *id)
aac->cardtype = index;
INIT_LIST_HEAD(>entry);
 
+   if (aac_reset_devices || reset_devices)
+   aac->init_reset = true;
+
aac->fibs = kzalloc(sizeof(struct fib) * (shost->can_queue + 
AAC_NUM_MGT_FIB), GFP_KERNEL);
if (!aac->fibs)
goto out_free_host;
diff --git a/drivers/scsi/aacraid/rx.c b/drivers/scsi/aacraid/rx.c
index 93ef7c37e568..ff2af06e7dd9 100644
--- a/drivers/scsi/aacraid/rx.c
+++ b/drivers/scsi/aacraid/rx.c
@@ -561,11 +561,16 @@ int _aac_rx_init(struct aac_dev *dev)
dev->a_ops.adapter_sync_cmd = rx_sync_cmd;
dev->a_ops.adapter_enable_int = aac_rx_disable_interrupt;
dev->OIMR = status = rx_readb (dev, MUnit.OIMR);
-   if status & 0x0c) != 0x0c) || aac_reset_devices || reset_devices) &&
- !aac_rx_restart_adapter(dev, 0, IOP_HWSOFT_RESET))
-   /* Make sure the Hardware FIFO is empty */
-   while ((++restart < 512) &&
- (rx_readl(dev, MUnit.OutboundQueue) != 0xL));
+
+   if (((status & 0x0c) != 0x0c) || dev->init_reset) {
+   dev->init_reset = false;
+   if (!aac_rx_restart_adapter(dev, 0, IOP_HWSOFT_RESET)) {
+   /* Make sure the Hardware FIFO is empty */
+   while ((++restart < 512) &&
+ (rx_readl(dev, MUnit.OutboundQueue) != 
0xL));
+   }
+   }
+
/*
 *  Check to see if the board panic'd while booting.
 */
diff --git a/drivers/scsi/aacraid/src.c b/drivers/scsi/aacraid/src.c
index 0c9361c87ec8..fde6b6aa86e3 100644
--- a/drivers/scsi/aacraid/src.c
+++ b/drivers/scsi/aacraid/src.c
@@ -868,9 +868,13 @@ int aac_src_init(struct aac_dev *dev)
/* Failure to reset here is an option ... */
dev->a_ops.adapter_sync_cmd = src_sync_cmd;
dev->a_ops.adapter_enable_int = aac_src_disable_interrupt;
-   if ((aac_reset_devices || reset_devices) &&
-   !aac_src_restart_adapter(dev, 0, IOP_HWSOFT_RESET))
-   ++restart;
+
+   if (dev->init_reset) {
+   dev->init_reset = false;
+   if (!aac_src_restart_adapter(dev, 0, IOP_HWSOFT_RESET))
+   ++restart;
+   }
+
/*
 *  Check to see if the board panic'd while booting.
 */
@@ -1014,9 +1018,13 @@ int aac_srcv_init(struct aac_dev *dev)
/* Failure to reset here is an option ... */
dev->a_ops.adapter_sync_cmd = src_sync_cmd;
dev->a_ops.adapter_enable_int = aac_src_disable_interrupt;
-   if ((aac_reset_devices || reset_devices) &&
-   !aac_src_restart_adapter(dev, 0, IOP_HWSOFT_RESET))
-   ++restart;
+
+   if (dev->init_reset) {
+   dev->init_reset = false;
+   if (!aac_src_restart_adapter(dev, 0, IOP_HWSOFT_RESET))
+   ++restart;
+   }
+
/*
 *  Check to see if flash update is running.
 *  Wait for the adapter to be up and running. Wait up to 5 minutes
-- 
2.15.0

[PATCH 3/3] scsi: aacraid: Prevent crash in case of free interrupt during scsi EH path

2017-11-17 Thread Guilherme G. Piccoli

As part of the scsi EH path, aacraid performs a reinitialization of
the adapter, which encompass freeing resources and IRQs, NULLifying
lots of pointers, and then initialize it all over again.
We've identified a problem during the free IRQ portion of this path
if CONFIG_DEBUG_SHIRQ is enabled on kernel config file.

Happens that, in case this flag was set, right after free_irq()
effectively clears the interrupt, it checks if it was requested
as IRQF_SHARED. In positive case, it performs another call to the
IRQ handler on driver. Problem is: since aacraid currently free
some resources *before* freeing the IRQ, once free_irq() path
calls the handler again (due to CONFIG_DEBUG_SHIRQ), aacraid
crashes due to NULL pointer dereference with the following trace:

  aac_src_intr_message+0xf8/0x740 [aacraid]
  __free_irq+0x33c/0x4a0
  free_irq+0x78/0xb0
  aac_free_irq+0x13c/0x150 [aacraid]
  aac_reset_adapter+0x2e8/0x970 [aacraid]
  aac_eh_reset+0x3a8/0x5d0 [aacraid]
  scsi_try_host_reset+0x74/0x180
  scsi_eh_ready_devs+0xc70/0x1510
  scsi_error_handler+0x624/0xa20

This patch prevents the crash by changing the order of the
deinitialization in this path of aacraid: first we clear the IRQ,
then we free other resources. No functional change intended.

Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
 drivers/scsi/aacraid/commsup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
index 2abe8fd83494..bec9f3193f60 100644
--- a/drivers/scsi/aacraid/commsup.c
+++ b/drivers/scsi/aacraid/commsup.c
@@ -1554,6 +1554,7 @@ static int _aac_reset_adapter(struct aac_dev *aac, int 
forced, u8 reset_type)
 * will ensure that i/o is queisced and the card is flushed in that
 * case.
 */
+   aac_free_irq(aac);
aac_fib_map_free(aac);
dma_free_coherent(>pdev->dev, aac->comm_size, aac->comm_addr,
  aac->comm_phys);
@@ -1561,7 +1562,6 @@ static int _aac_reset_adapter(struct aac_dev *aac, int 
forced, u8 reset_type)
aac->comm_phys = 0;
kfree(aac->queues);
aac->queues = NULL;
-   aac_free_irq(aac);
kfree(aac->fsa_dev);
aac->fsa_dev = NULL;
 
-- 
2.15.0

[PATCH 1/3] scsi: aacraid: Check for PCI state of device in a generic way

2017-11-17 Thread Guilherme G. Piccoli

Commit 16ae9dd35d37 ("scsi: aacraid: Fix for excessive prints on EEH")
introduced checks about the state of device before any PCI operations
in the driver. Basically, this prevents it to perform PCI accesses
when device is in the process of recover from a PCI error. In PowerPC,
such mechanism is called EEH, and the aforementioned commit introduced
checks that are based on EEH-specific primitives for that.

The potential problems with this approach are three: first, these checks
are "locked" to powerpc only - another archs could have error recovery
methods too, like AER in Intel. Also, the powerpc primitives perform
expensive FW accesses to validate the precise PCI state of a device.
Finally, code becomes more complicated and needs ifdef validation
based on arch config being set.

So, this patch makes use of generic PCI state checks, which are
lightweight and non-dependent of arch configs - also, it makes
the code cleaner.

Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
 drivers/scsi/aacraid/commsup.c | 33 ++---
 1 file changed, 2 insertions(+), 31 deletions(-)

diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
index 525a652dab48..2abe8fd83494 100644
--- a/drivers/scsi/aacraid/commsup.c
+++ b/drivers/scsi/aacraid/commsup.c
@@ -467,35 +467,6 @@ int aac_queue_get(struct aac_dev * dev, u32 * index, u32 
qid, struct hw_fib * hw
return 0;
 }
 
-#ifdef CONFIG_EEH
-static inline int aac_check_eeh_failure(struct aac_dev *dev)
-{
-   /* Check for an EEH failure for the given
-* device node. Function eeh_dev_check_failure()
-* returns 0 if there has not been an EEH error
-* otherwise returns a non-zero value.
-*
-* Need to be called before any PCI operation,
-* i.e.,before aac_adapter_check_health()
-*/
-   struct eeh_dev *edev = pci_dev_to_eeh_dev(dev->pdev);
-
-   if (eeh_dev_check_failure(edev)) {
-   /* The EEH mechanisms will handle this
-* error and reset the device if
-* necessary.
-*/
-   return 1;
-   }
-   return 0;
-}
-#else
-static inline int aac_check_eeh_failure(struct aac_dev *dev)
-{
-   return 0;
-}
-#endif
-
 /*
  * Define the highest level of host to adapter communication routines.
  * These routines will support host to adapter FS commuication. These
@@ -701,7 +672,7 @@ int aac_fib_send(u16 command, struct fib *fibptr, unsigned 
long size,
return -ETIMEDOUT;
}
 
-   if (aac_check_eeh_failure(dev))
+   if (unlikely(pci_channel_offline(dev->pdev)))
return -EFAULT;
 
if ((blink = aac_adapter_check_health(dev)) > 
0) {
@@ -801,7 +772,7 @@ int aac_hba_send(u8 command, struct fib *fibptr, 
fib_callback callback,
 
spin_unlock_irqrestore(>event_lock, flags);
 
-   if (aac_check_eeh_failure(dev))
+   if (unlikely(pci_channel_offline(dev->pdev)))
return -EFAULT;
 
fibptr->flags |= FIB_CONTEXT_FLAG_WAIT;
-- 
2.15.0

Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-25 Thread Guilherme G. Piccoli

On 09/21/2017 01:19 PM, Dave Carroll wrote:
>> [...]
>> ---
> Acked-by: Dave Carroll 
> 

Thanks Dave!

James/Martin, am I expected to send a v2 with some change? Perhaps with
Dave's ack?
Sorry to annoy, thanks in advance for any advice!

Cheers,


Guilherme

Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli

On 09/19/2017 02:05 PM, James Bottomley wrote:
> Actually, the whole problem sounds like a posted write.  Likely the
> write that causes the reset doesn't get flushed until the read checking
> if the reset has succeeded, which might explain the 100% initial
> failure.  Why not throw away that first value if it's a failure and
> then do your polled wait and timeout on the reset success.  We should
> anyway be waiting some time for a reset to be issued, so even on non-
> posted write systems we could see this problem intermittently.
> 
> James
> 

Thanks for this suggestion James.

I tried to remove the sleep and did a dummy read to register using
readl() - issue reproduced.

I did expect that, since in aac_is_ctrl_up_and_running() we indeed read
a register and if it shows us reset is not complete, we wait and read it
again. So, we can think in this 1st read as a dummy one heheh

My theory here is that we're observing a failure similar to one we
already did in some specific NVMe adapters - the readl before some delay
(in nvme it was 2s) corrupts the adapter FW procedure. It's as if the
adapter doesn't like to deal with this read while the reset procedure is
ongoing. So, we wait a bit to issue a readl and everything goes fine.

Cheers,

Guilherme

Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli

On 09/19/2017 12:52 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:49:21PM -0300, Guilherme G. Piccoli wrote:
>> On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
>>> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
>>>>src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
>>>> +
>>>> +  msleep(5000);
>>>
>>> src_writel is a writel, and thus a posted MMIO write.  You'll need
>>> to have to a read first to make it a reliable timing base.
>>>
>>
>> Just for my full understanding - you're saying a readl BEFORE
>> src_writel() or AFTER src_writel() ?
> 
> AFTER.

Thanks!

> 
>> I could add a read to some dummy register, but notice it was a sequence
>> of readl's on aac_is_ctrl_up_and_running() that caused the failure of
>> reset...
> 
> Oh, ouch.  I guess in that case we'll need to do the writel and pray,
> but that would need a big comment explaining what's going on there.
> 

Heheh you're right!

But do you remember that quirk added on nvme? Basically, it was a
similar scenario in which some adapters weren't happy in poll a register
in nvme_wait_ready() right after we wrote in the Controller Config
register when disabling a controller.

Seems the same thing here, the controller is not being able to handle a
read right after some internal procedure (reset) were initiated.

If you have suggestion to improve the commit msg, let me know :)
Thanks!

Re: [PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli

On 09/19/2017 12:37 PM, Christoph Hellwig wrote:
> On Tue, Sep 19, 2017 at 12:11:55PM -0300, Guilherme G. Piccoli wrote:
>>  src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
>> +
>> +msleep(5000);
> 
> src_writel is a writel, and thus a posted MMIO write.  You'll need
> to have to a read first to make it a reliable timing base.
> 

Just for my full understanding - you're saying a readl BEFORE
src_writel() or AFTER src_writel() ?

I could add a read to some dummy register, but notice it was a sequence
of readl's on aac_is_ctrl_up_and_running() that caused the failure of
reset...

Thanks

[PATCH] scsi: aacraid: Add a small delay after IOP reset

2017-09-19 Thread Guilherme G. Piccoli

Commit 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset
status") changed the way driver checks if a reset succeeded. Now, after an
IOP reset, aacraid immediately start polling a register to verify the reset
is complete.

This behavior cause regressions on the reset path in PowerPC (at least).
Since the delay after the IOP reset was removed by the aforementioned patch,
the fact driver just starts to read a register instantly after the reset
was issued (by writing in another register) "corrupts" the reset procedure,
which ends up failing all the time.

The issue highly impacted kdump on PowerPC, since on kdump path we
proactively issue a reset in adapter (through the reset_devices kernel
parameter).

This patch (re-)adds a delay right after IOP reset is issued. Empirically
we measured that 3 seconds is enough, but for safety reasons we delay
for 5s (and since it was 30s before, 5s is still a small amount).

For reference, without this patch we observe the following messages
on kdump kernel boot process:

  [ 76.294] aacraid 0003:01:00.0: IOP reset failed
  [ 76.294] aacraid 0003:01:00.0: ARC Reset attempt failed
  [ 86.524] aacraid 0003:01:00.0: adapter kernel panic'd ff.
  [ 86.524] aacraid 0003:01:00.0: Controller reset type is 3
  [ 86.524] aacraid 0003:01:00.0: Issuing IOP reset
  [146.534] aacraid 0003:01:00.0: IOP reset failed
  [146.534] aacraid 0003:01:00.0: ARC Reset attempt failed

Fixes: 0e9973ed3382 ("scsi: aacraid: Add periodic checks to see IOP reset 
status")
Cc: sta...@vger.kernel.org # v4.13+
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
 drivers/scsi/aacraid/src.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/aacraid/src.c b/drivers/scsi/aacraid/src.c
index 48c2b2b34b72..0c9361c87ec8 100644
--- a/drivers/scsi/aacraid/src.c
+++ b/drivers/scsi/aacraid/src.c
@@ -740,6 +740,8 @@ static void aac_send_iop_reset(struct aac_dev *dev)
aac_set_intx_mode(dev);
 
src_writel(dev, MUnit.IDR, IOP_SRC_RESET_MASK);
+
+   msleep(5000);
 }
 
 static void aac_send_hardware_soft_reset(struct aac_dev *dev)
-- 
2.14.1

Re: smp-induced oops/NULL pointer dereference in mpt3sas, from kernel >= 4.11

2017-09-04 Thread Guilherme G. Piccoli

On 09/04/2017 05:44 PM, Guilherme G. Piccoli wrote:
> [Hopefully this messages gets threaded properly...]

And of course, I forgot to CC linux-scsi heheh
Sorry!

> 
> Chaitra, the following 4 patches fix this issue (the last one is really
> the fix, the others are part of the patchset):
> 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.14/scsi-queue=c1225f01a
> ("scsi: bsg-lib: pass the release callback through bsg_setup_queue")
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.14/scsi-queue=9a664f492
> ("scsi: hpsa: remove the smp_handler stub")
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.14/scsi-queue=eaa79a6cd
> ("scsi: smartpqi: remove the smp_handler stub")
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.14/scsi-queue=651a01364
> ("scsi: scsi_transport_sas: switch to bsg-lib for SMP passthrough")
> 
> Issue happens because mpt3asas is using a non-allocated sense buffer on
> scsi_request; it started after the scsi_req refactoring, that made the
> sense buffer a pointer (scsi_transport_sas was an user that didn't
> allocate it, neither used - before the patches above - the bsg alloc
> API, hence the bug).
> 
> Cheers,
> 
> 
> Guilherme
>

Re: nvmf question - synchronization between target/initiator regarding partitions

2017-08-10 Thread Guilherme G. Piccoli

On 08/10/2017 06:16 AM, Christoph Hellwig wrote:
> On Mon, Aug 07, 2017 at 02:29:47PM -0300, Guilherme G. Piccoli wrote:
>> Thanks for your feedback Hannes, agreed!
> 
> And btw, you'll see similar results with the SCSI target or nbd,
> so it's not really nvme specific.

Thanks, I agree - noticed the same stuff. I've used nvmf as a "trigger"
to the subject, in order to discuss a possible generic solution. But
since everything is working as expected, no need to pursue a "fix" heheh

Re: nvmf question - synchronization between target/initiator regarding partitions

2017-08-07 Thread Guilherme G. Piccoli

Thanks for your feedback Hannes, agreed!

Cheers,


Guilherme

nvmf question - synchronization between target/initiator regarding partitions

2017-08-07 Thread Guilherme G. Piccoli

We observed that it's possible to perform partition operations in both
nvmf target and initiator block devices, like creating and deleting
partitions.

But there is no sync mechanism between target and initiator regarding
the partitions operations. After creating a partition on initiator, for
example, we don't see it on target side. We could format it like ext4 on
initiator, and still target cannot see it. It's possible to trigger a
BLKRRPART ioctl on target, which end up calling revalidate_disk() on
nvme driver and then partitions are perceived.

So, question: is this behavior expected/acceptable? Is it completely up
to userspace to deal with the synchronization between host/target?
I think answer might be yes since partitions are a higher level of
abstraction than nvmf (which is purely block device aware).

But if kernel could/should jump in, we possibly could try to issue a
revalidate_disk on target whenever this operation is performed on
initiator side (and vice-versa). I confess I couldn't find such sync
idea in NVMe over fabrics spec, though.
Also, reading NVMe spec 1.3, we do have the optional feature
"reservations", but seems it doesn't mention target (only multiple hosts).

Thanks in advance for any ideas on this subject.
Cheers,



Guilherme

[PATCH] scsi: lpfc: Fix crash on PCI hotplug remove path

2017-05-28 Thread Guilherme G. Piccoli

During a PCI hotplug remove event we could have a NULL pointer
dereference on lpfc_sli_abort_iocb(), if pring is NULL. This
patch adds a check for this case and is able to circumvent the
failure and continue the hotplug remove process with success.

This issue was introduced after the driver refactor made on
commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications").

Fixes: 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications")
Reported-by: Naresh Bannoth <nbann...@in.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
This patch was rebased against Martin's 4.12/scsi-fixes.

 drivers/scsi/lpfc/lpfc_sli.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index d6b184839bc2..134c60a66fb8 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -11003,9 +11003,13 @@ lpfc_sli_abort_iocb(struct lpfc_vport *vport, struct 
lpfc_sli_ring *pring,
 
/* Setup callback routine and issue the command. */
abtsiocb->iocb_cmpl = lpfc_sli_abort_fcp_cmpl;
-   ret_val = lpfc_sli_issue_iocb(phba, pring->ringno,
- abtsiocb, 0);
-   if (ret_val == IOCB_ERROR) {
+
+   /* In PCI hotplug remove path, pring might be NULL */
+   if (pring)
+   ret_val = lpfc_sli_issue_iocb(phba, pring->ringno,
+ abtsiocb, 0);
+
+   if (!pring || ret_val == IOCB_ERROR) {
lpfc_sli_release_iocbq(phba, abtsiocb);
errcnt++;
continue;
-- 
2.12.0.rc0

[PATCH] scsi: lpfc: Avoid NULL pointer dereference in lpfc_els_abort()

2017-05-24 Thread Guilherme G. Piccoli

We might have a NULL pring in lpfc_els_abort(), for example on
error recovery path, since queues are destroyed during error
recovery mechanism.

In this case, we should just drop the abort since the queues will
be recreated anyway. This patch just verifies for NULL pointer
and stop the abortion of the queue in case of a NULL pring.

Also, this patch converts return type of lpfc_els_abort() from int
to void, since it's not checked anywhere.

Reported-by: Harsha Thyagaraja <hathy...@in.ibm.com>
Reported-by: Naresh Bannoth <nbann...@in.ibm.com>
Tested-by: Raphael Silva <rapha...@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
* This patch was rebased against Martin's 4.12/scsi-fixes.

 drivers/scsi/lpfc/lpfc_crtn.h  | 2 +-
 drivers/scsi/lpfc/lpfc_nportdisc.c | 7 +--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_crtn.h b/drivers/scsi/lpfc/lpfc_crtn.h
index 8912767e7bc8..da669dce12fe 100644
--- a/drivers/scsi/lpfc/lpfc_crtn.h
+++ b/drivers/scsi/lpfc/lpfc_crtn.h
@@ -127,7 +127,7 @@ int lpfc_disc_state_machine(struct lpfc_vport *, struct 
lpfc_nodelist *, void *,
 void lpfc_do_scr_ns_plogi(struct lpfc_hba *, struct lpfc_vport *);
 int lpfc_check_sparm(struct lpfc_vport *, struct lpfc_nodelist *,
 struct serv_parm *, uint32_t, int);
-int lpfc_els_abort(struct lpfc_hba *, struct lpfc_nodelist *);
+void lpfc_els_abort(struct lpfc_hba *, struct lpfc_nodelist *);
 void lpfc_more_plogi(struct lpfc_vport *);
 void lpfc_more_adisc(struct lpfc_vport *);
 void lpfc_end_rscn(struct lpfc_vport *);
diff --git a/drivers/scsi/lpfc/lpfc_nportdisc.c 
b/drivers/scsi/lpfc/lpfc_nportdisc.c
index bff3de053df4..f74cb0142fd4 100644
--- a/drivers/scsi/lpfc/lpfc_nportdisc.c
+++ b/drivers/scsi/lpfc/lpfc_nportdisc.c
@@ -206,7 +206,7 @@ lpfc_check_elscmpl_iocb(struct lpfc_hba *phba, struct 
lpfc_iocbq *cmdiocb,
  * associated with a LPFC_NODELIST entry. This
  * routine effectively results in a "software abort".
  */
-int
+void
 lpfc_els_abort(struct lpfc_hba *phba, struct lpfc_nodelist *ndlp)
 {
LIST_HEAD(abort_list);
@@ -215,6 +215,10 @@ lpfc_els_abort(struct lpfc_hba *phba, struct lpfc_nodelist 
*ndlp)
 
pring = lpfc_phba_elsring(phba);
 
+   /* In case of error recovery path, we might have a NULL pring here */
+   if (!pring)
+   return;
+
/* Abort outstanding I/O on NPort  */
lpfc_printf_vlog(ndlp->vport, KERN_INFO, LOG_DISCOVERY,
 "2819 Abort outstanding I/O on NPort x%x "
@@ -273,7 +277,6 @@ lpfc_els_abort(struct lpfc_hba *phba, struct lpfc_nodelist 
*ndlp)
  IOSTAT_LOCAL_REJECT, IOERR_SLI_ABORTED);
 
lpfc_cancel_retry_delay_tmo(phba->pport, ndlp);
-   return 0;
 }
 
 static int
-- 
2.12.0.rc0

Re: [PATCH] lpfc: Fix NULL pointer dereference during PCI error recovery

2017-05-18 Thread Guilherme G. Piccoli

On 05/17/2017 09:21 PM, Martin K. Petersen wrote:
> 
> Guilherme,
> 
>> Recent commit on patchset "lpfc updates for 11.2.0.14" fixed an issue
>> about dereferencing a NULL pointer on port reset. The specific commit,
>> named "lpfc: Fix system crash when port is reset.", is missing a check
>> against NULL pointer on lpfc_els_flush_cmd() though.
>>
>> Since we destroy the queues on adapter resets, like in PCI error
>> recovery path, we need the validation present on this patch in order
>> to avoid a NULL pointer dereference when trying to flush commands of
>> ELS wq, after it has been destroyed (which would lead to a kernel
>> oops).
> 
> Applied to 4.12/scsi-fixes. Thank you!
> 

Thanks Martin!

Re: [PATCH v2 02/15] lpfc: Fix system crash when port is reset.

2017-05-17 Thread Guilherme G. Piccoli

On 05/17/2017 03:49 PM, James Smart wrote:
> Your patch is still needed.  Can you repost your patch with a new 
> subject and add my signature:
> 
> Signed-off-by: James Smart 
> 
> Sorry, I didn't see it to roll it into the set.

No problem James, already sent it to linux-scsi:
https://marc.info/?l=linux-scsi=149505857715991

Cheers,


Guilherme

> 
> -- james
> 
> 
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

[PATCH] lpfc: Fix NULL pointer dereference during PCI error recovery

2017-05-17 Thread Guilherme G. Piccoli

Recent commit on patchset "lpfc updates for 11.2.0.14" fixed an issue
about dereferencing a NULL pointer on port reset. The specific commit,
named "lpfc: Fix system crash when port is reset.", is missing a check
against NULL pointer on lpfc_els_flush_cmd() though.

Since we destroy the queues on adapter resets, like in PCI error
recovery path, we need the validation present on this patch in order
to avoid a NULL pointer dereference when trying to flush commands of
ELS wq, after it has been destroyed (which would lead to a kernel oops).

Tested-by: Raphael Silva <rapha...@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
Signed-off-by: James Smart <james.sm...@broadcom.com>
---

This patch was rebased against 4.12/scsi-fixes on mkp tree.
I couldn't figure a better way to refer to commits in this
message because they weren't merged on linus tree yet, so
the sha hashes wouldn't make sense. If you have some idea,
please let me know and I can send v2 if desired.

 drivers/scsi/lpfc/lpfc_els.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index 1d36f82fa369..8e532b39ae93 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -7451,6 +7451,13 @@ lpfc_els_flush_cmd(struct lpfc_vport *vport)
 */
spin_lock_irq(>hbalock);
pring = lpfc_phba_elsring(phba);
+
+   /* Bail out if we've no ELS wq, like in PCI error recovery case. */
+   if (unlikely(!pring)) {
+   spin_unlock_irq(>hbalock);
+   return;
+   }
+
if (phba->sli_rev == LPFC_SLI_REV4)
spin_lock(>ring_lock);
 
-- 
2.12.0.rc0

Re: [PATCH 02/15] lpfc: Fix system crash when port is reset.

2017-05-15 Thread Guilherme G. Piccoli

Hi James and Dick, thanks for this patch. We were investigating a pretty
similar issue, and we raised a patch that looks like yours.
Since you sent this one, we reviewed and seems it's missing a hunk to
prevent the issue we are dealing here, in PCI error recovery.

Please see the attached patch. Basically, lpfc is dereferencing a NULL
pointer on PCI error recovery path in lpfc_els_flush_cmd(). Feel free to
take this hunk on your patch, or if you prefer we can send a complete patch.

Thanks in advance,


Guilherme

---
 drivers/scsi/lpfc/lpfc_els.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index 67827e397431..4e354194cb54 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -7441,6 +7441,13 @@ lpfc_els_flush_cmd(struct lpfc_vport *vport)
 */
spin_lock_irq(>hbalock);
pring = lpfc_phba_elsring(phba);
+
+   /* Bail out if we've no ELS wq, like in PCI error recovery case. */
+   if (unlikely(!pring)) {
+   spin_unlock_irq(>hbalock);
+   return;
+   }
+
if (phba->sli_rev == LPFC_SLI_REV4)
spin_lock(>ring_lock);
 
-- 
2.12.0.rc0

[PATCH] scsi: aacraid: fix PCI error recovery path

2017-04-06 Thread Guilherme G. Piccoli

During a PCI error recovery, if aac_check_health() is not aware that
a PCI error happened and we have an offline PCI channel, it might
trigger some errors (like NULL pointer dereference) and inhibit the
error recovery process to complete.

This patch makes the health check procedure aware of PCI channel
issues, and in case of error recovery process, the function
aac_adapter_check_health() returns -1 and let the recovery process
to complete successfully. This patch was tested on upstream kernel
v4.11-rc5 in PowerPC ppc64le architecture with adapter 9005:028d
(VID:DID) - the error recovery procedure was able to recover fine.

Fixes: 5c63f7f710bd ("aacraid: Added EEH support")
Cc: sta...@vger.kernel.org # v4.6+
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
---
 drivers/scsi/aacraid/aacraid.h | 11 ---
 drivers/scsi/aacraid/commsup.c |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h
index d036a806f31c..d281492009fb 100644
--- a/drivers/scsi/aacraid/aacraid.h
+++ b/drivers/scsi/aacraid/aacraid.h
@@ -1690,9 +1690,6 @@ struct aac_dev
 #define aac_adapter_sync_cmd(dev, command, p1, p2, p3, p4, p5, p6, status, r1, 
r2, r3, r4) \
(dev)->a_ops.adapter_sync_cmd(dev, command, p1, p2, p3, p4, p5, p6, 
status, r1, r2, r3, r4)
 
-#define aac_adapter_check_health(dev) \
-   (dev)->a_ops.adapter_check_health(dev)
-
 #define aac_adapter_restart(dev, bled, reset_type) \
((dev)->a_ops.adapter_restart(dev, bled, reset_type))
 
@@ -2615,6 +2612,14 @@ static inline unsigned int cap_to_cyls(sector_t 
capacity, unsigned divisor)
return capacity;
 }
 
+static inline int aac_adapter_check_health(struct aac_dev *dev)
+{
+   if (unlikely(pci_channel_offline(dev->pdev)))
+   return -1;
+
+   return (dev)->a_ops.adapter_check_health(dev);
+}
+
 /* SCp.phase values */
 #define AAC_OWNER_MIDLEVEL 0x101
 #define AAC_OWNER_LOWLEVEL 0x102
diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
index c8172f16cf33..1f4918355fdb 100644
--- a/drivers/scsi/aacraid/commsup.c
+++ b/drivers/scsi/aacraid/commsup.c
@@ -1873,7 +1873,8 @@ int aac_check_health(struct aac_dev * aac)
spin_unlock_irqrestore(>fib_lock, flagv);
 
if (BlinkLED < 0) {
-   printk(KERN_ERR "%s: Host adapter dead %d\n", aac->name, 
BlinkLED);
+   printk(KERN_ERR "%s: Host adapter is dead (or got a PCI error) 
%d\n",
+   aac->name, BlinkLED);
goto out;
}
 
-- 
2.11.0

Re: [PATCH v6] sd: Check for unaligned partial completion

2017-03-01 Thread Guilherme G. Piccoli

Looks good, thanks for the patch.


Reviewed-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>


Cheers,

Guilherme

Re: [PATCH v2] libiscsi: add lock around task lists to fix list corruption regression

2017-02-28 Thread Guilherme G. Piccoli

Thanks very much Chris, really good patch. Hopefully it can reach 4.11!

Cheers,


Guilherme

Re: [PATCH] libiscsi: add lock around task lists to fix list corruption regression

2017-02-24 Thread Guilherme G. Piccoli

On 02/23/2017 07:25 PM, Chris Leech wrote:
> Yikes, my git-send-email settings suppressed the important CCs.  Sorry!
> 
> Guilherme and Ilkka, can you comment about your testing results or review 
> please?

Hi Chris, thanks for looping me. Patch seems nice, I do have some points
below, most nitpicks heheh
Feel free to accept or not the suggestions!

Also, you can add my:

Reviewed-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>


> 
> - Chris Leech
> 
> - Original Message -
>> There's a rather long standing regression from commit
>> 659743b [SCSI] libiscsi: Reduce locking contention in fast path
>>

Perhaps worth to follow the checkpatch "rule" of citing commits here?
659743b02c41 ("[SCSI] libiscsi: Reduce locking contention in fast path")


>> Depending on iSCSI target behavior, it's possible to hit the case in
>> iscsi_complete_task where the task is still on a pending list
>> (!list_empty(>running)).  When that happens the task is removed
>> from the list while holding the session back_lock, but other task list
>> modification occur under the frwd_lock.  That leads to linked list
>> corruption and eventually a panicked system.
>>
>> Rather than back out the session lock split entirely, in order to try
>> and keep some of the performance gains this patch adds another lock to
>> maintain the task lists integrity.
>>
>> Major enterprise supported kernels have been backing out the lock split
>> for while now, thanks to the efforts at IBM where a lab setup has the
>> most reliable reproducer I've seen on this issue.  This patch has been
>> tested there successfully.
>>
>> Signed-off-by: Chris Leech <cle...@redhat.com>

Guess we're missing here:

(a) Fixes: 659743b02c41 ("[SCSI] libiscsi: Reduce locking contention in
fast path")

(b) CC:  #v3.15+

Also, would be nice to point the original reporter, if possible:

Reported-by: Prashantha Subbarao <psubb...@us.ibm.com>


>> ---
>>  drivers/scsi/libiscsi.c | 26 +-
>>  include/scsi/libiscsi.h |  1 +
>>  2 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
>> index 834d121..acb5ef3 100644
>> --- a/drivers/scsi/libiscsi.c
>> +++ b/drivers/scsi/libiscsi.c
>> @@ -560,8 +560,12 @@ static void iscsi_complete_task(struct iscsi_task *task,
>> int state)
>>  WARN_ON_ONCE(task->state == ISCSI_TASK_FREE);
>>  task->state = state;
>>  
>> -if (!list_empty(>running))
>> +spin_lock_bh(>taskqueuelock);
>> +if (!list_empty(>running)) {
>> +WARN_ONCE(1, "iscsi_complete_task while task on list");

I'm retesting this patch right now, and again I saw this giant warning.
Perhaps worth to make it pr_debug()? So, it can be enabled as user
desire and don't alarm all users too much.

Thanks,


Guilherme


>>  list_del_init(>running);
>> +}
>> +spin_unlock_bh(>taskqueuelock);
>>  
>>  if (conn->task == task)
>>  conn->task = NULL;
>> @@ -783,7 +787,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct
>> iscsi_hdr *hdr,
>>  if (session->tt->xmit_task(task))
>>  goto free_task;
>>  } else {
>> +spin_lock_bh(>taskqueuelock);
>>  list_add_tail(>running, >mgmtqueue);
>> +spin_unlock_bh(>taskqueuelock);
>>  iscsi_conn_queue_work(conn);
>>  }
>>  
>> @@ -1474,8 +1480,10 @@ void iscsi_requeue_task(struct iscsi_task *task)
>>   * this may be on the requeue list already if the xmit_task callout
>>   * is handling the r2ts while we are adding new ones
>>   */
>> +spin_lock_bh(>taskqueuelock);
>>  if (list_empty(>running))
>>  list_add_tail(>running, >requeue);
>> +spin_unlock_bh(>taskqueuelock);
>>  iscsi_conn_queue_work(conn);
>>  }
>>  EXPORT_SYMBOL_GPL(iscsi_requeue_task);
>> @@ -1512,22 +1520,26 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>>   * only have one nop-out as a ping from us and targets should not
>>   * overflow us with nop-ins
>>   */
>> +spin_lock_bh(>taskqueuelock);
>>  check_mgmt:
>>  while (!list_empty(>mgmtqueue)) {
>>  conn->task = list_entry(conn->mgmtqueue.next,
>>   struct iscsi_task, running);
>>  list_del_init(>task->running);
>> +

Re: [PATCH v4] sd: Check for unaligned partial completion

2017-02-16 Thread Guilherme G. Piccoli

On 15/02/2017 05:06, Ram Pai wrote:
> On Wed, Feb 15, 2017 at 03:48:52PM +0900, Damien Le Moal wrote:
>> Christoph,
>>
>> On 2/15/17 15:34, Christoph Hellwig wrote:
>>> this looks reasonable, but we should ask Guilherme and Ram to confirm
>>> it fixes their originally reported issue.  I've added them to Cc.
>>
>> Thank you.
>>
>> Guilherme, Ram,
>>
>> Please test ! The original fix breaks the zoned block device support
>> that was newly added to 4.10. So we need a fix for that and your issue
>> combined before Linus releases 4.10 stable this weekend.
> 
> Yes logically it looks correct and should fix our issue. 
> 
> I doubt Guilherme has access to the same adapter firmware that exposed
> the original problem. We will probably have to induce the erroneous
> behavior in the driver to reproduce the issue and verify the fix.

Ram, I do have access but the environment changed a lot, including
disks, setup of the distributed filesystem and tests...and so I wasn't
able to reproduce the issue anymore.

Since the logic of the new approach looks correct, I'm ok with the patch.

Thanks,


Guilherme

> 
> Will let you know,
> RP
>

Re: [PATCH 1/1] iscsi: fix regression caused by session lock patch

2017-02-06 Thread Guilherme G. Piccoli

On 06/02/2017 15:27, Chris Leech wrote:
> - Original Message -
>> On 09/11/2016 03:21, Chris Leech wrote:
>>> On Mon, Nov 07, 2016 at 04:23:10PM -0200, Guilherme G. Piccoli wrote:
>>>>
>>>> Sure! Count on us to test any patches. I guess the first step is to
>>>> reproduce on upstream right? We haven't tested specifically this
>>>> scenario for long time. Will try to reproduce on 4.9-rc4 and update here.
>>>
>>> Great, I'm looking forward to hearing the result.
>>>
>>> Assuming it reproduces, I don't think this level of fine grained locking
>>> is necessarily the best solution, but it may help confirm the problem.
>>> Especially if the WARN_ONCE I slipped in here triggers.
>>
>> Chris, sorry for my huge delay.
>> Finally I was able to perform tests and I have good news - seems your
>> patch below fixed the issue.
> 
> Thanks for the testing, looks like you have the magic target to reproduce 
> this.
> 
> I think this verifies what Mike's idea of what was going wrong, and we're way 
> overdue to get this fixed upstream.  Thanks to IBM for pushing this, I don't 
> think any major distro is shipping this patch and we don't want to keep 
> having to back it out.
> 
> The options look like
> 1) back out the session lock changes that split it into two locks
> 2) add in the additional locking from this test patch
> 3) some other fix for the issue of targets that complete tasks oddly
> 
> I'm leaning to #1, as I don't want to keep adding more locks for this.

Thanks Chris! IIRC, the lock changes from Shlomo/Or are not on RHEL,
SLES and Ubuntu anymore, as you mentioned. We requested them to revert
the patch, and it was accepted.

On the other hand, your patch is great and a cool fix to this. If we
have any good numbers and/or reasons to keep their patch, guess the
alternative #2 is cool too. I can perform more testing if you plan to
send this (or similar) patch to iscsi list.


> Sagi, Or, Shlomo?  You pushed to keep this from being backed out before.  
> Here's your cause, any better ideas on fixing it?  I also tried to go back in 
> the mailing list archives, but I don't see any real numbers for the 
> performance gains.

I'll loop Sagi here based on the email I see he's using on NVMe list
currently - seems it's different from the one showed in the header of
this message.


Thanks,



Guilherme

> 
> - Chris
> 
>> Firstly, I was able to reproduce with kernel 4.10-rc6. See the file
>> repro.out - it's a dump from xmon, the kernel debugger from PowerPC.
>> With this tool we can dump the exception details, registers, PACA
>> (Processor Address Communication Area, ppc specific structure) and
>> dmesg. It took me less than 15 minutes to reproduce.
>>
>> Then, I applied your patch on the top of this kernel and the benchmark
>> was able to successfully complete, in about 3 hours. We can see the
>> WARN() you added was reached, the attached file dmesg-cleech_patch shows
>> the kernel log with your patch.
>>
>> The workload was FIO benchmark doing both reads and writes to the remote
>> storage via iSCSI, connected over ethernet direct cabling (10Gb speed).
>> Distro was Ubuntu 16.04.1 .
>>
>> Any more tests or info you need, please let me know!
>> Cheers,
>>
>>
>> Guilherme
>>
>>
>>> - Chris
>>>
>>> ---
>>>
>>> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
>>> index f9b6fba..fbd18ab 100644
>>> --- a/drivers/scsi/libiscsi.c
>>> +++ b/drivers/scsi/libiscsi.c
>>> @@ -560,8 +560,12 @@ static void iscsi_complete_task(struct iscsi_task
>>> *task, int state)
>>> WARN_ON_ONCE(task->state == ISCSI_TASK_FREE);
>>> task->state = state;
>>>
>>> -   if (!list_empty(>running))
>>> +   spin_lock_bh(>taskqueuelock);
>>> +   if (!list_empty(>running)) {
>>> +   WARN_ONCE(1, "iscsi_complete_task while task on list");
>>> list_del_init(>running);
>>> +   }
>>> +   spin_unlock_bh(>taskqueuelock);
>>>
>>> if (conn->task == task)
>>> conn->task = NULL;
>>> @@ -783,7 +787,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct
>>> iscsi_hdr *hdr,
>>> if (session->tt->xmit_task(task))
>>> goto free_task;
>>> } else {
>>> +   spin_lock_bh(>taskqueuelock);
>>> list_add_tail(>running, >mgmtqueue);
>>> +   spin_unlock_bh(>taskqueuelock);
>>

Re: [PATCH v2] mpt3sas: Force request partial completion alignment

2017-01-26 Thread Guilherme G. Piccoli

On 01/26/2017 03:02 PM, Ram Pai wrote:
> On Thu, Jan 26, 2017 at 11:31:53AM -0200, Guilherme G. Piccoli wrote:
>> On 01/25/2017 09:46 PM, Martin K. Petersen wrote:
>>>>>>>> "Guilherme" == Guilherme G Piccoli <gpicc...@linux.vnet.ibm.com> 
>>>>>>>> writes:
>>>
>>> Hi Guilherme,
>>
>> Hi Martin, thanks for the review!
>>
>>
>>>
>>> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
>>> b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
>>> index 75f3fce..e52c942 100644
>>> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
>>> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
>>> @@ -4657,6 +4657,8 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, 
>>> u8 msix_index, u32 reply)
>>> struct MPT3SAS_DEVICE *sas_device_priv_data;
>>> u32 response_code = 0;
>>> unsigned long flags;
>>> +   unsigned int sector_sz;
>>> +   struct request *req;
>>>
>>> mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
>>> scmd = _scsih_scsi_lookup_get_clear(ioc, smid);
>>> @@ -4715,6 +4717,21 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 
>>> smid, u8 msix_index, u32 reply)
>>> }
>>>
>>> xfer_cnt = le32_to_cpu(mpi_reply->TransferCount);
>>> +
>>> +   /* In case of bogus fw or device, we could end up having
>>> +* unaligned partial completion. We can force alignment here,
>>> +* then scsi-ml does not need to handle this misbehavior.
>>> +*/
>>> +   sector_sz = scmd->device->sector_size;
>>> +   req = scmd->request;
>>> +   if (unlikely(sector_sz && req && (req->cmd_type == REQ_TYPE_FS) &&
>>> +   (xfer_cnt % sector_sz))) {
>>>
>>> Maybe a bit zealous on the sanity checking...
>>
>> A bit...? heheh
>> Too much I'd say. Since this is dealing with a bogus FW scenario, I
>> found more safe to check everything...of course we can remove checks if
>> it's sure req isn't NULL ever. The sector_sz check is avoiding
>> degenerate cases, since our division below.
>>
>>
>>>
>>> +   sdev_printk(KERN_INFO, scmd->device,
>>> +   "unaligned partial completion avoided (xfer_cnt=%u, 
>>> sector_sz=%u)\n",
>>> +   xfer_cnt, sector_sz);
>>> +   xfer_cnt = (xfer_cnt / sector_sz) * sector_sz;
>>>
>>> Not so keen on divisions. xfer_cnt = round_down(xfer_cnt, sector_sz), maybe?
>>>
>>
>> Martin, I might be completely wrong here (please correct me if this is
>> the case), but isn't C standard integer division a truncation that acts
>> like a round down? I checked (what I think is) the specification of C
>> language (ISO/IEC 9899:1999), and it seems the division proposed by Ram
>> Pai is accurate in this case. Also, both variables are unsigned.
> 
> Guilherme,  Its better to use round_down() instead of division. Among
> other things it saves a few nanoseconds.

Thanks Ram and Martin for the suggestion and explanation. I just sent a V3.

Cheers,


Guilherme

> 
> RP
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] mpt3sas: Force request partial completion alignment

2017-01-26 Thread Guilherme G. Piccoli

From: Ram Pai <linux...@us.ibm.com>

The firmware or device, possibly under a heavy I/O load, can return
on a partial unaligned boundary. Scsi-ml expects these requests to be
completed on an alignment boundary. Scsi-ml blindly requeues the I/O
without checking the alignment boundary of the I/O request for the
remaining bytes. This leads to errors, since devices cannot perform
non-aligned read/write operations.

This patch fixes the issue in the driver. It aligns unaligned
completions of FS requests, by truncating them to the nearest
alignment boundary.

Reported-by: Mauricio Faria De Oliveira <mauri...@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linux...@us.ibm.com>
Acked-by: Sreekanth Reddy <sreekanth.re...@broadcom.com>
---
v2->v3:
  * Changed division to round_down() [suggestion by Martin].

 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 75f3fce..a6b6f58 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4657,6 +4657,8 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 
msix_index, u32 reply)
struct MPT3SAS_DEVICE *sas_device_priv_data;
u32 response_code = 0;
unsigned long flags;
+   unsigned int sector_sz;
+   struct request *req;
 
mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
scmd = _scsih_scsi_lookup_get_clear(ioc, smid);
@@ -4715,6 +4717,21 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 
msix_index, u32 reply)
}
 
xfer_cnt = le32_to_cpu(mpi_reply->TransferCount);
+
+   /* In case of bogus fw or device, we could end up having
+* unaligned partial completion. We can force alignment here,
+* then scsi-ml does not need to handle this misbehavior.
+*/
+   sector_sz = scmd->device->sector_size;
+   req = scmd->request;
+   if (unlikely(sector_sz && req && (req->cmd_type == REQ_TYPE_FS) &&
+   (xfer_cnt % sector_sz))) {
+   sdev_printk(KERN_INFO, scmd->device,
+   "unaligned partial completion avoided (xfer_cnt=%u, 
sector_sz=%u)\n",
+   xfer_cnt, sector_sz);
+   xfer_cnt = round_down(xfer_cnt, sector_sz);
+   }
+
scsi_set_resid(scmd, scsi_bufflen(scmd) - xfer_cnt);
if (ioc_status & MPI2_IOCSTATUS_FLAG_LOG_INFO_AVAILABLE)
log_info =  le32_to_cpu(mpi_reply->IOCLogInfo);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] mpt3sas: Force request partial completion alignment

2017-01-26 Thread Guilherme G. Piccoli

On 01/25/2017 09:46 PM, Martin K. Petersen wrote:
>>>>>> "Guilherme" == Guilherme G Piccoli <gpicc...@linux.vnet.ibm.com> writes:
> 
> Hi Guilherme,

Hi Martin, thanks for the review!


> 
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
> b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> index 75f3fce..e52c942 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -4657,6 +4657,8 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, 
> u8 msix_index, u32 reply)
>   struct MPT3SAS_DEVICE *sas_device_priv_data;
>   u32 response_code = 0;
>   unsigned long flags;
> + unsigned int sector_sz;
> + struct request *req;
> 
>   mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
>   scmd = _scsih_scsi_lookup_get_clear(ioc, smid);
> @@ -4715,6 +4717,21 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, 
> u8 msix_index, u32 reply)
>   }
> 
>   xfer_cnt = le32_to_cpu(mpi_reply->TransferCount);
> +
> + /* In case of bogus fw or device, we could end up having
> +  * unaligned partial completion. We can force alignment here,
> +  * then scsi-ml does not need to handle this misbehavior.
> +  */
> + sector_sz = scmd->device->sector_size;
> + req = scmd->request;
> + if (unlikely(sector_sz && req && (req->cmd_type == REQ_TYPE_FS) &&
> + (xfer_cnt % sector_sz))) {
> 
> Maybe a bit zealous on the sanity checking...

A bit...? heheh
Too much I'd say. Since this is dealing with a bogus FW scenario, I
found more safe to check everything...of course we can remove checks if
it's sure req isn't NULL ever. The sector_sz check is avoiding
degenerate cases, since our division below.


> 
> + sdev_printk(KERN_INFO, scmd->device,
> + "unaligned partial completion avoided (xfer_cnt=%u, 
> sector_sz=%u)\n",
> + xfer_cnt, sector_sz);
> + xfer_cnt = (xfer_cnt / sector_sz) * sector_sz;
> 
> Not so keen on divisions. xfer_cnt = round_down(xfer_cnt, sector_sz), maybe?
>

Martin, I might be completely wrong here (please correct me if this is
the case), but isn't C standard integer division a truncation that acts
like a round down? I checked (what I think is) the specification of C
language (ISO/IEC 9899:1999), and it seems the division proposed by Ram
Pai is accurate in this case. Also, both variables are unsigned.

Let me know your thoughts.
Thanks,


Guilherme

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] mpt3sas: Force request partial completion alignment

2017-01-24 Thread Guilherme G. Piccoli

From: Ram Pai <linux...@us.ibm.com>

The firmware or device, possibly under a heavy I/O load, can return
on a partial unaligned boundary. Scsi-ml expects these requests to be
completed on an alignment boundary. Scsi-ml blindly requeues the I/O
without checking the alignment boundary of the I/O request for the
remaining bytes. This leads to errors, since devices cannot perform
non-aligned read/write operations.

This patch fixes the issue in the driver. It aligns unaligned
completions of FS requests, by truncating them to the nearest
alignment boundary.

Reported-by: Mauricio Faria De Oliveira <mauri...@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linux...@us.ibm.com>
---
v1->v2:
  * Improved printk, by showing some variables too [suggested by Sreekanth].

 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 75f3fce..e52c942 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4657,6 +4657,8 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 
msix_index, u32 reply)
struct MPT3SAS_DEVICE *sas_device_priv_data;
u32 response_code = 0;
unsigned long flags;
+   unsigned int sector_sz;
+   struct request *req;
 
mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
scmd = _scsih_scsi_lookup_get_clear(ioc, smid);
@@ -4715,6 +4717,21 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 
msix_index, u32 reply)
}
 
xfer_cnt = le32_to_cpu(mpi_reply->TransferCount);
+
+   /* In case of bogus fw or device, we could end up having
+* unaligned partial completion. We can force alignment here,
+* then scsi-ml does not need to handle this misbehavior.
+*/
+   sector_sz = scmd->device->sector_size;
+   req = scmd->request;
+   if (unlikely(sector_sz && req && (req->cmd_type == REQ_TYPE_FS) &&
+   (xfer_cnt % sector_sz))) {
+   sdev_printk(KERN_INFO, scmd->device,
+   "unaligned partial completion avoided (xfer_cnt=%u, 
sector_sz=%u)\n",
+   xfer_cnt, sector_sz);
+   xfer_cnt = (xfer_cnt / sector_sz) * sector_sz;
+   }
+
scsi_set_resid(scmd, scsi_bufflen(scmd) - xfer_cnt);
if (ioc_status & MPI2_IOCSTATUS_FLAG_LOG_INFO_AVAILABLE)
log_info =  le32_to_cpu(mpi_reply->IOCLogInfo);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mpt3sas: Force request partial completion alignment

2017-01-23 Thread Guilherme G. Piccoli

On 01/23/2017 07:05 AM, Sreekanth Reddy wrote:
> On Wed, Dec 28, 2016 at 6:21 PM, Guilherme G. Piccoli
> <gpicc...@linux.vnet.ibm.com> wrote:
>> From: Ram Pai <linux...@us.ibm.com>
>>
>> The firmware or device, possibly under a heavy I/O load, can return
>> on a partial unaligned boundary. Scsi-ml expects these requests to be
>> completed on an alignment boundary. Scsi-ml blindly requeues the I/O
>> without checking the alignment boundary of the I/O request for the
>> remaining bytes. This leads to errors, since devices cannot perform
>> non-aligned read/write operations.
>>
>> This patch fixes the issue in the driver. It aligns unaligned
>> completions of FS requests, by truncating them to the nearest
>> alignment boundary.
>>
>> Reported-by: Mauricio Faria De Oliveira <mauri...@linux.vnet.ibm.com>
>> Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
>> Signed-off-by: Ram Pai <linux...@us.ibm.com>
>> ---
>>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 16 
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
>> b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
>> index b5c966e..55332a3 100644
>> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
>> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
>> @@ -4644,6 +4644,8 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, 
>> u8 msix_index, u32 reply)
>> struct MPT3SAS_DEVICE *sas_device_priv_data;
>> u32 response_code = 0;
>> unsigned long flags;
>> +   unsigned int sector_sz;
>> +   struct request *req;
>>
>> mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
>> scmd = _scsih_scsi_lookup_get_clear(ioc, smid);
>> @@ -4703,6 +4705,20 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, 
>> u8 msix_index, u32 reply)
>> }
>>
>> xfer_cnt = le32_to_cpu(mpi_reply->TransferCount);
>> +
>> +   /* In case of bogus fw or device, we could end up having
>> +* unaligned partial completion. We can force alignment here,
>> +* then scsi-ml does not need to handle this misbehavior.
>> +*/
>> +   sector_sz = scmd->device->sector_size;
>> +   req = scmd->request;
>> +   if (unlikely(sector_sz && req && (req->cmd_type == REQ_TYPE_FS) &&
>> +   (xfer_cnt % sector_sz))) {
>> +   sdev_printk(KERN_INFO, scmd->device,
>> +   "unaligned partial completion avoided\n");
> 
> [Sreekanth] Patch looks good. But can we print xfer_cnt & sector_sz
> values along with above print.
> 
> Also if it is generic drive issue, then can we move this work around
> to SCSI Mid Layer?
> 

Thank you! I'll send a v2 including your suggestion.
Regarding a fix in scsi-ml, we tried already:
https://lkml.org/lkml/2016/12/19/591

Reception wasn't in favor of the patch; they suggested we patch the
driver instead, then we sent the current change only for mpt3sas.

Thanks,


Guilherme

>> +   xfer_cnt = (xfer_cnt / sector_sz) * sector_sz;
>> +   }
>> +
>> scsi_set_resid(scmd, scsi_bufflen(scmd) - xfer_cnt);
>> if (ioc_status & MPI2_IOCSTATUS_FLAG_LOG_INFO_AVAILABLE)
>> log_info =  le32_to_cpu(mpi_reply->IOCLogInfo);
>> --
>> 2.1.0
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] mpt3sas: Force request partial completion alignment

2016-12-28 Thread Guilherme G. Piccoli

From: Ram Pai <linux...@us.ibm.com>

The firmware or device, possibly under a heavy I/O load, can return
on a partial unaligned boundary. Scsi-ml expects these requests to be
completed on an alignment boundary. Scsi-ml blindly requeues the I/O
without checking the alignment boundary of the I/O request for the
remaining bytes. This leads to errors, since devices cannot perform
non-aligned read/write operations.

This patch fixes the issue in the driver. It aligns unaligned
completions of FS requests, by truncating them to the nearest
alignment boundary.

Reported-by: Mauricio Faria De Oliveira <mauri...@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linux...@us.ibm.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b5c966e..55332a3 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4644,6 +4644,8 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 
msix_index, u32 reply)
struct MPT3SAS_DEVICE *sas_device_priv_data;
u32 response_code = 0;
unsigned long flags;
+   unsigned int sector_sz;
+   struct request *req;
 
mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
scmd = _scsih_scsi_lookup_get_clear(ioc, smid);
@@ -4703,6 +4705,20 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 
msix_index, u32 reply)
}
 
xfer_cnt = le32_to_cpu(mpi_reply->TransferCount);
+
+   /* In case of bogus fw or device, we could end up having
+* unaligned partial completion. We can force alignment here,
+* then scsi-ml does not need to handle this misbehavior.
+*/
+   sector_sz = scmd->device->sector_size;
+   req = scmd->request;
+   if (unlikely(sector_sz && req && (req->cmd_type == REQ_TYPE_FS) &&
+   (xfer_cnt % sector_sz))) {
+   sdev_printk(KERN_INFO, scmd->device,
+   "unaligned partial completion avoided\n");
+   xfer_cnt = (xfer_cnt / sector_sz) * sector_sz;
+   }
+
scsi_set_resid(scmd, scsi_bufflen(scmd) - xfer_cnt);
if (ioc_status & MPI2_IOCSTATUS_FLAG_LOG_INFO_AVAILABLE)
log_info =  le32_to_cpu(mpi_reply->IOCLogInfo);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] scsi: do not requeue requests unaligned with device sector size

2016-12-20 Thread Guilherme G. Piccoli

72 bytes done.
> [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
> (sector size 4096, remainder 3072, resid 1024)
> [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
> [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
> [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
> (sector size 4096, remainder 3072, resid 1024)
> [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
> [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
> [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
> (sector size 4096, remainder 3072, resid 1024)
> [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT 
> driverbyte=DRIVER_OK
> [...] sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 01 02 03 04 00 00 01 00
> [...] blk_update_request: I/O error, dev sda, sector 135272480
> [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.
> 
> And now the one sector corner case when eventually one retry succeeds
> (the original request plus three retries, of which the last retry passes)
> and the request is successfully passed up the stack and finished:
> 
> Test 3)
> 
> # dmesg -c >/dev/null; dd if=/dev/sda of=/dev/null bs=4096 count=1
>   skip=16909064 iflag=direct 2>/dev/null; dmesg
> [...] sd 0:0:0:0: [sda] sd_setup_read_write_cmnd: block=135272512, count=8
> [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
> [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
> [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
>   (sector size 4096, remainder 3072, resid 1024)
> [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
> [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
> [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
>   (sector size 4096, remainder 3072, resid 1024)
>     [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
> [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 3072 bytes done.
> [...] sd 0:0:0:0: [sda] tag#0 checking 3072 bytes for alignment
>   (sector size 4096, remainder 3072, resid 1024)
> [...] sd 0:0:0:0: [sda] tag#0 sd_done: completed 4096 of 4096 bytes
> [...] sd 0:0:0:0: [sda] tag#0 8 sectors total, 4096 bytes done.
> [...] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.
> 
> Apologies for the ridiculously long commit message with description and
> test-cases, but this problem has been relatively difficult to reproduce
> and understand, so I thought the documentation/instructions for working
> with that could be helpful for others too.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mauri...@linux.vnet.ibm.com>

Thanks Maurício, great patch!
Feel free to add my:

Tested-by: Guilherme G. Piccoli <gpicc...@linux.vnet.ibm.com>


> ---
>  drivers/scsi/scsi_lib.c | 41 +
>  1 file changed, 41 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 2cca9cffc63f..190ab28cfb2d 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -821,6 +821,45 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
> int good_bytes)
>   }
> 
>   /*
> +  * If the SCSI command is successful but has residual bytes,
> +  * align good_bytes to the device sector size to ensure the
> +  * requeued request (to complete the remainder transfer) is
> +  * also aligned and does not fail alignment/size validation.
> +  */
> + if (unlikely(!result && scsi_get_resid(cmd) &&
> +  req->cmd_type == REQ_TYPE_FS)) {
> +
> + unsigned int sector_size = cmd->device->sector_size;
> + unsigned int remainder = good_bytes % sector_size;
> + int resid = scsi_get_resid(cmd);
> +
> + SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, cmd,
> + "checking %d bytes for alignment "
> + "(sector size %u, remainder %u, resid %d)\n",
> + good_bytes, sector_size, remainder, resid));
> +
> + if (remainder) {
> + good_bytes -= remainder;
> + resid += remainder;
> + scsi_set_resid(cmd, resid);
> +
> + /*
> +  * If less than one device sector has completed, retry 
> the
> +  * request after delay (up to the retry limit) or 
> timeout.
> +

Re: [PATCH 1/1] iscsi: fix regression caused by session lock patch

2016-11-11 Thread Guilherme G. Piccoli

On 11/09/2016 03:21 AM, Chris Leech wrote:
> On Mon, Nov 07, 2016 at 04:23:10PM -0200, Guilherme G. Piccoli wrote:
>>
>> Sure! Count on us to test any patches. I guess the first step is to
>> reproduce on upstream right? We haven't tested specifically this
>> scenario for long time. Will try to reproduce on 4.9-rc4 and update here.
> 
> Great, I'm looking forward to hearing the result.
> 
> Assuming it reproduces, I don't think this level of fine grained locking
> is necessarily the best solution, but it may help confirm the problem.
> Especially if the WARN_ONCE I slipped in here triggers.
> 
> - Chris

Chris, I'm not able to reproduce right now - environment was
misconfigured, and now I'm leaving the office for 20 days.

Will test as soon as I'm back!
Thanks,


Guilherme


> 
> ---
> 
> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> index f9b6fba..fbd18ab 100644
> --- a/drivers/scsi/libiscsi.c
> +++ b/drivers/scsi/libiscsi.c
> @@ -560,8 +560,12 @@ static void iscsi_complete_task(struct iscsi_task *task, 
> int state)
>   WARN_ON_ONCE(task->state == ISCSI_TASK_FREE);
>   task->state = state;
> 
> - if (!list_empty(>running))
> + spin_lock_bh(>taskqueuelock);
> + if (!list_empty(>running)) {
> + WARN_ONCE(1, "iscsi_complete_task while task on list");
>   list_del_init(>running);
> + }
> + spin_unlock_bh(>taskqueuelock);
> 
>   if (conn->task == task)
>   conn->task = NULL;
> @@ -783,7 +787,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct 
> iscsi_hdr *hdr,
>   if (session->tt->xmit_task(task))
>   goto free_task;
>   } else {
> + spin_lock_bh(>taskqueuelock);
>   list_add_tail(>running, >mgmtqueue);
> + spin_unlock_bh(>taskqueuelock);
>   iscsi_conn_queue_work(conn);
>   }
> 
> @@ -1474,8 +1480,10 @@ void iscsi_requeue_task(struct iscsi_task *task)
>* this may be on the requeue list already if the xmit_task callout
>* is handling the r2ts while we are adding new ones
>*/
> + spin_lock_bh(>taskqueuelock);
>   if (list_empty(>running))
>   list_add_tail(>running, >requeue);
> + spin_unlock_bh(>taskqueuelock);
>   iscsi_conn_queue_work(conn);
>  }
>  EXPORT_SYMBOL_GPL(iscsi_requeue_task);
> @@ -1512,22 +1520,26 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>* only have one nop-out as a ping from us and targets should not
>* overflow us with nop-ins
>*/
> + spin_lock_bh(>taskqueuelock);
>  check_mgmt:
>   while (!list_empty(>mgmtqueue)) {
>   conn->task = list_entry(conn->mgmtqueue.next,
>struct iscsi_task, running);
>   list_del_init(>task->running);
> + spin_unlock_bh(>taskqueuelock);
>   if (iscsi_prep_mgmt_task(conn, conn->task)) {
>   /* regular RX path uses back_lock */
>   spin_lock_bh(>session->back_lock);
>   __iscsi_put_task(conn->task);
>   spin_unlock_bh(>session->back_lock);
>   conn->task = NULL;
> + spin_lock_bh(>taskqueuelock);
>   continue;
>   }
>   rc = iscsi_xmit_task(conn);
>   if (rc)
>   goto done;
> + spin_lock_bh(>taskqueuelock);
>   }
> 
>   /* process pending command queue */
> @@ -1535,19 +1547,24 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
>   conn->task = list_entry(conn->cmdqueue.next, struct iscsi_task,
>   running);
>   list_del_init(>task->running);
> + spin_unlock_bh(>taskqueuelock);
>   if (conn->session->state == ISCSI_STATE_LOGGING_OUT) {
>   fail_scsi_task(conn->task, DID_IMM_RETRY);
> + spin_lock_bh(>taskqueuelock);
>   continue;
>   }
>   rc = iscsi_prep_scsi_cmd_pdu(conn->task);
>   if (rc) {
>   if (rc == -ENOMEM || rc == -EACCES) {
> + spin_lock_bh(>taskqueuelock);
>   list_add_tail(>task->running,
> >cmdqueue);
>   conn->task = NULL;
> + spin_unlock_bh(>ta

Re: [PATCH 1/1] iscsi: fix regression caused by session lock patch

2016-11-07 Thread Guilherme G. Piccoli

On 11/07/2016 04:15 PM, Chris Leech wrote:
> Hi,
> 
> I'm kicking this old thread because I don't think this ever got
> resolved.  I wish I had more info, but it seems to involve target
> specific behavior that hasn't come up in our test labs.

Thanks very much for reopening this thread! We have the Or's patch
reverted in multiple distros, so the issue is not likely to happen on
customer's from IBM, but upstream kernel never saw a proper fix for this.


> 
> So what can I do at this point to help resolve this?
> 
> On Sun, Nov 15, 2015 at 12:10:48PM +0200, Or Gerlitz wrote:
>> Mike, Chris
>>
>> After the locking change, adding a task to any of the connection
>> mgmtqueue, cmdqueue, or requeue lists is under the session forward lock.
>>
>> Removing tasks from any of these lists in iscsi_data_xmit is under
>> the session forward lock and **before** calling down to the transport
>> to handle the task.
>>
>> The iscsi_complete_task helper was added by Mike's commit
>> 3bbaaad95fd38ded "[SCSI] libiscsi: handle cleanup task races"
>> and is indeed typically called under the backward lock && has this section
>>
>> +   if (!list_empty(>running))
>> +   list_del_init(>running);
>>
>> which per my reading of the code never comes into play, can you comment?
>>
>> Lets address this area before we move to the others claims made on the patch.
> 
> This bit in particular is where I see a cause for concern.  If that
> list_del_init call ever races against other list operations, there's a
> potential corruption.  It's presumably there for a reason, and Mike
> explained a case where some targets have been known to send a check
> condition at unexpected times that would hit it.
> 
> I don't like having known list locking violations hanging around, based
> on an expectation that we'll never hit that path with well behaved
> targets.
> 
> If we can get a fix worked up for the list locking here, can we get any
> testing on it from the original reports at IBM?  That was very helpful
> in testing a full reversion patch.

Sure! Count on us to test any patches. I guess the first step is to
reproduce on upstream right? We haven't tested specifically this
scenario for long time. Will try to reproduce on 4.9-rc4 and update here.

Cheers,



Guilherme


> 
> - Chris Leech
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

40 matches

Mail list logo