Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-16 Thread Sreekanth Reddy
On Mon, Apr 16, 2018 at 1:46 PM, Jinpu Wang  wrote:
> On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
>  wrote:
>>
>> Jinpu,
>>
>> [CC:ed the mpt3sas maintainers]
>>
>> The ratelimit patch is just an attempt to treat the symptom, not the
>> cause.
> Agree. If we can fix the root cause, it will be great.
>>
>>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>>> After reboot, kernel reports the IO errors from all the drives behind
>>> HBA, seems for almost every read IO, which turns the system unusable:
>>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>>
>> That sounds like a bug in the mpt3sas driver or firmware. I guess the
>> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
>> the PI on the drive side. But that doesn't seem to be a particularly
>> useful mode of operation.
>>
>> Jinpu: Which firmware are you running? Also, please send us the output
>> of:
>>
>> sg_readcap -l /dev/sda
>> sg_inq -x /dev/sda
>> sg_vpd /dev/sda
>>
> Disks are INTEL SSDSC2BX48, directly attached to HBA.
> LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), 
> BiosVersion(08.11.00.00)
> mpt3sas_cm2: Protocol=(Initiator,Target),
> Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
> Full,NCQ)
>
> jwang@x:~$ sudo sg_vpd /dev/sdz
> Supported VPD pages VPD page:
>   Supported VPD pages [sv]
>   Unit serial number [sn]
>   Device identification [di]
>   Mode page policy [mpp]
>   ATA information (SAT) [ai]
>   Block limits (SBC) [bl]
>   Block device characteristics (SBC) [bdc]
>   Logical block provisioning (SBC) [lbpv]
> jwang@x:~$ sudo sg_inq -x /dev/sdz
> VPD INQUIRY: extended INQUIRY data page
> inquiry: field in cdb illegal (page not supported)
> jwang@x:~$ sudo sg_readcap -l /dev/sdz
> Read Capacity results:
>Protection: prot_en=0, p_type=0, p_i_exponent=0
>Logical block provisioning: lbpme=1, lbprz=1
>Last logical block address=937703087 (0x37e436af), Number of
> logical blocks=937703088
>Logical block length=512 bytes
>Logical blocks per physical block exponent=3 [so physical block
> length=4096 bytes]
>Lowest aligned logical block address=0
> Hence:
>Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB
>
>
>> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
>> controller?

[Sreekanth] Current Upstream mpt3sas driver doesn't have DIX support
capabilities,
it supports only DIF feature.


Thanks,
Sreekanth
>>
>> --
>> Martin K. Petersen  Oracle Linux Engineering
>
>
> Thanks!
>
> --
> Jack Wang
> Linux Kernel Developer
>
> ProfitBricks GmbH
> Greifswalder Str. 207
> D - 10405 Berlin


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-16 Thread Jinpu Wang
On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
 wrote:
>
> Jinpu,
>
> [CC:ed the mpt3sas maintainers]
>
> The ratelimit patch is just an attempt to treat the symptom, not the
> cause.
Agree. If we can fix the root cause, it will be great.
>
>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>> After reboot, kernel reports the IO errors from all the drives behind
>> HBA, seems for almost every read IO, which turns the system unusable:
>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>
> That sounds like a bug in the mpt3sas driver or firmware. I guess the
> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
> the PI on the drive side. But that doesn't seem to be a particularly
> useful mode of operation.
>
> Jinpu: Which firmware are you running? Also, please send us the output
> of:
>
> sg_readcap -l /dev/sda
> sg_inq -x /dev/sda
> sg_vpd /dev/sda
>
Disks are INTEL SSDSC2BX48, directly attached to HBA.
LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00)
mpt3sas_cm2: Protocol=(Initiator,Target),
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)

jwang@x:~$ sudo sg_vpd /dev/sdz
Supported VPD pages VPD page:
  Supported VPD pages [sv]
  Unit serial number [sn]
  Device identification [di]
  Mode page policy [mpp]
  ATA information (SAT) [ai]
  Block limits (SBC) [bl]
  Block device characteristics (SBC) [bdc]
  Logical block provisioning (SBC) [lbpv]
jwang@x:~$ sudo sg_inq -x /dev/sdz
VPD INQUIRY: extended INQUIRY data page
inquiry: field in cdb illegal (page not supported)
jwang@x:~$ sudo sg_readcap -l /dev/sdz
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=1, lbprz=1
   Last logical block address=937703087 (0x37e436af), Number of
logical blocks=937703088
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block
length=4096 bytes]
   Lowest aligned logical block address=0
Hence:
   Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB


> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
> controller?
>
> --
> Martin K. Petersen  Oracle Linux Engineering


Thanks!

-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Martin K. Petersen

Jinpu,

[CC:ed the mpt3sas maintainers]

The ratelimit patch is just an attempt to treat the symptom, not the
cause.

> Thanks for asking, we updated mpt3sas driver which enables DIX support
> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
> After reboot, kernel reports the IO errors from all the drives behind
> HBA, seems for almost every read IO, which turns the system unusable:
> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)

That sounds like a bug in the mpt3sas driver or firmware. I guess the
HBA could conceivably be operating a SATA device as DIX Type 0 and strip
the PI on the drive side. But that doesn't seem to be a particularly
useful mode of operation.

Jinpu: Which firmware are you running? Also, please send us the output
of:

sg_readcap -l /dev/sda
sg_inq -x /dev/sda
sg_vpd /dev/sda

Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
controller?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Jinpu Wang
On Thu, Apr 12, 2018 at 11:20 PM, Martin K. Petersen
 wrote:
>
> Jack,
>
>> + pr_err_ratelimited("%s: ref tag error at 
>> location %llu (rcvd %u)\n",
>
> I'm a bit concerned about dropping records of potential data loss.
>
> Also, what are you doing that compels all these to be logged? This
> should be a very rare occurrence.
>
> --
> Martin K. Petersen  Oracle Linux Engineering
Hi Martin,

Thanks for asking, we updated mpt3sas driver which enables DIX support
(prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
After reboot, kernel reports the IO errors from all the drives behind
HBA, seems for almost every read IO, which turns the system unusable:
[   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
[   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
[   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
[   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
[   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
[   13.080996] sda: ref tag error at location 0 (rcvd 143196159)
[   13.089878] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090275] sdb: ref tag error at location 937702912 (rcvd 277413887)
[   13.090448] sdb: ref tag error at location 937703072 (rcvd 143196159)
[   13.090655] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090823] sdb: ref tag error at location 8 (rcvd 277413887)
[   13.091218] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.095412] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.095859] sdc: ref tag error at location 937702912 (rcvd 143196159)
[   13.096058] sdc: ref tag error at location 937703072 (rcvd 143196159)
[   13.096228] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.096445] sdc: ref tag error at location 8 (rcvd 143196159)
[   13.096833] sdc: ref tag error at location 0 (rcvd 277413887)
[   13.097187] sds: ref tag error at location 0 (rcvd 277413887)
[   13.097707] sds: ref tag error at location 937702912 (rcvd 143196159)
[   13.097855] sds: ref tag error at location 937703072 (rcvd 277413887)

Kernel version 4.15 and 4.14.28, I scan the commits in upstream,
haven't found any relevant.
in  4.4.112, there's no such errors.
Diable DIX support (prot_mask=0x7) in mpt3sas fixes the problem.

Regards,
-- 
Jack Wang
Linux Kernel DeveloperProfitBricks GmbH


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-12 Thread Martin K. Petersen

Jack,

> + pr_err_ratelimited("%s: ref tag error at 
> location %llu (rcvd %u)\n",

I'm a bit concerned about dropping records of potential data loss.

Also, what are you doing that compels all these to be logged? This
should be a very rare occurrence.

-- 
Martin K. Petersen  Oracle Linux Engineering