Re: [PATCH v2 11/53] mtd: nand: denali: fix bitflips calculation in handle_ecc()

2017-03-23 Thread Boris Brezillon
On Thu, 23 Mar 2017 16:02:02 +0900
Masahiro Yamada  wrote:

> Hi Boris,
> 
> 2017-03-23 5:57 GMT+09:00 Boris Brezillon 
> :
> > On Wed, 22 Mar 2017 23:07:18 +0900
> > Masahiro Yamada  wrote:
> >  
> >> + do {
> >> + err_addr = ioread32(denali->flash_reg + ECC_ERROR_ADDRESS);
> >> + err_sector = ECC_SECTOR(err_addr);
> >> + err_byte = ECC_BYTE(err_addr);
> >> +
> >> + err_cor_info = ioread32(denali->flash_reg + 
> >> ERR_CORRECTION_INFO);
> >> + err_cor_value = ECC_CORRECTION_VALUE(err_cor_info);
> >> + err_device = ECC_ERR_DEVICE(err_cor_info);
> >> +
> >> + /* reset the bitflip counter when crossing ECC sector */
> >> + if (err_sector != prev_sector)
> >> + bitflips = 0;
> >> +
> >> + if (ECC_ERROR_UNCORRECTABLE(err_cor_info)) {
> >> + /*
> >> +  * if the error is not correctable, need to look at 
> >> the
> >> +  * page to see if it is an erased page. if so, then
> >> +  * it's not a real ECC error
> >> +  */
> >> + ret = -EBADMSG;  
> >
> > You should never return -EBADMSG directly. Just increment
> > ecc_stats.failed and let the core return -EBADMSG to the upper layer.
> >  
> 
> Here, -EBADMSG is used like that returned from  ->ecc.correct()
> 
> 
> Please notice denali_read_page() never returns -EBADMSG.
> 
>  -EBADMSG is used as a mark "we need erased page check".
> 
> 
> I think nand_read_page_syndrome() does similar;
> -EBADMSG is used internally.

That's not exactly what happens. nand_read_page_syndrome() calls
ecc->correct() for each chunk, and if this method returns -EBADMSG (and
nand_check_erased_ecc_chunk() returns -EBADMSG too) it increments the
ecc_stats.failed counter.

Here you check all chunks in the same function and only increment
ecc_stats.failed once in denali_read_page() even if several chunks are
uncorrectable.
You handle_ecc() should act like nand_read_page_syndrome() WRT ECC
checking: check each block one by one, call
nand_check_erased_ecc_chunk() if needed, increment ecc_stats.failed
when an uncorrectable error is detected, and return max_bitflips at the
end.


Re: [PATCH v2 11/53] mtd: nand: denali: fix bitflips calculation in handle_ecc()

2017-03-23 Thread Masahiro Yamada
Hi Boris,

2017-03-23 5:57 GMT+09:00 Boris Brezillon :
> On Wed, 22 Mar 2017 23:07:18 +0900
> Masahiro Yamada  wrote:
>
>> + do {
>> + err_addr = ioread32(denali->flash_reg + ECC_ERROR_ADDRESS);
>> + err_sector = ECC_SECTOR(err_addr);
>> + err_byte = ECC_BYTE(err_addr);
>> +
>> + err_cor_info = ioread32(denali->flash_reg + 
>> ERR_CORRECTION_INFO);
>> + err_cor_value = ECC_CORRECTION_VALUE(err_cor_info);
>> + err_device = ECC_ERR_DEVICE(err_cor_info);
>> +
>> + /* reset the bitflip counter when crossing ECC sector */
>> + if (err_sector != prev_sector)
>> + bitflips = 0;
>> +
>> + if (ECC_ERROR_UNCORRECTABLE(err_cor_info)) {
>> + /*
>> +  * if the error is not correctable, need to look at the
>> +  * page to see if it is an erased page. if so, then
>> +  * it's not a real ECC error
>> +  */
>> + ret = -EBADMSG;
>
> You should never return -EBADMSG directly. Just increment
> ecc_stats.failed and let the core return -EBADMSG to the upper layer.
>

Here, -EBADMSG is used like that returned from  ->ecc.correct()


Please notice denali_read_page() never returns -EBADMSG.

 -EBADMSG is used as a mark "we need erased page check".


I think nand_read_page_syndrome() does similar;
-EBADMSG is used internally.



-- 
Best Regards
Masahiro Yamada


Re: [PATCH v2 11/53] mtd: nand: denali: fix bitflips calculation in handle_ecc()

2017-03-22 Thread Boris Brezillon
On Wed, 22 Mar 2017 23:07:18 +0900
Masahiro Yamada  wrote:

> + do {
> + err_addr = ioread32(denali->flash_reg + ECC_ERROR_ADDRESS);
> + err_sector = ECC_SECTOR(err_addr);
> + err_byte = ECC_BYTE(err_addr);
> +
> + err_cor_info = ioread32(denali->flash_reg + 
> ERR_CORRECTION_INFO);
> + err_cor_value = ECC_CORRECTION_VALUE(err_cor_info);
> + err_device = ECC_ERR_DEVICE(err_cor_info);
> +
> + /* reset the bitflip counter when crossing ECC sector */
> + if (err_sector != prev_sector)
> + bitflips = 0;
> +
> + if (ECC_ERROR_UNCORRECTABLE(err_cor_info)) {
> + /*
> +  * if the error is not correctable, need to look at the
> +  * page to see if it is an erased page. if so, then
> +  * it's not a real ECC error
> +  */
> + ret = -EBADMSG;

You should never return -EBADMSG directly. Just increment
ecc_stats.failed and let the core return -EBADMSG to the upper layer.



[RESEND PATCH v2 11/53] mtd: nand: denali: fix bitflips calculation in handle_ecc()

2017-03-22 Thread Masahiro Yamada
This function is wrong in multiple ways:

[1] Counting corrected bytes instead of corrected bits.

The following code is counting the number of corrected _bytes_.

/* correct the ECC error */
buf[offset] ^= err_cor_value;
mtd->ecc_stats.corrected++;
bitflips++;

What the core framework expects is the number of corrected _bits_.
They can be different if multiple bitflips occur within one byte.

[2] total number of errors instead of max of per-sector errors

The core framework expects that corrected errors are counted per
sector, then the max value should be taken.  The current code simply
iterates over the whole page, i.e. counts the total number of
correction in the page.  This means "too many bitflips" is triggered
earlier than it should be, i.e. the NAND device is worn out sooner.

Besides those bugs, this function is unreadable due to the deep
nesting.  Notice the whole code in this function is wrapped in
if (irq_status & INTR__ECC_ERR), so this conditional can be moved
out of the function.  Also, use shorter names for local variables.

Re-work the function to fix all the issues.

Signed-off-by: Masahiro Yamada 
---

Changes in v2:
  - Use shorter names for local variables.
  - Fix bugs addressed by [1], [2]

 drivers/mtd/nand/denali.c | 157 --
 1 file changed, 82 insertions(+), 75 deletions(-)

diff --git a/drivers/mtd/nand/denali.c b/drivers/mtd/nand/denali.c
index 86381ac..608fe6f 100644
--- a/drivers/mtd/nand/denali.c
+++ b/drivers/mtd/nand/denali.c
@@ -888,80 +888,87 @@ static void read_oob_data(struct mtd_info *mtd, uint8_t 
*buf, int page)
 #define ECC_SECTOR(x)  (((x) & ECC_ERROR_ADDRESS__SECTOR_NR) >> 12)
 #define ECC_BYTE(x)(((x) & ECC_ERROR_ADDRESS__OFFSET))
 #define ECC_CORRECTION_VALUE(x) ((x) & ERR_CORRECTION_INFO__BYTEMASK)
-#define ECC_ERROR_CORRECTABLE(x) (!((x) & ERR_CORRECTION_INFO__ERROR_TYPE))
+#define ECC_ERROR_UNCORRECTABLE(x) ((x) & ERR_CORRECTION_INFO__ERROR_TYPE)
 #define ECC_ERR_DEVICE(x)  (((x) & ERR_CORRECTION_INFO__DEVICE_NR) >> 8)
 #define ECC_LAST_ERR(x)((x) & 
ERR_CORRECTION_INFO__LAST_ERR_INFO)
 
-static bool handle_ecc(struct denali_nand_info *denali, uint8_t *buf,
-  uint32_t irq_status, unsigned int *max_bitflips)
+static int handle_ecc(struct mtd_info *mtd,
+ struct denali_nand_info *denali, uint8_t *buf)
 {
-   bool check_erased_page = false;
unsigned int bitflips = 0;
+   unsigned int max_bitflips = 0;
+   unsigned int total_bitflips = 0;
+   uint32_t err_addr, err_cor_info;
+   unsigned int err_byte, err_sector, err_device;
+   uint8_t err_cor_value;
+   unsigned int prev_sector = 0;
+   int ret = 0;
+
+   /* read the ECC errors. we'll ignore them for now */
+   denali_set_intr_modes(denali, false);
 
-   if (irq_status & INTR__ECC_ERR) {
-   /* read the ECC errors. we'll ignore them for now */
-   uint32_t err_address, err_correction_info, err_byte,
-err_sector, err_device, err_correction_value;
-   denali_set_intr_modes(denali, false);
-
-   do {
-   err_address = ioread32(denali->flash_reg +
-   ECC_ERROR_ADDRESS);
-   err_sector = ECC_SECTOR(err_address);
-   err_byte = ECC_BYTE(err_address);
-
-   err_correction_info = ioread32(denali->flash_reg +
-   ERR_CORRECTION_INFO);
-   err_correction_value =
-   ECC_CORRECTION_VALUE(err_correction_info);
-   err_device = ECC_ERR_DEVICE(err_correction_info);
-
-   if (ECC_ERROR_CORRECTABLE(err_correction_info)) {
-   /*
-* If err_byte is larger than ECC_SECTOR_SIZE,
-* means error happened in OOB, so we ignore
-* it. It's no need for us to correct it
-* err_device is represented the NAND error
-* bits are happened in if there are more
-* than one NAND connected.
-*/
-   if (err_byte < ECC_SECTOR_SIZE) {
-   struct mtd_info *mtd =
-   nand_to_mtd(&denali->nand);
-   int offset;
-
-   offset = (err_sector *
-   ECC_SECTOR_SIZE +
-   err_byte) *
-   denali->devnum +
-   err_device;
-

[PATCH v2 11/53] mtd: nand: denali: fix bitflips calculation in handle_ecc()

2017-03-22 Thread Masahiro Yamada
This function is wrong in multiple ways:

[1] Counting corrected bytes instead of corrected bits.

The following code is counting the number of corrected _bytes_.

/* correct the ECC error */
buf[offset] ^= err_cor_value;
mtd->ecc_stats.corrected++;
bitflips++;

What the core framework expects is the number of corrected _bits_.
They can be different if multiple bitflips occur within one byte.

[2] total number of errors instead of max of per-sector errors

The core framework expects that corrected errors are counted per
sector, then the max value should be taken.  The current code simply
iterates over the whole page, i.e. counts the total number of
correction in the page.  This means "too many bitflips" is triggered
earlier than it should be, i.e. the NAND device is worn out sooner.

Besides those bugs, this function is unreadable due to the deep
nesting.  Notice the whole code in this function is wrapped in
if (irq_status & INTR__ECC_ERR), so this conditional can be moved
out of the function.  Also, use shorter names for local variables.

Re-work the function to fix all the issues.

Signed-off-by: Masahiro Yamada 
---

Changes in v2:
  - Use shorter names for local variables.
  - Fix bugs addressed by [1], [2]

 drivers/mtd/nand/denali.c | 157 --
 1 file changed, 82 insertions(+), 75 deletions(-)

diff --git a/drivers/mtd/nand/denali.c b/drivers/mtd/nand/denali.c
index 86381ac..608fe6f 100644
--- a/drivers/mtd/nand/denali.c
+++ b/drivers/mtd/nand/denali.c
@@ -888,80 +888,87 @@ static void read_oob_data(struct mtd_info *mtd, uint8_t 
*buf, int page)
 #define ECC_SECTOR(x)  (((x) & ECC_ERROR_ADDRESS__SECTOR_NR) >> 12)
 #define ECC_BYTE(x)(((x) & ECC_ERROR_ADDRESS__OFFSET))
 #define ECC_CORRECTION_VALUE(x) ((x) & ERR_CORRECTION_INFO__BYTEMASK)
-#define ECC_ERROR_CORRECTABLE(x) (!((x) & ERR_CORRECTION_INFO__ERROR_TYPE))
+#define ECC_ERROR_UNCORRECTABLE(x) ((x) & ERR_CORRECTION_INFO__ERROR_TYPE)
 #define ECC_ERR_DEVICE(x)  (((x) & ERR_CORRECTION_INFO__DEVICE_NR) >> 8)
 #define ECC_LAST_ERR(x)((x) & 
ERR_CORRECTION_INFO__LAST_ERR_INFO)
 
-static bool handle_ecc(struct denali_nand_info *denali, uint8_t *buf,
-  uint32_t irq_status, unsigned int *max_bitflips)
+static int handle_ecc(struct mtd_info *mtd,
+ struct denali_nand_info *denali, uint8_t *buf)
 {
-   bool check_erased_page = false;
unsigned int bitflips = 0;
+   unsigned int max_bitflips = 0;
+   unsigned int total_bitflips = 0;
+   uint32_t err_addr, err_cor_info;
+   unsigned int err_byte, err_sector, err_device;
+   uint8_t err_cor_value;
+   unsigned int prev_sector = 0;
+   int ret = 0;
+
+   /* read the ECC errors. we'll ignore them for now */
+   denali_set_intr_modes(denali, false);
 
-   if (irq_status & INTR__ECC_ERR) {
-   /* read the ECC errors. we'll ignore them for now */
-   uint32_t err_address, err_correction_info, err_byte,
-err_sector, err_device, err_correction_value;
-   denali_set_intr_modes(denali, false);
-
-   do {
-   err_address = ioread32(denali->flash_reg +
-   ECC_ERROR_ADDRESS);
-   err_sector = ECC_SECTOR(err_address);
-   err_byte = ECC_BYTE(err_address);
-
-   err_correction_info = ioread32(denali->flash_reg +
-   ERR_CORRECTION_INFO);
-   err_correction_value =
-   ECC_CORRECTION_VALUE(err_correction_info);
-   err_device = ECC_ERR_DEVICE(err_correction_info);
-
-   if (ECC_ERROR_CORRECTABLE(err_correction_info)) {
-   /*
-* If err_byte is larger than ECC_SECTOR_SIZE,
-* means error happened in OOB, so we ignore
-* it. It's no need for us to correct it
-* err_device is represented the NAND error
-* bits are happened in if there are more
-* than one NAND connected.
-*/
-   if (err_byte < ECC_SECTOR_SIZE) {
-   struct mtd_info *mtd =
-   nand_to_mtd(&denali->nand);
-   int offset;
-
-   offset = (err_sector *
-   ECC_SECTOR_SIZE +
-   err_byte) *
-   denali->devnum +
-   err_device;
-