Re: NAND BBT corruption on MPC83xx

2011-07-11 Thread Matthew L. Creech
On Tue, Jul 5, 2011 at 3:58 PM, Matthew L. Creech  wrote:
>
> Separately, I set up 2 test devices to run while I was away last week.
>  One of them contained 2 patches:
>
> - Mike Hench's patch which eliminates this block of code in fsl_elbc_nand.c
> - Adam Thomson's patch
> (http://lists.infradead.org/pipermail/linux-mtd/2011-June/036427.html)
> which initializes oob_poi correctly
>
> Upon my return, the device with these patches saw no problems at all,
> and had no additional bad blocks.  The device without these patches
> had some 200+ blocks which had been newly marked as bad in the BBT
> over the course of 10 days.  After rebooting, this latter device then
> failed to boot, as shown here:
>
> http://mcreech.com/work/bbt-ecc-error4.txt
>
> I'm currently running another test to verify which of the two patches
> actually fixed this problem (which might take a few days), but it
> seems like removing that block of code in fsl_elbc_nand.c is a good
> idea.
>

Just an update: my tests confirmed that the patch to fsl_elbc_nand.c
(http://lists.infradead.org/pipermail/linux-mtd/2011-July/036893.html)
seems to have fixed these BBT corruption problems.

I ran a torture test on 2 devices for several days: the one which had
only that patch had no further issues, while the one which didn't have
it (but did have the other oob_poi patch from Adam) experienced BBT
corruption.

Thanks everyone

-- 
Matthew L. Creech
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: NAND BBT corruption on MPC83xx

2011-07-05 Thread Matthew L. Creech
On Fri, Jun 17, 2011 at 5:34 PM, Scott Wood  wrote:
>
> It seems that the generic code always passes -1 with PAGEPROG, and only
> provides the actual page address on SEQIN.
>
> I don't think the ECC readback is needed, and the fact that it looks like
> it has always been broken would seem to confirm that.  It's broken in
> other ways, too -- it assumes a particular ECC layout.  Let's get rid of it.
>
> As for the corruption, could it be degradation from repeated reads of that
> one page?
>

I modified nanddump to do repeated reads, and compare the data
obtained from the first iteration with that obtained later (to detect
bit-flips).  I tried 3 different variations:

- one which reads the first page (2k) of the last block
- one which reads the second page (2k) of the last block
- one which reads the entire last block (128k), just for comparison

As I understand it, read-disturb would primarily come into play when
the second page is read, since it's adjacent to the first page (please
correct me if I'm wrong there).  Anyway, all 3 of these tests were run
for at least 50 million read cycles, with no bit-flips detected.  So
I'm somewhat doubtful that this is the cause of the BBT corruption
I've been seeing.



Separately, I set up 2 test devices to run while I was away last week.
 One of them contained 2 patches:

- Mike Hench's patch which eliminates this block of code in fsl_elbc_nand.c
- Adam Thomson's patch
(http://lists.infradead.org/pipermail/linux-mtd/2011-June/036427.html)
which initializes oob_poi correctly

Upon my return, the device with these patches saw no problems at all,
and had no additional bad blocks.  The device without these patches
had some 200+ blocks which had been newly marked as bad in the BBT
over the course of 10 days.  After rebooting, this latter device then
failed to boot, as shown here:

http://mcreech.com/work/bbt-ecc-error4.txt

I'm currently running another test to verify which of the two patches
actually fixed this problem (which might take a few days), but it
seems like removing that block of code in fsl_elbc_nand.c is a good
idea.

-- 
Matthew L. Creech
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: NAND BBT corruption on MPC83xx

2011-06-23 Thread Artem Bityutskiy
On Mon, 2011-06-20 at 07:22 -0400, Atlant Schmidt wrote:
> 
>   As far as I know (and I'm sure the list will correct
>   me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any
>   Linux layer provides this routine scrubbing; you have
>   to code it up yourself, probably by accessing the
>   device at the UBI (underlying block device/LEB) layer. 

UBI will scrub all LEBs with bit-flips once they are read.
But if you have bit-flips in an LEB and it is never read, it will never
be scrubbed. And erasures of the neighboring PEBs may turn bit-flips
into hard errors.

To force scrubbing, the easies way is to just read all volumes, like

dd if=/dev/ubi0_i of=/dev/null bs=4096

for each i.

-- 
Best Regards,
Artem Bityutskiy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: NAND BBT corruption on MPC83xx

2011-06-20 Thread Matthew L. Creech
On Fri, Jun 17, 2011 at 5:34 PM, Scott Wood  wrote:
>
> As for the corruption, could it be degradation from repeated reads of that
> one page?
>

Could be.  I think Mike's theory was that the -1 page_addr sort of
"wrapped around", and caused us to read in the last block on flash
each time NAND_CMD_PAGEPROG was performed.  So with a lot of writes
happening, we could end up with a BBT that looks like this.

That makes sense I guess, since set_addr() in fsl_elbc_nand.c uses
page_addr to set FBAR.  I don't see anything about it in the manual,
but if FBAR wraps beyond the end of the chip, maybe the bits that
don't make sense are simply ignored.  (In which case we should
probably add a check in set_addr() to prevent anything like this in
the future)

In theory I should be able to prove it out by running 2 devices in
parallel - one with that block of code still there, and one with it
removed.  If the former device sees bit-flips in the BBT and the
latter one doesn't, we'll be sure of the culprit.  I'll try this and
come back with the results.

Thanks!

-- 
Matthew L. Creech
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: NAND BBT corruption on MPC83xx

2011-06-20 Thread Atlant Schmidt
Mike:

> It is not a permanent damage thing.

  A "read disturb" does no permanent damage to the chip
  but if the read disturb event involves more bits than
  can be corrected by your ECC code, it can do permanent
  damage to the *DATA* you've stored in that block.

  For this reason, a good flash management system manages
  to at least occasionally read through *ALL* of the in-use
  blocks in the device so that single-bit errors can be
  scrubbed out (read and successfully corrected) before
  an adjacent bit in the block also fails (which would
  eventually lead to a multi-bit error that might be
  beyond the ability to be corrected by the ECC).

  As far as I know (and I'm sure the list will correct
  me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any
  Linux layer provides this routine scrubbing; you have
  to code it up yourself, probably by accessing the
  device at the UBI (underlying block device/LEB) layer.

Atlant

-Original Message-
From: linux-mtd-boun...@lists.infradead.org 
[mailto:linux-mtd-boun...@lists.infradead.org] On Behalf Of Mike Hench
Sent: Saturday, June 18, 2011 13:55
To: Scott Wood; Matthew L. Creech
Cc: linuxppc-dev@lists.ozlabs.org; linux-...@lists.infradead.org
Subject: RE: NAND BBT corruption on MPC83xx

Scott Wood wrote:
> As for the corruption, could it be degradation from repeated reads of
that
> one page?

Read Disturb. I Did not know SLC did that.
It just takes 10x as long as MLC, on the order of a million reads.
Supposedly erasing the block fixes it.
It is not a permanent damage thing.
I was seeing ~9 hours before failure with heavy writes.
~4GByte/hour = 2M pages, total ~18 million reads before errors in that
last block showed up.

Cool. Now we know.
Thanks.

Mike Hench



__
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

This e-mail and the information, including any attachments, it contains are 
intended to be a confidential communication only to the person or entity to 
whom it is addressed and may contain information that is privileged. If the 
reader of this message is not the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
immediately notify the sender and destroy the original message.

Thank you.

Please consider the environment before printing this email.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: NAND BBT corruption on MPC83xx

2011-06-18 Thread Mike Hench
Scott Wood wrote:
> As for the corruption, could it be degradation from repeated reads of
that
> one page?

Read Disturb. I Did not know SLC did that.
It just takes 10x as long as MLC, on the order of a million reads.
Supposedly erasing the block fixes it.
It is not a permanent damage thing.
I was seeing ~9 hours before failure with heavy writes.
~4GByte/hour = 2M pages, total ~18 million reads before errors in that
last block showed up.

Cool. Now we know.
Thanks.

Mike Hench


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: NAND BBT corruption on MPC83xx

2011-06-17 Thread Scott Wood
On Fri, 17 Jun 2011 16:54:27 -0400
"Matthew L. Creech"  wrote:

> Hi, I posted this on the Linux-MTD list but haven't gotten any hits.
> Since it looks like it could be MPC83xx-specific, I'm reposting here.
> Rick Johnson noted a problem in fsl_elbc_nand.c back in May which
> might be related:
> 
> http://lists.infradead.org/pipermail/linux-mtd/2011-May/035372.html

It seems that the generic code always passes -1 with PAGEPROG, and only
provides the actual page address on SEQIN.

I don't think the ECC readback is needed, and the fact that it looks like
it has always been broken would seem to confirm that.  It's broken in
other ways, too -- it assumes a particular ECC layout.  Let's get rid of it.

As for the corruption, could it be degradation from repeated reads of that
one page?

> More info on this board:
> - MPC 8313 SoC
> - 1GB Samsung NAND flash (K9K8G08U0B)
> - Linux 2.6.31
> - U-Boot 2009.06

Hmm, 2.6.31... it's probably not related to this problem, but you
should cherry pick b3a70f0bc32d1b70584bcaa6019fa4260b0da92e and
476459a6cf46d20ec73d9b211f3894ced5f9871e.

-Scott

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev