Re: NAND BBT corruption on MPC83xx
On Tue, Jul 5, 2011 at 3:58 PM, Matthew L. Creech wrote: > > Separately, I set up 2 test devices to run while I was away last week. > One of them contained 2 patches: > > - Mike Hench's patch which eliminates this block of code in fsl_elbc_nand.c > - Adam Thomson's patch > (http://lists.infradead.org/pipermail/linux-mtd/2011-June/036427.html) > which initializes oob_poi correctly > > Upon my return, the device with these patches saw no problems at all, > and had no additional bad blocks. The device without these patches > had some 200+ blocks which had been newly marked as bad in the BBT > over the course of 10 days. After rebooting, this latter device then > failed to boot, as shown here: > > http://mcreech.com/work/bbt-ecc-error4.txt > > I'm currently running another test to verify which of the two patches > actually fixed this problem (which might take a few days), but it > seems like removing that block of code in fsl_elbc_nand.c is a good > idea. > Just an update: my tests confirmed that the patch to fsl_elbc_nand.c (http://lists.infradead.org/pipermail/linux-mtd/2011-July/036893.html) seems to have fixed these BBT corruption problems. I ran a torture test on 2 devices for several days: the one which had only that patch had no further issues, while the one which didn't have it (but did have the other oob_poi patch from Adam) experienced BBT corruption. Thanks everyone -- Matthew L. Creech ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: NAND BBT corruption on MPC83xx
On Fri, Jun 17, 2011 at 5:34 PM, Scott Wood wrote: > > It seems that the generic code always passes -1 with PAGEPROG, and only > provides the actual page address on SEQIN. > > I don't think the ECC readback is needed, and the fact that it looks like > it has always been broken would seem to confirm that. It's broken in > other ways, too -- it assumes a particular ECC layout. Let's get rid of it. > > As for the corruption, could it be degradation from repeated reads of that > one page? > I modified nanddump to do repeated reads, and compare the data obtained from the first iteration with that obtained later (to detect bit-flips). I tried 3 different variations: - one which reads the first page (2k) of the last block - one which reads the second page (2k) of the last block - one which reads the entire last block (128k), just for comparison As I understand it, read-disturb would primarily come into play when the second page is read, since it's adjacent to the first page (please correct me if I'm wrong there). Anyway, all 3 of these tests were run for at least 50 million read cycles, with no bit-flips detected. So I'm somewhat doubtful that this is the cause of the BBT corruption I've been seeing. Separately, I set up 2 test devices to run while I was away last week. One of them contained 2 patches: - Mike Hench's patch which eliminates this block of code in fsl_elbc_nand.c - Adam Thomson's patch (http://lists.infradead.org/pipermail/linux-mtd/2011-June/036427.html) which initializes oob_poi correctly Upon my return, the device with these patches saw no problems at all, and had no additional bad blocks. The device without these patches had some 200+ blocks which had been newly marked as bad in the BBT over the course of 10 days. After rebooting, this latter device then failed to boot, as shown here: http://mcreech.com/work/bbt-ecc-error4.txt I'm currently running another test to verify which of the two patches actually fixed this problem (which might take a few days), but it seems like removing that block of code in fsl_elbc_nand.c is a good idea. -- Matthew L. Creech ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: NAND BBT corruption on MPC83xx
On Mon, 2011-06-20 at 07:22 -0400, Atlant Schmidt wrote: > > As far as I know (and I'm sure the list will correct > me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any > Linux layer provides this routine scrubbing; you have > to code it up yourself, probably by accessing the > device at the UBI (underlying block device/LEB) layer. UBI will scrub all LEBs with bit-flips once they are read. But if you have bit-flips in an LEB and it is never read, it will never be scrubbed. And erasures of the neighboring PEBs may turn bit-flips into hard errors. To force scrubbing, the easies way is to just read all volumes, like dd if=/dev/ubi0_i of=/dev/null bs=4096 for each i. -- Best Regards, Artem Bityutskiy ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: NAND BBT corruption on MPC83xx
On Fri, Jun 17, 2011 at 5:34 PM, Scott Wood wrote: > > As for the corruption, could it be degradation from repeated reads of that > one page? > Could be. I think Mike's theory was that the -1 page_addr sort of "wrapped around", and caused us to read in the last block on flash each time NAND_CMD_PAGEPROG was performed. So with a lot of writes happening, we could end up with a BBT that looks like this. That makes sense I guess, since set_addr() in fsl_elbc_nand.c uses page_addr to set FBAR. I don't see anything about it in the manual, but if FBAR wraps beyond the end of the chip, maybe the bits that don't make sense are simply ignored. (In which case we should probably add a check in set_addr() to prevent anything like this in the future) In theory I should be able to prove it out by running 2 devices in parallel - one with that block of code still there, and one with it removed. If the former device sees bit-flips in the BBT and the latter one doesn't, we'll be sure of the culprit. I'll try this and come back with the results. Thanks! -- Matthew L. Creech ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: NAND BBT corruption on MPC83xx
Mike: > It is not a permanent damage thing. A "read disturb" does no permanent damage to the chip but if the read disturb event involves more bits than can be corrected by your ECC code, it can do permanent damage to the *DATA* you've stored in that block. For this reason, a good flash management system manages to at least occasionally read through *ALL* of the in-use blocks in the device so that single-bit errors can be scrubbed out (read and successfully corrected) before an adjacent bit in the block also fails (which would eventually lead to a multi-bit error that might be beyond the ability to be corrected by the ECC). As far as I know (and I'm sure the list will correct me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any Linux layer provides this routine scrubbing; you have to code it up yourself, probably by accessing the device at the UBI (underlying block device/LEB) layer. Atlant -Original Message- From: linux-mtd-boun...@lists.infradead.org [mailto:linux-mtd-boun...@lists.infradead.org] On Behalf Of Mike Hench Sent: Saturday, June 18, 2011 13:55 To: Scott Wood; Matthew L. Creech Cc: linuxppc-dev@lists.ozlabs.org; linux-...@lists.infradead.org Subject: RE: NAND BBT corruption on MPC83xx Scott Wood wrote: > As for the corruption, could it be degradation from repeated reads of that > one page? Read Disturb. I Did not know SLC did that. It just takes 10x as long as MLC, on the order of a million reads. Supposedly erasing the block fixes it. It is not a permanent damage thing. I was seeing ~9 hours before failure with heavy writes. ~4GByte/hour = 2M pages, total ~18 million reads before errors in that last block showed up. Cool. Now we know. Thanks. Mike Hench __ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message. Thank you. Please consider the environment before printing this email. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: NAND BBT corruption on MPC83xx
Scott Wood wrote: > As for the corruption, could it be degradation from repeated reads of that > one page? Read Disturb. I Did not know SLC did that. It just takes 10x as long as MLC, on the order of a million reads. Supposedly erasing the block fixes it. It is not a permanent damage thing. I was seeing ~9 hours before failure with heavy writes. ~4GByte/hour = 2M pages, total ~18 million reads before errors in that last block showed up. Cool. Now we know. Thanks. Mike Hench ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: NAND BBT corruption on MPC83xx
On Fri, 17 Jun 2011 16:54:27 -0400 "Matthew L. Creech" wrote: > Hi, I posted this on the Linux-MTD list but haven't gotten any hits. > Since it looks like it could be MPC83xx-specific, I'm reposting here. > Rick Johnson noted a problem in fsl_elbc_nand.c back in May which > might be related: > > http://lists.infradead.org/pipermail/linux-mtd/2011-May/035372.html It seems that the generic code always passes -1 with PAGEPROG, and only provides the actual page address on SEQIN. I don't think the ECC readback is needed, and the fact that it looks like it has always been broken would seem to confirm that. It's broken in other ways, too -- it assumes a particular ECC layout. Let's get rid of it. As for the corruption, could it be degradation from repeated reads of that one page? > More info on this board: > - MPC 8313 SoC > - 1GB Samsung NAND flash (K9K8G08U0B) > - Linux 2.6.31 > - U-Boot 2009.06 Hmm, 2.6.31... it's probably not related to this problem, but you should cherry pick b3a70f0bc32d1b70584bcaa6019fa4260b0da92e and 476459a6cf46d20ec73d9b211f3894ced5f9871e. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev