On 12 July 2018 18:46:11 GMT+10:00, Richard Weinberger <rich...@nod.at> wrote:
Mark,

Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
Hello Mark,

added Richard Weinberger to cc...

Am 12.07.2018 um 02:28 schrieb Mark Spieth:
> Hi
> > In the process of investigating a boot failure on one of our
devices, the
> > UBI: fixable bit-flip detected at PEB > > message was seen with the following behaviour during kernel load in
u-boot.
> > Read [2285568] bytes
> UBI: fixable bit-flip detected at PEB 415
> UBI: schedule PEB 415 for scrubbing
> UBI: fixable bit-flip detected at PEB 415
> UBI: fixable bit-flip detected at PEB 419
> UBI: schedule PEB 419 for scrubbing
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: schedule PEB 420 for scrubbing
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> > This repeats until reset.

Do you see the same symptom also on Linux?
We need to be very sure that it is actually a UBI problem.

The linux provided has an up to date mtd/ubi driver so already has the 75% bitflip threshold thus hiding the issue in a new flash. So the 2 are not the same. Untested on linux.


> This fix is not a root cause fix though. Investigating further led
to the following root cause
> solution. The following is AFAICT.
> > When the scrubber chooses a PEB to move the from the free balanced
tree. This tree is sorted by EC
> (erase count) and then by PEB number.
> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF
which is 8192 in this config. So the
> find_wl_entry function will find a PEB that is better in error
count that the current PEB EC. This

error count? You mean erase count?

Yes of course.


> can easily cause it to find the PEB that was just moved from if it
is the lowest numbered PEB in the
> free tree. Waiting for EC to go above 8192 would take a long time
and cause premature aging of the
> flash PEBs in question.
> > The easy solution is to change the max parameter to this call to 0
so it finds a PEB with a smaller
> EC than the one being replaced. This means it wont use the
previously discarded PEB as its first
> choice.

For scrubbing this might be a good idea, but not for regular
wear-leveling.
Yes only for scrubbing, not wear leveling.

See comment in UBI:
/*
* When a physical eraseblock is moved, the WL sub-system has to pick
the target
* physical eraseblock to move to. The simplest way would be just to
pick the
* one with the highest erase counter. But in certain workloads this
could lead
* to an unlimited wear of one or few physical eraseblock. Indeed,
imagine a
* situation when the picked physical eraseblock is constantly erased
after the
* data is written to it. So, we have a constant which limits the
highest erase
* counter of the free physical eraseblock to pick. Namely, the WL
sub-system
* does not pick eraseblocks with erase counter greater than the lowest
erase
* counter plus %WL_FREE_MAX_DIFF.
*/
#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)

So we could change the logic such that for regular wear-leveling we
keep using WL_FREE_MAX_DIFF,
but for scrubbing (which is 1:1 wear-leveling but the source PEB is
showing bit-flips) we use
a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
I'm not sure whether 0 is too extreme and might cause other
distortions.

Yes the wear leveling threshold is still WL_FREE_MAX_DIFF and the scubbing threshold is 0.

This is why I'm asking. Because the 2 PEBs will track each others EC I'm not sure that will work.

Mark, can you please file a patch and send it to linux-mtd mailing
list?
Such a change needs to go through Linux and then to u-boot.
But first we need to think about and discuss it in detail.

Will do.


  I am not sure if it is so easy ...

> This fix was implemented and fixable bit-flip errors no longer
hang/freeze the boot process! UBI
> erase and reformat was used between re-tests to get consistent
results.
> > Adding the above 75% correctable bitflip threshold is also a good
thing as less movement will ensue
> when the FLASH is new, but as the flash ages, the root cause will
once again be invoked causing
> un-recoverable boot failures.
> > Note this fault is also in the latest kernel drivers for UBI and
may also exist in other wear
> leveling implementations. The kernel driver issue may be at fault
for android devices locking
> up/freezing sporadically during FLASH read when scrubbing due to a
relatively full flash and
> correctable errors causing ping pong PEB moves.
> > The question is, is my root cause solution sound or have I missed
something?

I have to think about, before I write nonsene, but may Richard has
here a deeper insight.


Thanks for your input.

Mark


--
Mark Spieth, PhD
Digivation Pty Ltd
9 Catalina Ave
ASHBURTON VIC 3147
Australia
Phone: +61 4 11 515717 (0411515717)
Fax: +61 3 9885 5774
_______________________________________________
U-Boot mailing list
U-Boot@lists.denx.de
https://lists.denx.de/listinfo/u-boot

Reply via email to