Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability

2019-12-12 Thread Roman Beránek


On Wednesday, December 4, 2019 at 5:19:23 AM UTC+1, Vasily Khoruzhick wrote:
>
> On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier  > wrote: 
> > 
> > Hi Samuel, 
>
> Hi Samuel, 
>
> > On 13/01/2019 02:17, Samuel Holland wrote: 
> > > The Allwinner A64 SoC is known[1] to have an unstable architectural 
> > > timer, which manifests itself most obviously in the time jumping 
> forward 
> > > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a 
> > > timer frequency of 24 MHz, implying that the time went slightly 
> backward 
> > > (and this was interpreted by the kernel as it jumping forward and 
> > > wrapping around past the epoch). 
> > > 
> > > Investigation revealed instability in the low bits of CNTVCT at the 
> > > point a high bit rolls over. This leads to power-of-two cycle forward 
> > > and backward jumps. (Testing shows that forward jumps are about twice 
> as 
> > > likely as backward jumps.) Since the counter value returns to normal 
> > > after an indeterminate read, each "jump" really consists of both a 
> > > forward and backward jump from the software perspective. 
> > > 
> > > Unless the kernel is trapping CNTVCT reads, a userspace program is 
> able 
> > > to read the register in a loop faster than it changes. A test program 
> > > running on all 4 CPU cores that reported jumps larger than 100 ms was 
> > > run for 13.6 hours and reported the following: 
> > > 
> > >  Count | Event 
> > > ---+--- 
> > >   9940 | jumped backward  699ms 
> > >268 | jumped backward 1398ms 
> > >  1 | jumped backward 2097ms 
> > >  16020 | jumped forward   175ms 
> > >   6443 | jumped forward   699ms 
> > >   2976 | jumped forward  1398ms 
> > >  9 | jumped forward356516ms 
> > >  9 | jumped forward357215ms 
> > >  4 | jumped forward714430ms 
> > >  1 | jumped forward   3578440ms 
> > > 
> > > This works out to a jump larger than 100 ms about every 5.5 seconds on 
> > > each CPU core. 
> > > 
> > > The largest jump (almost an hour!) was the following sequence of 
> reads: 
> > > 0x007f → 0x0093feff → 0x0080 
> > > 
> > > Note that the middle bits don't necessarily all read as all zeroes or 
> > > all ones during the anomalous behavior; however the low 10 bits 
> checked 
> > > by the function in this patch have never been observed with any other 
> > > value. 
> > > 
> > > Also note that smaller jumps are much more common, with backward jumps 
> > > of 2048 (2^11) cycles observed over 400 times per second on each core. 
> > > (Of course, this is partially explained by lower bits rolling over 
> more 
> > > frequently.) Any one of these could have caused the 95 year time skip. 
> > > 
> > > Similar anomalies were observed while reading CNTPCT (after patching 
> the 
> > > kernel to allow reads from userspace). However, the CNTPCT jumps are 
> > > much less frequent, and only small jumps were observed. The same 
> program 
> > > as before (except now reading CNTPCT) observed after 72 hours: 
> > > 
> > >  Count | Event 
> > > ---+--- 
> > > 17 | jumped backward  699ms 
> > > 52 | jumped forward   175ms 
> > >   2831 | jumped forward   699ms 
> > >  5 | jumped forward  1398ms 
> > > 
> > > Further investigation showed that the instability in CNTPCT/CNTVCT 
> also 
> > > affected the respective timer's TVAL register. The following values 
> were 
> > > observed immediately after writing CNVT_TVAL to 0x1000: 
> > > 
> > >  CNTVCT | CNTV_TVAL  | CNTV_CVAL  | CNTV_TVAL 
> Error 
> > > 
> +++- 
> > >  0x00d4a2d8bfff | 0x10003fff | 0x00d4b2d8bfff | +0x4000 
> > >  0x00d4a2d94000 | 0x0fff | 0x00d4b2d97fff | -0x4000 
> > >  0x00d4a2d97fff | 0x10003fff | 0x00d4b2d97fff | +0x4000 
> > >  0x00d4a2d9c000 | 0x0fff | 0x00d4b2d9 | -0x4000 
> > > 
> > > The pattern of errors in CNTV_TVAL seemed to depend on exactly which 
> > > value was written to it. For example, after writing 0x10101010: 
> > > 
> > >  CNTVCT | CNTV_TVAL  | CNTV_CVAL  | CNTV_TVAL 
> Error 
> > > 
> +++- 
> > >  0x01ac3eff | 0x1110100f | 0x01ac4f10100f | +0x100 
> > >  0x01ac4000 | 0x1010100f | 0x01ac5110100f | -0x100 
> > >  0x01ac58ff | 0x1110100f | 0x01ac6910100f | +0x100 
> > >  0x01ac6600 | 0x1010100f | 0x01ac7710100f | -0x100 
> > >  0x01ac6aff | 0x1110100f | 0x01ac7b10100f | +0x100 
> > >  0x01ac6e00 | 0x1010100f | 0x01ac7f10100f | -0x100 
> > > 
> > > I was also twice able to reproduce the issue covered by Allwinner's 
> > > workaround[4], that writing to TVAL sometimes fails, and both CVAL and 
> > > TVAL are left with entirely 

Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability

2019-12-04 Thread Vasily Khoruzhick
On Wed, Dec 4, 2019 at 4:21 AM Marc Zyngier  wrote:
>
> [please note that my email address has changed]
>
> On 2019-12-04 04:18, Vasily Khoruzhick wrote:
>
> [...]
>
> > Unfortunately this patch doesn't completely eliminate the jumps.
> > There
> > have been reports from users who still saw 95y jump even with the
> > patch applied.
> >
> > Personally I've seen it once or twice on my Pine64-LTS.
> >
> > Looks like we need bigger hammer. Does anyone have any idea what it
> > could be?
>
> Which kernel version did you see this happening on?

I've seen it on 5.3

>  M.
> --
> Jazz is not dead. It just smells funny...

-- 
You received this message because you are subscribed to the Google Groups 
"linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to linux-sunxi+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/linux-sunxi/CA%2BE%3DqVc-BA_W8O1qpkKgg5pDax-Jbvmpc-TB7gWB7CfYAxXCXQ%40mail.gmail.com.


Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability

2019-12-03 Thread Vasily Khoruzhick
On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier  wrote:
>
> Hi Samuel,

Hi Samuel,

> On 13/01/2019 02:17, Samuel Holland wrote:
> > The Allwinner A64 SoC is known[1] to have an unstable architectural
> > timer, which manifests itself most obviously in the time jumping forward
> > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a
> > timer frequency of 24 MHz, implying that the time went slightly backward
> > (and this was interpreted by the kernel as it jumping forward and
> > wrapping around past the epoch).
> >
> > Investigation revealed instability in the low bits of CNTVCT at the
> > point a high bit rolls over. This leads to power-of-two cycle forward
> > and backward jumps. (Testing shows that forward jumps are about twice as
> > likely as backward jumps.) Since the counter value returns to normal
> > after an indeterminate read, each "jump" really consists of both a
> > forward and backward jump from the software perspective.
> >
> > Unless the kernel is trapping CNTVCT reads, a userspace program is able
> > to read the register in a loop faster than it changes. A test program
> > running on all 4 CPU cores that reported jumps larger than 100 ms was
> > run for 13.6 hours and reported the following:
> >
> >  Count | Event
> > ---+---
> >   9940 | jumped backward  699ms
> >268 | jumped backward 1398ms
> >  1 | jumped backward 2097ms
> >  16020 | jumped forward   175ms
> >   6443 | jumped forward   699ms
> >   2976 | jumped forward  1398ms
> >  9 | jumped forward356516ms
> >  9 | jumped forward357215ms
> >  4 | jumped forward714430ms
> >  1 | jumped forward   3578440ms
> >
> > This works out to a jump larger than 100 ms about every 5.5 seconds on
> > each CPU core.
> >
> > The largest jump (almost an hour!) was the following sequence of reads:
> > 0x007f → 0x0093feff → 0x0080
> >
> > Note that the middle bits don't necessarily all read as all zeroes or
> > all ones during the anomalous behavior; however the low 10 bits checked
> > by the function in this patch have never been observed with any other
> > value.
> >
> > Also note that smaller jumps are much more common, with backward jumps
> > of 2048 (2^11) cycles observed over 400 times per second on each core.
> > (Of course, this is partially explained by lower bits rolling over more
> > frequently.) Any one of these could have caused the 95 year time skip.
> >
> > Similar anomalies were observed while reading CNTPCT (after patching the
> > kernel to allow reads from userspace). However, the CNTPCT jumps are
> > much less frequent, and only small jumps were observed. The same program
> > as before (except now reading CNTPCT) observed after 72 hours:
> >
> >  Count | Event
> > ---+---
> > 17 | jumped backward  699ms
> > 52 | jumped forward   175ms
> >   2831 | jumped forward   699ms
> >  5 | jumped forward  1398ms
> >
> > Further investigation showed that the instability in CNTPCT/CNTVCT also
> > affected the respective timer's TVAL register. The following values were
> > observed immediately after writing CNVT_TVAL to 0x1000:
> >
> >  CNTVCT | CNTV_TVAL  | CNTV_CVAL  | CNTV_TVAL Error
> > +++-
> >  0x00d4a2d8bfff | 0x10003fff | 0x00d4b2d8bfff | +0x4000
> >  0x00d4a2d94000 | 0x0fff | 0x00d4b2d97fff | -0x4000
> >  0x00d4a2d97fff | 0x10003fff | 0x00d4b2d97fff | +0x4000
> >  0x00d4a2d9c000 | 0x0fff | 0x00d4b2d9 | -0x4000
> >
> > The pattern of errors in CNTV_TVAL seemed to depend on exactly which
> > value was written to it. For example, after writing 0x10101010:
> >
> >  CNTVCT | CNTV_TVAL  | CNTV_CVAL  | CNTV_TVAL Error
> > +++-
> >  0x01ac3eff | 0x1110100f | 0x01ac4f10100f | +0x100
> >  0x01ac4000 | 0x1010100f | 0x01ac5110100f | -0x100
> >  0x01ac58ff | 0x1110100f | 0x01ac6910100f | +0x100
> >  0x01ac6600 | 0x1010100f | 0x01ac7710100f | -0x100
> >  0x01ac6aff | 0x1110100f | 0x01ac7b10100f | +0x100
> >  0x01ac6e00 | 0x1010100f | 0x01ac7f10100f | -0x100
> >
> > I was also twice able to reproduce the issue covered by Allwinner's
> > workaround[4], that writing to TVAL sometimes fails, and both CVAL and
> > TVAL are left with entirely bogus values. One was the following values:
> >
> >  CNTVCT | CNTV_TVAL  | CNTV_CVAL
> > ++--
> >  0x00d4a2d6014c | 0x8fbd5721 | 0x00d132935fff (615s in the past)
> >
> > 
> >
> > Because the CPU can read the CNTPCT/CNTVCT registers faster than