Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
On Wednesday, December 4, 2019 at 5:19:23 AM UTC+1, Vasily Khoruzhick wrote: > > On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier > wrote: > > > > Hi Samuel, > > Hi Samuel, > > > On 13/01/2019 02:17, Samuel Holland wrote: > > > The Allwinner A64 SoC is known[1] to have an unstable architectural > > > timer, which manifests itself most obviously in the time jumping > forward > > > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a > > > timer frequency of 24 MHz, implying that the time went slightly > backward > > > (and this was interpreted by the kernel as it jumping forward and > > > wrapping around past the epoch). > > > > > > Investigation revealed instability in the low bits of CNTVCT at the > > > point a high bit rolls over. This leads to power-of-two cycle forward > > > and backward jumps. (Testing shows that forward jumps are about twice > as > > > likely as backward jumps.) Since the counter value returns to normal > > > after an indeterminate read, each "jump" really consists of both a > > > forward and backward jump from the software perspective. > > > > > > Unless the kernel is trapping CNTVCT reads, a userspace program is > able > > > to read the register in a loop faster than it changes. A test program > > > running on all 4 CPU cores that reported jumps larger than 100 ms was > > > run for 13.6 hours and reported the following: > > > > > > Count | Event > > > ---+--- > > > 9940 | jumped backward 699ms > > >268 | jumped backward 1398ms > > > 1 | jumped backward 2097ms > > > 16020 | jumped forward 175ms > > > 6443 | jumped forward 699ms > > > 2976 | jumped forward 1398ms > > > 9 | jumped forward356516ms > > > 9 | jumped forward357215ms > > > 4 | jumped forward714430ms > > > 1 | jumped forward 3578440ms > > > > > > This works out to a jump larger than 100 ms about every 5.5 seconds on > > > each CPU core. > > > > > > The largest jump (almost an hour!) was the following sequence of > reads: > > > 0x007f → 0x0093feff → 0x0080 > > > > > > Note that the middle bits don't necessarily all read as all zeroes or > > > all ones during the anomalous behavior; however the low 10 bits > checked > > > by the function in this patch have never been observed with any other > > > value. > > > > > > Also note that smaller jumps are much more common, with backward jumps > > > of 2048 (2^11) cycles observed over 400 times per second on each core. > > > (Of course, this is partially explained by lower bits rolling over > more > > > frequently.) Any one of these could have caused the 95 year time skip. > > > > > > Similar anomalies were observed while reading CNTPCT (after patching > the > > > kernel to allow reads from userspace). However, the CNTPCT jumps are > > > much less frequent, and only small jumps were observed. The same > program > > > as before (except now reading CNTPCT) observed after 72 hours: > > > > > > Count | Event > > > ---+--- > > > 17 | jumped backward 699ms > > > 52 | jumped forward 175ms > > > 2831 | jumped forward 699ms > > > 5 | jumped forward 1398ms > > > > > > Further investigation showed that the instability in CNTPCT/CNTVCT > also > > > affected the respective timer's TVAL register. The following values > were > > > observed immediately after writing CNVT_TVAL to 0x1000: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > +++- > > > 0x00d4a2d8bfff | 0x10003fff | 0x00d4b2d8bfff | +0x4000 > > > 0x00d4a2d94000 | 0x0fff | 0x00d4b2d97fff | -0x4000 > > > 0x00d4a2d97fff | 0x10003fff | 0x00d4b2d97fff | +0x4000 > > > 0x00d4a2d9c000 | 0x0fff | 0x00d4b2d9 | -0x4000 > > > > > > The pattern of errors in CNTV_TVAL seemed to depend on exactly which > > > value was written to it. For example, after writing 0x10101010: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > +++- > > > 0x01ac3eff | 0x1110100f | 0x01ac4f10100f | +0x100 > > > 0x01ac4000 | 0x1010100f | 0x01ac5110100f | -0x100 > > > 0x01ac58ff | 0x1110100f | 0x01ac6910100f | +0x100 > > > 0x01ac6600 | 0x1010100f | 0x01ac7710100f | -0x100 > > > 0x01ac6aff | 0x1110100f | 0x01ac7b10100f | +0x100 > > > 0x01ac6e00 | 0x1010100f | 0x01ac7f10100f | -0x100 > > > > > > I was also twice able to reproduce the issue covered by Allwinner's > > > workaround[4], that writing to TVAL sometimes fails, and both CVAL and > > > TVAL are left with entirely
Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
On Wed, Dec 4, 2019 at 4:21 AM Marc Zyngier wrote: > > [please note that my email address has changed] > > On 2019-12-04 04:18, Vasily Khoruzhick wrote: > > [...] > > > Unfortunately this patch doesn't completely eliminate the jumps. > > There > > have been reports from users who still saw 95y jump even with the > > patch applied. > > > > Personally I've seen it once or twice on my Pine64-LTS. > > > > Looks like we need bigger hammer. Does anyone have any idea what it > > could be? > > Which kernel version did you see this happening on? I've seen it on 5.3 > M. > -- > Jazz is not dead. It just smells funny... -- You received this message because you are subscribed to the Google Groups "linux-sunxi" group. To unsubscribe from this group and stop receiving emails from it, send an email to linux-sunxi+unsubscr...@googlegroups.com. To view this discussion on the web, visit https://groups.google.com/d/msgid/linux-sunxi/CA%2BE%3DqVc-BA_W8O1qpkKgg5pDax-Jbvmpc-TB7gWB7CfYAxXCXQ%40mail.gmail.com.
Re: [linux-sunxi] Re: [PATCH v3 1/2] arm64: arch_timer: Workaround for Allwinner A64 timer instability
On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier wrote: > > Hi Samuel, Hi Samuel, > On 13/01/2019 02:17, Samuel Holland wrote: > > The Allwinner A64 SoC is known[1] to have an unstable architectural > > timer, which manifests itself most obviously in the time jumping forward > > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a > > timer frequency of 24 MHz, implying that the time went slightly backward > > (and this was interpreted by the kernel as it jumping forward and > > wrapping around past the epoch). > > > > Investigation revealed instability in the low bits of CNTVCT at the > > point a high bit rolls over. This leads to power-of-two cycle forward > > and backward jumps. (Testing shows that forward jumps are about twice as > > likely as backward jumps.) Since the counter value returns to normal > > after an indeterminate read, each "jump" really consists of both a > > forward and backward jump from the software perspective. > > > > Unless the kernel is trapping CNTVCT reads, a userspace program is able > > to read the register in a loop faster than it changes. A test program > > running on all 4 CPU cores that reported jumps larger than 100 ms was > > run for 13.6 hours and reported the following: > > > > Count | Event > > ---+--- > > 9940 | jumped backward 699ms > >268 | jumped backward 1398ms > > 1 | jumped backward 2097ms > > 16020 | jumped forward 175ms > > 6443 | jumped forward 699ms > > 2976 | jumped forward 1398ms > > 9 | jumped forward356516ms > > 9 | jumped forward357215ms > > 4 | jumped forward714430ms > > 1 | jumped forward 3578440ms > > > > This works out to a jump larger than 100 ms about every 5.5 seconds on > > each CPU core. > > > > The largest jump (almost an hour!) was the following sequence of reads: > > 0x007f → 0x0093feff → 0x0080 > > > > Note that the middle bits don't necessarily all read as all zeroes or > > all ones during the anomalous behavior; however the low 10 bits checked > > by the function in this patch have never been observed with any other > > value. > > > > Also note that smaller jumps are much more common, with backward jumps > > of 2048 (2^11) cycles observed over 400 times per second on each core. > > (Of course, this is partially explained by lower bits rolling over more > > frequently.) Any one of these could have caused the 95 year time skip. > > > > Similar anomalies were observed while reading CNTPCT (after patching the > > kernel to allow reads from userspace). However, the CNTPCT jumps are > > much less frequent, and only small jumps were observed. The same program > > as before (except now reading CNTPCT) observed after 72 hours: > > > > Count | Event > > ---+--- > > 17 | jumped backward 699ms > > 52 | jumped forward 175ms > > 2831 | jumped forward 699ms > > 5 | jumped forward 1398ms > > > > Further investigation showed that the instability in CNTPCT/CNTVCT also > > affected the respective timer's TVAL register. The following values were > > observed immediately after writing CNVT_TVAL to 0x1000: > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL Error > > +++- > > 0x00d4a2d8bfff | 0x10003fff | 0x00d4b2d8bfff | +0x4000 > > 0x00d4a2d94000 | 0x0fff | 0x00d4b2d97fff | -0x4000 > > 0x00d4a2d97fff | 0x10003fff | 0x00d4b2d97fff | +0x4000 > > 0x00d4a2d9c000 | 0x0fff | 0x00d4b2d9 | -0x4000 > > > > The pattern of errors in CNTV_TVAL seemed to depend on exactly which > > value was written to it. For example, after writing 0x10101010: > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL Error > > +++- > > 0x01ac3eff | 0x1110100f | 0x01ac4f10100f | +0x100 > > 0x01ac4000 | 0x1010100f | 0x01ac5110100f | -0x100 > > 0x01ac58ff | 0x1110100f | 0x01ac6910100f | +0x100 > > 0x01ac6600 | 0x1010100f | 0x01ac7710100f | -0x100 > > 0x01ac6aff | 0x1110100f | 0x01ac7b10100f | +0x100 > > 0x01ac6e00 | 0x1010100f | 0x01ac7f10100f | -0x100 > > > > I was also twice able to reproduce the issue covered by Allwinner's > > workaround[4], that writing to TVAL sometimes fails, and both CVAL and > > TVAL are left with entirely bogus values. One was the following values: > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL > > ++-- > > 0x00d4a2d6014c | 0x8fbd5721 | 0x00d132935fff (615s in the past) > > > > > > > > Because the CPU can read the CNTPCT/CNTVCT registers faster than