Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
> On 9 May 2018, at 11:27, Daniel Danzbergerwrote: > > On 05/ >> >> >> So that smells more of a race condition between the writer filling with 0xFF >> and the reader catching up. > The open() syscall does the memset(0xff) and blocks. This ensures the memory > is > initialized before the open() returns. I don't think there is a race. >> >> Again, assume that I am an idiot and am missing something fundamental. I did say assume I’m an idiot. OK so I did a bit more playing because this has piqued my interest as a ‘fun’ thing to play with :-) First off, I further modded the user code to report the byte value seen when non 0xFF. I further modified it to carry on, to see if there were discontinuous errors. I only ever see values of 0 or 0xFF. What is more curious is that I do see discontiguous segments of 0’s. Occasionally for 2 or 3 bytes only: mmap addr: 0x779ae000 check memory ...FAIL 00 (at byte 96) FAIL 00 (at byte 97) FAIL 00 (at byte 416) FAIL 00 (at byte 417) FAIL 00 (at byte 418) FAIL 00 (at byte 419) FAIL 00 (at byte 420) FAIL 00 (at byte 421) FAIL 00 (at byte 422) FAIL 00 (at byte 423) FAIL 00 (at byte 424) FAIL 00 (at byte 425) FAIL 00 (at byte 426) FAIL 00 (at byte 427) FAIL 00 (at byte 428) FAIL 00 (at byte 429) FAIL 00 (at byte 430) FAIL 00 (at byte 431) FAIL 00 (at byte 432) FAIL 00 (at byte 433) FAIL 00 (at byte 434) FAIL 00 (at byte 435) FAIL 00 (at byte 436) FAIL 00 (at byte 437) FAIL 00 (at byte 438) FAIL 00 (at byte 439) FAIL 00 (at byte 440) FAIL 00 (at byte 441) FAIL 00 (at byte 442) FAIL 00 (at byte 443) FAIL 00 (at byte 444) FAIL 00 (at byte 445) FAIL 00 (at byte 446) FAIL 00 (at byte 447) I further modified the kernel code to set 2 intermediate values (0 allocated, 55, AA, FF) to see if there was some sort ‘race’. Again, the only values I saw were 0 or FF. No idea what this means, or whether it’s helpful or not. :-/ Cheers, Kevin D-B 012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A signature.asc Description: Message signed with OpenPGP ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
On 05/08/2018 10:33 PM, Kevin Darbyshire-Bryant wrote: > > >> On 8 May 2018, at 21:11, Rosen Penevwrote: >> >>> >>> So out of curiosity I built this for my Archer C7 v2 ar71xx device. I also >>> modified the code to not give up on ‘OK’, so it always iterated 10 times. >>> I then ran this repeatedly using ‘watch’. Observations: >>> >>> 1) Failure only occurred on 1st check, it never appeared/re-appeared on >>> subsequent passes. >>> 2) Failure offsets are always at 32 byte intervals. That corresponds >>> nicely with cache-line size. >> Yeah the L1 >>> >>> Grabbing at straws to some extent. > > So I modified the user space code a little more, namely moving the usleep to > before the check (obviously still after the mmap) - I am yet to see an error. > > printf("mmap addr: %p\n", addr); > data = addr; > > for (i = 0; i < 10; i++) { > usleep(50); > check_data(data, page_size); > } > > So that smells more of a race condition between the writer filling with 0xFF > and the reader catching up. The open() syscall does the memset(0xff) and blocks. This ensures the memory is initialized before the open() returns. I don't think there is a race. > > Again, assume that I am an idiot and am missing something fundamental. > > > -- Regards Daniel Danzberger embeDD GmbH, Alter Postplatz 2, CH-6370 Stans ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
The sender domain has a DMARC Reject/Quarantine policy which disallows sending mailing list messages using the original "From" header. To mitigate this problem, the original message has been wrapped automatically by the mailing list software.--- Begin Message --- > On 8 May 2018, at 21:11, Rosen Penevwrote: > >> >> So out of curiosity I built this for my Archer C7 v2 ar71xx device. I also >> modified the code to not give up on ‘OK’, so it always iterated 10 times. I >> then ran this repeatedly using ‘watch’. Observations: >> >> 1) Failure only occurred on 1st check, it never appeared/re-appeared on >> subsequent passes. >> 2) Failure offsets are always at 32 byte intervals. That corresponds nicely >> with cache-line size. > Yeah the L1 >> >> Grabbing at straws to some extent. So I modified the user space code a little more, namely moving the usleep to before the check (obviously still after the mmap) - I am yet to see an error. printf("mmap addr: %p\n", addr); data = addr; for (i = 0; i < 10; i++) { usleep(50); check_data(data, page_size); } So that smells more of a race condition between the writer filling with 0xFF and the reader catching up. Again, assume that I am an idiot and am missing something fundamental. signature.asc Description: Message signed with OpenPGP --- End Message --- ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
On Tue, May 8, 2018 at 1:11 PM, Rosen Penev <ros...@gmail.com> wrote: > On Tue, May 8, 2018 at 1:08 PM, Kevin Darbyshire-Bryant via Lede-dev > <lede-dev@lists.infradead.org> wrote: >> The sender domain has a DMARC Reject/Quarantine policy which disallows >> sending mailing list messages using the original "From" header. >> >> To mitigate this problem, the original message has been wrapped >> automatically by the mailing list software. >> >> -- Forwarded message -- >> From: Kevin Darbyshire-Bryant <ke...@darbyshire-bryant.me.uk> >> To: Daniel Danzberger <dan...@dd-wrt.com> >> Cc: Rosen Penev <ros...@gmail.com>, "lede-dev@lists.infradead.org" >> <lede-dev@lists.infradead.org> >> Bcc: >> Date: Tue, 8 May 2018 20:07:56 + >> Subject: Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan >> (8devices) board. >> >> >>> On 8 May 2018, at 11:16, Daniel Danzberger <dan...@dd-wrt.com> wrote: >>> >>>>> >> >>>>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no >>>>> data corruption on my external hard drive. >>>> I can't tell right now. I was trying to use an older version with kernel >>>> 4.4, >>>> but it fails to build. >>> I just tested with 4.4 and the problem is present there as well. Then 3.18. Something tells me the issue is present there as well. I remember the btrfs driver crashing within 10 seconds after mounting a hard drive on that kernel. >>>>> >> >> So out of curiosity I built this for my Archer C7 v2 ar71xx device. I also >> modified the code to not give up on ‘OK’, so it always iterated 10 times. I >> then ran this repeatedly using ‘watch’. Observations: >> >> 1) Failure only occurred on 1st check, it never appeared/re-appeared on >> subsequent passes. >> 2) Failure offsets are always at 32 byte intervals. That corresponds nicely >> with cache-line size. > Yeah the L1 cache is not being flushed. This plagued ramips for a while. That particular issue was that the L1 cache on the first CPU was being flushed but not the second (L1 cache is per core). I tested this on mt7621 and found no errors. MIPS, the gift that keeps on giving... Now that I think about it, one of the work arounds for mt7621 was to limit the number of CPUs to 1 with the nr-cpus=1 kernel command line flag. This should be valid no matter what though. The issue is probably different. Yeah I also have the ramips issue backported to kernel 4.9 (generic kernel patch) in my tree and I still see this issue. Hmm I have an mr3020. Trunk has ath79 with support for it(from what I see). Will report on 4.14. >> >> Grabbing at straws to some extent. >> >> >> Cheers, >> >> Kevin D-B >> >> 012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A >> >> >> ___ >> Lede-dev mailing list >> Lede-dev@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/lede-dev >> ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
On Tue, May 8, 2018 at 1:08 PM, Kevin Darbyshire-Bryant via Lede-dev <lede-dev@lists.infradead.org> wrote: > The sender domain has a DMARC Reject/Quarantine policy which disallows > sending mailing list messages using the original "From" header. > > To mitigate this problem, the original message has been wrapped > automatically by the mailing list software. > > -- Forwarded message -- > From: Kevin Darbyshire-Bryant <ke...@darbyshire-bryant.me.uk> > To: Daniel Danzberger <dan...@dd-wrt.com> > Cc: Rosen Penev <ros...@gmail.com>, "lede-dev@lists.infradead.org" > <lede-dev@lists.infradead.org> > Bcc: > Date: Tue, 8 May 2018 20:07:56 + > Subject: Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) > board. > > >> On 8 May 2018, at 11:16, Daniel Danzberger <dan...@dd-wrt.com> wrote: >> >>>> > >>>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no >>>> data corruption on my external hard drive. >>> I can't tell right now. I was trying to use an older version with kernel >>> 4.4, >>> but it fails to build. >> I just tested with 4.4 and the problem is present there as well. >>>> > > So out of curiosity I built this for my Archer C7 v2 ar71xx device. I also > modified the code to not give up on ‘OK’, so it always iterated 10 times. I > then ran this repeatedly using ‘watch’. Observations: > > 1) Failure only occurred on 1st check, it never appeared/re-appeared on > subsequent passes. > 2) Failure offsets are always at 32 byte intervals. That corresponds nicely > with cache-line size. Yeah the L1 > > Grabbing at straws to some extent. > > > Cheers, > > Kevin D-B > > 012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A > > > ___ > Lede-dev mailing list > Lede-dev@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/lede-dev > ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
The sender domain has a DMARC Reject/Quarantine policy which disallows sending mailing list messages using the original "From" header. To mitigate this problem, the original message has been wrapped automatically by the mailing list software.--- Begin Message --- > On 8 May 2018, at 11:16, Daniel Danzbergerwrote: > >>> >>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no >>> data corruption on my external hard drive. >> I can't tell right now. I was trying to use an older version with kernel 4.4, >> but it fails to build. > I just tested with 4.4 and the problem is present there as well. >>> So out of curiosity I built this for my Archer C7 v2 ar71xx device. I also modified the code to not give up on ‘OK’, so it always iterated 10 times. I then ran this repeatedly using ‘watch’. Observations: 1) Failure only occurred on 1st check, it never appeared/re-appeared on subsequent passes. 2) Failure offsets are always at 32 byte intervals. That corresponds nicely with cache-line size. Grabbing at straws to some extent. Cheers, Kevin D-B 012C ACB2 28C6 C53E 9775 9123 B3A2 389B 9DE2 334A signature.asc Description: Message signed with OpenPGP --- End Message --- ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
On 05/07/2018 01:16 PM, Daniel Danzberger wrote: > On 05/06/2018 09:00 PM, Rosen Penev wrote: >> On Sun, May 6, 2018 at 10:08 AM, Rosen Penevwrote: >>> On Sun, May 6, 2018 at 3:52 AM, Daniel Danzberger wrote: MMAP'ed memory that has been allocated via 'get_zeroed_page(GFP_KERNEL)' or 'vmalloc()' doesn't always contain the same data when accessed from userspace. This means all userspace programs using mmap to access kernel memory aren't always working properly on the Rambutan board. I am currently testing if other ar71xx devices are affected as well. I first noticed this when using ALSA's mmap api to capture audio. Here is the feed for a kmod + userpace util to reproduce the issue: g...@github.com:dddaniel/mmaptest.git The kernel module simply allocates a page and initializes it with 0xff. The userspace application then mmap's and reads this page 10 times with a 500ms delay an checks if all data is 0xff. --- root@OpenWrt:/# mmaptest-user mmap addr: 0x77a04000 [ 760.464968] mmap page 7573000 at va 87573000 check memory ...FAIL (at byte 0) check memory ...FAIL (at byte 96) check memory ...FAIL (at byte 96) check memory ...FAIL (at byte 96) check memory ...FAIL (at byte 128) --- I have no idea whats causing it. Does anybody have a hint on how to fix this ? >>> Try reverting >>> https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170 >> Disregard that. That commit should have no impact. >> >> I just tested it on an Archer C7v4 and the issue is present as well. >> This was probably causing data corruption for me when I used an >> external hard drive... >> >> Strange that sometimes it works and sometimes not.Yes, the same in my tests. >> Look's like this issue appears on all ar71xx devices. >> >> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no >> data corruption on my external hard drive. > I can't tell right now. I was trying to use an older version with kernel 4.4, > but it fails to build. I just tested with 4.4 and the problem is present there as well. >> >> There seems to be a pattern with kernel 4.9 breaking various MIPS devices... > -- Regards Daniel Danzberger embeDD GmbH, Alter Postplatz 2, CH-6370 Stans ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev >> > -- Regards Daniel Danzberger embeDD GmbH, Alter Postplatz 2, CH-6370 Stans ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
On 05/06/2018 09:00 PM, Rosen Penev wrote: > On Sun, May 6, 2018 at 10:08 AM, Rosen Penevwrote: >> On Sun, May 6, 2018 at 3:52 AM, Daniel Danzberger wrote: >>> MMAP'ed memory that has been allocated via 'get_zeroed_page(GFP_KERNEL)' or >>> 'vmalloc()' doesn't always contain the same data when accessed from >>> userspace. >>> >>> This means all userspace programs using mmap to access kernel memory aren't >>> always working properly on the Rambutan board. I am currently testing if >>> other >>> ar71xx devices are affected as well. >>> >>> I first noticed this when using ALSA's mmap api to capture audio. >>> >>> Here is the feed for a kmod + userpace util to reproduce the issue: >>> g...@github.com:dddaniel/mmaptest.git >>> >>> The kernel module simply allocates a page and initializes it with 0xff. >>> The userspace application then mmap's and reads this page 10 times with a >>> 500ms >>> delay an checks if all data is 0xff. >>> >>> --- >>> root@OpenWrt:/# mmaptest-user >>> mmap addr: 0x77a04000 >>> [ 760.464968] mmap page 7573000 at va 87573000 >>> check memory ...FAIL (at byte 0) >>> check memory ...FAIL (at byte 96) >>> check memory ...FAIL (at byte 96) >>> check memory ...FAIL (at byte 96) >>> check memory ...FAIL (at byte 128) >>> --- >>> >>> I have no idea whats causing it. Does anybody have a hint on how to fix >>> this ? >>> >> Try reverting >> https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170 > Disregard that. That commit should have no impact. > > I just tested it on an Archer C7v4 and the issue is present as well. > This was probably causing data corruption for me when I used an > external hard drive... > > Strange that sometimes it works and sometimes not.Yes, the same in my tests. > Look's like this issue appears on all ar71xx devices. > > Did you encounter this issue with kernel 4.9? For me, 4.4 caused no > data corruption on my external hard drive. I can't tell right now. I was trying to use an older version with kernel 4.4, but it fails to build. > > There seems to be a pattern with kernel 4.9 breaking various MIPS devices... >>> -- >>> Regards >>> >>> Daniel Danzberger >>> embeDD GmbH, Alter Postplatz 2, CH-6370 Stans >>> >>> ___ >>> Lede-dev mailing list >>> Lede-dev@lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/lede-dev > -- Regards Daniel Danzberger embeDD GmbH, Alter Postplatz 2, CH-6370 Stans ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev
Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.
On Sun, May 6, 2018 at 3:52 AM, Daniel Danzbergerwrote: > MMAP'ed memory that has been allocated via 'get_zeroed_page(GFP_KERNEL)' or > 'vmalloc()' doesn't always contain the same data when accessed from userspace. > > This means all userspace programs using mmap to access kernel memory aren't > always working properly on the Rambutan board. I am currently testing if other > ar71xx devices are affected as well. > > I first noticed this when using ALSA's mmap api to capture audio. > > Here is the feed for a kmod + userpace util to reproduce the issue: > g...@github.com:dddaniel/mmaptest.git > > The kernel module simply allocates a page and initializes it with 0xff. > The userspace application then mmap's and reads this page 10 times with a > 500ms > delay an checks if all data is 0xff. > > --- > root@OpenWrt:/# mmaptest-user > mmap addr: 0x77a04000 > [ 760.464968] mmap page 7573000 at va 87573000 > check memory ...FAIL (at byte 0) > check memory ...FAIL (at byte 96) > check memory ...FAIL (at byte 96) > check memory ...FAIL (at byte 96) > check memory ...FAIL (at byte 128) > --- > > I have no idea whats causing it. Does anybody have a hint on how to fix this ? > Try reverting https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170 > -- > Regards > > Daniel Danzberger > embeDD GmbH, Alter Postplatz 2, CH-6370 Stans > > ___ > Lede-dev mailing list > Lede-dev@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/lede-dev ___ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev