Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-10 Thread Kevin Darbyshire-Bryant


> On 9 May 2018, at 11:27, Daniel Danzberger  wrote:
> 
> On 05/
>> 
>> 
>> So that smells more of a race condition between the writer filling with 0xFF 
>> and the reader catching up.
> The open() syscall does the memset(0xff) and blocks. This ensures the memory 
> is
> initialized before the open() returns. I don't think there is a race.
>> 
>> Again, assume that I am an idiot and am missing something fundamental.

I did say assume I’m an idiot.  OK so I did a bit more playing because this has 
piqued my interest as a ‘fun’ thing to play with :-)

First off, I further modded the user code to report the byte value seen when 
non 0xFF.  I further modified it to carry on, to see if there were 
discontinuous errors.

I only ever see values of 0 or 0xFF.  What is more curious is that I do see 
discontiguous segments of 0’s.  Occasionally for 2 or 3 bytes only:

mmap addr: 0x779ae000
check memory ...FAIL 00 (at byte 96)
FAIL 00 (at byte 97)
FAIL 00 (at byte 416)
FAIL 00 (at byte 417)
FAIL 00 (at byte 418)
FAIL 00 (at byte 419)
FAIL 00 (at byte 420)
FAIL 00 (at byte 421)
FAIL 00 (at byte 422)
FAIL 00 (at byte 423)
FAIL 00 (at byte 424)
FAIL 00 (at byte 425)
FAIL 00 (at byte 426)
FAIL 00 (at byte 427)
FAIL 00 (at byte 428)
FAIL 00 (at byte 429)
FAIL 00 (at byte 430)
FAIL 00 (at byte 431)
FAIL 00 (at byte 432)
FAIL 00 (at byte 433)
FAIL 00 (at byte 434)
FAIL 00 (at byte 435)
FAIL 00 (at byte 436)
FAIL 00 (at byte 437)
FAIL 00 (at byte 438)
FAIL 00 (at byte 439)
FAIL 00 (at byte 440)
FAIL 00 (at byte 441)
FAIL 00 (at byte 442)
FAIL 00 (at byte 443)
FAIL 00 (at byte 444)
FAIL 00 (at byte 445)
FAIL 00 (at byte 446)
FAIL 00 (at byte 447)

I further modified the kernel code to set 2 intermediate values (0 allocated, 
55, AA, FF) to see if there was some sort ‘race’.  Again, the only values I saw 
were 0 or FF.  No idea what this means, or whether it’s helpful or not. :-/


Cheers,

Kevin D-B

012C ACB2 28C6 C53E 9775  9123 B3A2 389B 9DE2 334A



signature.asc
Description: Message signed with OpenPGP
___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-09 Thread Daniel Danzberger
On 05/08/2018 10:33 PM, Kevin Darbyshire-Bryant wrote:
> 
> 
>> On 8 May 2018, at 21:11, Rosen Penev  wrote:
>>
>>>
>>> So out of curiosity I built this for my Archer C7 v2 ar71xx device.  I also 
>>> modified the code to not give up on ‘OK’, so it always iterated 10 times.  
>>> I then ran this repeatedly using ‘watch’.  Observations:
>>>
>>> 1) Failure only occurred on 1st check, it never appeared/re-appeared on 
>>> subsequent passes.
>>> 2) Failure offsets are always at 32 byte intervals.  That corresponds 
>>> nicely with cache-line size.
>> Yeah the L1
>>>
>>> Grabbing at straws to some extent.
> 
> So I modified the user space code a little more, namely moving the usleep to 
> before the check (obviously still after the mmap) - I am yet to see an error.
> 
>  printf("mmap addr: %p\n", addr);
> data = addr;
> 
> for (i = 0; i < 10; i++) {
> usleep(50);
> check_data(data, page_size);
> }
> 
> So that smells more of a race condition between the writer filling with 0xFF 
> and the reader catching up.
The open() syscall does the memset(0xff) and blocks. This ensures the memory is
initialized before the open() returns. I don't think there is a race.
> 
> Again, assume that I am an idiot and am missing something fundamental.
> 
> 
> 

-- 
Regards

Daniel Danzberger
embeDD GmbH, Alter Postplatz 2, CH-6370 Stans

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-08 Thread Kevin Darbyshire-Bryant via Lede-dev
The sender domain has a DMARC Reject/Quarantine policy which disallows
sending mailing list messages using the original "From" header.

To mitigate this problem, the original message has been wrapped
automatically by the mailing list software.--- Begin Message ---


> On 8 May 2018, at 21:11, Rosen Penev  wrote:
> 
>> 
>> So out of curiosity I built this for my Archer C7 v2 ar71xx device.  I also 
>> modified the code to not give up on ‘OK’, so it always iterated 10 times.  I 
>> then ran this repeatedly using ‘watch’.  Observations:
>> 
>> 1) Failure only occurred on 1st check, it never appeared/re-appeared on 
>> subsequent passes.
>> 2) Failure offsets are always at 32 byte intervals.  That corresponds nicely 
>> with cache-line size.
> Yeah the L1
>> 
>> Grabbing at straws to some extent.

So I modified the user space code a little more, namely moving the usleep to 
before the check (obviously still after the mmap) - I am yet to see an error.

 printf("mmap addr: %p\n", addr);
data = addr;

for (i = 0; i < 10; i++) {
usleep(50);
check_data(data, page_size);
}

So that smells more of a race condition between the writer filling with 0xFF 
and the reader catching up.

Again, assume that I am an idiot and am missing something fundamental.





signature.asc
Description: Message signed with OpenPGP
--- End Message ---
___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-08 Thread Rosen Penev
On Tue, May 8, 2018 at 1:11 PM, Rosen Penev <ros...@gmail.com> wrote:
> On Tue, May 8, 2018 at 1:08 PM, Kevin Darbyshire-Bryant via Lede-dev
> <lede-dev@lists.infradead.org> wrote:
>> The sender domain has a DMARC Reject/Quarantine policy which disallows
>> sending mailing list messages using the original "From" header.
>>
>> To mitigate this problem, the original message has been wrapped
>> automatically by the mailing list software.
>>
>> -- Forwarded message --
>> From: Kevin Darbyshire-Bryant <ke...@darbyshire-bryant.me.uk>
>> To: Daniel Danzberger <dan...@dd-wrt.com>
>> Cc: Rosen Penev <ros...@gmail.com>, "lede-dev@lists.infradead.org" 
>> <lede-dev@lists.infradead.org>
>> Bcc:
>> Date: Tue, 8 May 2018 20:07:56 +
>> Subject: Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan 
>> (8devices) board.
>>
>>
>>> On 8 May 2018, at 11:16, Daniel Danzberger <dan...@dd-wrt.com> wrote:
>>>
>>>>>
>> 
>>>>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no
>>>>> data corruption on my external hard drive.
>>>> I can't tell right now. I was trying to use an older version with kernel 
>>>> 4.4,
>>>> but it fails to build.
>>> I just tested with 4.4 and the problem is present there as well.
Then 3.18. Something tells me the issue is present there as well. I
remember the btrfs driver crashing within 10 seconds after mounting a
hard drive on that kernel.
>>>>>
>>
>> So out of curiosity I built this for my Archer C7 v2 ar71xx device.  I also 
>> modified the code to not give up on ‘OK’, so it always iterated 10 times.  I 
>> then ran this repeatedly using ‘watch’.  Observations:
>>
>> 1) Failure only occurred on 1st check, it never appeared/re-appeared on 
>> subsequent passes.
>> 2) Failure offsets are always at 32 byte intervals.  That corresponds nicely 
>> with cache-line size.
> Yeah the L1
cache is not being flushed. This plagued ramips for a while. That
particular issue was that the L1 cache on the first CPU was being
flushed but not the second (L1 cache is per core).

I tested this on mt7621 and found no errors.

MIPS, the gift that keeps on giving...

Now that I think about it, one of the work arounds for mt7621 was to
limit the number of CPUs to 1 with the nr-cpus=1 kernel command line
flag. This should be valid no matter what though. The issue is
probably different. Yeah I also have the ramips issue backported to
kernel 4.9 (generic kernel patch) in my tree and I still see this
issue.

Hmm I have an mr3020. Trunk has ath79 with support for it(from what I
see). Will report on 4.14.
>>
>> Grabbing at straws to some extent.
>>
>>
>> Cheers,
>>
>> Kevin D-B
>>
>> 012C ACB2 28C6 C53E 9775  9123 B3A2 389B 9DE2 334A
>>
>>
>> ___
>> Lede-dev mailing list
>> Lede-dev@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/lede-dev
>>

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-08 Thread Rosen Penev
On Tue, May 8, 2018 at 1:08 PM, Kevin Darbyshire-Bryant via Lede-dev
<lede-dev@lists.infradead.org> wrote:
> The sender domain has a DMARC Reject/Quarantine policy which disallows
> sending mailing list messages using the original "From" header.
>
> To mitigate this problem, the original message has been wrapped
> automatically by the mailing list software.
>
> -- Forwarded message --
> From: Kevin Darbyshire-Bryant <ke...@darbyshire-bryant.me.uk>
> To: Daniel Danzberger <dan...@dd-wrt.com>
> Cc: Rosen Penev <ros...@gmail.com>, "lede-dev@lists.infradead.org" 
> <lede-dev@lists.infradead.org>
> Bcc:
> Date: Tue, 8 May 2018 20:07:56 +
> Subject: Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) 
> board.
>
>
>> On 8 May 2018, at 11:16, Daniel Danzberger <dan...@dd-wrt.com> wrote:
>>
>>>>
> 
>>>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no
>>>> data corruption on my external hard drive.
>>> I can't tell right now. I was trying to use an older version with kernel 
>>> 4.4,
>>> but it fails to build.
>> I just tested with 4.4 and the problem is present there as well.
>>>>
>
> So out of curiosity I built this for my Archer C7 v2 ar71xx device.  I also 
> modified the code to not give up on ‘OK’, so it always iterated 10 times.  I 
> then ran this repeatedly using ‘watch’.  Observations:
>
> 1) Failure only occurred on 1st check, it never appeared/re-appeared on 
> subsequent passes.
> 2) Failure offsets are always at 32 byte intervals.  That corresponds nicely 
> with cache-line size.
Yeah the L1
>
> Grabbing at straws to some extent.
>
>
> Cheers,
>
> Kevin D-B
>
> 012C ACB2 28C6 C53E 9775  9123 B3A2 389B 9DE2 334A
>
>
> ___
> Lede-dev mailing list
> Lede-dev@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/lede-dev
>

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-08 Thread Kevin Darbyshire-Bryant via Lede-dev
The sender domain has a DMARC Reject/Quarantine policy which disallows
sending mailing list messages using the original "From" header.

To mitigate this problem, the original message has been wrapped
automatically by the mailing list software.--- Begin Message ---


> On 8 May 2018, at 11:16, Daniel Danzberger  wrote:
> 
>>> 

>>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no
>>> data corruption on my external hard drive.
>> I can't tell right now. I was trying to use an older version with kernel 4.4,
>> but it fails to build.
> I just tested with 4.4 and the problem is present there as well.
>>> 

So out of curiosity I built this for my Archer C7 v2 ar71xx device.  I also 
modified the code to not give up on ‘OK’, so it always iterated 10 times.  I 
then ran this repeatedly using ‘watch’.  Observations:

1) Failure only occurred on 1st check, it never appeared/re-appeared on 
subsequent passes.
2) Failure offsets are always at 32 byte intervals.  That corresponds nicely 
with cache-line size.

Grabbing at straws to some extent.


Cheers,

Kevin D-B

012C ACB2 28C6 C53E 9775  9123 B3A2 389B 9DE2 334A



signature.asc
Description: Message signed with OpenPGP
--- End Message ---
___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-08 Thread Daniel Danzberger
On 05/07/2018 01:16 PM, Daniel Danzberger wrote:
> On 05/06/2018 09:00 PM, Rosen Penev wrote:
>> On Sun, May 6, 2018 at 10:08 AM, Rosen Penev  wrote:
>>> On Sun, May 6, 2018 at 3:52 AM, Daniel Danzberger  wrote:
 MMAP'ed memory that has been allocated via 'get_zeroed_page(GFP_KERNEL)' or
 'vmalloc()' doesn't always contain the same data when accessed from 
 userspace.

 This means all userspace programs using mmap to access kernel memory aren't
 always working properly on the Rambutan board. I am currently testing if 
 other
 ar71xx devices are affected as well.

 I first noticed this when using ALSA's mmap api to capture audio.

 Here is the feed for a kmod + userpace util to reproduce the issue:
 g...@github.com:dddaniel/mmaptest.git

 The kernel module simply allocates a page and initializes it with 0xff.
 The userspace application then mmap's and reads this page 10 times with a 
 500ms
 delay an checks if all data is 0xff.

 ---
 root@OpenWrt:/# mmaptest-user
 mmap addr: 0x77a04000
 [  760.464968] mmap page 7573000 at va 87573000
 check memory ...FAIL (at byte 0)
 check memory ...FAIL (at byte 96)
 check memory ...FAIL (at byte 96)
 check memory ...FAIL (at byte 96)
 check memory ...FAIL (at byte 128)
 ---

 I have no idea whats causing it. Does anybody have a hint on how to fix 
 this ?

>>> Try reverting 
>>> https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170
>> Disregard that. That commit should have no impact.
>>
>> I just tested it on an Archer C7v4 and the issue is present as well.
>> This was probably causing data corruption for me when I used an
>> external hard drive...
>>
>> Strange that sometimes it works and sometimes not.Yes, the same in my tests. 
>> Look's like this issue appears on all ar71xx devices.
>>
>> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no
>> data corruption on my external hard drive.
> I can't tell right now. I was trying to use an older version with kernel 4.4,
> but it fails to build.
I just tested with 4.4 and the problem is present there as well.
>>
>> There seems to be a pattern with kernel 4.9 breaking various MIPS devices...
> 
 --
 Regards

 Daniel Danzberger
 embeDD GmbH, Alter Postplatz 2, CH-6370 Stans

 ___
 Lede-dev mailing list
 Lede-dev@lists.infradead.org
 http://lists.infradead.org/mailman/listinfo/lede-dev
>>
> 

-- 
Regards

Daniel Danzberger
embeDD GmbH, Alter Postplatz 2, CH-6370 Stans

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-07 Thread Daniel Danzberger
On 05/06/2018 09:00 PM, Rosen Penev wrote:
> On Sun, May 6, 2018 at 10:08 AM, Rosen Penev  wrote:
>> On Sun, May 6, 2018 at 3:52 AM, Daniel Danzberger  wrote:
>>> MMAP'ed memory that has been allocated via 'get_zeroed_page(GFP_KERNEL)' or
>>> 'vmalloc()' doesn't always contain the same data when accessed from 
>>> userspace.
>>>
>>> This means all userspace programs using mmap to access kernel memory aren't
>>> always working properly on the Rambutan board. I am currently testing if 
>>> other
>>> ar71xx devices are affected as well.
>>>
>>> I first noticed this when using ALSA's mmap api to capture audio.
>>>
>>> Here is the feed for a kmod + userpace util to reproduce the issue:
>>> g...@github.com:dddaniel/mmaptest.git
>>>
>>> The kernel module simply allocates a page and initializes it with 0xff.
>>> The userspace application then mmap's and reads this page 10 times with a 
>>> 500ms
>>> delay an checks if all data is 0xff.
>>>
>>> ---
>>> root@OpenWrt:/# mmaptest-user
>>> mmap addr: 0x77a04000
>>> [  760.464968] mmap page 7573000 at va 87573000
>>> check memory ...FAIL (at byte 0)
>>> check memory ...FAIL (at byte 96)
>>> check memory ...FAIL (at byte 96)
>>> check memory ...FAIL (at byte 96)
>>> check memory ...FAIL (at byte 128)
>>> ---
>>>
>>> I have no idea whats causing it. Does anybody have a hint on how to fix 
>>> this ?
>>>
>> Try reverting 
>> https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170
> Disregard that. That commit should have no impact.
> 
> I just tested it on an Archer C7v4 and the issue is present as well.
> This was probably causing data corruption for me when I used an
> external hard drive...
> 
> Strange that sometimes it works and sometimes not.Yes, the same in my tests. 
> Look's like this issue appears on all ar71xx devices.
> 
> Did you encounter this issue with kernel 4.9? For me, 4.4 caused no
> data corruption on my external hard drive.
I can't tell right now. I was trying to use an older version with kernel 4.4,
but it fails to build.
> 
> There seems to be a pattern with kernel 4.9 breaking various MIPS devices...

>>> --
>>> Regards
>>>
>>> Daniel Danzberger
>>> embeDD GmbH, Alter Postplatz 2, CH-6370 Stans
>>>
>>> ___
>>> Lede-dev mailing list
>>> Lede-dev@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/lede-dev
> 

-- 
Regards

Daniel Danzberger
embeDD GmbH, Alter Postplatz 2, CH-6370 Stans

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev


Re: [LEDE-DEV] MMAP memory out of sync on AR71xx Rambutan (8devices) board.

2018-05-06 Thread Rosen Penev
On Sun, May 6, 2018 at 3:52 AM, Daniel Danzberger  wrote:
> MMAP'ed memory that has been allocated via 'get_zeroed_page(GFP_KERNEL)' or
> 'vmalloc()' doesn't always contain the same data when accessed from userspace.
>
> This means all userspace programs using mmap to access kernel memory aren't
> always working properly on the Rambutan board. I am currently testing if other
> ar71xx devices are affected as well.
>
> I first noticed this when using ALSA's mmap api to capture audio.
>
> Here is the feed for a kmod + userpace util to reproduce the issue:
> g...@github.com:dddaniel/mmaptest.git
>
> The kernel module simply allocates a page and initializes it with 0xff.
> The userspace application then mmap's and reads this page 10 times with a 
> 500ms
> delay an checks if all data is 0xff.
>
> ---
> root@OpenWrt:/# mmaptest-user
> mmap addr: 0x77a04000
> [  760.464968] mmap page 7573000 at va 87573000
> check memory ...FAIL (at byte 0)
> check memory ...FAIL (at byte 96)
> check memory ...FAIL (at byte 96)
> check memory ...FAIL (at byte 96)
> check memory ...FAIL (at byte 128)
> ---
>
> I have no idea whats causing it. Does anybody have a hint on how to fix this ?
>
Try reverting 
https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170
> --
> Regards
>
> Daniel Danzberger
> embeDD GmbH, Alter Postplatz 2, CH-6370 Stans
>
> ___
> Lede-dev mailing list
> Lede-dev@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/lede-dev

___
Lede-dev mailing list
Lede-dev@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/lede-dev