Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-09 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 11:51 PM Arnd Bergmann  wrote:

> I already found a couple of things:
>
> - Failure to copy always happens at the *end* of a 16 byte aligned
>   physical address, it misses between 1 and 6 bytes, never 7 or more,
>   and it's more likely to be fewer bytes that are affected.
>
> - The first byte that fails to get copied is always 16 bytes after the
>   memcpy target. Since we only observe it at the end of the 16 byte
>   range, it means this happens specifically for addresses ending in
>   0x9 (7 bytes missed) to 0xf (1 byte missed).
>
> - Out of 7445 corruptions, 4358 were of the kind that misses a copy at the
>   end of a 16-byte area, they were for copies between 41 and 64 bytes,
>   more to the larger end of the scale  (note that with your test program,
>   smaller memcpys happen more frequenly than larger ones).

Thinking about it some more, this scenario can be explained by a
read-modify-write logic gone wrong somewhere in the hardware,
leading to the original bytes being written back after we write the
correct data.

The code path we hit most commonly in glibc is like this one:

// offset = 0xd, could be 0x9..0xf + n*0x10
// length = 0x3f, could be 0x29..0x3f
memcpy(map + 0xd, data + 0xd, 0x3f);

   stp B_l, B_h, [dstin, 16]   # offset 0x1d
   stp C_l, C_h, [dstend, -32]  # offset 0x2c
   stp A_l, A_h, [dstin] # offset 0x0d
   stp D_l, D_h, [dstend, -16]  # offset 0x3c

The corruption here always appears in bytes 0x1d..0x1f. A theory
that matches this  corruption is that the stores for B, C and D get
combined into write transaction of length 0x2f, spanning bytes
0x1d..0x4b in the map. This may prefetch either 8 bytes at 0x18 or
16 bytes at 0x10 into a temporary HW buffer, which gets modified
with the correct data for 0x1d..0x1f before writing back that
prefetched data.

The key here is the write of A to offset 0x0d..0x1c. This also
prefetches the data at 0x18..0x1f, and modifies the bytes ..1c
in it. When this is prefetched before the first write, but written
back after it, offsets 0x1d..0x1f have the original data again!

Variations that trigger the same thing include the modified
sequence:

   stp C_l, C_h, [dstend, -32]  # offset 0x2c
   stp B_l, B_h, [dstin, 16]   # offset 0x1d
   stp D_l, D_h, [dstend, -16]  # offset 0x3c
   stp A_l, A_h, [dstin] # offset 0x0d

and the special case for 64 byte memcpy that uses a completely
different sequence, either (original, corruption is common for 64 byte)

stp A_l, A_h, [dstin]# offset 0x0d
stp B_l, B_h, [dstin, 16]  # offset 0x1d
stp C_l, C_h, [dstin, 32]  # offset 0x2d
stp D_l, D_h, [dstin, 48]  # offset 0x3d
stp E_l, E_h, [dstend, -32]  # offset 0x2d again
stp F_l, F_h, [dstend, -16]  # offset 0x3d again

or (patched libc, corruption happens very rarely for 64 byte
compared to other sizes)

stp E_l, E_h, [dstend, -32]  # offset 0x2d
stp F_l, F_h, [dstend, -16]  # offset 0x3d
stp A_l, A_h, [dstin]# offset 0x0d
stp B_l, B_h, [dstin, 16]  # offset 0x1d
stp C_l, C_h, [dstin, 32]  # offset 0x2d again
stp D_l, D_h, [dstin, 48]  # offset 0x3d again

The corruption for both also happens at 0x1d..0x1f, which unfortunately
is not easily explained by the theory above, but maybe my glibc sources
are slightly different from the ones that were used on the system.

> - All corruption with data copied to the wrong place happened for copies
>   between 33 and 47 bytes, mostly to the smaller end of the scale:
> 391 0x21
> 360 0x22
...
>  33 0x2e
>   1 0x2f
>
> - One common (but not the only, still investigating) case for data getting
>   written to the wrong place is:
>* corruption starts 16 bytes after the memcpy start
>* corrupt bytes are the same as the bytes written to the start
>* start address ends in 0x1 through 0x7
>* length of corruption is at most memcpy length- 32, always
>  between 1 and 7.

This is only observed with the original sequence (B, C, A, D) in
glibc, and only when C overlaps with both A and B. A typical
example would be

// offset = 0x02, can be [0x01..0x07,0x09..0x0f] + n*0x10
// length = 0x23, could be 0x21..0x2f
memcpy(map + 0x2, data + 0x2, 0x23);

   stp B_l, B_h, [dstin, 16]   # offset 0x22
   stp C_l, C_h, [dstend, -32]  # offset 0x15
   stp A_l, A_h, [dstin] # offset 0x12
   stp D_l, D_h, [dstend, -16]  # offset 0x25

In this example, bytes 0x22..0x24 incorrectly contain the data that
was written to bytes 0x12..0x14. I would guess that only the stores
to C and D get combined here, so we actually have three separate
store transactions rather than the two in 

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-09 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 11:51 PM Arnd Bergmann  wrote:

> I already found a couple of things:
>
> - Failure to copy always happens at the *end* of a 16 byte aligned
>   physical address, it misses between 1 and 6 bytes, never 7 or more,
>   and it's more likely to be fewer bytes that are affected.
>
> - The first byte that fails to get copied is always 16 bytes after the
>   memcpy target. Since we only observe it at the end of the 16 byte
>   range, it means this happens specifically for addresses ending in
>   0x9 (7 bytes missed) to 0xf (1 byte missed).
>
> - Out of 7445 corruptions, 4358 were of the kind that misses a copy at the
>   end of a 16-byte area, they were for copies between 41 and 64 bytes,
>   more to the larger end of the scale  (note that with your test program,
>   smaller memcpys happen more frequenly than larger ones).

Thinking about it some more, this scenario can be explained by a
read-modify-write logic gone wrong somewhere in the hardware,
leading to the original bytes being written back after we write the
correct data.

The code path we hit most commonly in glibc is like this one:

// offset = 0xd, could be 0x9..0xf + n*0x10
// length = 0x3f, could be 0x29..0x3f
memcpy(map + 0xd, data + 0xd, 0x3f);

   stp B_l, B_h, [dstin, 16]   # offset 0x1d
   stp C_l, C_h, [dstend, -32]  # offset 0x2c
   stp A_l, A_h, [dstin] # offset 0x0d
   stp D_l, D_h, [dstend, -16]  # offset 0x3c

The corruption here always appears in bytes 0x1d..0x1f. A theory
that matches this  corruption is that the stores for B, C and D get
combined into write transaction of length 0x2f, spanning bytes
0x1d..0x4b in the map. This may prefetch either 8 bytes at 0x18 or
16 bytes at 0x10 into a temporary HW buffer, which gets modified
with the correct data for 0x1d..0x1f before writing back that
prefetched data.

The key here is the write of A to offset 0x0d..0x1c. This also
prefetches the data at 0x18..0x1f, and modifies the bytes ..1c
in it. When this is prefetched before the first write, but written
back after it, offsets 0x1d..0x1f have the original data again!

Variations that trigger the same thing include the modified
sequence:

   stp C_l, C_h, [dstend, -32]  # offset 0x2c
   stp B_l, B_h, [dstin, 16]   # offset 0x1d
   stp D_l, D_h, [dstend, -16]  # offset 0x3c
   stp A_l, A_h, [dstin] # offset 0x0d

and the special case for 64 byte memcpy that uses a completely
different sequence, either (original, corruption is common for 64 byte)

stp A_l, A_h, [dstin]# offset 0x0d
stp B_l, B_h, [dstin, 16]  # offset 0x1d
stp C_l, C_h, [dstin, 32]  # offset 0x2d
stp D_l, D_h, [dstin, 48]  # offset 0x3d
stp E_l, E_h, [dstend, -32]  # offset 0x2d again
stp F_l, F_h, [dstend, -16]  # offset 0x3d again

or (patched libc, corruption happens very rarely for 64 byte
compared to other sizes)

stp E_l, E_h, [dstend, -32]  # offset 0x2d
stp F_l, F_h, [dstend, -16]  # offset 0x3d
stp A_l, A_h, [dstin]# offset 0x0d
stp B_l, B_h, [dstin, 16]  # offset 0x1d
stp C_l, C_h, [dstin, 32]  # offset 0x2d again
stp D_l, D_h, [dstin, 48]  # offset 0x3d again

The corruption for both also happens at 0x1d..0x1f, which unfortunately
is not easily explained by the theory above, but maybe my glibc sources
are slightly different from the ones that were used on the system.

> - All corruption with data copied to the wrong place happened for copies
>   between 33 and 47 bytes, mostly to the smaller end of the scale:
> 391 0x21
> 360 0x22
...
>  33 0x2e
>   1 0x2f
>
> - One common (but not the only, still investigating) case for data getting
>   written to the wrong place is:
>* corruption starts 16 bytes after the memcpy start
>* corrupt bytes are the same as the bytes written to the start
>* start address ends in 0x1 through 0x7
>* length of corruption is at most memcpy length- 32, always
>  between 1 and 7.

This is only observed with the original sequence (B, C, A, D) in
glibc, and only when C overlaps with both A and B. A typical
example would be

// offset = 0x02, can be [0x01..0x07,0x09..0x0f] + n*0x10
// length = 0x23, could be 0x21..0x2f
memcpy(map + 0x2, data + 0x2, 0x23);

   stp B_l, B_h, [dstin, 16]   # offset 0x22
   stp C_l, C_h, [dstend, -32]  # offset 0x15
   stp A_l, A_h, [dstin] # offset 0x12
   stp D_l, D_h, [dstend, -16]  # offset 0x25

In this example, bytes 0x22..0x24 incorrectly contain the data that
was written to bytes 0x12..0x14. I would guess that only the stores
to C and D get combined here, so we actually have three separate
store transactions rather than the two in 

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 8:25 PM Mikulas Patocka  wrote:
> On Wed, 8 Aug 2018, Arnd Bergmann wrote:
>
> > On Wed, Aug 8, 2018 at 5:15 PM Catalin Marinas  
> > wrote:
> > >
> > > On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> > > > On 08/08/18 15:12, Mikulas Patocka wrote:
> > > > > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > > > >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > > - failing to write a few bytes
> > > - writing a few bytes that were written 16 bytes before
> > > - writing a few bytes that were written 16 bytes after
> > >
> > > > The overlapping writes in memcpy never write different values to the
> > > > same location, so I still feel this must be some sort of HW issue, not a
> > > > SW one.
> > >
> > > So do I (my interpretation is that it combines or rather skips some of
> > > the writes to the same 16-byte address as it ignores the data strobes).
> >
> > Maybe it just always writes to the wrong location, 16 bytes apart for one of
> > the stp instructions. Since we are usually dealing with a pair of 
> > overlapping
> > 'stp', both unaligned, that could explain both the missing bytes (we write
> > data to the wrong place, but overwrite it with the correct data right away)
> > and the extra copy (we write it to the wrong place, but then write the 
> > correct
> > data to the correct place as well).
> >
> > This sounds a bit like what the original ARM CPUs did on unaligned
> > memory access, where a single aligned 4-byte location was accessed,
> > but the bytes swapped around.
> >
> > There may be a few more things worth trying out or analysing from
> > the recorded past failures to understand more about how it goes
> > wrong:
> >
> > - For which data lengths does it fail? Having two overlapping
> >   unaligned stp is something that only happens for 16..96 byte
> >   memcpy.
>
> If you want to research the corruptions in detail, I uploaded a file
> containing 7k corruptions here:
> http://people.redhat.com/~mpatocka/testcases/arm-pcie-corruption/

Nice!

I already found a couple of things:

- Failure to copy always happens at the *end* of a 16 byte aligned
  physical address, it misses between 1 and 6 bytes, never 7 or more,
  and it's more likely to be fewer bytes that are affected.
   279 7
   389 6
   484 5
   683 4
   741 3
   836 2
   946 1

- The first byte that fails to get copied is always 16 bytes after the
  memcpy target. Since we only observe it at the end of the 16 byte
  range, it means this happens specifically for addresses ending in
  0x9 (7 bytes missed) to 0xf (1 byte missed).

- Out of 7445 corruptions, 4358 were of the kind that misses a copy at the
  end of a 16-byte area, they were for copies between 41 and 64 bytes,
  more to the larger end of the scale  (note that with your test program,
  smaller memcpys happen more frequenly than larger ones).
 47 0x29
 36 0x2a
 47 0x2b
 23 0x2c
 29 0x2d
 31 0x2e
 36 0x2f
 46 0x30
 45 0x31
 51 0x32
 62 0x33
 64 0x34
 77 0x35
 91 0x36
 90 0x37
100 0x38
100 0x39
209 0x3a
279 0x3b
366 0x3c
498 0x3d
602 0x3e
682 0x3f
747 0x40

- All corruption with data copied to the wrong place happened for copies
  between 33 and 47 bytes, mostly to the smaller end of the scale:
391 0x21
360 0x22
319 0x23
273 0x24
273 0x25
241 0x26
224 0x27
221 0x28
231 0x29
208 0x2a
163 0x2b
 86 0x2c
 63 0x2d
 33 0x2e
  1 0x2f

- One common (but not the only, still investigating) case for data getting
  written to the wrong place is:
   * corruption starts 16 bytes after the memcpy start
   * corrupt bytes are the same as the bytes written to the start
   * start address ends in 0x1 through 0x7
   * length of corruption is at most memcpy length- 32, always
 between 1 and 7.

   Arnd


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 8:25 PM Mikulas Patocka  wrote:
> On Wed, 8 Aug 2018, Arnd Bergmann wrote:
>
> > On Wed, Aug 8, 2018 at 5:15 PM Catalin Marinas  
> > wrote:
> > >
> > > On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> > > > On 08/08/18 15:12, Mikulas Patocka wrote:
> > > > > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > > > >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > > - failing to write a few bytes
> > > - writing a few bytes that were written 16 bytes before
> > > - writing a few bytes that were written 16 bytes after
> > >
> > > > The overlapping writes in memcpy never write different values to the
> > > > same location, so I still feel this must be some sort of HW issue, not a
> > > > SW one.
> > >
> > > So do I (my interpretation is that it combines or rather skips some of
> > > the writes to the same 16-byte address as it ignores the data strobes).
> >
> > Maybe it just always writes to the wrong location, 16 bytes apart for one of
> > the stp instructions. Since we are usually dealing with a pair of 
> > overlapping
> > 'stp', both unaligned, that could explain both the missing bytes (we write
> > data to the wrong place, but overwrite it with the correct data right away)
> > and the extra copy (we write it to the wrong place, but then write the 
> > correct
> > data to the correct place as well).
> >
> > This sounds a bit like what the original ARM CPUs did on unaligned
> > memory access, where a single aligned 4-byte location was accessed,
> > but the bytes swapped around.
> >
> > There may be a few more things worth trying out or analysing from
> > the recorded past failures to understand more about how it goes
> > wrong:
> >
> > - For which data lengths does it fail? Having two overlapping
> >   unaligned stp is something that only happens for 16..96 byte
> >   memcpy.
>
> If you want to research the corruptions in detail, I uploaded a file
> containing 7k corruptions here:
> http://people.redhat.com/~mpatocka/testcases/arm-pcie-corruption/

Nice!

I already found a couple of things:

- Failure to copy always happens at the *end* of a 16 byte aligned
  physical address, it misses between 1 and 6 bytes, never 7 or more,
  and it's more likely to be fewer bytes that are affected.
   279 7
   389 6
   484 5
   683 4
   741 3
   836 2
   946 1

- The first byte that fails to get copied is always 16 bytes after the
  memcpy target. Since we only observe it at the end of the 16 byte
  range, it means this happens specifically for addresses ending in
  0x9 (7 bytes missed) to 0xf (1 byte missed).

- Out of 7445 corruptions, 4358 were of the kind that misses a copy at the
  end of a 16-byte area, they were for copies between 41 and 64 bytes,
  more to the larger end of the scale  (note that with your test program,
  smaller memcpys happen more frequenly than larger ones).
 47 0x29
 36 0x2a
 47 0x2b
 23 0x2c
 29 0x2d
 31 0x2e
 36 0x2f
 46 0x30
 45 0x31
 51 0x32
 62 0x33
 64 0x34
 77 0x35
 91 0x36
 90 0x37
100 0x38
100 0x39
209 0x3a
279 0x3b
366 0x3c
498 0x3d
602 0x3e
682 0x3f
747 0x40

- All corruption with data copied to the wrong place happened for copies
  between 33 and 47 bytes, mostly to the smaller end of the scale:
391 0x21
360 0x22
319 0x23
273 0x24
273 0x25
241 0x26
224 0x27
221 0x28
231 0x29
208 0x2a
163 0x2b
 86 0x2c
 63 0x2d
 33 0x2e
  1 0x2f

- One common (but not the only, still investigating) case for data getting
  written to the wrong place is:
   * corruption starts 16 bytes after the memcpy start
   * corrupt bytes are the same as the bytes written to the start
   * start address ends in 0x1 through 0x7
   * length of corruption is at most memcpy length- 32, always
 between 1 and 7.

   Arnd


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> No that works fine for me. VDPAU acceleration works as well, but it
> depends on your chromium build whether it can actually use it, I
> think? In any case, mplayer can use vdpau to play 1080p h264 without
> breaking a sweat on this system.

I didn't install the vdpau libraries and firmware. mplayer plays through 
xv and works (it can't play through vdpau). Chromium uses 
I-don't-know-what and locks up.

> Note that the VDPAU driver also relies on memory semantics, i.e., it
> may use DC ZVA (zero cacheline) instructions which are not permitted
> on device mappings. This is probably just glibc's memset() being
> invoked, but I remember hitting this on another PCIe-impaired arm64
> system with Synopsys PCIe IP

> Are you setting the pstate to auto? That helps a lot in my experience.
>
> I.e.,
>
> echo auto > /sys/kernel/debug/dri/0/pstate

I tried that, but it didn't help.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> No that works fine for me. VDPAU acceleration works as well, but it
> depends on your chromium build whether it can actually use it, I
> think? In any case, mplayer can use vdpau to play 1080p h264 without
> breaking a sweat on this system.

I didn't install the vdpau libraries and firmware. mplayer plays through 
xv and works (it can't play through vdpau). Chromium uses 
I-don't-know-what and locks up.

> Note that the VDPAU driver also relies on memory semantics, i.e., it
> may use DC ZVA (zero cacheline) instructions which are not permitted
> on device mappings. This is probably just glibc's memset() being
> invoked, but I remember hitting this on another PCIe-impaired arm64
> system with Synopsys PCIe IP

> Are you setting the pstate to auto? That helps a lot in my experience.
>
> I.e.,
>
> echo auto > /sys/kernel/debug/dri/0/pstate

I tried that, but it didn't help.

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, David Laight wrote:

> From: Arnd Bergmann
> > Sent: 08 August 2018 17:31
> ..
> > > They do modify the same byte, but with the same value. Suppose that you
> > > want to copy a piece of data that is between 8 and 16 bytes long. You can
> > > do this:
> > >
> > > add src_end, src, len
> > > add dst_end, dst, len
> > > ldr x0, [src]
> > > ldr x1, [src_end - 8]
> > > str x0, [dst]
> > > str x1, [dst_end - 8]
> 
> I've done that myself (on x86) copied the last 'word' first then
> everything else in increasing address order.
> 
> > > The ARM64 memcpy uses this trick heavily in order to reduce branching, and
> > > this is what makes the PCIe controller choke.
> 
> More likely the write combining buffer?

When I write to memory (using the NC mapping - that is also used in the 
PCI BAR), I get no corruption. So the corruption must be in the PCIe 
controller, not the core or memory subsystem.

I also tried to disable write streaming on NC mapping with a chicken bit, 
but it didn't help.

> > So when a single unaligned 'stp' gets translated into a PCIe with TLP
> > with length=5 (20 bytes) and LastBE = ~1stBE, write combining the
> > overlapping stores gives us a TLP with a longer length (5..8 for two
> > stores), and byte-enable bits that are not exactly a complement.
> 
> Write combining should generate a much longer TLP.
> Depending on the size of the write combining buffer.
> 
> But in the above case I'd have thought that the second write
> would fail to 'combine' - because it isn't contiguous with the
> stored data.
> 
> So something more complex will be going on.
> 
>   David

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, David Laight wrote:

> From: Arnd Bergmann
> > Sent: 08 August 2018 17:31
> ..
> > > They do modify the same byte, but with the same value. Suppose that you
> > > want to copy a piece of data that is between 8 and 16 bytes long. You can
> > > do this:
> > >
> > > add src_end, src, len
> > > add dst_end, dst, len
> > > ldr x0, [src]
> > > ldr x1, [src_end - 8]
> > > str x0, [dst]
> > > str x1, [dst_end - 8]
> 
> I've done that myself (on x86) copied the last 'word' first then
> everything else in increasing address order.
> 
> > > The ARM64 memcpy uses this trick heavily in order to reduce branching, and
> > > this is what makes the PCIe controller choke.
> 
> More likely the write combining buffer?

When I write to memory (using the NC mapping - that is also used in the 
PCI BAR), I get no corruption. So the corruption must be in the PCIe 
controller, not the core or memory subsystem.

I also tried to disable write streaming on NC mapping with a chicken bit, 
but it didn't help.

> > So when a single unaligned 'stp' gets translated into a PCIe with TLP
> > with length=5 (20 bytes) and LastBE = ~1stBE, write combining the
> > overlapping stores gives us a TLP with a longer length (5..8 for two
> > stores), and byte-enable bits that are not exactly a complement.
> 
> Write combining should generate a much longer TLP.
> Depending on the size of the write combining buffer.
> 
> But in the above case I'd have thought that the second write
> would fail to 'combine' - because it isn't contiguous with the
> stored data.
> 
> So something more complex will be going on.
> 
>   David

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Catalin Marinas wrote:

> On Wed, Aug 08, 2018 at 10:12:27AM -0400, Mikulas Patocka wrote:
> > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > > On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > > > while (1) {
> > > > start = (unsigned)random() % (LEN + 1);
> > > > end = (unsigned)random() % (LEN + 1);
> > > > if (start > end)
> > > > continue;
> > > > for (i = start; i < end; i++)
> > > > data[i] = val++;
> > > > memcpy(map + start, data + start, end - start);
> > > > if (memcmp(map, data, LEN)) {
> > > 
> > > It may be worth trying to do a memcmp(map+start, data+start, end-start)
> > > here to see whether the hazard logic fails when the writes are unaligned
> > > but the reads are not.
> > > 
> > > This problem may as well appear if you do byte writes and read longs
> > > back (and I consider this a hardware problem on this specific board).
> > 
> > I triad to insert usleep(1) between the memcpy and memcmp, but the 
> > same corruption occurs. So, it can't be read-after-write hazard. It is 
> > caused by the improper handling of hazard between the overlapping writes 
> > inside memcpy.
> 
> It could get it wrong between subsequent writes to the same 64-bit range
> (e.g. the address & ~63 is the same but the data strobes for which bytes
> to write are different). If it somehow thinks that it's a
> write-after-write hazard even though the strobes are different, it could
> cancel one of the writes.

I believe that the SoC has logic for write-after-write detection, but the 
logic is broken and corrupts data.

If I insert "dmb sy" between the overlapping writes, there's no corruption 
(the PCIe controller won't see any overlapping writes in that case).

> It may be worth trying with a byte-only memcpy() function while keeping
> the default memcmp().

I tried that and byte-only memcpy works without any corruption.

> -- 
> Catalin

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Catalin Marinas wrote:

> On Wed, Aug 08, 2018 at 10:12:27AM -0400, Mikulas Patocka wrote:
> > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > > On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > > > while (1) {
> > > > start = (unsigned)random() % (LEN + 1);
> > > > end = (unsigned)random() % (LEN + 1);
> > > > if (start > end)
> > > > continue;
> > > > for (i = start; i < end; i++)
> > > > data[i] = val++;
> > > > memcpy(map + start, data + start, end - start);
> > > > if (memcmp(map, data, LEN)) {
> > > 
> > > It may be worth trying to do a memcmp(map+start, data+start, end-start)
> > > here to see whether the hazard logic fails when the writes are unaligned
> > > but the reads are not.
> > > 
> > > This problem may as well appear if you do byte writes and read longs
> > > back (and I consider this a hardware problem on this specific board).
> > 
> > I triad to insert usleep(1) between the memcpy and memcmp, but the 
> > same corruption occurs. So, it can't be read-after-write hazard. It is 
> > caused by the improper handling of hazard between the overlapping writes 
> > inside memcpy.
> 
> It could get it wrong between subsequent writes to the same 64-bit range
> (e.g. the address & ~63 is the same but the data strobes for which bytes
> to write are different). If it somehow thinks that it's a
> write-after-write hazard even though the strobes are different, it could
> cancel one of the writes.

I believe that the SoC has logic for write-after-write detection, but the 
logic is broken and corrupts data.

If I insert "dmb sy" between the overlapping writes, there's no corruption 
(the PCIe controller won't see any overlapping writes in that case).

> It may be worth trying with a byte-only memcpy() function while keeping
> the default memcmp().

I tried that and byte-only memcpy works without any corruption.

> -- 
> Catalin

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 08 August 2018 14:47
> ...
> > The problem on ARM is that I see data corruption when the overlapping
> > unaligned writes are done just by a single core.
> 
> Is this a sequence of unaligned writes (that shouldn't modify the
> same physical locations) or an aligned write followed by an
> unaligned one that updates part of the earlier write.
> (Or the opposite order?)
> 
> It might be that the unaligned writes are bypassing the write-combining
> buffer (without flushing it) - so overtake the aligned write.
> 
> Alternatively the unaligned writes go through the write-combining
> buffer but the byte-enables aren't handled in the expected way.
> 
> It ought to be possible to work out which sequence is actually broken.
> 
>   David

All the unaligned/or aligned writes inside memcpy write the same value to 
the overlapping bytes. So, the corruption can't be explained just by 
reordering the writes or failing to detect hazard between them.

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 08 August 2018 14:47
> ...
> > The problem on ARM is that I see data corruption when the overlapping
> > unaligned writes are done just by a single core.
> 
> Is this a sequence of unaligned writes (that shouldn't modify the
> same physical locations) or an aligned write followed by an
> unaligned one that updates part of the earlier write.
> (Or the opposite order?)
> 
> It might be that the unaligned writes are bypassing the write-combining
> buffer (without flushing it) - so overtake the aligned write.
> 
> Alternatively the unaligned writes go through the write-combining
> buffer but the byte-enables aren't handled in the expected way.
> 
> It ought to be possible to work out which sequence is actually broken.
> 
>   David

All the unaligned/or aligned writes inside memcpy write the same value to 
the overlapping bytes. So, the corruption can't be explained just by 
reordering the writes or failing to detect hazard between them.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Robin Murphy wrote:

> I would strongly suspect this issue is particular to Armada 8k, so its' 
> probably one for the Marvell folks to take a closer look at - I believe 
> some previous interconnect issues on those SoCs were actually fixable in 
> firmware.
> 
> Robin.

Do you have any contant for them? I suppose that corporate support would 
ignore just a single user.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Robin Murphy wrote:

> I would strongly suspect this issue is particular to Armada 8k, so its' 
> probably one for the Marvell folks to take a closer look at - I believe 
> some previous interconnect issues on those SoCs were actually fixable in 
> firmware.
> 
> Robin.

Do you have any contant for them? I suppose that corporate support would 
ignore just a single user.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Arnd Bergmann wrote:

> On Wed, Aug 8, 2018 at 5:15 PM Catalin Marinas  
> wrote:
> >
> > On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> > > On 08/08/18 15:12, Mikulas Patocka wrote:
> > > > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > > >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > - failing to write a few bytes
> > - writing a few bytes that were written 16 bytes before
> > - writing a few bytes that were written 16 bytes after
> >
> > > The overlapping writes in memcpy never write different values to the
> > > same location, so I still feel this must be some sort of HW issue, not a
> > > SW one.
> >
> > So do I (my interpretation is that it combines or rather skips some of
> > the writes to the same 16-byte address as it ignores the data strobes).
> 
> Maybe it just always writes to the wrong location, 16 bytes apart for one of
> the stp instructions. Since we are usually dealing with a pair of overlapping
> 'stp', both unaligned, that could explain both the missing bytes (we write
> data to the wrong place, but overwrite it with the correct data right away)
> and the extra copy (we write it to the wrong place, but then write the correct
> data to the correct place as well).
> 
> This sounds a bit like what the original ARM CPUs did on unaligned
> memory access, where a single aligned 4-byte location was accessed,
> but the bytes swapped around.
> 
> There may be a few more things worth trying out or analysing from
> the recorded past failures to understand more about how it goes
> wrong:
> 
> - For which data lengths does it fail? Having two overlapping
>   unaligned stp is something that only happens for 16..96 byte
>   memcpy.

If you want to research the corruptions in detail, I uploaded a file 
containing 7k corruptions here: 
http://people.redhat.com/~mpatocka/testcases/arm-pcie-corruption/

> - What if we use a pair of str instructions instead of an stp in
>   a modified memcpy? Does it now write to still write to the
>   wrong place 16 bytes away, just 8 bytes away, or correctly?

I replaced all stp instructions with str and it didn't have effect on 
corruptions. Either a few bytes is omitted, or a value that belongs 16 
bytes before or after is written.

> - Does it change in any way if we do the overlapping writes
>   in the reverse order? E.g. for the 16..64 byte case:
> 
> diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
> index 7e1163e6a0..09d0160bdf 100644
> --- a/sysdeps/aarch64/memcpy.S
> +++ b/sysdeps/aarch64/memcpy.S
> @@ -102,11 +102,11 @@ ENTRY (MEMCPY)
> tbz tmp1, 5, 1f
> ldp B_l, B_h, [src, 16]
> ldp C_l, C_h, [srcend, -32]
> -   stp B_l, B_h, [dstin, 16]
> stp C_l, C_h, [dstend, -32]
> +   stp B_l, B_h, [dstin, 16]
>  1:
> -   stp A_l, A_h, [dstin]
> stp D_l, D_h, [dstend, -16]
> +   stp A_l, A_h, [dstin]
> ret
> 
> .p2align 4
> 
> Arnd

After reordering them, I observe only omitted writes, there are no longer 
misdirected writes:

http://people.redhat.com/~mpatocka/testcases/arm-pcie-corruption/reorder-test/

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Arnd Bergmann wrote:

> On Wed, Aug 8, 2018 at 5:15 PM Catalin Marinas  
> wrote:
> >
> > On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> > > On 08/08/18 15:12, Mikulas Patocka wrote:
> > > > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > > >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > - failing to write a few bytes
> > - writing a few bytes that were written 16 bytes before
> > - writing a few bytes that were written 16 bytes after
> >
> > > The overlapping writes in memcpy never write different values to the
> > > same location, so I still feel this must be some sort of HW issue, not a
> > > SW one.
> >
> > So do I (my interpretation is that it combines or rather skips some of
> > the writes to the same 16-byte address as it ignores the data strobes).
> 
> Maybe it just always writes to the wrong location, 16 bytes apart for one of
> the stp instructions. Since we are usually dealing with a pair of overlapping
> 'stp', both unaligned, that could explain both the missing bytes (we write
> data to the wrong place, but overwrite it with the correct data right away)
> and the extra copy (we write it to the wrong place, but then write the correct
> data to the correct place as well).
> 
> This sounds a bit like what the original ARM CPUs did on unaligned
> memory access, where a single aligned 4-byte location was accessed,
> but the bytes swapped around.
> 
> There may be a few more things worth trying out or analysing from
> the recorded past failures to understand more about how it goes
> wrong:
> 
> - For which data lengths does it fail? Having two overlapping
>   unaligned stp is something that only happens for 16..96 byte
>   memcpy.

If you want to research the corruptions in detail, I uploaded a file 
containing 7k corruptions here: 
http://people.redhat.com/~mpatocka/testcases/arm-pcie-corruption/

> - What if we use a pair of str instructions instead of an stp in
>   a modified memcpy? Does it now write to still write to the
>   wrong place 16 bytes away, just 8 bytes away, or correctly?

I replaced all stp instructions with str and it didn't have effect on 
corruptions. Either a few bytes is omitted, or a value that belongs 16 
bytes before or after is written.

> - Does it change in any way if we do the overlapping writes
>   in the reverse order? E.g. for the 16..64 byte case:
> 
> diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
> index 7e1163e6a0..09d0160bdf 100644
> --- a/sysdeps/aarch64/memcpy.S
> +++ b/sysdeps/aarch64/memcpy.S
> @@ -102,11 +102,11 @@ ENTRY (MEMCPY)
> tbz tmp1, 5, 1f
> ldp B_l, B_h, [src, 16]
> ldp C_l, C_h, [srcend, -32]
> -   stp B_l, B_h, [dstin, 16]
> stp C_l, C_h, [dstend, -32]
> +   stp B_l, B_h, [dstin, 16]
>  1:
> -   stp A_l, A_h, [dstin]
> stp D_l, D_h, [dstend, -16]
> +   stp A_l, A_h, [dstin]
> ret
> 
> .p2align 4
> 
> Arnd

After reordering them, I observe only omitted writes, there are no longer 
misdirected writes:

http://people.redhat.com/~mpatocka/testcases/arm-pcie-corruption/reorder-test/

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread David Laight
From: Arnd Bergmann
> Sent: 08 August 2018 17:31
..
> > They do modify the same byte, but with the same value. Suppose that you
> > want to copy a piece of data that is between 8 and 16 bytes long. You can
> > do this:
> >
> > add src_end, src, len
> > add dst_end, dst, len
> > ldr x0, [src]
> > ldr x1, [src_end - 8]
> > str x0, [dst]
> > str x1, [dst_end - 8]

I've done that myself (on x86) copied the last 'word' first then
everything else in increasing address order.

> > The ARM64 memcpy uses this trick heavily in order to reduce branching, and
> > this is what makes the PCIe controller choke.

More likely the write combining buffer?

> So when a single unaligned 'stp' gets translated into a PCIe with TLP
> with length=5 (20 bytes) and LastBE = ~1stBE, write combining the
> overlapping stores gives us a TLP with a longer length (5..8 for two
> stores), and byte-enable bits that are not exactly a complement.

Write combining should generate a much longer TLP.
Depending on the size of the write combining buffer.

But in the above case I'd have thought that the second write
would fail to 'combine' - because it isn't contiguous with the
stored data.

So something more complex will be going on.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread David Laight
From: Arnd Bergmann
> Sent: 08 August 2018 17:31
..
> > They do modify the same byte, but with the same value. Suppose that you
> > want to copy a piece of data that is between 8 and 16 bytes long. You can
> > do this:
> >
> > add src_end, src, len
> > add dst_end, dst, len
> > ldr x0, [src]
> > ldr x1, [src_end - 8]
> > str x0, [dst]
> > str x1, [dst_end - 8]

I've done that myself (on x86) copied the last 'word' first then
everything else in increasing address order.

> > The ARM64 memcpy uses this trick heavily in order to reduce branching, and
> > this is what makes the PCIe controller choke.

More likely the write combining buffer?

> So when a single unaligned 'stp' gets translated into a PCIe with TLP
> with length=5 (20 bytes) and LastBE = ~1stBE, write combining the
> overlapping stores gives us a TLP with a longer length (5..8 for two
> stores), and byte-enable bits that are not exactly a complement.

Write combining should generate a much longer TLP.
Depending on the size of the write combining buffer.

But in the above case I'd have thought that the second write
would fail to 'combine' - because it isn't contiguous with the
stored data.

So something more complex will be going on.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 6:22 PM Mikulas Patocka  wrote:
>
> On Wed, 8 Aug 2018, Catalin Marinas wrote:
>
> > On Wed, Aug 08, 2018 at 02:26:11PM +, David Laight wrote:
> > > From: Mikulas Patocka
> > > > Sent: 08 August 2018 14:47
> > > ...
> > > > The problem on ARM is that I see data corruption when the overlapping
> > > > unaligned writes are done just by a single core.
> > >
> > > Is this a sequence of unaligned writes (that shouldn't modify the
> > > same physical locations) or an aligned write followed by an
> > > unaligned one that updates part of the earlier write.
> > > (Or the opposite order?)
> >
> > In the memcpy() case, there can be a sequence of unaligned writes but
> > they would not modify the same byte (so no overlapping address at the
> > byte level).
>
> They do modify the same byte, but with the same value. Suppose that you
> want to copy a piece of data that is between 8 and 16 bytes long. You can
> do this:
>
> add src_end, src, len
> add dst_end, dst, len
> ldr x0, [src]
> ldr x1, [src_end - 8]
> str x0, [dst]
> str x1, [dst_end - 8]
>
> The ARM64 memcpy uses this trick heavily in order to reduce branching, and
> this is what makes the PCIe controller choke.

So when a single unaligned 'stp' gets translated into a PCIe with TLP
with length=5 (20 bytes) and LastBE = ~1stBE, write combining the
overlapping stores gives us a TLP with a longer length (5..8 for two
stores), and byte-enable bits that are not exactly a complement.

If the explanation is just that of the byte-enable settings of the merged
TLP are wrong, maybe the problem is that one of them is always
the complement of the other, which would work for power-of-two
length but not the odd length of the TLP post write-combining?

  Arnd


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 6:22 PM Mikulas Patocka  wrote:
>
> On Wed, 8 Aug 2018, Catalin Marinas wrote:
>
> > On Wed, Aug 08, 2018 at 02:26:11PM +, David Laight wrote:
> > > From: Mikulas Patocka
> > > > Sent: 08 August 2018 14:47
> > > ...
> > > > The problem on ARM is that I see data corruption when the overlapping
> > > > unaligned writes are done just by a single core.
> > >
> > > Is this a sequence of unaligned writes (that shouldn't modify the
> > > same physical locations) or an aligned write followed by an
> > > unaligned one that updates part of the earlier write.
> > > (Or the opposite order?)
> >
> > In the memcpy() case, there can be a sequence of unaligned writes but
> > they would not modify the same byte (so no overlapping address at the
> > byte level).
>
> They do modify the same byte, but with the same value. Suppose that you
> want to copy a piece of data that is between 8 and 16 bytes long. You can
> do this:
>
> add src_end, src, len
> add dst_end, dst, len
> ldr x0, [src]
> ldr x1, [src_end - 8]
> str x0, [dst]
> str x1, [dst_end - 8]
>
> The ARM64 memcpy uses this trick heavily in order to reduce branching, and
> this is what makes the PCIe controller choke.

So when a single unaligned 'stp' gets translated into a PCIe with TLP
with length=5 (20 bytes) and LastBE = ~1stBE, write combining the
overlapping stores gives us a TLP with a longer length (5..8 for two
stores), and byte-enable bits that are not exactly a complement.

If the explanation is just that of the byte-enable settings of the merged
TLP are wrong, maybe the problem is that one of them is always
the complement of the other, which would work for power-of-two
length but not the odd length of the TLP post write-combining?

  Arnd


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Catalin Marinas wrote:

> On Wed, Aug 08, 2018 at 02:26:11PM +, David Laight wrote:
> > From: Mikulas Patocka
> > > Sent: 08 August 2018 14:47
> > ...
> > > The problem on ARM is that I see data corruption when the overlapping
> > > unaligned writes are done just by a single core.
> > 
> > Is this a sequence of unaligned writes (that shouldn't modify the
> > same physical locations) or an aligned write followed by an
> > unaligned one that updates part of the earlier write.
> > (Or the opposite order?)
> 
> In the memcpy() case, there can be a sequence of unaligned writes but
> they would not modify the same byte (so no overlapping address at the
> byte level).

They do modify the same byte, but with the same value. Suppose that you 
want to copy a piece of data that is between 8 and 16 bytes long. You can 
do this:

add src_end, src, len
add dst_end, dst, len
ldr x0, [src]
ldr x1, [src_end - 8]
str x0, [dst]
str x1, [dst_end - 8]

The ARM64 memcpy uses this trick heavily in order to reduce branching, and 
this is what makes the PCIe controller choke.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Catalin Marinas wrote:

> On Wed, Aug 08, 2018 at 02:26:11PM +, David Laight wrote:
> > From: Mikulas Patocka
> > > Sent: 08 August 2018 14:47
> > ...
> > > The problem on ARM is that I see data corruption when the overlapping
> > > unaligned writes are done just by a single core.
> > 
> > Is this a sequence of unaligned writes (that shouldn't modify the
> > same physical locations) or an aligned write followed by an
> > unaligned one that updates part of the earlier write.
> > (Or the opposite order?)
> 
> In the memcpy() case, there can be a sequence of unaligned writes but
> they would not modify the same byte (so no overlapping address at the
> byte level).

They do modify the same byte, but with the same value. Suppose that you 
want to copy a piece of data that is between 8 and 16 bytes long. You can 
do this:

add src_end, src, len
add dst_end, dst, len
ldr x0, [src]
ldr x1, [src_end - 8]
str x0, [dst]
str x1, [dst_end - 8]

The ARM64 memcpy uses this trick heavily in order to reduce branching, and 
this is what makes the PCIe controller choke.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 5:15 PM Catalin Marinas  wrote:
>
> On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> > On 08/08/18 15:12, Mikulas Patocka wrote:
> > > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> - failing to write a few bytes
> - writing a few bytes that were written 16 bytes before
> - writing a few bytes that were written 16 bytes after
>
> > The overlapping writes in memcpy never write different values to the
> > same location, so I still feel this must be some sort of HW issue, not a
> > SW one.
>
> So do I (my interpretation is that it combines or rather skips some of
> the writes to the same 16-byte address as it ignores the data strobes).

Maybe it just always writes to the wrong location, 16 bytes apart for one of
the stp instructions. Since we are usually dealing with a pair of overlapping
'stp', both unaligned, that could explain both the missing bytes (we write
data to the wrong place, but overwrite it with the correct data right away)
and the extra copy (we write it to the wrong place, but then write the correct
data to the correct place as well).

This sounds a bit like what the original ARM CPUs did on unaligned
memory access, where a single aligned 4-byte location was accessed,
but the bytes swapped around.

There may be a few more things worth trying out or analysing from
the recorded past failures to understand more about how it goes
wrong:

- For which data lengths does it fail? Having two overlapping
  unaligned stp is something that only happens for 16..96 byte
  memcpy.

- What if we use a pair of str instructions instead of an stp in
  a modified memcpy? Does it now write to still write to the
  wrong place 16 bytes away, just 8 bytes away, or correctly?

- Does it change in any way if we do the overlapping writes
  in the reverse order? E.g. for the 16..64 byte case:

diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
index 7e1163e6a0..09d0160bdf 100644
--- a/sysdeps/aarch64/memcpy.S
+++ b/sysdeps/aarch64/memcpy.S
@@ -102,11 +102,11 @@ ENTRY (MEMCPY)
tbz tmp1, 5, 1f
ldp B_l, B_h, [src, 16]
ldp C_l, C_h, [srcend, -32]
-   stp B_l, B_h, [dstin, 16]
stp C_l, C_h, [dstend, -32]
+   stp B_l, B_h, [dstin, 16]
 1:
-   stp A_l, A_h, [dstin]
stp D_l, D_h, [dstend, -16]
+   stp A_l, A_h, [dstin]
ret

.p2align 4

Arnd


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Arnd Bergmann
On Wed, Aug 8, 2018 at 5:15 PM Catalin Marinas  wrote:
>
> On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> > On 08/08/18 15:12, Mikulas Patocka wrote:
> > > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> - failing to write a few bytes
> - writing a few bytes that were written 16 bytes before
> - writing a few bytes that were written 16 bytes after
>
> > The overlapping writes in memcpy never write different values to the
> > same location, so I still feel this must be some sort of HW issue, not a
> > SW one.
>
> So do I (my interpretation is that it combines or rather skips some of
> the writes to the same 16-byte address as it ignores the data strobes).

Maybe it just always writes to the wrong location, 16 bytes apart for one of
the stp instructions. Since we are usually dealing with a pair of overlapping
'stp', both unaligned, that could explain both the missing bytes (we write
data to the wrong place, but overwrite it with the correct data right away)
and the extra copy (we write it to the wrong place, but then write the correct
data to the correct place as well).

This sounds a bit like what the original ARM CPUs did on unaligned
memory access, where a single aligned 4-byte location was accessed,
but the bytes swapped around.

There may be a few more things worth trying out or analysing from
the recorded past failures to understand more about how it goes
wrong:

- For which data lengths does it fail? Having two overlapping
  unaligned stp is something that only happens for 16..96 byte
  memcpy.

- What if we use a pair of str instructions instead of an stp in
  a modified memcpy? Does it now write to still write to the
  wrong place 16 bytes away, just 8 bytes away, or correctly?

- Does it change in any way if we do the overlapping writes
  in the reverse order? E.g. for the 16..64 byte case:

diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
index 7e1163e6a0..09d0160bdf 100644
--- a/sysdeps/aarch64/memcpy.S
+++ b/sysdeps/aarch64/memcpy.S
@@ -102,11 +102,11 @@ ENTRY (MEMCPY)
tbz tmp1, 5, 1f
ldp B_l, B_h, [src, 16]
ldp C_l, C_h, [srcend, -32]
-   stp B_l, B_h, [dstin, 16]
stp C_l, C_h, [dstend, -32]
+   stp B_l, B_h, [dstin, 16]
 1:
-   stp A_l, A_h, [dstin]
stp D_l, D_h, [dstend, -16]
+   stp A_l, A_h, [dstin]
ret

.p2align 4

Arnd


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> On 08/08/18 15:12, Mikulas Patocka wrote:
> > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> >>>   while (1) {
> >>>   start = (unsigned)random() % (LEN + 1);
> >>>   end = (unsigned)random() % (LEN + 1);
> >>>   if (start > end)
> >>>   continue;
> >>>   for (i = start; i < end; i++)
> >>>   data[i] = val++;
> >>>   memcpy(map + start, data + start, end - start);
> >>>   if (memcmp(map, data, LEN)) {
> >>
> >> It may be worth trying to do a memcmp(map+start, data+start, end-start)
> >> here to see whether the hazard logic fails when the writes are unaligned
> >> but the reads are not.
> >>
> >> This problem may as well appear if you do byte writes and read longs
> >> back (and I consider this a hardware problem on this specific board).
> > 
> > I triad to insert usleep(1) between the memcpy and memcmp, but the 
> > same corruption occurs. So, it can't be read-after-write hazard. It is 
> > caused by the improper handling of hazard between the overlapping writes 
> > inside memcpy.
> 
> I don't think you've told us what form the corruption takes.  Does it
> lose some bytes?  Modify values beyond the copy range?  Write completely
> arbitrary values?

>From this message:

https://lore.kernel.org/lkml/alpine.lrh.2.02.1808060553130.30...@file01.intranet.prod.int.rdu2.redhat.com/

- failing to write a few bytes
- writing a few bytes that were written 16 bytes before
- writing a few bytes that were written 16 bytes after

> The overlapping writes in memcpy never write different values to the
> same location, so I still feel this must be some sort of HW issue, not a
> SW one.

So do I (my interpretation is that it combines or rather skips some of
the writes to the same 16-byte address as it ignores the data strobes).

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Wed, Aug 08, 2018 at 04:01:12PM +0100, Richard Earnshaw wrote:
> On 08/08/18 15:12, Mikulas Patocka wrote:
> > On Wed, 8 Aug 2018, Catalin Marinas wrote:
> >> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> >>>   while (1) {
> >>>   start = (unsigned)random() % (LEN + 1);
> >>>   end = (unsigned)random() % (LEN + 1);
> >>>   if (start > end)
> >>>   continue;
> >>>   for (i = start; i < end; i++)
> >>>   data[i] = val++;
> >>>   memcpy(map + start, data + start, end - start);
> >>>   if (memcmp(map, data, LEN)) {
> >>
> >> It may be worth trying to do a memcmp(map+start, data+start, end-start)
> >> here to see whether the hazard logic fails when the writes are unaligned
> >> but the reads are not.
> >>
> >> This problem may as well appear if you do byte writes and read longs
> >> back (and I consider this a hardware problem on this specific board).
> > 
> > I triad to insert usleep(1) between the memcpy and memcmp, but the 
> > same corruption occurs. So, it can't be read-after-write hazard. It is 
> > caused by the improper handling of hazard between the overlapping writes 
> > inside memcpy.
> 
> I don't think you've told us what form the corruption takes.  Does it
> lose some bytes?  Modify values beyond the copy range?  Write completely
> arbitrary values?

>From this message:

https://lore.kernel.org/lkml/alpine.lrh.2.02.1808060553130.30...@file01.intranet.prod.int.rdu2.redhat.com/

- failing to write a few bytes
- writing a few bytes that were written 16 bytes before
- writing a few bytes that were written 16 bytes after

> The overlapping writes in memcpy never write different values to the
> same location, so I still feel this must be some sort of HW issue, not a
> SW one.

So do I (my interpretation is that it combines or rather skips some of
the writes to the same 16-byte address as it ignores the data strobes).

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Richard Earnshaw (lists)
On 08/08/18 15:12, Mikulas Patocka wrote:
> 
> 
> On Wed, 8 Aug 2018, Catalin Marinas wrote:
> 
>> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
>>> while (1) {
>>> start = (unsigned)random() % (LEN + 1);
>>> end = (unsigned)random() % (LEN + 1);
>>> if (start > end)
>>> continue;
>>> for (i = start; i < end; i++)
>>> data[i] = val++;
>>> memcpy(map + start, data + start, end - start);
>>> if (memcmp(map, data, LEN)) {
>>
>> It may be worth trying to do a memcmp(map+start, data+start, end-start)
>> here to see whether the hazard logic fails when the writes are unaligned
>> but the reads are not.
>>
>> This problem may as well appear if you do byte writes and read longs
>> back (and I consider this a hardware problem on this specific board).
> 
> I triad to insert usleep(1) between the memcpy and memcmp, but the 
> same corruption occurs. So, it can't be read-after-write hazard. It is 
> caused by the improper handling of hazard between the overlapping writes 
> inside memcpy.
> 
> Mikulas
> 

I don't think you've told us what form the corruption takes.  Does it
lose some bytes?  Modify values beyond the copy range?  Write completely
arbitrary values?

The overlapping writes in memcpy never write different values to the
same location, so I still feel this must be some sort of HW issue, not a
SW one.

R.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Richard Earnshaw (lists)
On 08/08/18 15:12, Mikulas Patocka wrote:
> 
> 
> On Wed, 8 Aug 2018, Catalin Marinas wrote:
> 
>> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
>>> while (1) {
>>> start = (unsigned)random() % (LEN + 1);
>>> end = (unsigned)random() % (LEN + 1);
>>> if (start > end)
>>> continue;
>>> for (i = start; i < end; i++)
>>> data[i] = val++;
>>> memcpy(map + start, data + start, end - start);
>>> if (memcmp(map, data, LEN)) {
>>
>> It may be worth trying to do a memcmp(map+start, data+start, end-start)
>> here to see whether the hazard logic fails when the writes are unaligned
>> but the reads are not.
>>
>> This problem may as well appear if you do byte writes and read longs
>> back (and I consider this a hardware problem on this specific board).
> 
> I triad to insert usleep(1) between the memcpy and memcmp, but the 
> same corruption occurs. So, it can't be read-after-write hazard. It is 
> caused by the improper handling of hazard between the overlapping writes 
> inside memcpy.
> 
> Mikulas
> 

I don't think you've told us what form the corruption takes.  Does it
lose some bytes?  Modify values beyond the copy range?  Write completely
arbitrary values?

The overlapping writes in memcpy never write different values to the
same location, so I still feel this must be some sort of HW issue, not a
SW one.

R.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Wed, Aug 08, 2018 at 02:26:11PM +, David Laight wrote:
> From: Mikulas Patocka
> > Sent: 08 August 2018 14:47
> ...
> > The problem on ARM is that I see data corruption when the overlapping
> > unaligned writes are done just by a single core.
> 
> Is this a sequence of unaligned writes (that shouldn't modify the
> same physical locations) or an aligned write followed by an
> unaligned one that updates part of the earlier write.
> (Or the opposite order?)

In the memcpy() case, there can be a sequence of unaligned writes but
they would not modify the same byte (so no overlapping address at the
byte level).

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Wed, Aug 08, 2018 at 02:26:11PM +, David Laight wrote:
> From: Mikulas Patocka
> > Sent: 08 August 2018 14:47
> ...
> > The problem on ARM is that I see data corruption when the overlapping
> > unaligned writes are done just by a single core.
> 
> Is this a sequence of unaligned writes (that shouldn't modify the
> same physical locations) or an aligned write followed by an
> unaligned one that updates part of the earlier write.
> (Or the opposite order?)

In the memcpy() case, there can be a sequence of unaligned writes but
they would not modify the same byte (so no overlapping address at the
byte level).

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Wed, Aug 08, 2018 at 10:12:27AM -0400, Mikulas Patocka wrote:
> On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > >   while (1) {
> > >   start = (unsigned)random() % (LEN + 1);
> > >   end = (unsigned)random() % (LEN + 1);
> > >   if (start > end)
> > >   continue;
> > >   for (i = start; i < end; i++)
> > >   data[i] = val++;
> > >   memcpy(map + start, data + start, end - start);
> > >   if (memcmp(map, data, LEN)) {
> > 
> > It may be worth trying to do a memcmp(map+start, data+start, end-start)
> > here to see whether the hazard logic fails when the writes are unaligned
> > but the reads are not.
> > 
> > This problem may as well appear if you do byte writes and read longs
> > back (and I consider this a hardware problem on this specific board).
> 
> I triad to insert usleep(1) between the memcpy and memcmp, but the 
> same corruption occurs. So, it can't be read-after-write hazard. It is 
> caused by the improper handling of hazard between the overlapping writes 
> inside memcpy.

It could get it wrong between subsequent writes to the same 64-bit range
(e.g. the address & ~63 is the same but the data strobes for which bytes
to write are different). If it somehow thinks that it's a
write-after-write hazard even though the strobes are different, it could
cancel one of the writes.

It may be worth trying with a byte-only memcpy() function while keeping
the default memcmp().

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Wed, Aug 08, 2018 at 10:12:27AM -0400, Mikulas Patocka wrote:
> On Wed, 8 Aug 2018, Catalin Marinas wrote:
> > On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > >   while (1) {
> > >   start = (unsigned)random() % (LEN + 1);
> > >   end = (unsigned)random() % (LEN + 1);
> > >   if (start > end)
> > >   continue;
> > >   for (i = start; i < end; i++)
> > >   data[i] = val++;
> > >   memcpy(map + start, data + start, end - start);
> > >   if (memcmp(map, data, LEN)) {
> > 
> > It may be worth trying to do a memcmp(map+start, data+start, end-start)
> > here to see whether the hazard logic fails when the writes are unaligned
> > but the reads are not.
> > 
> > This problem may as well appear if you do byte writes and read longs
> > back (and I consider this a hardware problem on this specific board).
> 
> I triad to insert usleep(1) between the memcpy and memcmp, but the 
> same corruption occurs. So, it can't be read-after-write hazard. It is 
> caused by the improper handling of hazard between the overlapping writes 
> inside memcpy.

It could get it wrong between subsequent writes to the same 64-bit range
(e.g. the address & ~63 is the same but the data strobes for which bytes
to write are different). If it somehow thinks that it's a
write-after-write hazard even though the strobes are different, it could
cancel one of the writes.

It may be worth trying with a byte-only memcpy() function while keeping
the default memcmp().

-- 
Catalin


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread David Laight
From: Mikulas Patocka
> Sent: 08 August 2018 14:47
...
> The problem on ARM is that I see data corruption when the overlapping
> unaligned writes are done just by a single core.

Is this a sequence of unaligned writes (that shouldn't modify the
same physical locations) or an aligned write followed by an
unaligned one that updates part of the earlier write.
(Or the opposite order?)

It might be that the unaligned writes are bypassing the write-combining
buffer (without flushing it) - so overtake the aligned write.

Alternatively the unaligned writes go through the write-combining
buffer but the byte-enables aren't handled in the expected way.

It ought to be possible to work out which sequence is actually broken.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread David Laight
From: Mikulas Patocka
> Sent: 08 August 2018 14:47
...
> The problem on ARM is that I see data corruption when the overlapping
> unaligned writes are done just by a single core.

Is this a sequence of unaligned writes (that shouldn't modify the
same physical locations) or an aligned write followed by an
unaligned one that updates part of the earlier write.
(Or the opposite order?)

It might be that the unaligned writes are bypassing the write-combining
buffer (without flushing it) - so overtake the aligned write.

Alternatively the unaligned writes go through the write-combining
buffer but the byte-enables aren't handled in the expected way.

It ought to be possible to work out which sequence is actually broken.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Tue, 7 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 07 August 2018 15:07
> ...
> > Unaccelerated scrolling is still painfully slow
> > even on modern computers because of slow framebuffer read.
> 
> I solved that many years ago on a strongarm system by mapping
> the screen memory at two separate virtual addresses.
> One uncached used for writes, the second cached using the
> 'minicache' for reads.
> (and immediately fell foul of a memcpy() function that compared
> the two virtual addresses and decided to copy backwards)
> 
> I suspect some modern cpus don't like you doing that and the
> graphics 'drivers' won't use different mappings.

Intel says that you can't mix PAT memory attributes - but the non-temporal 
store instructions use write-combining semantics on a memory that is 
normally cacheable - and it is allowed to mix non-temporal stores with 
other cacheable memory accesses - so I believe that the CPU will snoop the 
cache for wc accesses and handle the conflict.

> Even in glibc you want a more general copy_to/from_io_memory()
> rather than just 'copy_from_framebuffer()'.
> Best to define both - even if they end up identical.
> Other drivers allow PCIe space be mmap()ed into user space.
> 
> While your tests show vmovntdqa being slightly slower than an
> avx read for uncached mappings it is still much better than
> all the other options.

Tihs was a measuring glitch - movntdqa is as fast as movdqa on non-cached
mappings.

Mikulas

>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
> 


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Tue, 7 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 07 August 2018 15:07
> ...
> > Unaccelerated scrolling is still painfully slow
> > even on modern computers because of slow framebuffer read.
> 
> I solved that many years ago on a strongarm system by mapping
> the screen memory at two separate virtual addresses.
> One uncached used for writes, the second cached using the
> 'minicache' for reads.
> (and immediately fell foul of a memcpy() function that compared
> the two virtual addresses and decided to copy backwards)
> 
> I suspect some modern cpus don't like you doing that and the
> graphics 'drivers' won't use different mappings.

Intel says that you can't mix PAT memory attributes - but the non-temporal 
store instructions use write-combining semantics on a memory that is 
normally cacheable - and it is allowed to mix non-temporal stores with 
other cacheable memory accesses - so I believe that the CPU will snoop the 
cache for wc accesses and handle the conflict.

> Even in glibc you want a more general copy_to/from_io_memory()
> rather than just 'copy_from_framebuffer()'.
> Best to define both - even if they end up identical.
> Other drivers allow PCIe space be mmap()ed into user space.
> 
> While your tests show vmovntdqa being slightly slower than an
> avx read for uncached mappings it is still much better than
> all the other options.

Tihs was a measuring glitch - movntdqa is as fast as movdqa on non-cached
mappings.

Mikulas

>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
> 


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Catalin Marinas wrote:

> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > while (1) {
> > start = (unsigned)random() % (LEN + 1);
> > end = (unsigned)random() % (LEN + 1);
> > if (start > end)
> > continue;
> > for (i = start; i < end; i++)
> > data[i] = val++;
> > memcpy(map + start, data + start, end - start);
> > if (memcmp(map, data, LEN)) {
> 
> It may be worth trying to do a memcmp(map+start, data+start, end-start)
> here to see whether the hazard logic fails when the writes are unaligned
> but the reads are not.
> 
> This problem may as well appear if you do byte writes and read longs
> back (and I consider this a hardware problem on this specific board).

I triad to insert usleep(1) between the memcpy and memcmp, but the 
same corruption occurs. So, it can't be read-after-write hazard. It is 
caused by the improper handling of hazard between the overlapping writes 
inside memcpy.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Catalin Marinas wrote:

> On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
> > while (1) {
> > start = (unsigned)random() % (LEN + 1);
> > end = (unsigned)random() % (LEN + 1);
> > if (start > end)
> > continue;
> > for (i = start; i < end; i++)
> > data[i] = val++;
> > memcpy(map + start, data + start, end - start);
> > if (memcmp(map, data, LEN)) {
> 
> It may be worth trying to do a memcmp(map+start, data+start, end-start)
> here to see whether the hazard logic fails when the writes are unaligned
> but the reads are not.
> 
> This problem may as well appear if you do byte writes and read longs
> back (and I consider this a hardware problem on this specific board).

I triad to insert usleep(1) between the memcpy and memcmp, but the 
same corruption occurs. So, it can't be read-after-write hazard. It is 
caused by the improper handling of hazard between the overlapping writes 
inside memcpy.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Marcin Wojtas wrote:

> Hi Mikulas,
> 
> wt., 7 sie 2018 o 19:39 Mikulas Patocka  napisa?(a):
> >
> >
> >
> > On Tue, 7 Aug 2018, Marcin Wojtas wrote:
> >
> > > Ard, Mikulas,
> > >
> > > After some self-caused setup issues I was able to run the test on my
> > > MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
> > > loading the CPU to 100% and no single error event...
> > >
> > > I built the binary file with:
> > > gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
> > >  -O2
> > >
> > > Maybe it's the older firmware issue?
> >
> > I have downloaded and built the firmware recently (it has timestamp Jul 30
> > 2018).
> >
> > Do you still have your firmware file "flash-image.bin" that you used, so
> > that I could try it?
> 
> Attached. Please let know if you see any difference.

I booted this image, but the same corruption happens.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, Marcin Wojtas wrote:

> Hi Mikulas,
> 
> wt., 7 sie 2018 o 19:39 Mikulas Patocka  napisa?(a):
> >
> >
> >
> > On Tue, 7 Aug 2018, Marcin Wojtas wrote:
> >
> > > Ard, Mikulas,
> > >
> > > After some self-caused setup issues I was able to run the test on my
> > > MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
> > > loading the CPU to 100% and no single error event...
> > >
> > > I built the binary file with:
> > > gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
> > >  -O2
> > >
> > > Maybe it's the older firmware issue?
> >
> > I have downloaded and built the firmware recently (it has timestamp Jul 30
> > 2018).
> >
> > Do you still have your firmware file "flash-image.bin" that you used, so
> > that I could try it?
> 
> Attached. Please let know if you see any difference.

I booted this image, but the same corruption happens.

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, David Laight wrote:

> From: Catalin Marinas
> > Sent: 08 August 2018 13:17
> ...
> > I think hazarding is what goes wrong here, especially since with
> > overlapping unaligned addresses. However, I disagree that it is
> > impossible to implement this properly on a platform with PCIe so that
> > Normal NC mappings can be used.
> 
> I've been trying to follow this discussion...
> 
> Is the problem just that reads don't snoop/flush the write-combining buffer?

No. The pixel corruption is permanently visible on the monitor (even if 
there are no reads from the framebuffer at all). So it can't be explained 
as mishandling read-after-write hazard.

> Aligned writes that end on an appropriate boundary will leave the write
> combining buffer empty.
> But if the buffer isn't emptied the PCIe read gets ahead of the PCIe write.
> 
> ISTR even x86 requires a fence instruction in some sequence associated
> with write-combining writes.

Other x86 cores may observe wc writes out of order - but a single x86 
core is self-consistent - i.e. if you do
movl $0x, (%ebx)
movl $0x, 3(%ebx)
then the byte at ebx+3 will always contain 0xFF. The core can't just 
corrupt data while doing reordering.

The problem on ARM is that I see data corruption when the overlapping 
unaligned writes are done just by a single core.

>   David

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Mikulas Patocka



On Wed, 8 Aug 2018, David Laight wrote:

> From: Catalin Marinas
> > Sent: 08 August 2018 13:17
> ...
> > I think hazarding is what goes wrong here, especially since with
> > overlapping unaligned addresses. However, I disagree that it is
> > impossible to implement this properly on a platform with PCIe so that
> > Normal NC mappings can be used.
> 
> I've been trying to follow this discussion...
> 
> Is the problem just that reads don't snoop/flush the write-combining buffer?

No. The pixel corruption is permanently visible on the monitor (even if 
there are no reads from the framebuffer at all). So it can't be explained 
as mishandling read-after-write hazard.

> Aligned writes that end on an appropriate boundary will leave the write
> combining buffer empty.
> But if the buffer isn't emptied the PCIe read gets ahead of the PCIe write.
> 
> ISTR even x86 requires a fence instruction in some sequence associated
> with write-combining writes.

Other x86 cores may observe wc writes out of order - but a single x86 
core is self-consistent - i.e. if you do
movl $0x, (%ebx)
movl $0x, 3(%ebx)
then the byte at ebx+3 will always contain 0xFF. The core can't just 
corrupt data while doing reordering.

The problem on ARM is that I see data corruption when the overlapping 
unaligned writes are done just by a single core.

>   David

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread David Laight
From: Catalin Marinas
> Sent: 08 August 2018 13:17
...
> I think hazarding is what goes wrong here, especially since with
> overlapping unaligned addresses. However, I disagree that it is
> impossible to implement this properly on a platform with PCIe so that
> Normal NC mappings can be used.

I've been trying to follow this discussion...

Is the problem just that reads don't snoop/flush the write-combining buffer?
Aligned writes that end on an appropriate boundary will leave the write
combining buffer empty.
But if the buffer isn't emptied the PCIe read gets ahead of the PCIe write.

ISTR even x86 requires a fence instruction in some sequence associated
with write-combining writes.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread David Laight
From: Catalin Marinas
> Sent: 08 August 2018 13:17
...
> I think hazarding is what goes wrong here, especially since with
> overlapping unaligned addresses. However, I disagree that it is
> impossible to implement this properly on a platform with PCIe so that
> Normal NC mappings can be used.

I've been trying to follow this discussion...

Is the problem just that reads don't snoop/flush the write-combining buffer?
Aligned writes that end on an appropriate boundary will leave the write
combining buffer empty.
But if the buffer isn't emptied the PCIe read gets ahead of the PCIe write.

ISTR even x86 requires a fence instruction in some sequence associated
with write-combining writes.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
Hi Matt,

On Fri, Aug 03, 2018 at 03:44:44PM -0500, Matt Sealey wrote:
> On 3 August 2018 at 13:25, Mikulas Patocka  wrote:
> > On Fri, 3 Aug 2018, Ard Biesheuvel wrote:
> >> Are we still talking about overlapping unaligned accesses here? Or do
> >> you see other failures as well?
> >
> > Yes - it is caused by overlapping unaligned accesses inside memcpy.
> > When I put "dmb sy" between the overlapping accesses in
> > glibc/sysdeps/aarch64/memcpy.S, this program doesn't detect any
> > memory corruption.
> 
> It is a symptom of generating reorderable accesses inside memcpy. It's nothing
> to do with alignment, per se (see below). A dmb sy just hides the symptoms.
> 
> What we're talking about here - yes, Ard, within certain amounts of
> reason - is that you cannot use PCI BAR memory as 'Normal' - certainly
> never cacheable memory, but Normal NC isn't good either. That is that
> your CPU cannot post writes or reads towards PCI memory spaces unless
> it is dealing with it as Device memory or very strictly controlled use
> of Normal Non-Cacheable.

I disagree that it's not possible to use Normal NC on prefetchable BARs.
This particular case looks more like a hardware issue to me as other
platforms don't exhibit the same behaviour.

Note that allowing Normal NC mapping of prefetchable BARs together with
unaliagned accesses is also a requirement for SBSA-compliant platforms
([1]; though I don't find the text in D.2 very clear).

> >> > I tried to run it on system RAM mapped with the NC attribute and I didn't
> >> > get any corruption - that suggests the the bug may be in the PCIE
> >> > subsystem.
> 
> Pure fluke.

Do you mean you don't expect Mikulas' test to run fine on system RAM
with Normal NC mapping? We would have bigger issues if this was the
case.

> I'll give a simple explanation. The Arm Architecture defines
> single-copy and multi-copy atomic transactions. You can treat
> 'single-copy' to mean that that transaction cannot be made partial, or
> reordered within itself, i.e. it must modify memory (if it is a store)
> in a single swift effort and any future reads from that memory must
> return the FULL result of that write.
> 
> Multi-copy means it can be resized and reordered a bit. Will Deacon is
> going to crucify me for simplifying it, but.. let's proceed with a
> poor example:

Not sure about Will but I think you got them wrong ;). The single/multi
copy atomicity is considered in respect to (multiple) observers, a.k.a.
masters, and nothing to do with reordering a bit (see B2.2 in the ARMv8
ARM).

> STR X0,[X1] on a 32-bit bus cannot ever be single-copy atomic, because
> you cannot write 64-bits of data on a 32-bit bus in a single,
> unbreakable transaction. This is because from one bus cycle to the
> next, one half of the transaction will be in a different place. Your
> interconnect will have latched and buffered 32-bits and the CPU is
> holding the other.

It depends on the implementation, interconnect, buses. Since single-copy
atomicity refers to master accesses, the above transaction could be a
burst of two 32-bit writes and treated atomically by the interconnect
(i.e. not interruptible).

> STP X0, X1, [X2] on a 64-bit bus can be single-copy atomic with
> respect to the element size. But it is on the whole multi-copy atomic
> - that is to say that it can provide a single transaction with
> multiple elements which are transmitted, and those elements could be
> messed with on the way down the pipe.

This has nothing to do with multi-copy atomicity which actually refers
to multiple observers seeing the same write. The ARM architecture is not
exactly multi-copy atomic anyway (rather "other-multi-copy atomic").

Architecturally, STP is treated as two single-copy accesses (as you
mentioned already).

Anyway, the single/multiple copy atomicity is irrelevant for the C test
from Mikulas where you have the same observer (the CPU) writing and
reading the memory. I wonder whether writing a byte and reading a long
back would show similar corruption.

> And the granularity of the hazarding in your system, from the CPU
> store buffer to the bus interface to the interconnect buffering to the
> PCIe bridge to the PCIe EP is.. what? Not the same all the way down,
> I'll bet you.

I think hazarding is what goes wrong here, especially since with
overlapping unaligned addresses. However, I disagree that it is
impossible to implement this properly on a platform with PCIe so that
Normal NC mappings can be used.

Thanks.

[1] 
https://developer.arm.com/docs/den0029/latest/server-base-system-architecture

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
Hi Matt,

On Fri, Aug 03, 2018 at 03:44:44PM -0500, Matt Sealey wrote:
> On 3 August 2018 at 13:25, Mikulas Patocka  wrote:
> > On Fri, 3 Aug 2018, Ard Biesheuvel wrote:
> >> Are we still talking about overlapping unaligned accesses here? Or do
> >> you see other failures as well?
> >
> > Yes - it is caused by overlapping unaligned accesses inside memcpy.
> > When I put "dmb sy" between the overlapping accesses in
> > glibc/sysdeps/aarch64/memcpy.S, this program doesn't detect any
> > memory corruption.
> 
> It is a symptom of generating reorderable accesses inside memcpy. It's nothing
> to do with alignment, per se (see below). A dmb sy just hides the symptoms.
> 
> What we're talking about here - yes, Ard, within certain amounts of
> reason - is that you cannot use PCI BAR memory as 'Normal' - certainly
> never cacheable memory, but Normal NC isn't good either. That is that
> your CPU cannot post writes or reads towards PCI memory spaces unless
> it is dealing with it as Device memory or very strictly controlled use
> of Normal Non-Cacheable.

I disagree that it's not possible to use Normal NC on prefetchable BARs.
This particular case looks more like a hardware issue to me as other
platforms don't exhibit the same behaviour.

Note that allowing Normal NC mapping of prefetchable BARs together with
unaliagned accesses is also a requirement for SBSA-compliant platforms
([1]; though I don't find the text in D.2 very clear).

> >> > I tried to run it on system RAM mapped with the NC attribute and I didn't
> >> > get any corruption - that suggests the the bug may be in the PCIE
> >> > subsystem.
> 
> Pure fluke.

Do you mean you don't expect Mikulas' test to run fine on system RAM
with Normal NC mapping? We would have bigger issues if this was the
case.

> I'll give a simple explanation. The Arm Architecture defines
> single-copy and multi-copy atomic transactions. You can treat
> 'single-copy' to mean that that transaction cannot be made partial, or
> reordered within itself, i.e. it must modify memory (if it is a store)
> in a single swift effort and any future reads from that memory must
> return the FULL result of that write.
> 
> Multi-copy means it can be resized and reordered a bit. Will Deacon is
> going to crucify me for simplifying it, but.. let's proceed with a
> poor example:

Not sure about Will but I think you got them wrong ;). The single/multi
copy atomicity is considered in respect to (multiple) observers, a.k.a.
masters, and nothing to do with reordering a bit (see B2.2 in the ARMv8
ARM).

> STR X0,[X1] on a 32-bit bus cannot ever be single-copy atomic, because
> you cannot write 64-bits of data on a 32-bit bus in a single,
> unbreakable transaction. This is because from one bus cycle to the
> next, one half of the transaction will be in a different place. Your
> interconnect will have latched and buffered 32-bits and the CPU is
> holding the other.

It depends on the implementation, interconnect, buses. Since single-copy
atomicity refers to master accesses, the above transaction could be a
burst of two 32-bit writes and treated atomically by the interconnect
(i.e. not interruptible).

> STP X0, X1, [X2] on a 64-bit bus can be single-copy atomic with
> respect to the element size. But it is on the whole multi-copy atomic
> - that is to say that it can provide a single transaction with
> multiple elements which are transmitted, and those elements could be
> messed with on the way down the pipe.

This has nothing to do with multi-copy atomicity which actually refers
to multiple observers seeing the same write. The ARM architecture is not
exactly multi-copy atomic anyway (rather "other-multi-copy atomic").

Architecturally, STP is treated as two single-copy accesses (as you
mentioned already).

Anyway, the single/multiple copy atomicity is irrelevant for the C test
from Mikulas where you have the same observer (the CPU) writing and
reading the memory. I wonder whether writing a byte and reading a long
back would show similar corruption.

> And the granularity of the hazarding in your system, from the CPU
> store buffer to the bus interface to the interconnect buffering to the
> PCIe bridge to the PCIe EP is.. what? Not the same all the way down,
> I'll bet you.

I think hazarding is what goes wrong here, especially since with
overlapping unaligned addresses. However, I disagree that it is
impossible to implement this properly on a platform with PCIe so that
Normal NC mappings can be used.

Thanks.

[1] 
https://developer.arm.com/docs/den0029/latest/server-base-system-architecture

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
>   while (1) {
>   start = (unsigned)random() % (LEN + 1);
>   end = (unsigned)random() % (LEN + 1);
>   if (start > end)
>   continue;
>   for (i = start; i < end; i++)
>   data[i] = val++;
>   memcpy(map + start, data + start, end - start);
>   if (memcmp(map, data, LEN)) {

It may be worth trying to do a memcmp(map+start, data+start, end-start)
here to see whether the hazard logic fails when the writes are unaligned
but the reads are not.

This problem may as well appear if you do byte writes and read longs
back (and I consider this a hardware problem on this specific board).

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-08 Thread Catalin Marinas
On Fri, Aug 03, 2018 at 01:09:02PM -0400, Mikulas Patocka wrote:
>   while (1) {
>   start = (unsigned)random() % (LEN + 1);
>   end = (unsigned)random() % (LEN + 1);
>   if (start > end)
>   continue;
>   for (i = start; i < end; i++)
>   data[i] = val++;
>   memcpy(map + start, data + start, end - start);
>   if (memcmp(map, data, LEN)) {

It may be worth trying to do a memcmp(map+start, data+start, end-start)
here to see whether the hazard logic fails when the writes are unaligned
but the reads are not.

This problem may as well appear if you do byte writes and read longs
back (and I consider this a hardware problem on this specific board).

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Tue, 7 Aug 2018, Ard Biesheuvel wrote:

> On 7 August 2018 at 19:39, Mikulas Patocka  wrote:
> >
> >
> > On Tue, 7 Aug 2018, Marcin Wojtas wrote:
> >
> >> Ard, Mikulas,
> >>
> >> After some self-caused setup issues I was able to run the test on my
> >> MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
> >> loading the CPU to 100% and no single error event...
> >>
> >> I built the binary file with:
> >> gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
> >>  -O2
> >>
> >> Maybe it's the older firmware issue?
> >
> > I have downloaded and built the firmware recently (it has timestamp Jul 30
> > 2018).
> >
> > Do you still have your firmware file "flash-image.bin" that you used, so
> > that I could try it?
> >
> >> Please send the full bootlog with
> >> the very first line after reset. My board rev is v1.3 and I use
> >> mainline UEFI (newest edk2 + edk2-platforms) + newest publicly
> >> available ARM-TF and earliest firmware for this board.
> >>
> >> Best regards,
> >> Marcin
> >
> 
> Mikulas,
> 
> Is the issue reproducible with an nvidia card + nouveau driver as well ?
> 
> Given the screen corruption i see with radeon even on other arm
> systems, i'd like to ensure that this is a platform bug not a driver
> bug.

I see the same memcpy-to-framebuffer corruption on Radeon HD 6450 and 
nVidia Quadro NVS 285.

3D acceleration on nVidia is slow, but it doesn't have visible glitches.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Tue, 7 Aug 2018, Ard Biesheuvel wrote:

> On 7 August 2018 at 19:39, Mikulas Patocka  wrote:
> >
> >
> > On Tue, 7 Aug 2018, Marcin Wojtas wrote:
> >
> >> Ard, Mikulas,
> >>
> >> After some self-caused setup issues I was able to run the test on my
> >> MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
> >> loading the CPU to 100% and no single error event...
> >>
> >> I built the binary file with:
> >> gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
> >>  -O2
> >>
> >> Maybe it's the older firmware issue?
> >
> > I have downloaded and built the firmware recently (it has timestamp Jul 30
> > 2018).
> >
> > Do you still have your firmware file "flash-image.bin" that you used, so
> > that I could try it?
> >
> >> Please send the full bootlog with
> >> the very first line after reset. My board rev is v1.3 and I use
> >> mainline UEFI (newest edk2 + edk2-platforms) + newest publicly
> >> available ARM-TF and earliest firmware for this board.
> >>
> >> Best regards,
> >> Marcin
> >
> 
> Mikulas,
> 
> Is the issue reproducible with an nvidia card + nouveau driver as well ?
> 
> Given the screen corruption i see with radeon even on other arm
> systems, i'd like to ensure that this is a platform bug not a driver
> bug.

I see the same memcpy-to-framebuffer corruption on Radeon HD 6450 and 
nVidia Quadro NVS 285.

3D acceleration on nVidia is slow, but it doesn't have visible glitches.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Ard Biesheuvel
On 7 August 2018 at 19:39, Mikulas Patocka  wrote:
>
>
> On Tue, 7 Aug 2018, Marcin Wojtas wrote:
>
>> Ard, Mikulas,
>>
>> After some self-caused setup issues I was able to run the test on my
>> MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
>> loading the CPU to 100% and no single error event...
>>
>> I built the binary file with:
>> gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc 
>> -O2
>>
>> Maybe it's the older firmware issue?
>
> I have downloaded and built the firmware recently (it has timestamp Jul 30
> 2018).
>
> Do you still have your firmware file "flash-image.bin" that you used, so
> that I could try it?
>
>> Please send the full bootlog with
>> the very first line after reset. My board rev is v1.3 and I use
>> mainline UEFI (newest edk2 + edk2-platforms) + newest publicly
>> available ARM-TF and earliest firmware for this board.
>>
>> Best regards,
>> Marcin
>

Mikulas,

Is the issue reproducible with an nvidia card + nouveau driver as well ?

Given the screen corruption i see with radeon even on other arm
systems, i'd like to ensure that this is a platform bug not a driver
bug.


> This is my bootlog:
>
> BootROM - 2.03
> Starting CP-0 IOROM 1.07
> Booting from SD 0 (0x29)
> Found valid image at boot postion 0x002
> lNOTICE:  Starting binary extension
> NOTICE:  SVC: SW Revision 0x0. SVC is not supported
> mv_ddr: mv_ddr-devel-18.05.0-g84dd1d9 (Jul 30 2018 - 04:58:51 PM)
> mv_ddr: completed successfully
> NOTICE:  Cold boot
> NOTICE:  Booting Trusted Firmware
> NOTICE:  BL1: v1.4(release):armada-18.05.2:80bbf686
> NOTICE:  BL1: Built : 17:00:18, Jul 30 2018
> NOTICE:  BL1: Booting BL2
> lNOTICE:  BL2: v1.4(release):armada-18.05.2:80bbf686
> NOTICE:  BL2: Built : 17:00:21, Jul 30 2018
> BL2: Initiating SCP_BL2 transfer to SCP
> NOTICE:  SCP_BL2 contains 2 concatenated images
> NOTICE:  Load image to CP1 MSS AP0
> NOTICE:  Loading MSS image from address 0x4023020 Size 0x135c to MSS at 
> 0xf428
> NOTICE:  Done
> NOTICE:  Load image to AP0 MSS
> NOTICE:  Loading MSS image from address 0x402437c Size 0x1f6c to MSS at 
> 0xf058
> N
>
> FreeRTOS 7.3.0 - Marvell cm3 - A8K release armada-18.05.1
>
> OTICE:  Done
> NOTICE:  SCP Image doesn't contain PM firmware
> NOTICE:  BL1: Booting BL31
> lNOTICE:  MSS PM is not supported in this build
> NOTICE:  BL31: v1.4(release):armada-18.05.2:80bbf686
> NOTICE:  BL31: Built : 17:00:21, Jul 30 2018
> lUEFI firmware (version MARVELL_EFI built at 16:50:27 on Jul 30 2018)
>
> Armada 8040 MachiatoBin Platform Init
>
> Comphy0-0: PCIE0 5 Gbps
> Comphy0-1: PCIE0 5 Gbps
> Comphy0-2: PCIE0 5 Gbps
> Comphy0-3: PCIE0 5 Gbps
> Comphy0-4: SFI   10.31 Gbps
> Comphy0-5: SATA1 5 Gbps
>
> Comphy1-0: SGMII11.25 Gbps
> Comphy1-1: SATA2 5 Gbps
> Comphy1-2: USB3_HOST05 Gbps
> Comphy1-3: SATA3 5 Gbps
> Comphy1-4: SFI   10.31 Gbps
> Comphy1-5: SGMII23.125 Gbps
>
> UTMI PHY 0 initialized to USB Host0
> UTMI PHY 1 initialized to USB Host1
> UTMI PHY 2 initialized to USB Host0
> Succesfully installed protocol interfaces
> Error: Image at 000BF6F8000 start failed: 0001
> remove-symbol-file 
> /usr/src/git/macchiato/edk2/Build/Armada80x0McBin-AARCH64/RELEASE_GCC5/AARCH64/MdeModulePkg/Universal/Acpi/AcpiPlatformDxe/AcpiPlatformDxe/DEBUG/AcpiPlatform.dll
>  0xBF6F9000
> Detected w25q32bv SPI flash with page size 256 B, erase size 4 KB, total 4 MB
> ramdisk:blckio install. Status=Success
> Connect: PcieRoot(0x0)/Pci(0x0,0x0): Not Found
> 3h3h3hTianocore/EDK2 firmware version MARVELL_EFI
> Press ESCAPE for boot options ...error: no suitable video mode found.
> error: no video mode activated.
> GNU GRUB  version 2.02~beta3-5
>
> /\||\/
>  Use the ^ and v keys to select which entry is highlighted.
>   Press enter to boot the selected OS, `e' to edit the commands
>   before booting or `c' for a command-line.
> *Debian GNU/Linux
> Advanced options for Debian GNU/Linux   
> System setup
>The highlighted entry will be executed automatically in 5s.
> The highlighted entry will be executed automatically in 4s.   
>   
>   
> Loading Linux 4.17.11 ...
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [0.00] Booting Linux on physical CPU 0x00 [0x410fd081]
> [0.00] Linux version 4.17.11 (root@leontynka) (gcc version 8.2.0 
> (Debian 8.2.0-2)) #10 SMP 

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Ard Biesheuvel
On 7 August 2018 at 19:39, Mikulas Patocka  wrote:
>
>
> On Tue, 7 Aug 2018, Marcin Wojtas wrote:
>
>> Ard, Mikulas,
>>
>> After some self-caused setup issues I was able to run the test on my
>> MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
>> loading the CPU to 100% and no single error event...
>>
>> I built the binary file with:
>> gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc 
>> -O2
>>
>> Maybe it's the older firmware issue?
>
> I have downloaded and built the firmware recently (it has timestamp Jul 30
> 2018).
>
> Do you still have your firmware file "flash-image.bin" that you used, so
> that I could try it?
>
>> Please send the full bootlog with
>> the very first line after reset. My board rev is v1.3 and I use
>> mainline UEFI (newest edk2 + edk2-platforms) + newest publicly
>> available ARM-TF and earliest firmware for this board.
>>
>> Best regards,
>> Marcin
>

Mikulas,

Is the issue reproducible with an nvidia card + nouveau driver as well ?

Given the screen corruption i see with radeon even on other arm
systems, i'd like to ensure that this is a platform bug not a driver
bug.


> This is my bootlog:
>
> BootROM - 2.03
> Starting CP-0 IOROM 1.07
> Booting from SD 0 (0x29)
> Found valid image at boot postion 0x002
> lNOTICE:  Starting binary extension
> NOTICE:  SVC: SW Revision 0x0. SVC is not supported
> mv_ddr: mv_ddr-devel-18.05.0-g84dd1d9 (Jul 30 2018 - 04:58:51 PM)
> mv_ddr: completed successfully
> NOTICE:  Cold boot
> NOTICE:  Booting Trusted Firmware
> NOTICE:  BL1: v1.4(release):armada-18.05.2:80bbf686
> NOTICE:  BL1: Built : 17:00:18, Jul 30 2018
> NOTICE:  BL1: Booting BL2
> lNOTICE:  BL2: v1.4(release):armada-18.05.2:80bbf686
> NOTICE:  BL2: Built : 17:00:21, Jul 30 2018
> BL2: Initiating SCP_BL2 transfer to SCP
> NOTICE:  SCP_BL2 contains 2 concatenated images
> NOTICE:  Load image to CP1 MSS AP0
> NOTICE:  Loading MSS image from address 0x4023020 Size 0x135c to MSS at 
> 0xf428
> NOTICE:  Done
> NOTICE:  Load image to AP0 MSS
> NOTICE:  Loading MSS image from address 0x402437c Size 0x1f6c to MSS at 
> 0xf058
> N
>
> FreeRTOS 7.3.0 - Marvell cm3 - A8K release armada-18.05.1
>
> OTICE:  Done
> NOTICE:  SCP Image doesn't contain PM firmware
> NOTICE:  BL1: Booting BL31
> lNOTICE:  MSS PM is not supported in this build
> NOTICE:  BL31: v1.4(release):armada-18.05.2:80bbf686
> NOTICE:  BL31: Built : 17:00:21, Jul 30 2018
> lUEFI firmware (version MARVELL_EFI built at 16:50:27 on Jul 30 2018)
>
> Armada 8040 MachiatoBin Platform Init
>
> Comphy0-0: PCIE0 5 Gbps
> Comphy0-1: PCIE0 5 Gbps
> Comphy0-2: PCIE0 5 Gbps
> Comphy0-3: PCIE0 5 Gbps
> Comphy0-4: SFI   10.31 Gbps
> Comphy0-5: SATA1 5 Gbps
>
> Comphy1-0: SGMII11.25 Gbps
> Comphy1-1: SATA2 5 Gbps
> Comphy1-2: USB3_HOST05 Gbps
> Comphy1-3: SATA3 5 Gbps
> Comphy1-4: SFI   10.31 Gbps
> Comphy1-5: SGMII23.125 Gbps
>
> UTMI PHY 0 initialized to USB Host0
> UTMI PHY 1 initialized to USB Host1
> UTMI PHY 2 initialized to USB Host0
> Succesfully installed protocol interfaces
> Error: Image at 000BF6F8000 start failed: 0001
> remove-symbol-file 
> /usr/src/git/macchiato/edk2/Build/Armada80x0McBin-AARCH64/RELEASE_GCC5/AARCH64/MdeModulePkg/Universal/Acpi/AcpiPlatformDxe/AcpiPlatformDxe/DEBUG/AcpiPlatform.dll
>  0xBF6F9000
> Detected w25q32bv SPI flash with page size 256 B, erase size 4 KB, total 4 MB
> ramdisk:blckio install. Status=Success
> Connect: PcieRoot(0x0)/Pci(0x0,0x0): Not Found
> 3h3h3hTianocore/EDK2 firmware version MARVELL_EFI
> Press ESCAPE for boot options ...error: no suitable video mode found.
> error: no video mode activated.
> GNU GRUB  version 2.02~beta3-5
>
> /\||\/
>  Use the ^ and v keys to select which entry is highlighted.
>   Press enter to boot the selected OS, `e' to edit the commands
>   before booting or `c' for a command-line.
> *Debian GNU/Linux
> Advanced options for Debian GNU/Linux   
> System setup
>The highlighted entry will be executed automatically in 5s.
> The highlighted entry will be executed automatically in 4s.   
>   
>   
> Loading Linux 4.17.11 ...
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [0.00] Booting Linux on physical CPU 0x00 [0x410fd081]
> [0.00] Linux version 4.17.11 (root@leontynka) (gcc version 8.2.0 
> (Debian 8.2.0-2)) #10 SMP 

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Tue, 7 Aug 2018, Marcin Wojtas wrote:

> Ard, Mikulas,
> 
> After some self-caused setup issues I was able to run the test on my
> MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
> loading the CPU to 100% and no single error event...
> 
> I built the binary file with:
> gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc 
> -O2
> 
> Maybe it's the older firmware issue?

I have downloaded and built the firmware recently (it has timestamp Jul 30 
2018).

Do you still have your firmware file "flash-image.bin" that you used, so 
that I could try it?

> Please send the full bootlog with
> the very first line after reset. My board rev is v1.3 and I use
> mainline UEFI (newest edk2 + edk2-platforms) + newest publicly
> available ARM-TF and earliest firmware for this board.
> 
> Best regards,
> Marcin

This is my bootlog:

BootROM - 2.03
Starting CP-0 IOROM 1.07
Booting from SD 0 (0x29)
Found valid image at boot postion 0x002
lNOTICE:  Starting binary extension
NOTICE:  SVC: SW Revision 0x0. SVC is not supported
mv_ddr: mv_ddr-devel-18.05.0-g84dd1d9 (Jul 30 2018 - 04:58:51 PM)
mv_ddr: completed successfully
NOTICE:  Cold boot
NOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v1.4(release):armada-18.05.2:80bbf686
NOTICE:  BL1: Built : 17:00:18, Jul 30 2018
NOTICE:  BL1: Booting BL2
lNOTICE:  BL2: v1.4(release):armada-18.05.2:80bbf686
NOTICE:  BL2: Built : 17:00:21, Jul 30 2018
BL2: Initiating SCP_BL2 transfer to SCP
NOTICE:  SCP_BL2 contains 2 concatenated images
NOTICE:  Load image to CP1 MSS AP0
NOTICE:  Loading MSS image from address 0x4023020 Size 0x135c to MSS at 
0xf428
NOTICE:  Done
NOTICE:  Load image to AP0 MSS
NOTICE:  Loading MSS image from address 0x402437c Size 0x1f6c to MSS at 
0xf058
N

FreeRTOS 7.3.0 - Marvell cm3 - A8K release armada-18.05.1

OTICE:  Done
NOTICE:  SCP Image doesn't contain PM firmware
NOTICE:  BL1: Booting BL31
lNOTICE:  MSS PM is not supported in this build
NOTICE:  BL31: v1.4(release):armada-18.05.2:80bbf686
NOTICE:  BL31: Built : 17:00:21, Jul 30 2018
lUEFI firmware (version MARVELL_EFI built at 16:50:27 on Jul 30 2018)

Armada 8040 MachiatoBin Platform Init

Comphy0-0: PCIE0 5 Gbps
Comphy0-1: PCIE0 5 Gbps
Comphy0-2: PCIE0 5 Gbps
Comphy0-3: PCIE0 5 Gbps
Comphy0-4: SFI   10.31 Gbps
Comphy0-5: SATA1 5 Gbps

Comphy1-0: SGMII11.25 Gbps 
Comphy1-1: SATA2 5 Gbps
Comphy1-2: USB3_HOST05 Gbps
Comphy1-3: SATA3 5 Gbps
Comphy1-4: SFI   10.31 Gbps
Comphy1-5: SGMII23.125 Gbps

UTMI PHY 0 initialized to USB Host0
UTMI PHY 1 initialized to USB Host1
UTMI PHY 2 initialized to USB Host0
Succesfully installed protocol interfaces
Error: Image at 000BF6F8000 start failed: 0001
remove-symbol-file 
/usr/src/git/macchiato/edk2/Build/Armada80x0McBin-AARCH64/RELEASE_GCC5/AARCH64/MdeModulePkg/Universal/Acpi/AcpiPlatformDxe/AcpiPlatformDxe/DEBUG/AcpiPlatform.dll
 0xBF6F9000
Detected w25q32bv SPI flash with page size 256 B, erase size 4 KB, total 4 MB
ramdisk:blckio install. Status=Success
Connect: PcieRoot(0x0)/Pci(0x0,0x0): Not Found
3h3h3hTianocore/EDK2 firmware version MARVELL_EFI
Press ESCAPE for boot options ...error: no suitable video mode found.
error: no video mode activated.
GNU GRUB  version 2.02~beta3-5

/\||\/
 Use the ^ and v keys to select which entry is highlighted.  
  Press enter to boot the selected OS, `e' to edit the commands   
  before booting or `c' for a command-line.
*Debian GNU/Linux
Advanced options for Debian GNU/Linux   
System setup










   The highlighted entry will be executed automatically in 5s.  
  The highlighted entry will be executed automatically in 4s.   
   

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Tue, 7 Aug 2018, Marcin Wojtas wrote:

> Ard, Mikulas,
> 
> After some self-caused setup issues I was able to run the test on my
> MacchiatoBin with the kernel v4.18-rc8. It's been running for 1h+ now,
> loading the CPU to 100% and no single error event...
> 
> I built the binary file with:
> gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc 
> -O2
> 
> Maybe it's the older firmware issue?

I have downloaded and built the firmware recently (it has timestamp Jul 30 
2018).

Do you still have your firmware file "flash-image.bin" that you used, so 
that I could try it?

> Please send the full bootlog with
> the very first line after reset. My board rev is v1.3 and I use
> mainline UEFI (newest edk2 + edk2-platforms) + newest publicly
> available ARM-TF and earliest firmware for this board.
> 
> Best regards,
> Marcin

This is my bootlog:

BootROM - 2.03
Starting CP-0 IOROM 1.07
Booting from SD 0 (0x29)
Found valid image at boot postion 0x002
lNOTICE:  Starting binary extension
NOTICE:  SVC: SW Revision 0x0. SVC is not supported
mv_ddr: mv_ddr-devel-18.05.0-g84dd1d9 (Jul 30 2018 - 04:58:51 PM)
mv_ddr: completed successfully
NOTICE:  Cold boot
NOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v1.4(release):armada-18.05.2:80bbf686
NOTICE:  BL1: Built : 17:00:18, Jul 30 2018
NOTICE:  BL1: Booting BL2
lNOTICE:  BL2: v1.4(release):armada-18.05.2:80bbf686
NOTICE:  BL2: Built : 17:00:21, Jul 30 2018
BL2: Initiating SCP_BL2 transfer to SCP
NOTICE:  SCP_BL2 contains 2 concatenated images
NOTICE:  Load image to CP1 MSS AP0
NOTICE:  Loading MSS image from address 0x4023020 Size 0x135c to MSS at 
0xf428
NOTICE:  Done
NOTICE:  Load image to AP0 MSS
NOTICE:  Loading MSS image from address 0x402437c Size 0x1f6c to MSS at 
0xf058
N

FreeRTOS 7.3.0 - Marvell cm3 - A8K release armada-18.05.1

OTICE:  Done
NOTICE:  SCP Image doesn't contain PM firmware
NOTICE:  BL1: Booting BL31
lNOTICE:  MSS PM is not supported in this build
NOTICE:  BL31: v1.4(release):armada-18.05.2:80bbf686
NOTICE:  BL31: Built : 17:00:21, Jul 30 2018
lUEFI firmware (version MARVELL_EFI built at 16:50:27 on Jul 30 2018)

Armada 8040 MachiatoBin Platform Init

Comphy0-0: PCIE0 5 Gbps
Comphy0-1: PCIE0 5 Gbps
Comphy0-2: PCIE0 5 Gbps
Comphy0-3: PCIE0 5 Gbps
Comphy0-4: SFI   10.31 Gbps
Comphy0-5: SATA1 5 Gbps

Comphy1-0: SGMII11.25 Gbps 
Comphy1-1: SATA2 5 Gbps
Comphy1-2: USB3_HOST05 Gbps
Comphy1-3: SATA3 5 Gbps
Comphy1-4: SFI   10.31 Gbps
Comphy1-5: SGMII23.125 Gbps

UTMI PHY 0 initialized to USB Host0
UTMI PHY 1 initialized to USB Host1
UTMI PHY 2 initialized to USB Host0
Succesfully installed protocol interfaces
Error: Image at 000BF6F8000 start failed: 0001
remove-symbol-file 
/usr/src/git/macchiato/edk2/Build/Armada80x0McBin-AARCH64/RELEASE_GCC5/AARCH64/MdeModulePkg/Universal/Acpi/AcpiPlatformDxe/AcpiPlatformDxe/DEBUG/AcpiPlatform.dll
 0xBF6F9000
Detected w25q32bv SPI flash with page size 256 B, erase size 4 KB, total 4 MB
ramdisk:blckio install. Status=Success
Connect: PcieRoot(0x0)/Pci(0x0,0x0): Not Found
3h3h3hTianocore/EDK2 firmware version MARVELL_EFI
Press ESCAPE for boot options ...error: no suitable video mode found.
error: no video mode activated.
GNU GRUB  version 2.02~beta3-5

/\||\/
 Use the ^ and v keys to select which entry is highlighted.  
  Press enter to boot the selected OS, `e' to edit the commands   
  before booting or `c' for a command-line.
*Debian GNU/Linux
Advanced options for Debian GNU/Linux   
System setup










   The highlighted entry will be executed automatically in 5s.  
  The highlighted entry will be executed automatically in 4s.   
   

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Marcin Wojtas
Ard, Mikulas,

pon., 6 sie 2018 o 22:11 Ard Biesheuvel  napisał(a):
>
> On 6 August 2018 at 21:54, Mikulas Patocka  wrote:
> >
> >
> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
> >
> >> On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
> >> >
> >> >
> >> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
> >> >
> >> >> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> >> >> > On 06/08/18 11:25, Mikulas Patocka wrote:
> >> >> > [...]
> >> >> >>>
> >> >> >>> None of this explains why some transactions fail to make it across
> >> >> >>> entirely. The overlapping writes in question write the same data to
> >> >> >>> the memory locations that are covered by both, and so the ordering 
> >> >> >>> in
> >> >> >>> which the transactions are received should not affect the outcome.
> >> >> >>
> >> >> >>
> >> >> >> You're right that the corruption couldn't be explained just by 
> >> >> >> reordering
> >> >> >> writes. My hypothesis is that the PCIe controller tries to 
> >> >> >> disambiguate
> >> >> >> the overlapping writes, but the disambiguation logic was not tested 
> >> >> >> and it
> >> >> >> is buggy. If there's a barrier between the overlapping writes, the 
> >> >> >> PCIe
> >> >> >> controller won't see any overlapping writes, so it won't trigger the
> >> >> >> faulty disambiguation logic and it works.
> >> >> >>
> >> >> >> Could the ARM engineers look if there's some chicken bit in 
> >> >> >> Cortex-A72
> >> >> >> that could insert barriers between non-cached writes automatically?
> >> >> >
> >> >> >
> >> >> > I don't think there is, and even if there was I imagine it would have 
> >> >> > a
> >> >> > pretty hideous effect on non-coherent DMA buffers and the various 
> >> >> > other
> >> >> > places in which we have Normal-NC mappings of actual system RAM.
> >> >> >
> >> >>
> >> >> Looking at the A72 manual, there is one chicken bit that looks like it
> >> >> may be related:
> >> >>
> >> >> CPUACTLR_EL1 bit #50:
> >> >>
> >> >> 0 Enables store streaming on NC/GRE memory type. This is the reset 
> >> >> value.
> >> >> 1 Disables store streaming on NC/GRE memory type.
> >> >>
> >> >> so putting something like
> >> >>
> >> >> mrs x0, S3_1_C15_C2_0
> >> >> orr x0, x0, #(1 << 50)
> >> >> msr S3_1_C15_C2_0, x0
> >> >>
> >> >> in __cpu_setup() would be worth a try.
> >> >
> >> > It won't boot.
> >> >
> >> > But if i write the same value that was read, it also won't boot.
> >> >
> >> > I created a simple kernel module that reads this register and it has bit
> >> > 32 set, all other bits clear. But when I write the same value into it, 
> >> > the
> >> > core that does the write is stuck in infinite loop.
> >> >
> >> > So, it seems that we are writing this register from a wrong place.
> >> >
> >>
> >> Ah, my bad. I didn't look closely enough at the description:
> >>
> >> """
> >> The accessibility to the CPUACTLR_EL1 by Exception level is:
> >>
> >> EL0  -
> >> EL1(NS)  RW (a)
> >> EL1(S)   RW (a)
> >> EL2  RW (b)
> >> EL3(SCR.NS = 1)  RW
> >> EL3(SCR.NS = 0)  RW
> >>
> >> (a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
> >> 1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
> >> """
> >>
> >> so you'll have to do this from ARM Trusted Firmware. If you're
> >> comfortable rebuilding that:
> >>
> >> diff --git a/include/lib/cpus/aarch64/cortex_a72.h
> >> b/include/lib/cpus/aarch64/cortex_a72.h
> >> index bfd64918625b..a7b8cf4be0c6 100644
> >> --- a/include/lib/cpus/aarch64/cortex_a72.h
> >> +++ b/include/lib/cpus/aarch64/cortex_a72.h
> >> @@ -31,6 +31,7 @@
> >>  #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0
> >>
> >>  #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
> >> +#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
> >>  #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
> >>  #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
> >>  #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
> >> diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
> >> index 55e508678284..5914d6ee3ba6 100644
> >> --- a/lib/cpus/aarch64/cortex_a72.S
> >> +++ b/lib/cpus/aarch64/cortex_a72.S
> >> @@ -133,6 +133,15 @@ func cortex_a72_reset_func
> >> orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
> >> msr CORTEX_A72_ECTLR_EL1, x0
> >> isb
> >> +
> >> +   /* -
> >> +* Disables store streaming on NC/GRE memory type.
> >> +* -
> >> +*/
> >> +   mrs x0, CORTEX_A72_ACTLR_EL1
> >> +   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
> >> +   msr CORTEX_A72_ACTLR_EL1, x0
> >> +   isb
> >> ret x19
> >>  endfunc cortex_a72_reset_func
> >
> > Unfortunatelly, it doesn't work. I verified that the bit is set after
> > booting Linux, but the memcpy corruption was still present.
> >
> > I also tried the other chicken bits, 

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Marcin Wojtas
Ard, Mikulas,

pon., 6 sie 2018 o 22:11 Ard Biesheuvel  napisał(a):
>
> On 6 August 2018 at 21:54, Mikulas Patocka  wrote:
> >
> >
> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
> >
> >> On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
> >> >
> >> >
> >> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
> >> >
> >> >> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> >> >> > On 06/08/18 11:25, Mikulas Patocka wrote:
> >> >> > [...]
> >> >> >>>
> >> >> >>> None of this explains why some transactions fail to make it across
> >> >> >>> entirely. The overlapping writes in question write the same data to
> >> >> >>> the memory locations that are covered by both, and so the ordering 
> >> >> >>> in
> >> >> >>> which the transactions are received should not affect the outcome.
> >> >> >>
> >> >> >>
> >> >> >> You're right that the corruption couldn't be explained just by 
> >> >> >> reordering
> >> >> >> writes. My hypothesis is that the PCIe controller tries to 
> >> >> >> disambiguate
> >> >> >> the overlapping writes, but the disambiguation logic was not tested 
> >> >> >> and it
> >> >> >> is buggy. If there's a barrier between the overlapping writes, the 
> >> >> >> PCIe
> >> >> >> controller won't see any overlapping writes, so it won't trigger the
> >> >> >> faulty disambiguation logic and it works.
> >> >> >>
> >> >> >> Could the ARM engineers look if there's some chicken bit in 
> >> >> >> Cortex-A72
> >> >> >> that could insert barriers between non-cached writes automatically?
> >> >> >
> >> >> >
> >> >> > I don't think there is, and even if there was I imagine it would have 
> >> >> > a
> >> >> > pretty hideous effect on non-coherent DMA buffers and the various 
> >> >> > other
> >> >> > places in which we have Normal-NC mappings of actual system RAM.
> >> >> >
> >> >>
> >> >> Looking at the A72 manual, there is one chicken bit that looks like it
> >> >> may be related:
> >> >>
> >> >> CPUACTLR_EL1 bit #50:
> >> >>
> >> >> 0 Enables store streaming on NC/GRE memory type. This is the reset 
> >> >> value.
> >> >> 1 Disables store streaming on NC/GRE memory type.
> >> >>
> >> >> so putting something like
> >> >>
> >> >> mrs x0, S3_1_C15_C2_0
> >> >> orr x0, x0, #(1 << 50)
> >> >> msr S3_1_C15_C2_0, x0
> >> >>
> >> >> in __cpu_setup() would be worth a try.
> >> >
> >> > It won't boot.
> >> >
> >> > But if i write the same value that was read, it also won't boot.
> >> >
> >> > I created a simple kernel module that reads this register and it has bit
> >> > 32 set, all other bits clear. But when I write the same value into it, 
> >> > the
> >> > core that does the write is stuck in infinite loop.
> >> >
> >> > So, it seems that we are writing this register from a wrong place.
> >> >
> >>
> >> Ah, my bad. I didn't look closely enough at the description:
> >>
> >> """
> >> The accessibility to the CPUACTLR_EL1 by Exception level is:
> >>
> >> EL0  -
> >> EL1(NS)  RW (a)
> >> EL1(S)   RW (a)
> >> EL2  RW (b)
> >> EL3(SCR.NS = 1)  RW
> >> EL3(SCR.NS = 0)  RW
> >>
> >> (a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
> >> 1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
> >> """
> >>
> >> so you'll have to do this from ARM Trusted Firmware. If you're
> >> comfortable rebuilding that:
> >>
> >> diff --git a/include/lib/cpus/aarch64/cortex_a72.h
> >> b/include/lib/cpus/aarch64/cortex_a72.h
> >> index bfd64918625b..a7b8cf4be0c6 100644
> >> --- a/include/lib/cpus/aarch64/cortex_a72.h
> >> +++ b/include/lib/cpus/aarch64/cortex_a72.h
> >> @@ -31,6 +31,7 @@
> >>  #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0
> >>
> >>  #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
> >> +#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
> >>  #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
> >>  #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
> >>  #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
> >> diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
> >> index 55e508678284..5914d6ee3ba6 100644
> >> --- a/lib/cpus/aarch64/cortex_a72.S
> >> +++ b/lib/cpus/aarch64/cortex_a72.S
> >> @@ -133,6 +133,15 @@ func cortex_a72_reset_func
> >> orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
> >> msr CORTEX_A72_ECTLR_EL1, x0
> >> isb
> >> +
> >> +   /* -
> >> +* Disables store streaming on NC/GRE memory type.
> >> +* -
> >> +*/
> >> +   mrs x0, CORTEX_A72_ACTLR_EL1
> >> +   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
> >> +   msr CORTEX_A72_ACTLR_EL1, x0
> >> +   isb
> >> ret x19
> >>  endfunc cortex_a72_reset_func
> >
> > Unfortunatelly, it doesn't work. I verified that the bit is set after
> > booting Linux, but the memcpy corruption was still present.
> >
> > I also tried the other chicken bits, 

Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Ard Biesheuvel
On 7 August 2018 at 16:14, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> No that works fine for me. VDPAU acceleration works as well, but it
>> depends on your chromium build whether it can actually use it, I
>> think? In any case, mplayer can use vdpau to play 1080p h264 without
>> breaking a sweat on this system.
>>
>> Note that the VDPAU driver also relies on memory semantics, i.e., it
>> may use DC ZVA (zero cacheline) instructions which are not permitted
>> on device mappings. This is probably just glibc's memset() being
>> invoked, but I remember hitting this on another PCIe-impaired arm64
>> system with Synopsys PCIe IP
>
> DC ZVA can be disabled with the SCTRL_EL1.DZE bit, so that neither kernel
> nor userspace will use it.

Of course, but only the OS can do that, and only system wide unless
we're eager to create infrastructure for managing this per process.

But it is also beside the point: I mentioned it to illustrate that
even use cases like libvdpau that don't operate on the 'framebuffer'
abstraction make assumptions about VRAM having true memory semantics.

> If the mapping didn't support unaligned writes,
> it would be worse.
>
> Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Ard Biesheuvel
On 7 August 2018 at 16:14, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> No that works fine for me. VDPAU acceleration works as well, but it
>> depends on your chromium build whether it can actually use it, I
>> think? In any case, mplayer can use vdpau to play 1080p h264 without
>> breaking a sweat on this system.
>>
>> Note that the VDPAU driver also relies on memory semantics, i.e., it
>> may use DC ZVA (zero cacheline) instructions which are not permitted
>> on device mappings. This is probably just glibc's memset() being
>> invoked, but I remember hitting this on another PCIe-impaired arm64
>> system with Synopsys PCIe IP
>
> DC ZVA can be disabled with the SCTRL_EL1.DZE bit, so that neither kernel
> nor userspace will use it.

Of course, but only the OS can do that, and only system wide unless
we're eager to create infrastructure for managing this per process.

But it is also beside the point: I mentioned it to illustrate that
even use cases like libvdpau that don't operate on the 'framebuffer'
abstraction make assumptions about VRAM having true memory semantics.

> If the mapping didn't support unaligned writes,
> it would be worse.
>
> Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread David Laight
From: Mikulas Patocka
> Sent: 07 August 2018 15:07
...
> Unaccelerated scrolling is still painfully slow
> even on modern computers because of slow framebuffer read.

I solved that many years ago on a strongarm system by mapping
the screen memory at two separate virtual addresses.
One uncached used for writes, the second cached using the
'minicache' for reads.
(and immediately fell foul of a memcpy() function that compared
the two virtual addresses and decided to copy backwards)

I suspect some modern cpus don't like you doing that and the
graphics 'drivers' won't use different mappings.

Even in glibc you want a more general copy_to/from_io_memory()
rather than just 'copy_from_framebuffer()'.
Best to define both - even if they end up identical.
Other drivers allow PCIe space be mmap()ed into user space.

While your tests show vmovntdqa being slightly slower than an
avx read for uncached mappings it is still much better than
all the other options.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread David Laight
From: Mikulas Patocka
> Sent: 07 August 2018 15:07
...
> Unaccelerated scrolling is still painfully slow
> even on modern computers because of slow framebuffer read.

I solved that many years ago on a strongarm system by mapping
the screen memory at two separate virtual addresses.
One uncached used for writes, the second cached using the
'minicache' for reads.
(and immediately fell foul of a memcpy() function that compared
the two virtual addresses and decided to copy backwards)

I suspect some modern cpus don't like you doing that and the
graphics 'drivers' won't use different mappings.

Even in glibc you want a more general copy_to/from_io_memory()
rather than just 'copy_from_framebuffer()'.
Best to define both - even if they end up identical.
Other drivers allow PCIe space be mmap()ed into user space.

While your tests show vmovntdqa being slightly slower than an
avx read for uncached mappings it is still much better than
all the other options.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> No that works fine for me. VDPAU acceleration works as well, but it
> depends on your chromium build whether it can actually use it, I
> think? In any case, mplayer can use vdpau to play 1080p h264 without
> breaking a sweat on this system.
> 
> Note that the VDPAU driver also relies on memory semantics, i.e., it
> may use DC ZVA (zero cacheline) instructions which are not permitted
> on device mappings. This is probably just glibc's memset() being
> invoked, but I remember hitting this on another PCIe-impaired arm64
> system with Synopsys PCIe IP

DC ZVA can be disabled with the SCTRL_EL1.DZE bit, so that neither kernel 
nor userspace will use it. If the mapping didn't support unaligned writes, 
it would be worse.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> No that works fine for me. VDPAU acceleration works as well, but it
> depends on your chromium build whether it can actually use it, I
> think? In any case, mplayer can use vdpau to play 1080p h264 without
> breaking a sweat on this system.
> 
> Note that the VDPAU driver also relies on memory semantics, i.e., it
> may use DC ZVA (zero cacheline) instructions which are not permitted
> on device mappings. This is probably just glibc's memset() being
> invoked, but I remember hitting this on another PCIe-impaired arm64
> system with Synopsys PCIe IP

DC ZVA can be disabled with the SCTRL_EL1.DZE bit, so that neither kernel 
nor userspace will use it. If the mapping didn't support unaligned writes, 
it would be worse.

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Mon, 6 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 05 August 2018 15:36
> > To: David Laight
> ...
> > There's an instruction movntdqa (and vmovntdqa) that can actually do
> > prefetch on write-combining memory type. It's the only instruction that
> > can do it.
> > 
> > It this instruction is used on non-write-combining memory type, it behaves
> > like movdqa.
> > 
> ...
> > I benchmarked it on a processor with ERMS - for writes to the framebuffer,
> > there's no difference between memcpy, 8-byte writes, rep stosb, rep stosq,
> > mmx, sse, avx - all this method achieve 16-17 GB/s
> 
> The combination of write-combining, posted writes and a fast PCIe slave
> are probably why there is little difference.
> 
> > For reading from the framebuffer:
> >  323 MB/s - memcpy (using avx2)
> >   91 MB/s - explicit 8-byte reads
> >  249 MB/s - rep movsq
> >  307 MB/s - rep movsb
> 
> You must be getting the ERMS hardware optimised 'rep movsb'.
> 
> >   90 MB/s - mmx
> >  176 MB/s - sse
> > 4750 MB/s - sse movntdqa
> >  330 MB/s - avx
> 
> avx512 is probably faster still.
> 
> > 5369 MB/s - avx vmovntdqa
> > 
> > So - it may make sense to introduce a function memcpy_from_framebuffer()
> > that uses movntdqa or vmovntdqa on CPUs that support it.
> 
> For kernel space it ought to be just memcpy_fromio().

I meant for userspace. Unaccelerated scrolling is still painfully slow 
even on modern computers because of slow framebuffer read. If glibc 
provided a function memcpy_from_framebuffer() that used movntdqa and the 
fbdev Xorg driver used it, it would help the users who use unaccelerated 
drivers for some reason.

> Can you easily repeat the tests using a non-write-combining map of the
> same PCIe slave?

I mapped the framebuffer as uncached and these are the results:

reading from the framebuffer:
318 MB/s - memcpy
 74 MB/s - explicit 8-byte reads
 73 MB/s - rep movsq
 11 MB/s - rep movsb
 87 MB/s - mmx
173 MB/s - sse
173 MB/s - sse movntdqa
323 MB/s - avx
284 MB/s - avx vmovntdqa

zeroing the framebuffer:
 19 MB/s - memset
154 MB/s - explicit 8-byte writes
152 MB/s - rep stosq
 19 MB/s - rep stosb
152 MB/s - mmx
306 MB/s - sse
621 MB/s - avx

copying data to the framebuffer:
618 MB/s - memcpy (using avx2)
152 MB/s - explicit 8-byte writes
139 MB/s - rep movsq
 17 MB/s - rep movsb
154 MB/s - mmx
305 MB/s - sse
306 MB/s - sse movntdqa
619 MB/s - avx
619 MB/s - avx movntdqa

> I can probably run the same measurements against our rather leisurely
> FPGA based PCIe slave.
> IIRC PCIe reads happen every 128 clocks of the cards 62.5MHz clock,
> increasing the size of the registers makes a significant different.
> I've not tried mapping write-combining and using (v)movntdaq.
> I'm not sure what effect write-combining would have if the whole BAR
> were mapped that way - so I'll either have to map the physical addresses
> twice or add in another BAR.
> 
>   David

Mikulas


RE: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-07 Thread Mikulas Patocka



On Mon, 6 Aug 2018, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 05 August 2018 15:36
> > To: David Laight
> ...
> > There's an instruction movntdqa (and vmovntdqa) that can actually do
> > prefetch on write-combining memory type. It's the only instruction that
> > can do it.
> > 
> > It this instruction is used on non-write-combining memory type, it behaves
> > like movdqa.
> > 
> ...
> > I benchmarked it on a processor with ERMS - for writes to the framebuffer,
> > there's no difference between memcpy, 8-byte writes, rep stosb, rep stosq,
> > mmx, sse, avx - all this method achieve 16-17 GB/s
> 
> The combination of write-combining, posted writes and a fast PCIe slave
> are probably why there is little difference.
> 
> > For reading from the framebuffer:
> >  323 MB/s - memcpy (using avx2)
> >   91 MB/s - explicit 8-byte reads
> >  249 MB/s - rep movsq
> >  307 MB/s - rep movsb
> 
> You must be getting the ERMS hardware optimised 'rep movsb'.
> 
> >   90 MB/s - mmx
> >  176 MB/s - sse
> > 4750 MB/s - sse movntdqa
> >  330 MB/s - avx
> 
> avx512 is probably faster still.
> 
> > 5369 MB/s - avx vmovntdqa
> > 
> > So - it may make sense to introduce a function memcpy_from_framebuffer()
> > that uses movntdqa or vmovntdqa on CPUs that support it.
> 
> For kernel space it ought to be just memcpy_fromio().

I meant for userspace. Unaccelerated scrolling is still painfully slow 
even on modern computers because of slow framebuffer read. If glibc 
provided a function memcpy_from_framebuffer() that used movntdqa and the 
fbdev Xorg driver used it, it would help the users who use unaccelerated 
drivers for some reason.

> Can you easily repeat the tests using a non-write-combining map of the
> same PCIe slave?

I mapped the framebuffer as uncached and these are the results:

reading from the framebuffer:
318 MB/s - memcpy
 74 MB/s - explicit 8-byte reads
 73 MB/s - rep movsq
 11 MB/s - rep movsb
 87 MB/s - mmx
173 MB/s - sse
173 MB/s - sse movntdqa
323 MB/s - avx
284 MB/s - avx vmovntdqa

zeroing the framebuffer:
 19 MB/s - memset
154 MB/s - explicit 8-byte writes
152 MB/s - rep stosq
 19 MB/s - rep stosb
152 MB/s - mmx
306 MB/s - sse
621 MB/s - avx

copying data to the framebuffer:
618 MB/s - memcpy (using avx2)
152 MB/s - explicit 8-byte writes
139 MB/s - rep movsq
 17 MB/s - rep movsb
154 MB/s - mmx
305 MB/s - sse
306 MB/s - sse movntdqa
619 MB/s - avx
619 MB/s - avx movntdqa

> I can probably run the same measurements against our rather leisurely
> FPGA based PCIe slave.
> IIRC PCIe reads happen every 128 clocks of the cards 62.5MHz clock,
> increasing the size of the registers makes a significant different.
> I've not tried mapping write-combining and using (v)movntdaq.
> I'm not sure what effect write-combining would have if the whole BAR
> were mapped that way - so I'll either have to map the physical addresses
> twice or add in another BAR.
> 
>   David

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> > Unfortunatelly, it doesn't work. I verified that the bit is set after
> > booting Linux, but the memcpy corruption was still present.
> >
> > I also tried the other chicken bits, it slowed down the system noticeably,
> > but had no effect on the memcpy corruption.
> >
> 
> OK, it was worth a shot
> 
> Let's wait and see if Marcin has any results.

BTW. is there documentation for that DesignWare PCIe controller somewhere?

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> > Unfortunatelly, it doesn't work. I verified that the bit is set after
> > booting Linux, but the memcpy corruption was still present.
> >
> > I also tried the other chicken bits, it slowed down the system noticeably,
> > but had no effect on the memcpy corruption.
> >
> 
> OK, it was worth a shot
> 
> Let's wait and see if Marcin has any results.

BTW. is there documentation for that DesignWare PCIe controller somewhere?

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 21:54, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
>> >
>> >
>> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>> >
>> >> On 6 August 2018 at 14:42, Robin Murphy  wrote:
>> >> > On 06/08/18 11:25, Mikulas Patocka wrote:
>> >> > [...]
>> >> >>>
>> >> >>> None of this explains why some transactions fail to make it across
>> >> >>> entirely. The overlapping writes in question write the same data to
>> >> >>> the memory locations that are covered by both, and so the ordering in
>> >> >>> which the transactions are received should not affect the outcome.
>> >> >>
>> >> >>
>> >> >> You're right that the corruption couldn't be explained just by 
>> >> >> reordering
>> >> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> >> >> the overlapping writes, but the disambiguation logic was not tested 
>> >> >> and it
>> >> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> >> >> controller won't see any overlapping writes, so it won't trigger the
>> >> >> faulty disambiguation logic and it works.
>> >> >>
>> >> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> >> >> that could insert barriers between non-cached writes automatically?
>> >> >
>> >> >
>> >> > I don't think there is, and even if there was I imagine it would have a
>> >> > pretty hideous effect on non-coherent DMA buffers and the various other
>> >> > places in which we have Normal-NC mappings of actual system RAM.
>> >> >
>> >>
>> >> Looking at the A72 manual, there is one chicken bit that looks like it
>> >> may be related:
>> >>
>> >> CPUACTLR_EL1 bit #50:
>> >>
>> >> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
>> >> 1 Disables store streaming on NC/GRE memory type.
>> >>
>> >> so putting something like
>> >>
>> >> mrs x0, S3_1_C15_C2_0
>> >> orr x0, x0, #(1 << 50)
>> >> msr S3_1_C15_C2_0, x0
>> >>
>> >> in __cpu_setup() would be worth a try.
>> >
>> > It won't boot.
>> >
>> > But if i write the same value that was read, it also won't boot.
>> >
>> > I created a simple kernel module that reads this register and it has bit
>> > 32 set, all other bits clear. But when I write the same value into it, the
>> > core that does the write is stuck in infinite loop.
>> >
>> > So, it seems that we are writing this register from a wrong place.
>> >
>>
>> Ah, my bad. I didn't look closely enough at the description:
>>
>> """
>> The accessibility to the CPUACTLR_EL1 by Exception level is:
>>
>> EL0  -
>> EL1(NS)  RW (a)
>> EL1(S)   RW (a)
>> EL2  RW (b)
>> EL3(SCR.NS = 1)  RW
>> EL3(SCR.NS = 0)  RW
>>
>> (a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
>> 1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
>> """
>>
>> so you'll have to do this from ARM Trusted Firmware. If you're
>> comfortable rebuilding that:
>>
>> diff --git a/include/lib/cpus/aarch64/cortex_a72.h
>> b/include/lib/cpus/aarch64/cortex_a72.h
>> index bfd64918625b..a7b8cf4be0c6 100644
>> --- a/include/lib/cpus/aarch64/cortex_a72.h
>> +++ b/include/lib/cpus/aarch64/cortex_a72.h
>> @@ -31,6 +31,7 @@
>>  #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0
>>
>>  #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
>> +#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
>>  #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
>>  #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
>>  #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
>> diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
>> index 55e508678284..5914d6ee3ba6 100644
>> --- a/lib/cpus/aarch64/cortex_a72.S
>> +++ b/lib/cpus/aarch64/cortex_a72.S
>> @@ -133,6 +133,15 @@ func cortex_a72_reset_func
>> orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
>> msr CORTEX_A72_ECTLR_EL1, x0
>> isb
>> +
>> +   /* -
>> +* Disables store streaming on NC/GRE memory type.
>> +* -
>> +*/
>> +   mrs x0, CORTEX_A72_ACTLR_EL1
>> +   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
>> +   msr CORTEX_A72_ACTLR_EL1, x0
>> +   isb
>> ret x19
>>  endfunc cortex_a72_reset_func
>
> Unfortunatelly, it doesn't work. I verified that the bit is set after
> booting Linux, but the memcpy corruption was still present.
>
> I also tried the other chicken bits, it slowed down the system noticeably,
> but had no effect on the memcpy corruption.
>

OK, it was worth a shot

Let's wait and see if Marcin has any results.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 21:54, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
>> >
>> >
>> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>> >
>> >> On 6 August 2018 at 14:42, Robin Murphy  wrote:
>> >> > On 06/08/18 11:25, Mikulas Patocka wrote:
>> >> > [...]
>> >> >>>
>> >> >>> None of this explains why some transactions fail to make it across
>> >> >>> entirely. The overlapping writes in question write the same data to
>> >> >>> the memory locations that are covered by both, and so the ordering in
>> >> >>> which the transactions are received should not affect the outcome.
>> >> >>
>> >> >>
>> >> >> You're right that the corruption couldn't be explained just by 
>> >> >> reordering
>> >> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> >> >> the overlapping writes, but the disambiguation logic was not tested 
>> >> >> and it
>> >> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> >> >> controller won't see any overlapping writes, so it won't trigger the
>> >> >> faulty disambiguation logic and it works.
>> >> >>
>> >> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> >> >> that could insert barriers between non-cached writes automatically?
>> >> >
>> >> >
>> >> > I don't think there is, and even if there was I imagine it would have a
>> >> > pretty hideous effect on non-coherent DMA buffers and the various other
>> >> > places in which we have Normal-NC mappings of actual system RAM.
>> >> >
>> >>
>> >> Looking at the A72 manual, there is one chicken bit that looks like it
>> >> may be related:
>> >>
>> >> CPUACTLR_EL1 bit #50:
>> >>
>> >> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
>> >> 1 Disables store streaming on NC/GRE memory type.
>> >>
>> >> so putting something like
>> >>
>> >> mrs x0, S3_1_C15_C2_0
>> >> orr x0, x0, #(1 << 50)
>> >> msr S3_1_C15_C2_0, x0
>> >>
>> >> in __cpu_setup() would be worth a try.
>> >
>> > It won't boot.
>> >
>> > But if i write the same value that was read, it also won't boot.
>> >
>> > I created a simple kernel module that reads this register and it has bit
>> > 32 set, all other bits clear. But when I write the same value into it, the
>> > core that does the write is stuck in infinite loop.
>> >
>> > So, it seems that we are writing this register from a wrong place.
>> >
>>
>> Ah, my bad. I didn't look closely enough at the description:
>>
>> """
>> The accessibility to the CPUACTLR_EL1 by Exception level is:
>>
>> EL0  -
>> EL1(NS)  RW (a)
>> EL1(S)   RW (a)
>> EL2  RW (b)
>> EL3(SCR.NS = 1)  RW
>> EL3(SCR.NS = 0)  RW
>>
>> (a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
>> 1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
>> """
>>
>> so you'll have to do this from ARM Trusted Firmware. If you're
>> comfortable rebuilding that:
>>
>> diff --git a/include/lib/cpus/aarch64/cortex_a72.h
>> b/include/lib/cpus/aarch64/cortex_a72.h
>> index bfd64918625b..a7b8cf4be0c6 100644
>> --- a/include/lib/cpus/aarch64/cortex_a72.h
>> +++ b/include/lib/cpus/aarch64/cortex_a72.h
>> @@ -31,6 +31,7 @@
>>  #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0
>>
>>  #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
>> +#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
>>  #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
>>  #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
>>  #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
>> diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
>> index 55e508678284..5914d6ee3ba6 100644
>> --- a/lib/cpus/aarch64/cortex_a72.S
>> +++ b/lib/cpus/aarch64/cortex_a72.S
>> @@ -133,6 +133,15 @@ func cortex_a72_reset_func
>> orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
>> msr CORTEX_A72_ECTLR_EL1, x0
>> isb
>> +
>> +   /* -
>> +* Disables store streaming on NC/GRE memory type.
>> +* -
>> +*/
>> +   mrs x0, CORTEX_A72_ACTLR_EL1
>> +   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
>> +   msr CORTEX_A72_ACTLR_EL1, x0
>> +   isb
>> ret x19
>>  endfunc cortex_a72_reset_func
>
> Unfortunatelly, it doesn't work. I verified that the bit is set after
> booting Linux, but the memcpy corruption was still present.
>
> I also tried the other chicken bits, it slowed down the system noticeably,
> but had no effect on the memcpy corruption.
>

OK, it was worth a shot

Let's wait and see if Marcin has any results.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
> >
> >
> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
> >
> >> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> >> > On 06/08/18 11:25, Mikulas Patocka wrote:
> >> > [...]
> >> >>>
> >> >>> None of this explains why some transactions fail to make it across
> >> >>> entirely. The overlapping writes in question write the same data to
> >> >>> the memory locations that are covered by both, and so the ordering in
> >> >>> which the transactions are received should not affect the outcome.
> >> >>
> >> >>
> >> >> You're right that the corruption couldn't be explained just by 
> >> >> reordering
> >> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> >> the overlapping writes, but the disambiguation logic was not tested and 
> >> >> it
> >> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> >> controller won't see any overlapping writes, so it won't trigger the
> >> >> faulty disambiguation logic and it works.
> >> >>
> >> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> >> that could insert barriers between non-cached writes automatically?
> >> >
> >> >
> >> > I don't think there is, and even if there was I imagine it would have a
> >> > pretty hideous effect on non-coherent DMA buffers and the various other
> >> > places in which we have Normal-NC mappings of actual system RAM.
> >> >
> >>
> >> Looking at the A72 manual, there is one chicken bit that looks like it
> >> may be related:
> >>
> >> CPUACTLR_EL1 bit #50:
> >>
> >> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> >> 1 Disables store streaming on NC/GRE memory type.
> >>
> >> so putting something like
> >>
> >> mrs x0, S3_1_C15_C2_0
> >> orr x0, x0, #(1 << 50)
> >> msr S3_1_C15_C2_0, x0
> >>
> >> in __cpu_setup() would be worth a try.
> >
> > It won't boot.
> >
> > But if i write the same value that was read, it also won't boot.
> >
> > I created a simple kernel module that reads this register and it has bit
> > 32 set, all other bits clear. But when I write the same value into it, the
> > core that does the write is stuck in infinite loop.
> >
> > So, it seems that we are writing this register from a wrong place.
> >
> 
> Ah, my bad. I didn't look closely enough at the description:
> 
> """
> The accessibility to the CPUACTLR_EL1 by Exception level is:
> 
> EL0  -
> EL1(NS)  RW (a)
> EL1(S)   RW (a)
> EL2  RW (b)
> EL3(SCR.NS = 1)  RW
> EL3(SCR.NS = 0)  RW
> 
> (a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
> 1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
> """
> 
> so you'll have to do this from ARM Trusted Firmware. If you're
> comfortable rebuilding that:
> 
> diff --git a/include/lib/cpus/aarch64/cortex_a72.h
> b/include/lib/cpus/aarch64/cortex_a72.h
> index bfd64918625b..a7b8cf4be0c6 100644
> --- a/include/lib/cpus/aarch64/cortex_a72.h
> +++ b/include/lib/cpus/aarch64/cortex_a72.h
> @@ -31,6 +31,7 @@
>  #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0
> 
>  #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
> +#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
>  #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
>  #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
>  #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
> diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
> index 55e508678284..5914d6ee3ba6 100644
> --- a/lib/cpus/aarch64/cortex_a72.S
> +++ b/lib/cpus/aarch64/cortex_a72.S
> @@ -133,6 +133,15 @@ func cortex_a72_reset_func
> orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
> msr CORTEX_A72_ECTLR_EL1, x0
> isb
> +
> +   /* -
> +* Disables store streaming on NC/GRE memory type.
> +* -
> +*/
> +   mrs x0, CORTEX_A72_ACTLR_EL1
> +   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
> +   msr CORTEX_A72_ACTLR_EL1, x0
> +   isb
> ret x19
>  endfunc cortex_a72_reset_func

Unfortunatelly, it doesn't work. I verified that the bit is set after 
booting Linux, but the memcpy corruption was still present.

I also tried the other chicken bits, it slowed down the system noticeably, 
but had no effect on the memcpy corruption.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
> >
> >
> > On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
> >
> >> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> >> > On 06/08/18 11:25, Mikulas Patocka wrote:
> >> > [...]
> >> >>>
> >> >>> None of this explains why some transactions fail to make it across
> >> >>> entirely. The overlapping writes in question write the same data to
> >> >>> the memory locations that are covered by both, and so the ordering in
> >> >>> which the transactions are received should not affect the outcome.
> >> >>
> >> >>
> >> >> You're right that the corruption couldn't be explained just by 
> >> >> reordering
> >> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> >> the overlapping writes, but the disambiguation logic was not tested and 
> >> >> it
> >> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> >> controller won't see any overlapping writes, so it won't trigger the
> >> >> faulty disambiguation logic and it works.
> >> >>
> >> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> >> that could insert barriers between non-cached writes automatically?
> >> >
> >> >
> >> > I don't think there is, and even if there was I imagine it would have a
> >> > pretty hideous effect on non-coherent DMA buffers and the various other
> >> > places in which we have Normal-NC mappings of actual system RAM.
> >> >
> >>
> >> Looking at the A72 manual, there is one chicken bit that looks like it
> >> may be related:
> >>
> >> CPUACTLR_EL1 bit #50:
> >>
> >> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> >> 1 Disables store streaming on NC/GRE memory type.
> >>
> >> so putting something like
> >>
> >> mrs x0, S3_1_C15_C2_0
> >> orr x0, x0, #(1 << 50)
> >> msr S3_1_C15_C2_0, x0
> >>
> >> in __cpu_setup() would be worth a try.
> >
> > It won't boot.
> >
> > But if i write the same value that was read, it also won't boot.
> >
> > I created a simple kernel module that reads this register and it has bit
> > 32 set, all other bits clear. But when I write the same value into it, the
> > core that does the write is stuck in infinite loop.
> >
> > So, it seems that we are writing this register from a wrong place.
> >
> 
> Ah, my bad. I didn't look closely enough at the description:
> 
> """
> The accessibility to the CPUACTLR_EL1 by Exception level is:
> 
> EL0  -
> EL1(NS)  RW (a)
> EL1(S)   RW (a)
> EL2  RW (b)
> EL3(SCR.NS = 1)  RW
> EL3(SCR.NS = 0)  RW
> 
> (a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
> 1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
> """
> 
> so you'll have to do this from ARM Trusted Firmware. If you're
> comfortable rebuilding that:
> 
> diff --git a/include/lib/cpus/aarch64/cortex_a72.h
> b/include/lib/cpus/aarch64/cortex_a72.h
> index bfd64918625b..a7b8cf4be0c6 100644
> --- a/include/lib/cpus/aarch64/cortex_a72.h
> +++ b/include/lib/cpus/aarch64/cortex_a72.h
> @@ -31,6 +31,7 @@
>  #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0
> 
>  #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
> +#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
>  #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
>  #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
>  #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
> diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
> index 55e508678284..5914d6ee3ba6 100644
> --- a/lib/cpus/aarch64/cortex_a72.S
> +++ b/lib/cpus/aarch64/cortex_a72.S
> @@ -133,6 +133,15 @@ func cortex_a72_reset_func
> orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
> msr CORTEX_A72_ECTLR_EL1, x0
> isb
> +
> +   /* -
> +* Disables store streaming on NC/GRE memory type.
> +* -
> +*/
> +   mrs x0, CORTEX_A72_ACTLR_EL1
> +   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
> +   msr CORTEX_A72_ACTLR_EL1, x0
> +   isb
> ret x19
>  endfunc cortex_a72_reset_func

Unfortunatelly, it doesn't work. I verified that the bit is set after 
booting Linux, but the memcpy corruption was still present.

I also tried the other chicken bits, it slowed down the system noticeably, 
but had no effect on the memcpy corruption.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> On 6 August 2018 at 14:42, Robin Murphy  wrote:
>> > On 06/08/18 11:25, Mikulas Patocka wrote:
>> > [...]
>> >>>
>> >>> None of this explains why some transactions fail to make it across
>> >>> entirely. The overlapping writes in question write the same data to
>> >>> the memory locations that are covered by both, and so the ordering in
>> >>> which the transactions are received should not affect the outcome.
>> >>
>> >>
>> >> You're right that the corruption couldn't be explained just by reordering
>> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> >> the overlapping writes, but the disambiguation logic was not tested and it
>> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> >> controller won't see any overlapping writes, so it won't trigger the
>> >> faulty disambiguation logic and it works.
>> >>
>> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> >> that could insert barriers between non-cached writes automatically?
>> >
>> >
>> > I don't think there is, and even if there was I imagine it would have a
>> > pretty hideous effect on non-coherent DMA buffers and the various other
>> > places in which we have Normal-NC mappings of actual system RAM.
>> >
>>
>> Looking at the A72 manual, there is one chicken bit that looks like it
>> may be related:
>>
>> CPUACTLR_EL1 bit #50:
>>
>> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
>> 1 Disables store streaming on NC/GRE memory type.
>>
>> so putting something like
>>
>> mrs x0, S3_1_C15_C2_0
>> orr x0, x0, #(1 << 50)
>> msr S3_1_C15_C2_0, x0
>>
>> in __cpu_setup() would be worth a try.
>
> It won't boot.
>
> But if i write the same value that was read, it also won't boot.
>
> I created a simple kernel module that reads this register and it has bit
> 32 set, all other bits clear. But when I write the same value into it, the
> core that does the write is stuck in infinite loop.
>
> So, it seems that we are writing this register from a wrong place.
>

Ah, my bad. I didn't look closely enough at the description:

"""
The accessibility to the CPUACTLR_EL1 by Exception level is:

EL0  -
EL1(NS)  RW (a)
EL1(S)   RW (a)
EL2  RW (b)
EL3(SCR.NS = 1)  RW
EL3(SCR.NS = 0)  RW

(a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
"""

so you'll have to do this from ARM Trusted Firmware. If you're
comfortable rebuilding that:

diff --git a/include/lib/cpus/aarch64/cortex_a72.h
b/include/lib/cpus/aarch64/cortex_a72.h
index bfd64918625b..a7b8cf4be0c6 100644
--- a/include/lib/cpus/aarch64/cortex_a72.h
+++ b/include/lib/cpus/aarch64/cortex_a72.h
@@ -31,6 +31,7 @@
 #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0

 #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
+#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
 #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
 #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
 #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
index 55e508678284..5914d6ee3ba6 100644
--- a/lib/cpus/aarch64/cortex_a72.S
+++ b/lib/cpus/aarch64/cortex_a72.S
@@ -133,6 +133,15 @@ func cortex_a72_reset_func
orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
msr CORTEX_A72_ECTLR_EL1, x0
isb
+
+   /* -
+* Disables store streaming on NC/GRE memory type.
+* -
+*/
+   mrs x0, CORTEX_A72_ACTLR_EL1
+   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
+   msr CORTEX_A72_ACTLR_EL1, x0
+   isb
ret x19
 endfunc cortex_a72_reset_func


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 19:09, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> On 6 August 2018 at 14:42, Robin Murphy  wrote:
>> > On 06/08/18 11:25, Mikulas Patocka wrote:
>> > [...]
>> >>>
>> >>> None of this explains why some transactions fail to make it across
>> >>> entirely. The overlapping writes in question write the same data to
>> >>> the memory locations that are covered by both, and so the ordering in
>> >>> which the transactions are received should not affect the outcome.
>> >>
>> >>
>> >> You're right that the corruption couldn't be explained just by reordering
>> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> >> the overlapping writes, but the disambiguation logic was not tested and it
>> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> >> controller won't see any overlapping writes, so it won't trigger the
>> >> faulty disambiguation logic and it works.
>> >>
>> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> >> that could insert barriers between non-cached writes automatically?
>> >
>> >
>> > I don't think there is, and even if there was I imagine it would have a
>> > pretty hideous effect on non-coherent DMA buffers and the various other
>> > places in which we have Normal-NC mappings of actual system RAM.
>> >
>>
>> Looking at the A72 manual, there is one chicken bit that looks like it
>> may be related:
>>
>> CPUACTLR_EL1 bit #50:
>>
>> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
>> 1 Disables store streaming on NC/GRE memory type.
>>
>> so putting something like
>>
>> mrs x0, S3_1_C15_C2_0
>> orr x0, x0, #(1 << 50)
>> msr S3_1_C15_C2_0, x0
>>
>> in __cpu_setup() would be worth a try.
>
> It won't boot.
>
> But if i write the same value that was read, it also won't boot.
>
> I created a simple kernel module that reads this register and it has bit
> 32 set, all other bits clear. But when I write the same value into it, the
> core that does the write is stuck in infinite loop.
>
> So, it seems that we are writing this register from a wrong place.
>

Ah, my bad. I didn't look closely enough at the description:

"""
The accessibility to the CPUACTLR_EL1 by Exception level is:

EL0  -
EL1(NS)  RW (a)
EL1(S)   RW (a)
EL2  RW (b)
EL3(SCR.NS = 1)  RW
EL3(SCR.NS = 0)  RW

(a) Write access if ACTLR_EL3.CPUACTLR is 1 and ACTLR_EL2.CPUACTLR is
1, or ACTLR_EL3.CPUACTLR is 1 and SCR.NS is 0.
"""

so you'll have to do this from ARM Trusted Firmware. If you're
comfortable rebuilding that:

diff --git a/include/lib/cpus/aarch64/cortex_a72.h
b/include/lib/cpus/aarch64/cortex_a72.h
index bfd64918625b..a7b8cf4be0c6 100644
--- a/include/lib/cpus/aarch64/cortex_a72.h
+++ b/include/lib/cpus/aarch64/cortex_a72.h
@@ -31,6 +31,7 @@
 #define CORTEX_A72_ACTLR_EL1   S3_1_C15_C2_0

 #define CORTEX_A72_ACTLR_DISABLE_L1_DCACHE_HW_PFTCH(1 << 56)
+#define CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING(1 << 50)
 #define CORTEX_A72_ACTLR_NO_ALLOC_WBWA (1 << 49)
 #define CORTEX_A72_ACTLR_DCC_AS_DCCI   (1 << 44)
 #define CORTEX_A72_ACTLR_EL1_DIS_INSTR_PREFETCH(1 << 32)
diff --git a/lib/cpus/aarch64/cortex_a72.S b/lib/cpus/aarch64/cortex_a72.S
index 55e508678284..5914d6ee3ba6 100644
--- a/lib/cpus/aarch64/cortex_a72.S
+++ b/lib/cpus/aarch64/cortex_a72.S
@@ -133,6 +133,15 @@ func cortex_a72_reset_func
orr x0, x0, #CORTEX_A72_ECTLR_SMP_BIT
msr CORTEX_A72_ECTLR_EL1, x0
isb
+
+   /* -
+* Disables store streaming on NC/GRE memory type.
+* -
+*/
+   mrs x0, CORTEX_A72_ACTLR_EL1
+   orr x0, x0, #CORTEX_A72_ACTLR_DIS_NC_GRE_STORE_STREAMING
+   msr CORTEX_A72_ACTLR_EL1, x0
+   isb
ret x19
 endfunc cortex_a72_reset_func


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Catalin Marinas wrote:

> On Mon, Aug 06, 2018 at 05:47:36PM +0200, Ard Biesheuvel wrote:
> > On 6 August 2018 at 14:42, Robin Murphy  wrote:
> > > On 06/08/18 11:25, Mikulas Patocka wrote:
> > > [...]
> > >>>
> > >>> None of this explains why some transactions fail to make it across
> > >>> entirely. The overlapping writes in question write the same data to
> > >>> the memory locations that are covered by both, and so the ordering in
> > >>> which the transactions are received should not affect the outcome.
> > >>
> > >> You're right that the corruption couldn't be explained just by reordering
> > >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> > >> the overlapping writes, but the disambiguation logic was not tested and 
> > >> it
> > >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> > >> controller won't see any overlapping writes, so it won't trigger the
> > >> faulty disambiguation logic and it works.
> > >>
> > >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> > >> that could insert barriers between non-cached writes automatically?
> > >
> > > I don't think there is, and even if there was I imagine it would have a
> > > pretty hideous effect on non-coherent DMA buffers and the various other
> > > places in which we have Normal-NC mappings of actual system RAM.
> > 
> > Looking at the A72 manual, there is one chicken bit that looks like it
> > may be related:
> > 
> > CPUACTLR_EL1 bit #50:
> > 
> > 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> > 1 Disables store streaming on NC/GRE memory type.
> > 
> > so putting something like
> > 
> > mrs x0, S3_1_C15_C2_0
> > orr x0, x0, #(1 << 50)
> > msr S3_1_C15_C2_0, x0
> > 
> > in __cpu_setup() would be worth a try.
> 
> Note that access to this register may be disabled at EL3 by firmware
> (ACTLR_EL3.CPUACTLR).
> 
> FWIW, Mikulas' test seems to run fine on a ThunderX1 with AMD
> FirePro W2100 (on /dev/fb1)

I have the EDK EFI firmware sources (and I can load it from a SD card, so 
there's no risk of bricking the board), so I can insert the write into it, 
if you say where.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Catalin Marinas wrote:

> On Mon, Aug 06, 2018 at 05:47:36PM +0200, Ard Biesheuvel wrote:
> > On 6 August 2018 at 14:42, Robin Murphy  wrote:
> > > On 06/08/18 11:25, Mikulas Patocka wrote:
> > > [...]
> > >>>
> > >>> None of this explains why some transactions fail to make it across
> > >>> entirely. The overlapping writes in question write the same data to
> > >>> the memory locations that are covered by both, and so the ordering in
> > >>> which the transactions are received should not affect the outcome.
> > >>
> > >> You're right that the corruption couldn't be explained just by reordering
> > >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> > >> the overlapping writes, but the disambiguation logic was not tested and 
> > >> it
> > >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> > >> controller won't see any overlapping writes, so it won't trigger the
> > >> faulty disambiguation logic and it works.
> > >>
> > >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> > >> that could insert barriers between non-cached writes automatically?
> > >
> > > I don't think there is, and even if there was I imagine it would have a
> > > pretty hideous effect on non-coherent DMA buffers and the various other
> > > places in which we have Normal-NC mappings of actual system RAM.
> > 
> > Looking at the A72 manual, there is one chicken bit that looks like it
> > may be related:
> > 
> > CPUACTLR_EL1 bit #50:
> > 
> > 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> > 1 Disables store streaming on NC/GRE memory type.
> > 
> > so putting something like
> > 
> > mrs x0, S3_1_C15_C2_0
> > orr x0, x0, #(1 << 50)
> > msr S3_1_C15_C2_0, x0
> > 
> > in __cpu_setup() would be worth a try.
> 
> Note that access to this register may be disabled at EL3 by firmware
> (ACTLR_EL3.CPUACTLR).
> 
> FWIW, Mikulas' test seems to run fine on a ThunderX1 with AMD
> FirePro W2100 (on /dev/fb1)

I have the EDK EFI firmware sources (and I can load it from a SD card, so 
there's no risk of bricking the board), so I can insert the write into it, 
if you say where.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Catalin Marinas
On Mon, Aug 06, 2018 at 05:47:36PM +0200, Ard Biesheuvel wrote:
> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> > On 06/08/18 11:25, Mikulas Patocka wrote:
> > [...]
> >>>
> >>> None of this explains why some transactions fail to make it across
> >>> entirely. The overlapping writes in question write the same data to
> >>> the memory locations that are covered by both, and so the ordering in
> >>> which the transactions are received should not affect the outcome.
> >>
> >> You're right that the corruption couldn't be explained just by reordering
> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> the overlapping writes, but the disambiguation logic was not tested and it
> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> controller won't see any overlapping writes, so it won't trigger the
> >> faulty disambiguation logic and it works.
> >>
> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> that could insert barriers between non-cached writes automatically?
> >
> > I don't think there is, and even if there was I imagine it would have a
> > pretty hideous effect on non-coherent DMA buffers and the various other
> > places in which we have Normal-NC mappings of actual system RAM.
> 
> Looking at the A72 manual, there is one chicken bit that looks like it
> may be related:
> 
> CPUACTLR_EL1 bit #50:
> 
> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> 1 Disables store streaming on NC/GRE memory type.
> 
> so putting something like
> 
> mrs x0, S3_1_C15_C2_0
> orr x0, x0, #(1 << 50)
> msr S3_1_C15_C2_0, x0
> 
> in __cpu_setup() would be worth a try.

Note that access to this register may be disabled at EL3 by firmware
(ACTLR_EL3.CPUACTLR).

FWIW, Mikulas' test seems to run fine on a ThunderX1 with AMD
FirePro W2100 (on /dev/fb1)

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Catalin Marinas
On Mon, Aug 06, 2018 at 05:47:36PM +0200, Ard Biesheuvel wrote:
> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> > On 06/08/18 11:25, Mikulas Patocka wrote:
> > [...]
> >>>
> >>> None of this explains why some transactions fail to make it across
> >>> entirely. The overlapping writes in question write the same data to
> >>> the memory locations that are covered by both, and so the ordering in
> >>> which the transactions are received should not affect the outcome.
> >>
> >> You're right that the corruption couldn't be explained just by reordering
> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> the overlapping writes, but the disambiguation logic was not tested and it
> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> controller won't see any overlapping writes, so it won't trigger the
> >> faulty disambiguation logic and it works.
> >>
> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> that could insert barriers between non-cached writes automatically?
> >
> > I don't think there is, and even if there was I imagine it would have a
> > pretty hideous effect on non-coherent DMA buffers and the various other
> > places in which we have Normal-NC mappings of actual system RAM.
> 
> Looking at the A72 manual, there is one chicken bit that looks like it
> may be related:
> 
> CPUACTLR_EL1 bit #50:
> 
> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> 1 Disables store streaming on NC/GRE memory type.
> 
> so putting something like
> 
> mrs x0, S3_1_C15_C2_0
> orr x0, x0, #(1 << 50)
> msr S3_1_C15_C2_0, x0
> 
> in __cpu_setup() would be worth a try.

Note that access to this register may be disabled at EL3 by firmware
(ACTLR_EL3.CPUACTLR).

FWIW, Mikulas' test seems to run fine on a ThunderX1 with AMD
FirePro W2100 (on /dev/fb1)

-- 
Catalin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> > On 06/08/18 11:25, Mikulas Patocka wrote:
> > [...]
> >>>
> >>> None of this explains why some transactions fail to make it across
> >>> entirely. The overlapping writes in question write the same data to
> >>> the memory locations that are covered by both, and so the ordering in
> >>> which the transactions are received should not affect the outcome.
> >>
> >>
> >> You're right that the corruption couldn't be explained just by reordering
> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> the overlapping writes, but the disambiguation logic was not tested and it
> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> controller won't see any overlapping writes, so it won't trigger the
> >> faulty disambiguation logic and it works.
> >>
> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> that could insert barriers between non-cached writes automatically?
> >
> >
> > I don't think there is, and even if there was I imagine it would have a
> > pretty hideous effect on non-coherent DMA buffers and the various other
> > places in which we have Normal-NC mappings of actual system RAM.
> >
> 
> Looking at the A72 manual, there is one chicken bit that looks like it
> may be related:
> 
> CPUACTLR_EL1 bit #50:
> 
> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> 1 Disables store streaming on NC/GRE memory type.
> 
> so putting something like
> 
> mrs x0, S3_1_C15_C2_0
> orr x0, x0, #(1 << 50)
> msr S3_1_C15_C2_0, x0
> 
> in __cpu_setup() would be worth a try.

It won't boot.

But if i write the same value that was read, it also won't boot.

I created a simple kernel module that reads this register and it has bit 
32 set, all other bits clear. But when I write the same value into it, the 
core that does the write is stuck in infinite loop.

So, it seems that we are writing this register from a wrong place.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Ard Biesheuvel wrote:

> On 6 August 2018 at 14:42, Robin Murphy  wrote:
> > On 06/08/18 11:25, Mikulas Patocka wrote:
> > [...]
> >>>
> >>> None of this explains why some transactions fail to make it across
> >>> entirely. The overlapping writes in question write the same data to
> >>> the memory locations that are covered by both, and so the ordering in
> >>> which the transactions are received should not affect the outcome.
> >>
> >>
> >> You're right that the corruption couldn't be explained just by reordering
> >> writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> the overlapping writes, but the disambiguation logic was not tested and it
> >> is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> controller won't see any overlapping writes, so it won't trigger the
> >> faulty disambiguation logic and it works.
> >>
> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> that could insert barriers between non-cached writes automatically?
> >
> >
> > I don't think there is, and even if there was I imagine it would have a
> > pretty hideous effect on non-coherent DMA buffers and the various other
> > places in which we have Normal-NC mappings of actual system RAM.
> >
> 
> Looking at the A72 manual, there is one chicken bit that looks like it
> may be related:
> 
> CPUACTLR_EL1 bit #50:
> 
> 0 Enables store streaming on NC/GRE memory type. This is the reset value.
> 1 Disables store streaming on NC/GRE memory type.
> 
> so putting something like
> 
> mrs x0, S3_1_C15_C2_0
> orr x0, x0, #(1 << 50)
> msr S3_1_C15_C2_0, x0
> 
> in __cpu_setup() would be worth a try.

It won't boot.

But if i write the same value that was read, it also won't boot.

I created a simple kernel module that reads this register and it has bit 
32 set, all other bits clear. But when I write the same value into it, the 
core that does the write is stuck in infinite loop.

So, it seems that we are writing this register from a wrong place.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:42, Robin Murphy  wrote:
> On 06/08/18 11:25, Mikulas Patocka wrote:
> [...]
>>>
>>> None of this explains why some transactions fail to make it across
>>> entirely. The overlapping writes in question write the same data to
>>> the memory locations that are covered by both, and so the ordering in
>>> which the transactions are received should not affect the outcome.
>>
>>
>> You're right that the corruption couldn't be explained just by reordering
>> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> the overlapping writes, but the disambiguation logic was not tested and it
>> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> controller won't see any overlapping writes, so it won't trigger the
>> faulty disambiguation logic and it works.
>>
>> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> that could insert barriers between non-cached writes automatically?
>
>
> I don't think there is, and even if there was I imagine it would have a
> pretty hideous effect on non-coherent DMA buffers and the various other
> places in which we have Normal-NC mappings of actual system RAM.
>

Looking at the A72 manual, there is one chicken bit that looks like it
may be related:

CPUACTLR_EL1 bit #50:

0 Enables store streaming on NC/GRE memory type. This is the reset value.
1 Disables store streaming on NC/GRE memory type.

so putting something like

mrs x0, S3_1_C15_C2_0
orr x0, x0, #(1 << 50)
msr S3_1_C15_C2_0, x0

in __cpu_setup() would be worth a try.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:42, Robin Murphy  wrote:
> On 06/08/18 11:25, Mikulas Patocka wrote:
> [...]
>>>
>>> None of this explains why some transactions fail to make it across
>>> entirely. The overlapping writes in question write the same data to
>>> the memory locations that are covered by both, and so the ordering in
>>> which the transactions are received should not affect the outcome.
>>
>>
>> You're right that the corruption couldn't be explained just by reordering
>> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> the overlapping writes, but the disambiguation logic was not tested and it
>> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> controller won't see any overlapping writes, so it won't trigger the
>> faulty disambiguation logic and it works.
>>
>> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> that could insert barriers between non-cached writes automatically?
>
>
> I don't think there is, and even if there was I imagine it would have a
> pretty hideous effect on non-coherent DMA buffers and the various other
> places in which we have Normal-NC mappings of actual system RAM.
>

Looking at the A72 manual, there is one chicken bit that looks like it
may be related:

CPUACTLR_EL1 bit #50:

0 Enables store streaming on NC/GRE memory type. This is the reset value.
1 Disables store streaming on NC/GRE memory type.

so putting something like

mrs x0, S3_1_C15_C2_0
orr x0, x0, #(1 << 50)
msr S3_1_C15_C2_0, x0

in __cpu_setup() would be worth a try.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Sun, 5 Aug 2018, Pavel Machek wrote:

> Hi!
> 
> > > Can you run the test program on x86 using the similar framebuffer
> > > setup?  Does doing two writes (one aligned and one unaligned but
> > > overlapping with previous one) cause the same issue?  I suspect it
> > > does, then using memcpy for frame buffers is wrong.
> 
> I'm pretty sure it will work ok on x86.
> 
> > Overlapping unaligned writes work on x86 - they have to, because of 
> > backward compatibility.
> 
> It is not that easy. 8086s (and similar) did not have MTRRs and PATs
> either. Overlapping unaligned writes _on main memory_, _with normal
> MTRR settings_ certainly work ok on x86.

It works even with write-combining. Write-combining specifies, that the 
writes may hit the framebuffer in unspecified order. But if the writes are 
overlapping, the CPU can't just reorder them and write the wrong result to 
the framebuffer.

> Chances is memory type can be configured to work similar way on your
> ARM/PCIe case?

ARM has memory types GRE, nGRE, nGnRE, nGnRnE - that allow or not allow 
gathering, reordering, early write acknowledgement. Unfortunatelly, all 
these memory types will trigger a fault on unaligned accesses.

It has also Non-Cached memory type (some people on this thread believe 
that it can't be used for GPUs, some believe that it can) - this memory 
type supports unaligned accesses, so it is actually used for framebuffers 
on ARM.

If we had a memory type that didn't do early write acknowledgement and 
supported unaligned accesses, it would solve this problem.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Sun, 5 Aug 2018, Pavel Machek wrote:

> Hi!
> 
> > > Can you run the test program on x86 using the similar framebuffer
> > > setup?  Does doing two writes (one aligned and one unaligned but
> > > overlapping with previous one) cause the same issue?  I suspect it
> > > does, then using memcpy for frame buffers is wrong.
> 
> I'm pretty sure it will work ok on x86.
> 
> > Overlapping unaligned writes work on x86 - they have to, because of 
> > backward compatibility.
> 
> It is not that easy. 8086s (and similar) did not have MTRRs and PATs
> either. Overlapping unaligned writes _on main memory_, _with normal
> MTRR settings_ certainly work ok on x86.

It works even with write-combining. Write-combining specifies, that the 
writes may hit the framebuffer in unspecified order. But if the writes are 
overlapping, the CPU can't just reorder them and write the wrong result to 
the framebuffer.

> Chances is memory type can be configured to work similar way on your
> ARM/PCIe case?

ARM has memory types GRE, nGRE, nGnRE, nGnRnE - that allow or not allow 
gathering, reordering, early write acknowledgement. Unfortunatelly, all 
these memory types will trigger a fault on unaligned accesses.

It has also Non-Cached memory type (some people on this thread believe 
that it can't be used for GPUs, some believe that it can) - this memory 
type supports unaligned accesses, so it is actually used for framebuffers 
on ARM.

If we had a memory type that didn't do early write acknowledgement and 
supported unaligned accesses, it would solve this problem.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Tulio Magno Quites Machado Filho
Florian Weimer  writes:

> On 08/04/2018 01:04 PM, Mikulas Patocka wrote:
>> There's plenty of memcpy's in the graphics stack. No one will be rewriting
>> all the graphics drivers because of tiny market share that ARM has in
>> desktop computers. So if you refuse to fix things and blame everyone else,
>> you can as well announce that you don't want to have PCIe graphics on ARM
>> at all.
>
> The POWER toolchain maintainers said pretty much the same thing not too 
> long ago.  I wonder how many architectures need to fail until the 
> graphics stack is finally fixed.

Unfortunately, it is not just the graphics stack.
This is being used in other userspace programs that benefit from GPUs and
accelerators.

But can we say they're are nonportable programs?
I'm not convinced yet.

-- 
Tulio Magno



Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Tulio Magno Quites Machado Filho
Florian Weimer  writes:

> On 08/04/2018 01:04 PM, Mikulas Patocka wrote:
>> There's plenty of memcpy's in the graphics stack. No one will be rewriting
>> all the graphics drivers because of tiny market share that ARM has in
>> desktop computers. So if you refuse to fix things and blame everyone else,
>> you can as well announce that you don't want to have PCIe graphics on ARM
>> at all.
>
> The POWER toolchain maintainers said pretty much the same thing not too 
> long ago.  I wonder how many architectures need to fail until the 
> graphics stack is finally fixed.

Unfortunately, it is not just the graphics stack.
This is being used in other userspace programs that benefit from GPUs and
accelerators.

But can we say they're are nonportable programs?
I'm not convinced yet.

-- 
Tulio Magno



Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Marcin Wojtas wrote:

> > Hi Marcin,
> >
> > Could you please try running his reproducer?
> 
> This is exactly what I plan to do, as soon as I can plug my GFX card
> back to the board (tomorrow). Just to remain aligned - is it ok, if I
> boot my debian with GT630 plugged, compile the program with -O2 and
> simlply run it on /dev/fb0?
> 
> Best regards,
> Marcin

Yes - when you run it, don't switch consoles (it will obviously trigger 
false warning), don't move the mouse to the upper left corner, and if you 
want to run it in the long term, turn off console blanking.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Mikulas Patocka



On Mon, 6 Aug 2018, Marcin Wojtas wrote:

> > Hi Marcin,
> >
> > Could you please try running his reproducer?
> 
> This is exactly what I plan to do, as soon as I can plug my GFX card
> back to the board (tomorrow). Just to remain aligned - is it ok, if I
> boot my debian with GT630 plugged, compile the program with -O2 and
> simlply run it on /dev/fb0?
> 
> Best regards,
> Marcin

Yes - when you run it, don't switch consoles (it will obviously trigger 
false warning), don't move the mouse to the upper left corner, and if you 
want to run it in the long term, turn off console blanking.

Mikulas


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Marcin Wojtas
Hi Ard, Mikulas,

pon., 6 sie 2018 o 15:48 Ard Biesheuvel  napisał(a):
>
> On 6 August 2018 at 15:41, Marcin Wojtas  wrote:
> > Hi Mikulas,
> >
> > pon., 6 sie 2018 o 14:42 Robin Murphy  napisał(a):
> >>
> >> On 06/08/18 11:25, Mikulas Patocka wrote:
> >> [...]
> >> >> None of this explains why some transactions fail to make it across
> >> >> entirely. The overlapping writes in question write the same data to
> >> >> the memory locations that are covered by both, and so the ordering in
> >> >> which the transactions are received should not affect the outcome.
> >> >
> >> > You're right that the corruption couldn't be explained just by reordering
> >> > writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> > the overlapping writes, but the disambiguation logic was not tested and 
> >> > it
> >> > is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> > controller won't see any overlapping writes, so it won't trigger the
> >> > faulty disambiguation logic and it works.
> >> >
> >> > Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> > that could insert barriers between non-cached writes automatically?
> >>
> >> I don't think there is, and even if there was I imagine it would have a
> >> pretty hideous effect on non-coherent DMA buffers and the various other
> >> places in which we have Normal-NC mappings of actual system RAM.
> >>
> >> > I observe these kinds of corruptions:
> >> > - failing to write a few bytes
> >>
> >> That could potentially be explained by the reordering/atomicity issues
> >> Matt mentioned, i.e. the load is observing part of the store, before the
> >> store has fully completed.
> >>
> >> > - writing a few bytes that were written 16 bytes before
> >> > - writing a few bytes that were written 16 bytes after
> >>
> >> Those sound more like the interconnect or root complex ignoring the byte
> >> strobes on an unaligned burst, of which I think the simplistic view
> >> would be "it's broken".
> >>
> >> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
> >> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
> >> it's still happily flickering pixels in the corner of the console after
> >> nearly an hour (in parallel with some iperf3 just to ensure plenty of
> >> PCIe traffic). I would strongly suspect this issue is particular to
> >> Armada 8k, so its' probably one for the Marvell folks to take a closer
> >> look at - I believe some previous interconnect issues on those SoCs were
> >> actually fixable in firmware.
> >>
> >>
> >
> > On my Macchiato I use GT630 card (nuveau driver) + debian + xfce
> > desktop and in dual monitor mode, I could run a couple of 1080p
> > streams. All smooth and I've never noticed any image corruption
> > whatsoever (I spent a lot of time in front of such setup). Just to be
> > on a safe side, can you send me a bootlog and your board revision? I'd
> > like to see your firware version and type.
> >
>
> Hi Marcin,
>
> Could you please try running his reproducer?

This is exactly what I plan to do, as soon as I can plug my GFX card
back to the board (tomorrow). Just to remain aligned - is it ok, if I
boot my debian with GT630 plugged, compile the program with -O2 and
simlply run it on /dev/fb0?

Best regards,
Marcin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Marcin Wojtas
Hi Ard, Mikulas,

pon., 6 sie 2018 o 15:48 Ard Biesheuvel  napisał(a):
>
> On 6 August 2018 at 15:41, Marcin Wojtas  wrote:
> > Hi Mikulas,
> >
> > pon., 6 sie 2018 o 14:42 Robin Murphy  napisał(a):
> >>
> >> On 06/08/18 11:25, Mikulas Patocka wrote:
> >> [...]
> >> >> None of this explains why some transactions fail to make it across
> >> >> entirely. The overlapping writes in question write the same data to
> >> >> the memory locations that are covered by both, and so the ordering in
> >> >> which the transactions are received should not affect the outcome.
> >> >
> >> > You're right that the corruption couldn't be explained just by reordering
> >> > writes. My hypothesis is that the PCIe controller tries to disambiguate
> >> > the overlapping writes, but the disambiguation logic was not tested and 
> >> > it
> >> > is buggy. If there's a barrier between the overlapping writes, the PCIe
> >> > controller won't see any overlapping writes, so it won't trigger the
> >> > faulty disambiguation logic and it works.
> >> >
> >> > Could the ARM engineers look if there's some chicken bit in Cortex-A72
> >> > that could insert barriers between non-cached writes automatically?
> >>
> >> I don't think there is, and even if there was I imagine it would have a
> >> pretty hideous effect on non-coherent DMA buffers and the various other
> >> places in which we have Normal-NC mappings of actual system RAM.
> >>
> >> > I observe these kinds of corruptions:
> >> > - failing to write a few bytes
> >>
> >> That could potentially be explained by the reordering/atomicity issues
> >> Matt mentioned, i.e. the load is observing part of the store, before the
> >> store has fully completed.
> >>
> >> > - writing a few bytes that were written 16 bytes before
> >> > - writing a few bytes that were written 16 bytes after
> >>
> >> Those sound more like the interconnect or root complex ignoring the byte
> >> strobes on an unaligned burst, of which I think the simplistic view
> >> would be "it's broken".
> >>
> >> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
> >> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
> >> it's still happily flickering pixels in the corner of the console after
> >> nearly an hour (in parallel with some iperf3 just to ensure plenty of
> >> PCIe traffic). I would strongly suspect this issue is particular to
> >> Armada 8k, so its' probably one for the Marvell folks to take a closer
> >> look at - I believe some previous interconnect issues on those SoCs were
> >> actually fixable in firmware.
> >>
> >>
> >
> > On my Macchiato I use GT630 card (nuveau driver) + debian + xfce
> > desktop and in dual monitor mode, I could run a couple of 1080p
> > streams. All smooth and I've never noticed any image corruption
> > whatsoever (I spent a lot of time in front of such setup). Just to be
> > on a safe side, can you send me a bootlog and your board revision? I'd
> > like to see your firware version and type.
> >
>
> Hi Marcin,
>
> Could you please try running his reproducer?

This is exactly what I plan to do, as soon as I can plug my GFX card
back to the board (tomorrow). Just to remain aligned - is it ok, if I
boot my debian with GT630 plugged, compile the program with -O2 and
simlply run it on /dev/fb0?

Best regards,
Marcin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 15:41, Marcin Wojtas  wrote:
> Hi Mikulas,
>
> pon., 6 sie 2018 o 14:42 Robin Murphy  napisał(a):
>>
>> On 06/08/18 11:25, Mikulas Patocka wrote:
>> [...]
>> >> None of this explains why some transactions fail to make it across
>> >> entirely. The overlapping writes in question write the same data to
>> >> the memory locations that are covered by both, and so the ordering in
>> >> which the transactions are received should not affect the outcome.
>> >
>> > You're right that the corruption couldn't be explained just by reordering
>> > writes. My hypothesis is that the PCIe controller tries to disambiguate
>> > the overlapping writes, but the disambiguation logic was not tested and it
>> > is buggy. If there's a barrier between the overlapping writes, the PCIe
>> > controller won't see any overlapping writes, so it won't trigger the
>> > faulty disambiguation logic and it works.
>> >
>> > Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> > that could insert barriers between non-cached writes automatically?
>>
>> I don't think there is, and even if there was I imagine it would have a
>> pretty hideous effect on non-coherent DMA buffers and the various other
>> places in which we have Normal-NC mappings of actual system RAM.
>>
>> > I observe these kinds of corruptions:
>> > - failing to write a few bytes
>>
>> That could potentially be explained by the reordering/atomicity issues
>> Matt mentioned, i.e. the load is observing part of the store, before the
>> store has fully completed.
>>
>> > - writing a few bytes that were written 16 bytes before
>> > - writing a few bytes that were written 16 bytes after
>>
>> Those sound more like the interconnect or root complex ignoring the byte
>> strobes on an unaligned burst, of which I think the simplistic view
>> would be "it's broken".
>>
>> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
>> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
>> it's still happily flickering pixels in the corner of the console after
>> nearly an hour (in parallel with some iperf3 just to ensure plenty of
>> PCIe traffic). I would strongly suspect this issue is particular to
>> Armada 8k, so its' probably one for the Marvell folks to take a closer
>> look at - I believe some previous interconnect issues on those SoCs were
>> actually fixable in firmware.
>>
>>
>
> On my Macchiato I use GT630 card (nuveau driver) + debian + xfce
> desktop and in dual monitor mode, I could run a couple of 1080p
> streams. All smooth and I've never noticed any image corruption
> whatsoever (I spent a lot of time in front of such setup). Just to be
> on a safe side, can you send me a bootlog and your board revision? I'd
> like to see your firware version and type.
>

Hi Marcin,

Could you please try running his reproducer?


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 15:41, Marcin Wojtas  wrote:
> Hi Mikulas,
>
> pon., 6 sie 2018 o 14:42 Robin Murphy  napisał(a):
>>
>> On 06/08/18 11:25, Mikulas Patocka wrote:
>> [...]
>> >> None of this explains why some transactions fail to make it across
>> >> entirely. The overlapping writes in question write the same data to
>> >> the memory locations that are covered by both, and so the ordering in
>> >> which the transactions are received should not affect the outcome.
>> >
>> > You're right that the corruption couldn't be explained just by reordering
>> > writes. My hypothesis is that the PCIe controller tries to disambiguate
>> > the overlapping writes, but the disambiguation logic was not tested and it
>> > is buggy. If there's a barrier between the overlapping writes, the PCIe
>> > controller won't see any overlapping writes, so it won't trigger the
>> > faulty disambiguation logic and it works.
>> >
>> > Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> > that could insert barriers between non-cached writes automatically?
>>
>> I don't think there is, and even if there was I imagine it would have a
>> pretty hideous effect on non-coherent DMA buffers and the various other
>> places in which we have Normal-NC mappings of actual system RAM.
>>
>> > I observe these kinds of corruptions:
>> > - failing to write a few bytes
>>
>> That could potentially be explained by the reordering/atomicity issues
>> Matt mentioned, i.e. the load is observing part of the store, before the
>> store has fully completed.
>>
>> > - writing a few bytes that were written 16 bytes before
>> > - writing a few bytes that were written 16 bytes after
>>
>> Those sound more like the interconnect or root complex ignoring the byte
>> strobes on an unaligned burst, of which I think the simplistic view
>> would be "it's broken".
>>
>> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
>> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
>> it's still happily flickering pixels in the corner of the console after
>> nearly an hour (in parallel with some iperf3 just to ensure plenty of
>> PCIe traffic). I would strongly suspect this issue is particular to
>> Armada 8k, so its' probably one for the Marvell folks to take a closer
>> look at - I believe some previous interconnect issues on those SoCs were
>> actually fixable in firmware.
>>
>>
>
> On my Macchiato I use GT630 card (nuveau driver) + debian + xfce
> desktop and in dual monitor mode, I could run a couple of 1080p
> streams. All smooth and I've never noticed any image corruption
> whatsoever (I spent a lot of time in front of such setup). Just to be
> on a safe side, can you send me a bootlog and your board revision? I'd
> like to see your firware version and type.
>

Hi Marcin,

Could you please try running his reproducer?


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Marcin Wojtas
Hi Mikulas,

pon., 6 sie 2018 o 14:42 Robin Murphy  napisał(a):
>
> On 06/08/18 11:25, Mikulas Patocka wrote:
> [...]
> >> None of this explains why some transactions fail to make it across
> >> entirely. The overlapping writes in question write the same data to
> >> the memory locations that are covered by both, and so the ordering in
> >> which the transactions are received should not affect the outcome.
> >
> > You're right that the corruption couldn't be explained just by reordering
> > writes. My hypothesis is that the PCIe controller tries to disambiguate
> > the overlapping writes, but the disambiguation logic was not tested and it
> > is buggy. If there's a barrier between the overlapping writes, the PCIe
> > controller won't see any overlapping writes, so it won't trigger the
> > faulty disambiguation logic and it works.
> >
> > Could the ARM engineers look if there's some chicken bit in Cortex-A72
> > that could insert barriers between non-cached writes automatically?
>
> I don't think there is, and even if there was I imagine it would have a
> pretty hideous effect on non-coherent DMA buffers and the various other
> places in which we have Normal-NC mappings of actual system RAM.
>
> > I observe these kinds of corruptions:
> > - failing to write a few bytes
>
> That could potentially be explained by the reordering/atomicity issues
> Matt mentioned, i.e. the load is observing part of the store, before the
> store has fully completed.
>
> > - writing a few bytes that were written 16 bytes before
> > - writing a few bytes that were written 16 bytes after
>
> Those sound more like the interconnect or root complex ignoring the byte
> strobes on an unaligned burst, of which I think the simplistic view
> would be "it's broken".
>
> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
> it's still happily flickering pixels in the corner of the console after
> nearly an hour (in parallel with some iperf3 just to ensure plenty of
> PCIe traffic). I would strongly suspect this issue is particular to
> Armada 8k, so its' probably one for the Marvell folks to take a closer
> look at - I believe some previous interconnect issues on those SoCs were
> actually fixable in firmware.
>
>

On my Macchiato I use GT630 card (nuveau driver) + debian + xfce
desktop and in dual monitor mode, I could run a couple of 1080p
streams. All smooth and I've never noticed any image corruption
whatsoever (I spent a lot of time in front of such setup). Just to be
on a safe side, can you send me a bootlog and your board revision? I'd
like to see your firware version and type.

Thanks,
Marcin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Marcin Wojtas
Hi Mikulas,

pon., 6 sie 2018 o 14:42 Robin Murphy  napisał(a):
>
> On 06/08/18 11:25, Mikulas Patocka wrote:
> [...]
> >> None of this explains why some transactions fail to make it across
> >> entirely. The overlapping writes in question write the same data to
> >> the memory locations that are covered by both, and so the ordering in
> >> which the transactions are received should not affect the outcome.
> >
> > You're right that the corruption couldn't be explained just by reordering
> > writes. My hypothesis is that the PCIe controller tries to disambiguate
> > the overlapping writes, but the disambiguation logic was not tested and it
> > is buggy. If there's a barrier between the overlapping writes, the PCIe
> > controller won't see any overlapping writes, so it won't trigger the
> > faulty disambiguation logic and it works.
> >
> > Could the ARM engineers look if there's some chicken bit in Cortex-A72
> > that could insert barriers between non-cached writes automatically?
>
> I don't think there is, and even if there was I imagine it would have a
> pretty hideous effect on non-coherent DMA buffers and the various other
> places in which we have Normal-NC mappings of actual system RAM.
>
> > I observe these kinds of corruptions:
> > - failing to write a few bytes
>
> That could potentially be explained by the reordering/atomicity issues
> Matt mentioned, i.e. the load is observing part of the store, before the
> store has fully completed.
>
> > - writing a few bytes that were written 16 bytes before
> > - writing a few bytes that were written 16 bytes after
>
> Those sound more like the interconnect or root complex ignoring the byte
> strobes on an unaligned burst, of which I think the simplistic view
> would be "it's broken".
>
> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
> it's still happily flickering pixels in the corner of the console after
> nearly an hour (in parallel with some iperf3 just to ensure plenty of
> PCIe traffic). I would strongly suspect this issue is particular to
> Armada 8k, so its' probably one for the Marvell folks to take a closer
> look at - I believe some previous interconnect issues on those SoCs were
> actually fixable in firmware.
>
>

On my Macchiato I use GT630 card (nuveau driver) + debian + xfce
desktop and in dual monitor mode, I could run a couple of 1080p
streams. All smooth and I've never noticed any image corruption
whatsoever (I spent a lot of time in front of such setup). Just to be
on a safe side, can you send me a bootlog and your board revision? I'd
like to see your firware version and type.

Thanks,
Marcin


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:42, Robin Murphy  wrote:
> On 06/08/18 11:25, Mikulas Patocka wrote:
> [...]
>>>
>>> None of this explains why some transactions fail to make it across
>>> entirely. The overlapping writes in question write the same data to
>>> the memory locations that are covered by both, and so the ordering in
>>> which the transactions are received should not affect the outcome.
>>
>>
>> You're right that the corruption couldn't be explained just by reordering
>> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> the overlapping writes, but the disambiguation logic was not tested and it
>> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> controller won't see any overlapping writes, so it won't trigger the
>> faulty disambiguation logic and it works.
>>
>> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> that could insert barriers between non-cached writes automatically?
>
>
> I don't think there is, and even if there was I imagine it would have a
> pretty hideous effect on non-coherent DMA buffers and the various other
> places in which we have Normal-NC mappings of actual system RAM.
>
>> I observe these kinds of corruptions:
>> - failing to write a few bytes
>
>
> That could potentially be explained by the reordering/atomicity issues Matt
> mentioned, i.e. the load is observing part of the store, before the store
> has fully completed.
>

OK, so that means the unaligned transaction gets split, and the
subtransactions are reordered with the aligned transaction so that the
sub-writes contain stale values from the sub-reads?

>> - writing a few bytes that were written 16 bytes before
>> - writing a few bytes that were written 16 bytes after
>
>
> Those sound more like the interconnect or root complex ignoring the byte
> strobes on an unaligned burst, of which I think the simplistic view would be
> "it's broken".
>
> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
> it's still happily flickering pixels in the corner of the console after
> nearly an hour (in parallel with some iperf3 just to ensure plenty of PCIe
> traffic). I would strongly suspect this issue is particular to Armada 8k, so
> its' probably one for the Marvell folks to take a closer look at - I believe
> some previous interconnect issues on those SoCs were actually fixable in
> firmware.
>

IIRC that was DVM dropping a few VA bits at the top, and a single MMIO
control bit to put it back into 'non-broken' mode.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:42, Robin Murphy  wrote:
> On 06/08/18 11:25, Mikulas Patocka wrote:
> [...]
>>>
>>> None of this explains why some transactions fail to make it across
>>> entirely. The overlapping writes in question write the same data to
>>> the memory locations that are covered by both, and so the ordering in
>>> which the transactions are received should not affect the outcome.
>>
>>
>> You're right that the corruption couldn't be explained just by reordering
>> writes. My hypothesis is that the PCIe controller tries to disambiguate
>> the overlapping writes, but the disambiguation logic was not tested and it
>> is buggy. If there's a barrier between the overlapping writes, the PCIe
>> controller won't see any overlapping writes, so it won't trigger the
>> faulty disambiguation logic and it works.
>>
>> Could the ARM engineers look if there's some chicken bit in Cortex-A72
>> that could insert barriers between non-cached writes automatically?
>
>
> I don't think there is, and even if there was I imagine it would have a
> pretty hideous effect on non-coherent DMA buffers and the various other
> places in which we have Normal-NC mappings of actual system RAM.
>
>> I observe these kinds of corruptions:
>> - failing to write a few bytes
>
>
> That could potentially be explained by the reordering/atomicity issues Matt
> mentioned, i.e. the load is observing part of the store, before the store
> has fully completed.
>

OK, so that means the unaligned transaction gets split, and the
subtransactions are reordered with the aligned transaction so that the
sub-writes contain stale values from the sub-reads?

>> - writing a few bytes that were written 16 bytes before
>> - writing a few bytes that were written 16 bytes after
>
>
> Those sound more like the interconnect or root complex ignoring the byte
> strobes on an unaligned burst, of which I think the simplistic view would be
> "it's broken".
>
> FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
> Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
> it's still happily flickering pixels in the corner of the console after
> nearly an hour (in parallel with some iperf3 just to ensure plenty of PCIe
> traffic). I would strongly suspect this issue is particular to Armada 8k, so
> its' probably one for the Marvell folks to take a closer look at - I believe
> some previous interconnect issues on those SoCs were actually fixable in
> firmware.
>

IIRC that was DVM dropping a few VA bits at the top, and a single MMIO
control bit to put it back into 'non-broken' mode.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Robin Murphy

On 06/08/18 11:25, Mikulas Patocka wrote:
[...]

None of this explains why some transactions fail to make it across
entirely. The overlapping writes in question write the same data to
the memory locations that are covered by both, and so the ordering in
which the transactions are received should not affect the outcome.


You're right that the corruption couldn't be explained just by reordering
writes. My hypothesis is that the PCIe controller tries to disambiguate
the overlapping writes, but the disambiguation logic was not tested and it
is buggy. If there's a barrier between the overlapping writes, the PCIe
controller won't see any overlapping writes, so it won't trigger the
faulty disambiguation logic and it works.

Could the ARM engineers look if there's some chicken bit in Cortex-A72
that could insert barriers between non-cached writes automatically?


I don't think there is, and even if there was I imagine it would have a 
pretty hideous effect on non-coherent DMA buffers and the various other 
places in which we have Normal-NC mappings of actual system RAM.



I observe these kinds of corruptions:
- failing to write a few bytes


That could potentially be explained by the reordering/atomicity issues 
Matt mentioned, i.e. the load is observing part of the store, before the 
store has fully completed.



- writing a few bytes that were written 16 bytes before
- writing a few bytes that were written 16 bytes after


Those sound more like the interconnect or root complex ignoring the byte 
strobes on an unaligned burst, of which I think the simplistic view 
would be "it's broken".


FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x 
Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and 
it's still happily flickering pixels in the corner of the console after 
nearly an hour (in parallel with some iperf3 just to ensure plenty of 
PCIe traffic). I would strongly suspect this issue is particular to 
Armada 8k, so its' probably one for the Marvell folks to take a closer 
look at - I believe some previous interconnect issues on those SoCs were 
actually fixable in firmware.


Robin.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Robin Murphy

On 06/08/18 11:25, Mikulas Patocka wrote:
[...]

None of this explains why some transactions fail to make it across
entirely. The overlapping writes in question write the same data to
the memory locations that are covered by both, and so the ordering in
which the transactions are received should not affect the outcome.


You're right that the corruption couldn't be explained just by reordering
writes. My hypothesis is that the PCIe controller tries to disambiguate
the overlapping writes, but the disambiguation logic was not tested and it
is buggy. If there's a barrier between the overlapping writes, the PCIe
controller won't see any overlapping writes, so it won't trigger the
faulty disambiguation logic and it works.

Could the ARM engineers look if there's some chicken bit in Cortex-A72
that could insert barriers between non-cached writes automatically?


I don't think there is, and even if there was I imagine it would have a 
pretty hideous effect on non-coherent DMA buffers and the various other 
places in which we have Normal-NC mappings of actual system RAM.



I observe these kinds of corruptions:
- failing to write a few bytes


That could potentially be explained by the reordering/atomicity issues 
Matt mentioned, i.e. the load is observing part of the store, before the 
store has fully completed.



- writing a few bytes that were written 16 bytes before
- writing a few bytes that were written 16 bytes after


Those sound more like the interconnect or root complex ignoring the byte 
strobes on an unaligned burst, of which I think the simplistic view 
would be "it's broken".


FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x 
Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and 
it's still happily flickering pixels in the corner of the console after 
nearly an hour (in parallel with some iperf3 just to ensure plenty of 
PCIe traffic). I would strongly suspect this issue is particular to 
Armada 8k, so its' probably one for the Marvell folks to take a closer 
look at - I believe some previous interconnect issues on those SoCs were 
actually fixable in firmware.


Robin.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:19, Ard Biesheuvel  wrote:
> On 6 August 2018 at 14:09, Mikulas Patocka  wrote:
>>
>>
>> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>>
>>> >> Are we talking about a quirk for the Armada 8040 or about PCIe on ARM
>>> >> in general?
>>> >
>>> > I don't know - there are not any other easily available PCIe ARM boards
>>> > except for Armada 8040.
>>>
>>> ... indeed, and sadly, the ones that are available all have this
>>> horrible Synopsys DesignWare PCIe IP that does not implement a true
>>> root complex at all, but is simply repurposed endpoint IP with some
>>> tweaks so it vaguely resembles a root complex.
>>>
>>> But this is exactly why I am asking: I use a AMD Seattle Overdrive as
>>> my main Linux development system, and it runs the gnome-shell stack
>>> flawlessly (using the nouveau driver), as well as a UEFI framebuffer
>>> using efifb. So my suspicion is that this is either a Synopsys IP
>>> issue or an interconnect issue, and has nothing to do with the
>>> impedance mismatch between AMBA and PCIe.
>>
>> If you run the program for testing memcpy on framebuffer that I posted in
>> this thread - does it detect some corruption for you?
>>
>
> I won't be able to check that for a while - I'm currently travelling.
>
>>
>> BTW. does the Radeon GPU driver work for you?
>>
>> My observation is that OpenGL with Nouveau works, but it's slow and the
>> whole system locks up when playing video in chromium.
>>

Are you setting the pstate to auto? That helps a lot in my experience.

I.e.,

echo auto > /sys/kernel/debug/dri/0/pstate


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:19, Ard Biesheuvel  wrote:
> On 6 August 2018 at 14:09, Mikulas Patocka  wrote:
>>
>>
>> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>>
>>> >> Are we talking about a quirk for the Armada 8040 or about PCIe on ARM
>>> >> in general?
>>> >
>>> > I don't know - there are not any other easily available PCIe ARM boards
>>> > except for Armada 8040.
>>>
>>> ... indeed, and sadly, the ones that are available all have this
>>> horrible Synopsys DesignWare PCIe IP that does not implement a true
>>> root complex at all, but is simply repurposed endpoint IP with some
>>> tweaks so it vaguely resembles a root complex.
>>>
>>> But this is exactly why I am asking: I use a AMD Seattle Overdrive as
>>> my main Linux development system, and it runs the gnome-shell stack
>>> flawlessly (using the nouveau driver), as well as a UEFI framebuffer
>>> using efifb. So my suspicion is that this is either a Synopsys IP
>>> issue or an interconnect issue, and has nothing to do with the
>>> impedance mismatch between AMBA and PCIe.
>>
>> If you run the program for testing memcpy on framebuffer that I posted in
>> this thread - does it detect some corruption for you?
>>
>
> I won't be able to check that for a while - I'm currently travelling.
>
>>
>> BTW. does the Radeon GPU driver work for you?
>>
>> My observation is that OpenGL with Nouveau works, but it's slow and the
>> whole system locks up when playing video in chromium.
>>

Are you setting the pstate to auto? That helps a lot in my experience.

I.e.,

echo auto > /sys/kernel/debug/dri/0/pstate


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:09, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> >> Are we talking about a quirk for the Armada 8040 or about PCIe on ARM
>> >> in general?
>> >
>> > I don't know - there are not any other easily available PCIe ARM boards
>> > except for Armada 8040.
>>
>> ... indeed, and sadly, the ones that are available all have this
>> horrible Synopsys DesignWare PCIe IP that does not implement a true
>> root complex at all, but is simply repurposed endpoint IP with some
>> tweaks so it vaguely resembles a root complex.
>>
>> But this is exactly why I am asking: I use a AMD Seattle Overdrive as
>> my main Linux development system, and it runs the gnome-shell stack
>> flawlessly (using the nouveau driver), as well as a UEFI framebuffer
>> using efifb. So my suspicion is that this is either a Synopsys IP
>> issue or an interconnect issue, and has nothing to do with the
>> impedance mismatch between AMBA and PCIe.
>
> If you run the program for testing memcpy on framebuffer that I posted in
> this thread - does it detect some corruption for you?
>

I won't be able to check that for a while - I'm currently travelling.

>
> BTW. does the Radeon GPU driver work for you?
>
> My observation is that OpenGL with Nouveau works, but it's slow and the
> whole system locks up when playing video in chromium.
>

No that works fine for me. VDPAU acceleration works as well, but it
depends on your chromium build whether it can actually use it, I
think? In any case, mplayer can use vdpau to play 1080p h264 without
breaking a sweat on this system.

Note that the VDPAU driver also relies on memory semantics, i.e., it
may use DC ZVA (zero cacheline) instructions which are not permitted
on device mappings. This is probably just glibc's memset() being
invoked, but I remember hitting this on another PCIe-impaired arm64
system with Synopsys PCIe IP

> Radeon HD 6350 (pre-GCN), doesn't lock up, but OpenGL (and Glamour) has
> many artifacts and corrupted textures. When I switch it to EXA
> acceleration and don't use OpenGL, it works.
>
> The artifacts are not fixed by preloading a glibc with fixed memcpy, so
> there's supposedly some other bug somewhere.
>

Yes, I have the same experience, and I have been meaning to report it
to the maintainers/developers. Good to have another data point.


Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-06 Thread Ard Biesheuvel
On 6 August 2018 at 14:09, Mikulas Patocka  wrote:
>
>
> On Mon, 6 Aug 2018, Ard Biesheuvel wrote:
>
>> >> Are we talking about a quirk for the Armada 8040 or about PCIe on ARM
>> >> in general?
>> >
>> > I don't know - there are not any other easily available PCIe ARM boards
>> > except for Armada 8040.
>>
>> ... indeed, and sadly, the ones that are available all have this
>> horrible Synopsys DesignWare PCIe IP that does not implement a true
>> root complex at all, but is simply repurposed endpoint IP with some
>> tweaks so it vaguely resembles a root complex.
>>
>> But this is exactly why I am asking: I use a AMD Seattle Overdrive as
>> my main Linux development system, and it runs the gnome-shell stack
>> flawlessly (using the nouveau driver), as well as a UEFI framebuffer
>> using efifb. So my suspicion is that this is either a Synopsys IP
>> issue or an interconnect issue, and has nothing to do with the
>> impedance mismatch between AMBA and PCIe.
>
> If you run the program for testing memcpy on framebuffer that I posted in
> this thread - does it detect some corruption for you?
>

I won't be able to check that for a while - I'm currently travelling.

>
> BTW. does the Radeon GPU driver work for you?
>
> My observation is that OpenGL with Nouveau works, but it's slow and the
> whole system locks up when playing video in chromium.
>

No that works fine for me. VDPAU acceleration works as well, but it
depends on your chromium build whether it can actually use it, I
think? In any case, mplayer can use vdpau to play 1080p h264 without
breaking a sweat on this system.

Note that the VDPAU driver also relies on memory semantics, i.e., it
may use DC ZVA (zero cacheline) instructions which are not permitted
on device mappings. This is probably just glibc's memset() being
invoked, but I remember hitting this on another PCIe-impaired arm64
system with Synopsys PCIe IP

> Radeon HD 6350 (pre-GCN), doesn't lock up, but OpenGL (and Glamour) has
> many artifacts and corrupted textures. When I switch it to EXA
> acceleration and don't use OpenGL, it works.
>
> The artifacts are not fixed by preloading a glibc with fixed memcpy, so
> there's supposedly some other bug somewhere.
>

Yes, I have the same experience, and I have been meaning to report it
to the maintainers/developers. Good to have another data point.


  1   2   >