Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-09 Thread Krzysztof Hałasa
Ralf Baechle r...@linux-mips.org writes:

 16K is a silver bullet solution to all cache aliasing problems.  So if
 your issue persists with 16K page size, it's not a cache aliasing issue.
 Aside there are generally performance gains from the bigger page size.

I wonder why isn't the issue present in other cases. Perhaps remapping
of a userspace address and accessing it with kseg0 isn't a frequent
operation.

Shouldn't we change the default page size (on affected CPUs) to 16 KB
then? Alternatively, we could flush/invalidate the cache when needed -
is it a viable option?

Thanks.
-- 
Krzysztof Halasa

Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-09 Thread Ralf Baechle
On Wed, Oct 09, 2013 at 08:53:20AM +0200, Krzysztof Hałasa wrote:

  16K is a silver bullet solution to all cache aliasing problems.  So if
  your issue persists with 16K page size, it's not a cache aliasing issue.
  Aside there are generally performance gains from the bigger page size.
 
 I wonder why isn't the issue present in other cases. Perhaps remapping
 of a userspace address and accessing it with kseg0 isn't a frequent
 operation.
 
 Shouldn't we change the default page size (on affected CPUs) to 16 KB
 then? Alternatively, we could flush/invalidate the cache when needed -
 is it a viable option?

The kernel is supposed to perform the necessary cache flushing, so any
remaining aliasing issue would be considered a bug.  But the code is
performance sensitive, some of the problem cases are twisted and complex
so bugs and unsolved corner cases show up every now and then.

The historic default is 4K page size - on some processors such as the
venerable R3000 it's also the only page size available.  Some application
code wants to know the page size and has wisely hardcoded 4K.  Also
a fix to binutils many years ago reduced the alignment of generated
binaries so they'd not run on a kernel with larger page size.  The
kernel configuration defaults are chosen to just work out of the box,
and 4K page size is the safest choice.

Anyway, binutils got unfixed again years ago so chances are 16K
will just work.

Does it work for you, even solve your problem?

  Ralf
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-09 Thread Krzysztof Hałasa
Ralf Baechle r...@linux-mips.org writes:

 The kernel is supposed to perform the necessary cache flushing, so any
 remaining aliasing issue would be considered a bug.  But the code is
 performance sensitive, some of the problem cases are twisted and complex
 so bugs and unsolved corner cases show up every now and then.

Ok. This means I should also investigate the V4L2 and the hw driver
code, because the cache aliasing shouldn't be there in the first place.

 Does it work for you, even solve your problem?

Sure, with 16 KB page size everything works fine.

-- 
Krzysztof Halasa

Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-08 Thread Krzysztof Hałasa
Ralf Baechle r...@linux-mips.org writes:

 That's fine.  You just need to ensure that there are no virtual
 aliases.

Does this include virtual aliasing between a 4 KB TLB-mapped page and
a kseg0 address? I don't really have two TLBs pointing to the same page.

 One way to do so is to increase the page size to 16kB.

Right, this way we will have a unique mapping from the virtual address
to the data cache, as the cache size (per way) is 8 KB here. Is it the
correct fix in this situation?

 Note that there is a variant of the 24K which has a VIPT cache but uses
 hardware to resolve cache aliases.  That is, from a kernel cache management
 perspective it behaves like a PIPT cache.

It seems it's not the case here. What I have here is:
CPU revision is: 00019374 (MIPS 24Kc)
SoC: Atheros AR7161 rev 2
Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes

 However as I understand what you're mapping to userspace is actually
 device memory, right?

Not exactly - I'm using PCI DMA to userspace SG buffers in RAM.

The userspace first allocates the buffers in normal RAM (with vmalloc()
or something, there is an mmap ioctl() for this), the address returned
is 0x7xxx. Then the buffers (which consist of several pages each)
are presented to the hw driver which obtains separate (kernel) mapping
for each page (kseg0) and does dma_map_sg() and so on. The driver also
simply writes to the buffers. This isn't a problem, though - only the
incoherence between TLB and kseg0 is.

The problem is the userspace doesn't see the kernel writes - The
0x7xxx TLB-mapped pages are read-cached and not invalidated while
the kernel writes to the same pages using kseg0 addresses.

Thanks for looking at this.
-- 
Krzysztof Halasa

Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-08 Thread Krzysztof Hałasa
Ralf Baechle r...@linux-mips.org writes:

 That's fine.  You just need to ensure that there are no virtual aliases.
 One way to do so is to increase the page size to 16kB.

Checked, this thing works fine with 16 KB pages.
-- 
Krzysztof Halasa

Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-08 Thread Ralf Baechle
On Tue, Oct 08, 2013 at 10:24:13AM +0200, Krzysztof Hałasa wrote:

  That's fine.  You just need to ensure that there are no virtual
  aliases.
 
 Does this include virtual aliasing between a 4 KB TLB-mapped page and
 a kseg0 address? I don't really have two TLBs pointing to the same page.

Yes.

Note that the terminology used in the manuals may be confusing here.
They call KSEG0/1 and - on 64 bit - XKPHYS unmapped spaces.  But
obviously there is a mapping from virtual to physical addresses.  It's
just that there is no TLB entry being used for these mappings.

It's easier to understand if you see that all memory accesses are going
through the cache, TLB mapped or not.

  One way to do so is to increase the page size to 16kB.
 
 Right, this way we will have a unique mapping from the virtual address
 to the data cache, as the cache size (per way) is 8 KB here. Is it the
 correct fix in this situation?

16K is a silver bullet solution to all cache aliasing problems.  So if
your issue persists with 16K page size, it's not a cache aliasing issue.
Aside there are generally performance gains from the bigger page size.

  Note that there is a variant of the 24K which has a VIPT cache but uses
  hardware to resolve cache aliases.  That is, from a kernel cache management
  perspective it behaves like a PIPT cache.
 
 It seems it's not the case here. What I have here is:
 CPU revision is: 00019374 (MIPS 24Kc)
 SoC: Atheros AR7161 rev 2
 Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
 Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes

Yes; it would print PIPT if the cache was aliasing-free.

  However as I understand what you're mapping to userspace is actually
  device memory, right?
 
 Not exactly - I'm using PCI DMA to userspace SG buffers in RAM.
 
 The userspace first allocates the buffers in normal RAM (with vmalloc()
 or something, there is an mmap ioctl() for this), the address returned
 is 0x7xxx. Then the buffers (which consist of several pages each)
 are presented to the hw driver which obtains separate (kernel) mapping
 for each page (kseg0) and does dma_map_sg() and so on. The driver also
 simply writes to the buffers. This isn't a problem, though - only the
 incoherence between TLB and kseg0 is.

Now that very much sounds like an aliasing issue!

 The problem is the userspace doesn't see the kernel writes - The
 0x7xxx TLB-mapped pages are read-cached and not invalidated while
 the kernel writes to the same pages using kseg0 addresses.
 
 Thanks for looking at this.

You're welcome.

I'm just wondering the underlying issue might be a generic problem.

  Ralf
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-07 Thread Krzysztof Hałasa
Please forgive me my MIPS TLB ignorance.

It seems there is a TLB entry pointing to the userspace buffer at the
time the kernel pointer (kseg0) is used. Is is an allowed situation on
MIPS 24K?

buffer: len 0x1000 (first page),
userspace pointer 0x77327000,
kernel pointer 0x867ac000 (physical address = 0x067ac000)

TLB Index: 15 pgmask=4kb va=77326000 asid=be
   [pa=01149000 c=3 d=1 v=1 g=0] [pa=067ac000 c=3 d=1 v=1 g=0]

Should the TLB entry be deleted before using the kernel pointer (which
points at the same page)?
-- 
Krzysztof Halasa

Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-07 Thread Ralf Baechle
On Mon, Oct 07, 2013 at 10:38:49AM +0200, Krzysztof Hałasa wrote:

 Please forgive me my MIPS TLB ignorance.

May the manual be with you :-)

 It seems there is a TLB entry pointing to the userspace buffer at the
 time the kernel pointer (kseg0) is used. Is is an allowed situation on
 MIPS 24K?
 
 buffer: len 0x1000 (first page),
   userspace pointer 0x77327000,
   kernel pointer 0x867ac000 (physical address = 0x067ac000)
 
 TLB Index: 15 pgmask=4kb va=77326000 asid=be
[pa=01149000 c=3 d=1 v=1 g=0] [pa=067ac000 c=3 d=1 v=1 g=0]
 
 Should the TLB entry be deleted before using the kernel pointer (which
 points at the same page)?

That's fine.  You just need to ensure that there are no virtual aliases.
One way to do so is to increase the page size to 16kB.

Note that there is a variant of the 24K which has a VIPT cache but uses
hardware to resolve cache aliases.  That is, from a kernel cache management
perspective it behaves like a PIPT cache.

However as I understand what you're mapping to userspace is actually
device memory, right?  You probably want to map that uncached.  That's a
long standing issue in these two macros:

/*
 * Convert a physical pointer to a virtual kernel pointer for /dev/mem
 * access
 */
#define xlate_dev_mem_ptr(p)__va(p)

/*
 * Convert a virtual cached pointer to an uncached pointer
 */
#define xlate_dev_kmem_ptr(p)   p

which are defined in arch/mips/include/asm/io.h.  These should return
a KSEG1 (uncached XKPHYS) address for anything but RAM.

Would that explain your observations?

  Ralf
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-04 Thread Krzysztof Hałasa
 I'm debugging a problem with a SOLO6110-based H.264 PCI video encoder on
 Atheros AR7100-based (MIPS, big-endian) platform.

BTW this CPU obviously has VIPT data cache, this means a physical page
with multiple virtual addresses (e.g. mapped multiple times) may and
will be cached multiple times.

AR7100 = arch/mips/ath79.
-- 
Krzysztof Halasa

Research Institute for Automation and Measurements PIAP
Al. Jerozolimskie 202, 02-486 Warsaw, Poland
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Suspected cache coherency problem on V4L2 and AR7100 CPU

2013-10-03 Thread Krzysztof Hałasa
Hi,

I'm debugging a problem with a SOLO6110-based H.264 PCI video encoder on
Atheros AR7100-based (MIPS, big-endian) platform.

The problem manifests itself with stale data being returned by the
driver (using ioctl VIDIOC_DQBUF). The stale date always starts and ends
on 32-byte cache line boundary.

The driver is drivers/staging/media/solo6x10.

Initially I thought the encoder hardware is at fault (though it works on
i686 and on (both endians) ARM). But I've eliminated actual DMA accesses
from the driver and the problems still persists.

The control flow is now basically the following:
- userspace program initializes the adapter and allocates 192 KB long
  buffers (at least 2 of them):
open(/dev/video1);

various ioctl() calls

for (cnt = 0; cnt  buffer_count; cnt++) {
struct v4l2_buffer buf = {
.type = V4L2_BUF_TYPE_VIDEO_CAPTURE,
.memory = V4L2_MEMORY_MMAP,
.index = cnt,
};
ioctl(stream-fd, VIDIOC_QUERYBUF, buf);
mmap(NULL, buf.length, PROT_READ | PROT_WRITE, MAP_SHARED, 
stream-fd, buf.m.offset);
}

and then:

for (cnt = 0; cnt  buffer_count; cnt++) {
struct v4l2_buffer buf = {
.type = V4L2_BUF_TYPE_VIDEO_CAPTURE,
.memory = V4L2_MEMORY_MMAP,
.index = cnt,
ioctl(stream-fd, VIDIOC_QBUF, buf);
}

The buffers are now queued. The driver (upon receiving an encoded frame)
now mostly does:

various spin_lock() etc.
vb = list_first_entry(solo_enc-vidq_active, struct solo_vb2_buf, 
list);
list_del(vb-list);

struct vb2_dma_sg_desc *vbuf = vb2_dma_sg_plane_desc(vb, 0);

/* now we have vbuf-sglist which corresponds to a single
userspace 192-KB buffer */

vb2_set_plane_payload(vb, 0, 1024 /* bytes */);

static u32 n = 0; /* a counter to mark each buffer */

/* the following is normally done using dma_map_sg() and DMA,
and also with sg_copy_from_buffer() - eliminated for now */

/* I do the following instead */
struct page *page = sg_page(vbuf-sglist);
u32 *addr = kmap_atomic(page);

/* 4 times as large, I know, the buffer is much longer though */
for (i = 0; i  1024; i++)
addr[i] = n;

flush_dcache_page(page); /* and/or */
flush_kernel_dcache_page(page);

kunmap_atomic(addr);

vb-v4l2_buf.sequence = solo_enc-sequence++;
vb-v4l2_buf.timestamp.tv_sec = vop_sec(vh);
vb-v4l2_buf.timestamp.tv_usec = vop_usec(vh);

vb2_buffer_done(vb, VB2_BUF_STATE_DONE);

The userspace server now does ioctl(VIDIOC_DQBUF), sends it using UDP,
and populates buffer pool again with ioctl(VIDIOC_QBUF).

The driver uses directly-mapped cached (kernel) pointers to access the
buffers (0x8000-0x9FFF kseg0 region) while (obviously)
userspace uses TLB-mapped pointers.

I have verified with a JTAG-based debugger (OpenOCD) that the buffers
are flushed to DRAM (0xAxxx uncached directly-mapped region has
valid data), however the userspace TLB-mapped buffers (which correspond
to the same physical DRAM addresses) partially contain old cached data
(from previous iterations).

The question is which part of the code is at fault, and how do I fix it.
I understand invalidating (and perhaps first flushing) userspace buffers
(cache) should generally fix the problem.

This could also be a simple bug rather than API/platform incompatibility
because usually (though not always) only 1 of the buffers gets corrupted
(the second one of two).

It looks like this - valid buffer, counter n = 0x499, a fragment
of actual UDP packet:
0x0030:   0499  0499  0499  0499  
0x0040:   0499  0499  0499  0499  
0x0050:   0499  0499  0499  0499  
0x0060:   0499  0499  0499  0499  
0x0070:   0499  0499  0499  0499  
0x0080:   0499  0499  0499  0499  
0x0090:   0499  0499  0499  0499  
0x00a0:   0499  0499  0499  0499  
0x00b0:   0499  0499  0499  0499  

next buffer is corrupted, n = 0x49A:
0x0030:   049a  049a  049a  049a  
0x0040:   049a  0468  0468  0468  ...h...h...h
0x0050:   0468  0468  0468  0468  ...h...h...h...h
0x0060:   0468  049a  049a  049a  ...h
0x0070:   049a  049a  049a  049a  
0x0080:   049a  049a  049a  049a  
0x0090: