RE: [RFC] Global video buffers pool / Samsung SoC's

2009-11-12 Thread Marek Szyprowski
Hello,

On Wednesday, November 11, 2009 8:13 AM Harald Welte wrote:

 Hi Guennadi and others,
 
 first of all sorry for breaking the thread, but I am new to this list
 and could not find the message-id of the original mails nor a .mbox
 format archive for the list :(
 
 As I was one of the people giving comments to Guennadi's talk at ELCE,
 let me give some feedback here, too.
 
 I'm currently helping the Samsung System LSI Linux kernel team with
 bringing their various ports for their ARM SoCs mainline.  So far we
 have excluded much of the multimedia related parts due to the complexity
 and lack of kernel infrastructure.
 
 Let me briefly describe the SoCs in question: They have an ARM9, ARM11
 or Cortex-A8 core and multiple video input and output paths, such as
 * camera interface
 * 2d acceleration engine
 * 3d acceleration engine
 * post-processor (colorspace conversion, scaling, rotating)
 * LCM output for classic digital RGB+sync interfaces
 * TV scaler
 * TV encoder
 * HDMI interface (simple serial-HDMI with DMA from/to system memory)
 * Transport Stream interface (MPEG-transport stream input with PID
   filter which can DMA to system memory
 * MIPI-HSI LCM output device
 * Multi-Function codec for H.264 and other stuff
 * Hardware JPEG codec.
 plus even some more that I might have missed.
 
 One of the issues is that, at least in many current and upcoming
 products, all those integrated peripherals can only use physically
 contiguous memory.
 
 For the classic output path (e.g. Xorg+EXA+XAA+3D), that is fine.  The
 framebuffer driver can simply allocate some large chunk of physical
 system memory at boot time, map that into userspace and be happy.  This
 includes things like Xvideo support in the Xserver.  Also, HDMI output
 and TV output can be handled inside X or switch to a new KMS model.
 
 However, the input side looks quite different,  On the one hand, we have
 the camera driver, but possibly HDMI input and transport stream input,
 are less easy.
 
 also, given the plethora of such subsytems in a device, you definitely
 don't want to have one static big boot-time allocation for each of those
 devices.  You don't want to waste that much memory all the time just in
 case at some time you start an application that actually needs this.
 Also, it is unlikely that all of the subsystems will operate at the same
 time.
 
 So having an in-kernel allocator for physically contiguous memory is
 something that is needed to properly support this hardware.  At boot
 time you allocate one big pool, from which you then on-demand allocate
 and free physically contiguous buffers, even at much later time.
 
 Furthermore, think of something like the JPEG codec acceleration, which
 you also want to use zero-copy from userspace.  So userpsace (like
 libjpeg for decode, or a camera application for encode)would also need
 to be able to allocate such a buffer inside the kernel for input and
 output data of the codec, mmap it, put its jpeg data into it and then
 run the actual codec.
 
 How would that relate to the proposed global video buffers pool? Well,
 I think before thinking strictly about video buffers for camera chips,
 we have to think much more generically!

We have been working on multimedia drivers for Samsung SoCs for a while
and we already got through all these problems.

The current version of our drivers use private ioctls, zero-copy user
space memory access and our custom memory manager. Our solution is described
in the following thread: 
http://article.gmane.org/gmane.linux.ports.arm.kernel/66463
We posted it as a base (or reference) for the discussion on Global Video
Buffers Pool.

We also found that most of the multimedia devices (FIMC, JPEG, Rotator/Scaler,
MFC, Post Processor, TVOUT, maybe others) can be successfully wrapped into V4L2
framework. We only need to extend the framework a bit, but this is doable and
has been discussed on V4L2 mini summit on LPC 2009. 

The most important issue is how the device that only processes multimedia data
from one buffer in system memory to another should be implemented in V4L2
framework. Quite long, but successful discussion can be found here:
http://article.gmane.org/gmane.linux.drivers.video-input-infrastructure/10668/
We are currently implementing a reference test driver for v4l2 mem2mem device.

The other important issues that came up while preparing multimedia drivers
for v4l2 framework is the proper support for multi-plane buffers (like these
required by MFC on newer Samsung SoCs). Here are more details:
http://article.gmane.org/gmane.linux.drivers.video-input-infrastructure/11212/

Best regards
--
Marek Szyprowski
Samsung Poland RD Center


--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Global video buffers pool / Samsung SoC's

2009-11-11 Thread Guennadi Liakhovetski
On Wed, 11 Nov 2009, Harald Welte wrote:

 Hi Guennadi and others,
 
 first of all sorry for breaking the thread, but I am new to this list
 and could not find the message-id of the original mails nor a .mbox
 format archive for the list :(
 
 As I was one of the people giving comments to Guennadi's talk at ELCE,
 let me give some feedback here, too.

Adding the author of the RFC to CC.

 I'm currently helping the Samsung System LSI Linux kernel team with
 bringing their various ports for their ARM SoCs mainline.  So far we
 have excluded much of the multimedia related parts due to the complexity
 and lack of kernel infrastructure.
 
 Let me briefly describe the SoCs in question: They have an ARM9, ARM11
 or Cortex-A8 core and multiple video input and output paths, such as
 * camera interface
 * 2d acceleration engine
 * 3d acceleration engine
 * post-processor (colorspace conversion, scaling, rotating)
 * LCM output for classic digital RGB+sync interfaces
 * TV scaler
 * TV encoder
 * HDMI interface (simple serial-HDMI with DMA from/to system memory)
 * Transport Stream interface (MPEG-transport stream input with PID
   filter which can DMA to system memory
 * MIPI-HSI LCM output device
 * Multi-Function codec for H.264 and other stuff
 * Hardware JPEG codec.
 plus even some more that I might have missed.
 
 One of the issues is that, at least in many current and upcoming
 products, all those integrated peripherals can only use physically
 contiguous memory.
 
 For the classic output path (e.g. Xorg+EXA+XAA+3D), that is fine.  The
 framebuffer driver can simply allocate some large chunk of physical
 system memory at boot time, map that into userspace and be happy.  This
 includes things like Xvideo support in the Xserver.  Also, HDMI output
 and TV output can be handled inside X or switch to a new KMS model.
 
 However, the input side looks quite different,  On the one hand, we have
 the camera driver, but possibly HDMI input and transport stream input,
 are less easy.
 
 also, given the plethora of such subsytems in a device, you definitely
 don't want to have one static big boot-time allocation for each of those
 devices.  You don't want to waste that much memory all the time just in
 case at some time you start an application that actually needs this.
 Also, it is unlikely that all of the subsystems will operate at the same
 time.
 
 So having an in-kernel allocator for physically contiguous memory is
 something that is needed to properly support this hardware.  At boot
 time you allocate one big pool, from which you then on-demand allocate
 and free physically contiguous buffers, even at much later time.
 
 Furthermore, think of something like the JPEG codec acceleration, which
 you also want to use zero-copy from userspace.  So userpsace (like
 libjpeg for decode, or a camera application for encode)would also need
 to be able to allocate such a buffer inside the kernel for input and
 output data of the codec, mmap it, put its jpeg data into it and then
 run the actual codec.
 
 How would that relate to the proposed global video buffers pool? Well,
 I think before thinking strictly about video buffers for camera chips,
 we have to think much more generically!
 
 Also, has anyone investigated if GEM or TTM could be used in unmodified
 or modified form for this?  After all, they are intended to allocate
 (and possibly map) video buffers...

Don't think I can contribute much to the actual matter of the discussion, 
yes, there is a problem, the RFC is trying to address it, there have been 
attempts to implement similar things before (as you write above), so, it 
just has to eventually be done.

One question to your SoCs though - do they have SRAM? usable and 
sufficient for graphics buffers? In any case any such implementation will 
have to be able to handle RAMs other than main system memory too, 
including card memory, NUMA, sparse RAM, etc., which is probably obvious 
anyway.

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Global video buffers pool / Samsung SoC's

2009-11-10 Thread Harald Welte
Hi Guennadi and others,

first of all sorry for breaking the thread, but I am new to this list
and could not find the message-id of the original mails nor a .mbox
format archive for the list :(

As I was one of the people giving comments to Guennadi's talk at ELCE,
let me give some feedback here, too.

I'm currently helping the Samsung System LSI Linux kernel team with
bringing their various ports for their ARM SoCs mainline.  So far we
have excluded much of the multimedia related parts due to the complexity
and lack of kernel infrastructure.

Let me briefly describe the SoCs in question: They have an ARM9, ARM11
or Cortex-A8 core and multiple video input and output paths, such as
* camera interface
* 2d acceleration engine
* 3d acceleration engine
* post-processor (colorspace conversion, scaling, rotating)
* LCM output for classic digital RGB+sync interfaces
* TV scaler
* TV encoder
* HDMI interface (simple serial-HDMI with DMA from/to system memory)
* Transport Stream interface (MPEG-transport stream input with PID
  filter which can DMA to system memory
* MIPI-HSI LCM output device
* Multi-Function codec for H.264 and other stuff
* Hardware JPEG codec.
plus even some more that I might have missed.

One of the issues is that, at least in many current and upcoming
products, all those integrated peripherals can only use physically
contiguous memory.

For the classic output path (e.g. Xorg+EXA+XAA+3D), that is fine.  The
framebuffer driver can simply allocate some large chunk of physical
system memory at boot time, map that into userspace and be happy.  This
includes things like Xvideo support in the Xserver.  Also, HDMI output
and TV output can be handled inside X or switch to a new KMS model.

However, the input side looks quite different,  On the one hand, we have
the camera driver, but possibly HDMI input and transport stream input,
are less easy.

also, given the plethora of such subsytems in a device, you definitely
don't want to have one static big boot-time allocation for each of those
devices.  You don't want to waste that much memory all the time just in
case at some time you start an application that actually needs this.
Also, it is unlikely that all of the subsystems will operate at the same
time.

So having an in-kernel allocator for physically contiguous memory is
something that is needed to properly support this hardware.  At boot
time you allocate one big pool, from which you then on-demand allocate
and free physically contiguous buffers, even at much later time.

Furthermore, think of something like the JPEG codec acceleration, which
you also want to use zero-copy from userspace.  So userpsace (like
libjpeg for decode, or a camera application for encode)would also need
to be able to allocate such a buffer inside the kernel for input and
output data of the codec, mmap it, put its jpeg data into it and then
run the actual codec.

How would that relate to the proposed global video buffers pool? Well,
I think before thinking strictly about video buffers for camera chips,
we have to think much more generically!

Also, has anyone investigated if GEM or TTM could be used in unmodified
or modified form for this?  After all, they are intended to allocate
(and possibly map) video buffers...

Regards,
Harald
-- 
- Harald Welte lafo...@gnumonks.org   http://laforge.gnumonks.org/

Privacy in residential applications is a desirable marketing option.
  (ETSI EN 300 175-7 Ch. A6)
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Global video buffers pool

2009-10-28 Thread Laurent Pinchart
Hi Guennadi,

On Tuesday 27 October 2009 08:49:15 Guennadi Liakhovetski wrote:
 Hi
 
 This is a general comment to the whole (contiguous) video buffer work:
 having given a talk at the ELC-E in Grenoble on soc-camera, I mentioned
 briefly a few related RFCs, including this one. I've got a couple of
 comments back, including the following ones (which is to say, opinions are
 not mine and may or may not be relevant, I'm just fulfilling my promise to
 pass them on;)):
 
 1) has been requested to move this discussion to a generic mailing list
 like LKML.

 2) the reason for (1) was, obviously, to consider making such a buffer
 pool also available to other subsystems, of which video / framebuffer
 drivers have been mentioned as likely interested parties.

Those are good ideas. The global video buffers pool will sooner or later (and 
my guess is sooner) need to interact with X buffers (either for Xv rendering, 
or opengl textures). This needs to be discussed globally on the LKML.
 
 (btw, not sure if this has also been mentioned among those wishes - what
 about DVB? Can they also use such buffers?)

If I'm not mistaken DVB uses read/write syscalls to transfer data from/to the 
driver. A video buffers pool wouldn't fit well in that scheme.

-- 
Regards,

Laurent Pinchart
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Global video buffers pool

2009-10-02 Thread Robert Tivy
Laurent Pinchart laurent.pinchart at ideasonboard.com writes:

 
 Hi Stefan,
 
 On Monday 28 September 2009 16:04:58 Stefan.Kost at nokia.com wrote:
  hi,
  
  -Original Message-
  From: ext Laurent Pinchart [mailto:laurent.pinchart at ideasonboard.com]
  Sent: 16 September, 2009 18:47
  To: linux-media at vger.kernel.org; Hans Verkuil; Sakari Ailus;
  Cohen David.A (Nokia-D/Helsinki); Koskipaa Antti
  (Nokia-D/Helsinki); Zutshi Vimarsh (Nokia-D/Helsinki); Kost
  Stefan (Nokia-D/Helsinki)
  Subject: [RFC] Global video buffers pool
  
   Hi everybody,
  
   I didn't want to miss this year's pretty flourishing RFC
   season, so here's another one about a global video buffers pool.
  
  Sorry for ther very late reply.
 
 No worries, better late than never.
 
  I have been thinking about the problem on a bit broader scale and see the
  need for something more kernel wide. E.g. there is some work done from 
intel
  for graphics:
  http://keithp.com/blogs/gem_update/
  
  and this is not so much embedded even. If there buffer pools are
  v4l2specific then we need to make all those other subsystems like xvideo,
  opengl, dsp-bridges become v4l2 media controllers.
 
 The global video buffers pool topic has been discussed during the v4l2 mini-
 summit at Portland last week, and we all agreed that it needs more research.
 
 The idea of having pools at the media controller level has been dropped in 
 favor of a kernel-wide video buffers pool. Whether we can make the buffers 
 pool not v4l2-specific still needs to be tested. As you have pointed out, we 
 currently have a GPU memory manager in the kernel, and being able to 
interact 
 with it would be very interesting if we want to DMA video data to OpenGL 
 texture buffers for instance. I'm not sure if that would be possible though, 
 as the GPU and the video acquisition hardware might have different memory 
 requirements, at least in the general case. I will contact the GEM guys at 
 Intel to discuss the topic.
 
 If we can't share the buffers between the GPU and the rest of the system, we 
 could at least create a V4L2 wrapper on top of the DSP bridge core (which 
will 
 require a major cleanup/restructuring), making it possible to share video 
 buffers between the ISP and the DSP.
 


TI has been providing this sort of contiguous buffer support for quite a few 
years now.  TI provides a SW package named LinuxUtils, and it contains a 
module named CMEM (stand for Contiguous MEMory manager).

Latest LinuxUtils release, contains cdocs of CMEM:
http://software-
dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/linuxutils/2_24_03/exports/l
inuxutils_2_24_03.tar.gz

And the background/usage article here:
http://tiexpressdsp.com/index.php/CMEM_Overview

CMEM solves lots of the same sorts of things that the driver described in this 
thread does.  However, it doesn't integrate into other drivers, and it's 
accessed through the CMEM user interface.  Also, CMEM alleviates some of the 
issues raised in this thread since it uses memory not known to the kernel 
(user carves out a chunk by reducing kernel memory through u-boot mem= 
param), which IMO can be both good and bad (good - alleviates 
locking/unavailable memory issues, bad - doesn't cooperate with the kernel in 
getting memory, requiring user intervention).

Regards,

Robert Tivy
MGTS
Systems Software
Texas Instruments, Santa Barbara



--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Global video buffers pool

2009-09-28 Thread Stefan.Kost
hi, 

-Original Message-
From: ext Laurent Pinchart [mailto:laurent.pinch...@ideasonboard.com] 
Sent: 16 September, 2009 18:47
To: linux-media@vger.kernel.org; Hans Verkuil; Sakari Ailus; 
Cohen David.A (Nokia-D/Helsinki); Koskipaa Antti 
(Nokia-D/Helsinki); Zutshi Vimarsh (Nokia-D/Helsinki); Kost 
Stefan (Nokia-D/Helsinki)
Subject: [RFC] Global video buffers pool

Hi everybody,

I didn't want to miss this year's pretty flourishing RFC 
season, so here's another one about a global video buffers pool.


Sorry for ther very late reply. I have been thinking about the problem on a bit 
broader scale and see the need for something more kernel wide. E.g. there is 
some work done from intel for graphics:
http://keithp.com/blogs/gem_update/

and this is not so much embedded even. If there buffer pools are v4l2specific 
then we need to make all those other subsystems like xvideo, opengl, 
dsp-bridges become v4l2 media controllers. 

Stefan


All comments are welcome, but please don't trash this proposal 
too fast. It's a first shot at real problems encountered in 
real situations with real hardware (namely high resolution 
still image capture on OMAP3). It's far from perfect, and I'm 
open to completely different solutions if someone thinks of one.


Introduction


The V4L2 video buffers handling API makes use of a queue of 
video buffers to exchange data between video devices and 
userspace applications (the read method don't expose the 
buffers objects directly but uses them underneath). 
Although quite efficient for simple video capture and output 
use cases, the current implementation doesn't scale well when 
used with complex hardware and large video resolutions. This 
RFC will list the current limitations of the API and propose a 
possible solution.

The document is at this stage a work in progress. Its main 
purpose is to be used as support material for discussions at 
the Linux Plumbers Conference.


Limitations
===

Large buffers allocation


Many video devices still require physically contiguous memory. The 
introduction of IOMMUs on high-end systems will probably make 
that a distant 
nightmare in the future, but we have to deal with this 
situation for the 
moment (I'm not sure if the most recent PCI devices support 
scatter-gather 
lists, but many embedded systems still require physically 
contiguous memory).

Allocating large amounts of physically contiguous memory needs 
to be done as 
soon as possible after (or even during) system bootup, 
otherwise memory 
fragmentation will cause the allocation to fail.

As the amount of required video memory depends on the frame 
size and the 
number of buffers, the driver can't pre-allocate the buffers 
beforehand. A few 
drivers allocate a large chunk of memory when they are loaded 
and then use it 
when a userspace application requests video buffers to be 
allocated. However, 
that method requires guessing how much memory will be needed, 
and can lead to 
waste of system memory (if the guess was too large) or 
allocation failures (if 
the guess was too low).

Buffer queuing latency
---

VIDIOC_QBUF is becoming a performance bottleneck when 
capturing large images 
on some systems (especially in the embedded world). When 
capturing high 
resolution still pictures, the VIDIOC_QBUF delay adds to the 
shot latency, 
making the camera appear slow to the user.

The delay is caused by several operations required by DMA 
transfers that all 
happen when queuing buffers.

- Cache coherency management

When the processor has a non-coherent cache (which is the case 
with most 
embedded devices, especially ARM-based) the device driver 
needs to invalidate 
(for video capture) or flush (for video output) the cache 
(either a range, or 
the whole cache) every time a buffer is queued. This ensures 
that stale data 
in the cache will not be written back to memory during or 
after DMA and that 
all data written by the CPU is visible to the device.

Invalidating the cache for large resolutions take a 
considerable amount of 
time. Preliminary tests showed that cache invalidation for a 
5MP buffer 
requires several hundreds of milliseconds on an OMAP3 platform 
for range 
invalidation, or several tens of milliseconds when 
invalidating the whole D 
cache.

When video buffers are passed between two devices (for 
instance when passing 
the same USERPTR buffer to a video capture device and a 
hardware codec) 
without any userspace access to the memory, CPU cache 
invalidation/flushing 
isn't required on either side (video capture and hardware 
codec) and could be 
skipped.

- Memory locking and IOMMU

Drivers need to lock the video buffer pages in memory to make 
sure that the 
physical pages will not be freed while DMA is in progress 
under low-memory 
conditions. This requires looping over all pages (typically 
4kB long) that 
back the video buffer (10MB for a 5MP YUV image) and takes a 
considerable 
amount

Re: [RFC] Global video buffers pool

2009-09-28 Thread Laurent Pinchart
Hi Stefan,

On Monday 28 September 2009 16:04:58 stefan.k...@nokia.com wrote:
 hi,
 
 -Original Message-
 From: ext Laurent Pinchart [mailto:laurent.pinch...@ideasonboard.com]
 Sent: 16 September, 2009 18:47
 To: linux-media@vger.kernel.org; Hans Verkuil; Sakari Ailus;
 Cohen David.A (Nokia-D/Helsinki); Koskipaa Antti
 (Nokia-D/Helsinki); Zutshi Vimarsh (Nokia-D/Helsinki); Kost
 Stefan (Nokia-D/Helsinki)
 Subject: [RFC] Global video buffers pool
 
  Hi everybody,
 
  I didn't want to miss this year's pretty flourishing RFC
  season, so here's another one about a global video buffers pool.
 
 Sorry for ther very late reply.

No worries, better late than never.

 I have been thinking about the problem on a bit broader scale and see the
 need for something more kernel wide. E.g. there is some work done from intel
 for graphics:
 http://keithp.com/blogs/gem_update/
 
 and this is not so much embedded even. If there buffer pools are
 v4l2specific then we need to make all those other subsystems like xvideo,
 opengl, dsp-bridges become v4l2 media controllers.

The global video buffers pool topic has been discussed during the v4l2 mini-
summit at Portland last week, and we all agreed that it needs more research.

The idea of having pools at the media controller level has been dropped in 
favor of a kernel-wide video buffers pool. Whether we can make the buffers 
pool not v4l2-specific still needs to be tested. As you have pointed out, we 
currently have a GPU memory manager in the kernel, and being able to interact 
with it would be very interesting if we want to DMA video data to OpenGL 
texture buffers for instance. I'm not sure if that would be possible though, 
as the GPU and the video acquisition hardware might have different memory 
requirements, at least in the general case. I will contact the GEM guys at 
Intel to discuss the topic.

If we can't share the buffers between the GPU and the rest of the system, we 
could at least create a V4L2 wrapper on top of the DSP bridge core (which will 
require a major cleanup/restructuring), making it possible to share video 
buffers between the ISP and the DSP.

-- 
Regards,

Laurent Pinchart
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Global video buffers pool

2009-09-21 Thread Marek Szyprowski
Hello Laurent,

We have been developing quite simmilar solution for Samsung SoCs multimedia 
drivers that the one mentioned in this RFC.

Our solution bases on the global buffer manager that provides buffers 
(contiguous in physical memory) to user applications. Then the application can 
pass the buffers (as input or output) to different multimedia device drivers. 
Please note that our solution is aimed at UMA systems, where all multimedia 
devices can access system memory directly.

We decided not to use any special buffer identifiers. In our solution 
applications must mmap the buffer (even if they don't plan to read/write it 
directly) and pass the buffer user pointer to the multimedia driver.

To get access to specified buffer we prepared a special layer that checks if 
the passed user pointer points to the buffer that is continuous in physical 
memory, locks properly the buffer memory and returns the buffer physical 
address. More details on this solution can be found here: 
http://thread.gmane.org/gmane.linux.ports.arm.kernel/56879

Using the user pointer access type gave us the possibility to directly 
transfer multimedia data to frame buffer memory and to create an SYSV SHMem 
area from it (by some additional hacks in kernel mm). This gave us the real 
power esspecially in hardware acceleration of XServer - with XSHM extensions 
we were able to blit frames directly from user application's buffer to the 
frame buffer memory.

Our multimedia devices do not use V4L framework currently, but moving towards 
V4L2 is possible.

Now let's get back to the RFC thesis.

The idea behind the global memory pool is really good and especially required 
in embedded-like systems. One of the important features of the buffer manager 
is cache coherency control. User, who allocated a buffer can request the 
buffer should be mapped as cacheable area or not, depending on the aimed use 
case. Queueing non-cacheable buffers is faster of course (no cache flush is 
required), but CPU read access is much slower (note the write-combining here).

A global memory pool should also reduce system memory requirements, however it 
should be kept in mind that some use cases might cause memory fragmentation 
issues. A pluginable memory management should also be considered. With some 
standard allocating methods like all buffers of the same size, first fit, best 
fit, etc in the buffer manager most of the typical usecases can be covered. 
Also some statistics on buffer allocation/deallocation and usage can be easily 
gathered with buffer manager.

However one should consider whether introducing new v4l2 buffer access method 
(V4L2_MEMORY_POOL) is really required. One of the key features of the 
introduced pool buffer identifiers is the much quicker buffer locking, as no 
per-page locking needs to be done. However a simmilar effect can be achieved 
with USERPTR access method. User application can allocate the buffer from the 
buffer manager (global pool), mmap it and pass it to the driver with USERPTR 
method. The driver can quite easily check if the passed user pointer is a 
pointer to the buffer from the pool and then lock it quickly with the simmilar 
method we used in our drivers for SoCs multimedia hardware.

Best regards
--
Marek Szyprowski
Samsung Poland RD Center



--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Global video buffers pool

2009-09-18 Thread Hiremath, Vaibhav
 -Original Message-
 From: linux-media-ow...@vger.kernel.org [mailto:linux-media-
 ow...@vger.kernel.org] On Behalf Of Laurent Pinchart
 Sent: Wednesday, September 16, 2009 9:17 PM
 To: linux-media@vger.kernel.org; Hans Verkuil; Sakari Ailus; Cohen
 David Abraham; Koskipää Antti Jussi Petteri; Zutshi Vimarsh (Nokia-
 D-MSW/Helsinki); stefan.k...@nokia.com
 Subject: [RFC] Global video buffers pool
 
 Hi everybody,
 
 I didn't want to miss this year's pretty flourishing RFC season, so
 here's
 another one about a global video buffers pool.
 
 All comments are welcome, but please don't trash this proposal too
 fast. It's
 a first shot at real problems encountered in real situations with
 real
 hardware (namely high resolution still image capture on OMAP3). It's
 far from
 perfect, and I'm open to completely different solutions if someone
 thinks of
 one.
 
[Hiremath, Vaibhav] Thanks Laurent for putting this, I believe memory 
fragmentation is a critical issue for most of new the drivers. We need some 
sort of solution to address this.

Please find some observations/issues/Q below - 

 
 Introduction
 
 
 The V4L2 video buffers handling API makes use of a queue of video
 buffers to
 exchange data between video devices and userspace applications (the
 read
 method don't expose the buffers objects directly but uses them
 underneath).
 Although quite efficient for simple video capture and output use
 cases, the
 current implementation doesn't scale well when used with complex
 hardware and
 large video resolutions. This RFC will list the current limitations
 of the API
 and propose a possible solution.
 
 The document is at this stage a work in progress. Its main purpose
 is to be
 used as support material for discussions at the Linux Plumbers
 Conference.
 
 
 Limitations
 ===
 
 Large buffers allocation
 
 
 Many video devices still require physically contiguous memory. The
 introduction of IOMMUs on high-end systems will probably make that a
 distant
 nightmare in the future, but we have to deal with this situation for
 the
 moment (I'm not sure if the most recent PCI devices support scatter-
 gather
 lists, but many embedded systems still require physically contiguous
 memory).
 
 Allocating large amounts of physically contiguous memory needs to be
 done as
 soon as possible after (or even during) system bootup, otherwise
 memory
 fragmentation will cause the allocation to fail.
 
 As the amount of required video memory depends on the frame size and
 the
 number of buffers, the driver can't pre-allocate the buffers
 beforehand. A few
 drivers allocate a large chunk of memory when they are loaded and
 then use it
 when a userspace application requests video buffers to be allocated.
 However,
 that method requires guessing how much memory will be needed, and
 can lead to
 waste of system memory (if the guess was too large) or allocation
 failures (if
 the guess was too low).
 
[Hiremath, Vaibhav] Could it possible to fine tune this based on use-case. 
At-least on OMAP Display driver we have boot argument to control number of 
buffers and size of buffers which user can pass through boot time argument. The 
default setting is 3 buffers for max resolution (720P).
With this it won't be guessing any more, right?

 Buffer queuing latency
 ---
 
 VIDIOC_QBUF is becoming a performance bottleneck when capturing
 large images
 on some systems (especially in the embedded world). When capturing
 high
 resolution still pictures, the VIDIOC_QBUF delay adds to the shot
 latency,
 making the camera appear slow to the user.
 
 The delay is caused by several operations required by DMA transfers
 that all
 happen when queuing buffers.
 
 - Cache coherency management
 
[Hiremath, Vaibhav] Agreed.

 When the processor has a non-coherent cache (which is the case with
 most
 embedded devices, especially ARM-based) the device driver needs to
 invalidate
 (for video capture) or flush (for video output) the cache (either a
 range, or
 the whole cache) every time a buffer is queued. This ensures that
 stale data
 in the cache will not be written back to memory during or after DMA
 and that
 all data written by the CPU is visible to the device.
 
 Invalidating the cache for large resolutions take a considerable
 amount of
 time. Preliminary tests showed that cache invalidation for a 5MP
 buffer
 requires several hundreds of milliseconds on an OMAP3 platform for
 range
 invalidation, or several tens of milliseconds when invalidating the
 whole D
 cache.
 
 When video buffers are passed between two devices (for instance when
 passing
 the same USERPTR buffer to a video capture device and a hardware
 codec)
 without any userspace access to the memory, CPU cache
 invalidation/flushing
 isn't required on either side (video capture and hardware codec) and
 could be
 skipped.
 
 - Memory locking and IOMMU
 
 Drivers need to lock the video buffer pages in memory to make sure

Re: [RFC] Global video buffers pool

2009-09-18 Thread Laurent Pinchart
Hi Hans,

On Thursday 17 September 2009 23:19:24 Hans Verkuil wrote:
 On Thursday 17 September 2009 20:49:49 Mauro Carvalho Chehab wrote:
  Em Wed, 16 Sep 2009 17:46:39 +0200
  Laurent Pinchart laurent.pinch...@ideasonboard.com escreveu:
   Hi everybody,
  
   I didn't want to miss this year's pretty flourishing RFC season, so
   here's another one about a global video buffers pool.
  
   All comments are welcome, but please don't trash this proposal too
   fast. It's a first shot at real problems encountered in real situations
   with real hardware (namely high resolution still image capture on
   OMAP3). It's far from perfect, and I'm open to completely different
   solutions if someone thinks of one.
 
 First of all, thank you Laurent for working on this! Much appreciated.
 
  Some comments about your proposal:
 
  1) For embedded systems, probably the better is to create it at boot
  time, instead of controlling it via userspace, since as early it is done,
  the better.
 
 I agree with Mauro here. The only way you can allocate the required memory
 is in general to do it early in the boot sequence.

I agree with you there as well, but there's one obvious problem with that 
approach: the Linux kernel doesn't know how much memory you will need.
Let me take the OMAP3 camera as an example. The sensor has a native 5MP 
resolution (2548x1938). When taking a still picture, we want to display live 
video on the device's screen in a lower resolution (840x400) and, when the 
user presses the camera button, switch to the 5MP resolution and capture 3 
images.

For this we need a few (let's say 5) 840x400 buffers (672000 bytes each in 
YUV) and 3 2548x1938 buffers (9876048 bytes each). Those requirements come 
from the product specifications, and the device driver has no way to know 
about them. Allocating several huge buffers at boot time big enough for all 
use cases will here use 75MB of memory instead of 31.5MB.

That's why I was thinking about allowing a userspace application to allocate 
those buffers very early after boot. One other possible solution would be to 
use a kernel command line parameter set to something like 
5x672000,3x9876048.

Another reason to allow applications to allocate buffers in the pool was to be 
able to pre-queue buffers to avoid cache invalidation and memory pinning 
delays at VIDIOC_QBUF. This is a very important topic that I might not have 
stressed enough in the RFC. VIDIOC_QBUF currently hurts performances. For the 
camera use case I've explained above, we need a way to pre-queue the 3 5MP 
buffers while still streaming video in 840x400.

  2) As I've posted at the media controller RFC, we should take care to not
  abuse about its usage. Media controller has two specific objetives:
  topology enumeration/change and subdev parameter send.
 
 True, but perhaps it can also be used for other purposes. I'm not saying we
 should, but neither should we stop thinking about it. Someone may come up
  with a great idea for which a mc is ideally suited. We are still in the
  brainstorming stage, so any idea is welcome.

Agreed. The media controller RFC described its intended purpose, but I don't 
see why it couldn't be extended if we find a use case for which the media 
controller is ideally suited.

  For the last, as I've explained there, the proper solution is to create
  devices for each v4l subdev that requires control from userspace.
 
 The proper solution *in your opinion*. I'm still on the fence on that one.
 
  In the case of a video buffers memory poll, it is none of the usecases of
  media controller. So, it is needed to think better about where to
  implement it.
 
 Why couldn't it be one of the use cases? Again, it is your opinion, not a
  fact. Note that I share this opinion, but try to avoid presenting opinions
  as facts.
 
  3) I don't think that having a buffer pool per media controller will be
  so useful. A media controller groups /dev/video with their audio, IR,
  I2C... resources. On systems with more than one different board (for
  example a cellular phone with a camera and an DVB-H receiver), you'll
  likely have more than one media controller. So, controlling video buffer
  pools at /dev/video or at media controller will give the same results on
  several environments;
 
 I don't follow the logic here, sorry.
 
  4) As you've mentioned, a global set of buffers seem to be the better
  alternative. This means that V4L2 core will take care of controlling the
  pool, instead of leaving this task to the drivers. This makes easier to
  have a boot-time parameter specifying the size of the memory pool and
  will optimize memory usage. We may even have a Kconfig var specifying the
  default size of the memory pool (although this is not really needed,
  since new kernels allow specifying default line command parameters).
 
 Different devices may have quite different buffer requirements (size,
  number of buffers). Would it be safe to have them all allocated from a
  global 

Re: [RFC] Global video buffers pool

2009-09-18 Thread Laurent Pinchart
Hi Mauro,

thanks for the review. A few comments.

On Friday 18 September 2009 00:45:42 Mauro Carvalho Chehab wrote:
 Em Thu, 17 Sep 2009 23:19:24 +0200
 Hans Verkuil hverk...@xs4all.nl escreveu:

[snip]

   4) As you've mentioned, a global set of buffers seem to be the better
   alternative. This means that V4L2 core will take care of controlling
   the pool, instead of leaving this task to the drivers. This makes
   easier to have a boot-time parameter specifying the size of the memory
   pool and will optimize memory usage. We may even have a Kconfig var
   specifying the default size of the memory pool (although this is not
   really needed, since new kernels allow specifying default line command
   parameters).
 
  Different devices may have quite different buffer requirements (size,
  number of buffers). Would it be safe to have them all allocated from a
  global pool? I do not feel confident myself that I understand all the
  implications of a global pool or whether you actually always want that.
 
 This is a problem with the pool concept. Even having the same driver,
  you'll still be needing different resolutions, frame rates, formats and
  bits per pixel on each /dev/video interface.

That's right (the frame rate doesn't matter though), but not different memory 
type (low-mem, non-cacheable, contiguous, ...) requirements. The only thing 
that matters in the end is the number of buffers and their size. The pool 
doesn't care about the formats and resolutions separately.

  I'm not sure how to deal.

My idea was to have several groups of video buffers. You could allocate on 
large group of low-resolution buffers for video preview, and a small group 
of high-resolution buffers for still image capture. Video devices could then 
pick buffers from one of those groups depending on their needs.

  Maybe we'll need to allocate the buffers considering the worse case that
  can be passed to the driver. For example, in the case of a kernel
  parameter, it could be something like:
   videobuf=buffers=32,size=256K
 To allocate 32 buffers with 256K each. This way, even if application asks
  for a smaller buffer, it will keep reserving 256K for each buffer. If bad
  specified, memory will be wasted, but the memory will be there.
 Eventually, after allocating that memory, some API could be provided for
 example to rearrange the allocated space into 64 x 128K.

We still need separate groups, otherwise we will waste too much memory. 5MP 
sensors are common today, and the size will probably grow in the years to 
come. We can't allocate 32 5MP buffers on an embedded system.

   5) The step to have a a global-wide video buffers pool allocation, as
   you mentioned at the RFC, is to make sure that all drivers will use
   v4l2 framework to allocate memory. So, this means porting a few drivers
   (ivtv, uvcvideo, cx18 and gspca) to use videobuf. As videobuf already
   supports all sorts of different memory types and configs (contig and
   Scatter/Gather DMA, vmalloced buffers, mmap, userptr, read, overlay
   modes), it should fits well on the needs.
 
  Why would I want to change ivtv for this? In fact, I see no reason to
  modify any of the existing drivers. A mc-wide or global memory pool is
  only of interest for very complex devices where you want to pass buffers
  around between various sub-devices (and possibly to other media devices
  or DSPs). And yes, they probably will have to use the framework in order
  to be able to coordinate these pools properly.
 
 The issue here is not necessarely related to device complexity. It can be
 motivated by other factors, for example:
 
   - arch's with non-coherent cache;
   - devices that aren't capable of doing DMA scatter/gather;
   - high memory fragmentation.
 
 Just as an example, I used an old laptop with only 256 Mb of ram, running
  a new distro, when I started developing the tm6000 drivers. On that
  hardware, I was needing buffers of about 600 KB each. It were very common
  to not be able to allocate such buffers there, due to high memory
  fragmentation, since the USB driver were trying to allocate a continuous
  buffer on that hardware.
 
 So, the same argument we used with the EMBEEDED Kconfig option also applies
  here: it is not everything black or white. For example, surveillance
  systems need to be very reliable. So, the possibility of allocating memory
  during boot will help them.
 
 Just to take a random real usecase, David Liontooth mentioned recently at
  the ML his intention of maybe using ivtv hardware to capture TV signals at
  remote locations, having the hardware minimally assisted. He mention the
  needs of capturing data continuously for 15 hours. That means that the
  machine will likely close devices and reopen once a day, during years. In
  such application, a video buffer pool will for sure reduce the risk of
  memory fragmentation on such systems, giving more reliability to the
  system, especially if the hardware it will 

Re: [RFC] Global video buffers pool

2009-09-18 Thread Mauro Carvalho Chehab
I'm joining your comments to Vaibhav with your comments to me, in order to
avoid duplicating comments.

Em Fri, 18 Sep 2009 10:39:17 +0200
Laurent Pinchart laurent.pinch...@ideasonboard.com escreveu:

   Different devices may have quite different buffer requirements (size,
   number of buffers). Would it be safe to have them all allocated from a
   global pool? I do not feel confident myself that I understand all the
   implications of a global pool or whether you actually always want that.
  
  This is a problem with the pool concept. Even having the same driver,
   you'll still be needing different resolutions, frame rates, formats and
   bits per pixel on each /dev/video interface.
 
 That's right (the frame rate doesn't matter though), but not different memory 
 type (low-mem, non-cacheable, contiguous, ...) requirements. The only thing 
 that matters in the end is the number of buffers and their size. The pool 
 doesn't care about the formats and resolutions separately.

For raw formats, that's right. However, with some compressed formats, there are
other parameters that affect the size of a framebuffer. For example, just
knowing the resolution is not enough for h.264/mpeg/jpeg formats. Even frame
rate can affect some of them, since they'll affect the temporal estimations. For
compressed formats, maybe the right approach would be to allocate buffers based
on the maximum allowed bandwidth.

 
   I'm not sure how to deal.
 
 My idea was to have several groups of video buffers. You could allocate on 
 large group of low-resolution buffers for video preview, and a small 
 group 
 of high-resolution buffers for still image capture. Video devices could then 
 pick buffers from one of those groups depending on their needs.
 
   Maybe we'll need to allocate the buffers considering the worse case that
   can be passed to the driver. For example, in the case of a kernel
   parameter, it could be something like:
  videobuf=buffers=32,size=256K
  To allocate 32 buffers with 256K each. This way, even if application asks
   for a smaller buffer, it will keep reserving 256K for each buffer. If bad
   specified, memory will be wasted, but the memory will be there.
  Eventually, after allocating that memory, some API could be provided for
  example to rearrange the allocated space into 64 x 128K.
 
 We still need separate groups, otherwise we will waste too much memory. 5MP 
 sensors are common today, and the size will probably grow in the years to 
 come. We can't allocate 32 5MP buffers on an embedded system.

  I agree with Mauro here. The only way you can allocate the required memory
  is in general to do it early in the boot sequence.  
 
 I agree with you there as well, but there's one obvious problem with that 
 approach: the Linux kernel doesn't know how much memory you will need.
 Let me take the OMAP3 camera as an example. The sensor has a native 5MP 
 resolution (2548x1938). When taking a still picture, we want to display live 
 video on the device's screen in a lower resolution (840x400) and, when the 
 user presses the camera button, switch to the 5MP resolution and capture 3 
 images.

 For this we need a few (let's say 5) 840x400 buffers (672000 bytes each in 
 YUV) and 3 2548x1938 buffers (9876048 bytes each). Those requirements come 
 from the product specifications, and the device driver has no way to know 
 about them. Allocating several huge buffers at boot time big enough for all 
 use cases will here use 75MB of memory instead of 31.5MB.
 
 That's why I was thinking about allowing a userspace application to allocate 
 those buffers very early after boot. One other possible solution would be to 
 use a kernel command line parameter set to something like 
 5x672000,3x9876048.

Interesting approach. Another alternative would be to allocate a flat memory
block
during boot time, and provide a set of controls to control how the memory will
be divided.

  So, while I agree that it is not a mandatory requirement to port the
   existing drivers to benefit with the memory pool, by not doing it, those
   drivers will be less reliable than the other drivers on professional
   usage.
 
 Good point. No need to be too clever though. I think that the memory pool 
 concept can be restricted to use cases where the user knows in advance what's 
 going to happen with the hardware. A video monitoring system is one of them, 
 a 
 digital camera is another one.

Agreed.

 In those cases the system designer knows what 
 resolutions will be streamed at, and how many buffers will be needed. This 
 information can come from userspace or the kernel command line, and the 
 memory 
 pool won't need to become a complete memory management system. An application 
 that wants to use buffers from the pool will the explicitly which set of 
 buffers it wants to use.

I'm not sure that reserving memory size on userspace would be good enough, even
if done too early. On the other hand, a complex command line won't be good
enough.


Re: [RFC] Global video buffers pool

2009-09-17 Thread Hans de Goede

Hi,

On 09/16/2009 05:46 PM, Laurent Pinchart wrote:

Hi everybody,

I didn't want to miss this year's pretty flourishing RFC season, so here's
another one about a global video buffers pool.

All comments are welcome, but please don't trash this proposal too fast. It's
a first shot at real problems encountered in real situations with real
hardware (namely high resolution still image capture on OMAP3). It's far from
perfect, and I'm open to completely different solutions if someone thinks of
one.



Sound like a reasonable and usefull proposal to me. Not much to add other then
that.

Regards,

Hans
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Global video buffers pool

2009-09-17 Thread Mauro Carvalho Chehab
Em Wed, 16 Sep 2009 17:46:39 +0200
Laurent Pinchart laurent.pinch...@ideasonboard.com escreveu:

 Hi everybody,
 
 I didn't want to miss this year's pretty flourishing RFC season, so here's 
 another one about a global video buffers pool.
 
 All comments are welcome, but please don't trash this proposal too fast. It's 
 a first shot at real problems encountered in real situations with real 
 hardware (namely high resolution still image capture on OMAP3). It's far from 
 perfect, and I'm open to completely different solutions if someone thinks of 
 one.

Some comments about your proposal:

1) For embedded systems, probably the better is to create it at boot time, 
instead
of controlling it via userspace, since as early it is done, the better.

2) As I've posted at the media controller RFC, we should take care to not abuse
about its usage. Media controller has two specific objetives: topology
enumeration/change and subdev parameter send. For the last, as I've explained
there, the proper solution is to create devices for each v4l subdev that 
requires
control from userspace. In the case of a video buffers memory poll, it is none 
of the 
usecases of media controller. So, it is needed to think better about where to
implement it.

3) I don't think that having a buffer pool per media controller will be so 
useful.
A media controller groups /dev/video with their audio, IR, I2C... resources. On
systems with more than one different board (for example a cellular phone with a
camera and an DVB-H receiver), you'll likely have more than one media 
controller.
So, controlling video buffer pools at /dev/video or at media controller will 
give
the same results on several environments;

4) As you've mentioned, a global set of buffers seem to be the better 
alternative. This
means that V4L2 core will take care of controlling the pool, instead of leaving
this task to the drivers. This makes easier to have a boot-time parameter 
specifying
the size of the memory pool and will optimize memory usage. We may even have a
Kconfig var specifying the default size of the memory pool (although this is
not really needed, since new kernels allow specifying default line command 
parameters).

5) The step to have a a global-wide video buffers pool allocation, as you
mentioned at the RFC, is to make sure that all drivers will use v4l2 framework
to allocate memory. So, this means porting a few drivers (ivtv, uvcvideo, cx18
and gspca) to use videobuf. As videobuf already supports all sorts of different
memory types and configs (contig and Scatter/Gather DMA, vmalloced buffers,
mmap, userptr, read, overlay modes), it should fits well on the needs.

6) As videobuf uses a common method of allocating memory, and all memory
requests passes via videobuf-core (videobuf_alloc function), the implementation 
of a
global-wide set of videobuffer means to touch on just one function there, at 
the abstraction
layer, and to double check at the 
videobuf-dma-sg/videobuf-vmalloc/videobuf-contig if they
don't call directly their own allocation methods. If they do, a simple change
would be needed.

7) IMO, the better interface for it is to add some sysfs attributes to media
class, providing there the means to control the video buffer pools. If the size
of a video buffer pool is set to zero, it will use normal memory allocation.
Otherwise, it will work at the pool mode. 

8) By using videobuf, we can also export usage statistics via debugfs, providing
runtime statistics about how many memory is being used by what drivers and /dev 
devices.



Cheers,
Mauro
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Global video buffers pool

2009-09-17 Thread Karicheri, Muralidharan
Laurent,

Thanks for working on this. I might need some more time to review this
as there are many RFCs put for review (including one from myself). From
TI's point of view we need something like a global buffer allocator for video 
drivers.

Two ideas that came up while discussing this internally are given below which I 
thought will share with you.

1)Add a common contiguous buffer allocator/deallocator to video buffer
  layer which will pre-allocate the buffer at bootup and contiguous
  buffer layer use it for allocating buffer. When runs out, it will
  fallback to it's current scheme.

2)Similar to 1) except that the allocator use the bootargs mem variable to 
  calculate the available memory in the board and use this memory for buffer 
  allocation. This way user can customize it based on a system design goal.

We might want to have user application to request buffer from the same pool 
through an API. User space applications would use this API to allocate 
contiguous buffers as needed and would use USERPTR IO in all drivers or use 
MMAP IO in one driver and USERPTR IO in other drivers using these buffer 
pointers.

Murali Karicheri
Software Design Engineer
Texas Instruments Inc.
Germantown, MD 20874
new phone: 301-407-9583
Old Phone : 301-515-3736 (will be deprecated)
email: m-kariche...@ti.com

-Original Message-
From: linux-media-ow...@vger.kernel.org [mailto:linux-media-
ow...@vger.kernel.org] On Behalf Of Laurent Pinchart
Sent: Wednesday, September 16, 2009 11:47 AM
To: linux-media@vger.kernel.org; Hans Verkuil; Sakari Ailus; Cohen David
Abraham; Koskipää Antti Jussi Petteri; Zutshi Vimarsh (Nokia-D-
MSW/Helsinki); stefan.k...@nokia.com
Subject: [RFC] Global video buffers pool

Hi everybody,

I didn't want to miss this year's pretty flourishing RFC season, so here's
another one about a global video buffers pool.

All comments are welcome, but please don't trash this proposal too fast.
It's
a first shot at real problems encountered in real situations with real
hardware (namely high resolution still image capture on OMAP3). It's far
from
perfect, and I'm open to completely different solutions if someone thinks
of
one.


Introduction


The V4L2 video buffers handling API makes use of a queue of video buffers
to
exchange data between video devices and userspace applications (the read
method don't expose the buffers objects directly but uses them underneath).
Although quite efficient for simple video capture and output use cases, the
current implementation doesn't scale well when used with complex hardware
and
large video resolutions. This RFC will list the current limitations of the
API
and propose a possible solution.

The document is at this stage a work in progress. Its main purpose is to be
used as support material for discussions at the Linux Plumbers Conference.


Limitations
===

Large buffers allocation


Many video devices still require physically contiguous memory. The
introduction of IOMMUs on high-end systems will probably make that a
distant
nightmare in the future, but we have to deal with this situation for the
moment (I'm not sure if the most recent PCI devices support scatter-gather
lists, but many embedded systems still require physically contiguous
memory).

Allocating large amounts of physically contiguous memory needs to be done
as
soon as possible after (or even during) system bootup, otherwise memory
fragmentation will cause the allocation to fail.

As the amount of required video memory depends on the frame size and the
number of buffers, the driver can't pre-allocate the buffers beforehand. A
few
drivers allocate a large chunk of memory when they are loaded and then use
it
when a userspace application requests video buffers to be allocated.
However,
that method requires guessing how much memory will be needed, and can lead
to
waste of system memory (if the guess was too large) or allocation failures
(if
the guess was too low).

Buffer queuing latency
---

VIDIOC_QBUF is becoming a performance bottleneck when capturing large
images
on some systems (especially in the embedded world). When capturing high
resolution still pictures, the VIDIOC_QBUF delay adds to the shot latency,
making the camera appear slow to the user.

The delay is caused by several operations required by DMA transfers that
all
happen when queuing buffers.

- Cache coherency management

When the processor has a non-coherent cache (which is the case with most
embedded devices, especially ARM-based) the device driver needs to
invalidate
(for video capture) or flush (for video output) the cache (either a range,
or
the whole cache) every time a buffer is queued. This ensures that stale
data
in the cache will not be written back to memory during or after DMA and
that
all data written by the CPU is visible to the device.

Invalidating the cache for large resolutions take a considerable amount of
time. Preliminary tests showed that cache

Re: [RFC] Global video buffers pool

2009-09-17 Thread Hans Verkuil
On Thursday 17 September 2009 20:49:49 Mauro Carvalho Chehab wrote:
 Em Wed, 16 Sep 2009 17:46:39 +0200
 Laurent Pinchart laurent.pinch...@ideasonboard.com escreveu:
 
  Hi everybody,
  
  I didn't want to miss this year's pretty flourishing RFC season, so here's 
  another one about a global video buffers pool.
  
  All comments are welcome, but please don't trash this proposal too fast. 
  It's 
  a first shot at real problems encountered in real situations with real 
  hardware (namely high resolution still image capture on OMAP3). It's far 
  from 
  perfect, and I'm open to completely different solutions if someone thinks 
  of 
  one.

First of all, thank you Laurent for working on this! Much appreciated.
 
 Some comments about your proposal:
 
 1) For embedded systems, probably the better is to create it at boot time, 
 instead
 of controlling it via userspace, since as early it is done, the better.

I agree with Mauro here. The only way you can allocate the required memory is
in general to do it early in the boot sequence.

 2) As I've posted at the media controller RFC, we should take care to not 
 abuse
 about its usage. Media controller has two specific objetives: topology
 enumeration/change and subdev parameter send.

True, but perhaps it can also be used for other purposes. I'm not saying we
should, but neither should we stop thinking about it. Someone may come up with
a great idea for which a mc is ideally suited. We are still in the brainstorming
stage, so any idea is welcome.

 For the last, as I've explained 
 there, the proper solution is to create devices for each v4l subdev that 
 requires
 control from userspace.

The proper solution *in your opinion*. I'm still on the fence on that one.

 In the case of a video buffers memory poll, it is none of the  
 usecases of media controller. So, it is needed to think better about where to
 implement it.

Why couldn't it be one of the use cases? Again, it is your opinion, not a fact.
Note that I share this opinion, but try to avoid presenting opinions as facts.
 
 3) I don't think that having a buffer pool per media controller will be so 
 useful.
 A media controller groups /dev/video with their audio, IR, I2C... resources. 
 On
 systems with more than one different board (for example a cellular phone with 
 a
 camera and an DVB-H receiver), you'll likely have more than one media 
 controller.
 So, controlling video buffer pools at /dev/video or at media controller will 
 give
 the same results on several environments;

I don't follow the logic here, sorry.

 4) As you've mentioned, a global set of buffers seem to be the better 
 alternative. This
 means that V4L2 core will take care of controlling the pool, instead of 
 leaving
 this task to the drivers. This makes easier to have a boot-time parameter 
 specifying
 the size of the memory pool and will optimize memory usage. We may even have a
 Kconfig var specifying the default size of the memory pool (although this is
 not really needed, since new kernels allow specifying default line command 
 parameters).

Different devices may have quite different buffer requirements (size, number
of buffers). Would it be safe to have them all allocated from a global pool?
I do not feel confident myself that I understand all the implications of a
global pool or whether you actually always want that.
 
 5) The step to have a a global-wide video buffers pool allocation, as you
 mentioned at the RFC, is to make sure that all drivers will use v4l2 framework
 to allocate memory. So, this means porting a few drivers (ivtv, uvcvideo, cx18
 and gspca) to use videobuf. As videobuf already supports all sorts of 
 different
 memory types and configs (contig and Scatter/Gather DMA, vmalloced buffers,
 mmap, userptr, read, overlay modes), it should fits well on the needs.

Why would I want to change ivtv for this? In fact, I see no reason to modify
any of the existing drivers. A mc-wide or global memory pool is only of
interest for very complex devices where you want to pass buffers around
between various sub-devices (and possibly to other media devices or DSPs).
And yes, they probably will have to use the framework in order to be able to
coordinate these pools properly.
 
 6) As videobuf uses a common method of allocating memory, and all memory
 requests passes via videobuf-core (videobuf_alloc function), the 
 implementation of a
 global-wide set of videobuffer means to touch on just one function there, at 
 the abstraction
 layer, and to double check at the 
 videobuf-dma-sg/videobuf-vmalloc/videobuf-contig if they
 don't call directly their own allocation methods. If they do, a simple change
 would be needed.
 
 7) IMO, the better interface for it is to add some sysfs attributes to media
 class, providing there the means to control the video buffer pools. If the 
 size
 of a video buffer pool is set to zero, it will use normal memory allocation.
 Otherwise, it will work at the pool mode. 

Or you use the existing 

[RFC] Global video buffers pool

2009-09-16 Thread Laurent Pinchart
Hi everybody,

I didn't want to miss this year's pretty flourishing RFC season, so here's 
another one about a global video buffers pool.

All comments are welcome, but please don't trash this proposal too fast. It's 
a first shot at real problems encountered in real situations with real 
hardware (namely high resolution still image capture on OMAP3). It's far from 
perfect, and I'm open to completely different solutions if someone thinks of 
one.


Introduction


The V4L2 video buffers handling API makes use of a queue of video buffers to 
exchange data between video devices and userspace applications (the read 
method don't expose the buffers objects directly but uses them underneath). 
Although quite efficient for simple video capture and output use cases, the 
current implementation doesn't scale well when used with complex hardware and 
large video resolutions. This RFC will list the current limitations of the API 
and propose a possible solution.

The document is at this stage a work in progress. Its main purpose is to be 
used as support material for discussions at the Linux Plumbers Conference.


Limitations
===

Large buffers allocation


Many video devices still require physically contiguous memory. The 
introduction of IOMMUs on high-end systems will probably make that a distant 
nightmare in the future, but we have to deal with this situation for the 
moment (I'm not sure if the most recent PCI devices support scatter-gather 
lists, but many embedded systems still require physically contiguous memory).

Allocating large amounts of physically contiguous memory needs to be done as 
soon as possible after (or even during) system bootup, otherwise memory 
fragmentation will cause the allocation to fail.

As the amount of required video memory depends on the frame size and the 
number of buffers, the driver can't pre-allocate the buffers beforehand. A few 
drivers allocate a large chunk of memory when they are loaded and then use it 
when a userspace application requests video buffers to be allocated. However, 
that method requires guessing how much memory will be needed, and can lead to 
waste of system memory (if the guess was too large) or allocation failures (if 
the guess was too low).

Buffer queuing latency
---

VIDIOC_QBUF is becoming a performance bottleneck when capturing large images 
on some systems (especially in the embedded world). When capturing high 
resolution still pictures, the VIDIOC_QBUF delay adds to the shot latency, 
making the camera appear slow to the user.

The delay is caused by several operations required by DMA transfers that all 
happen when queuing buffers.

- Cache coherency management

When the processor has a non-coherent cache (which is the case with most 
embedded devices, especially ARM-based) the device driver needs to invalidate 
(for video capture) or flush (for video output) the cache (either a range, or 
the whole cache) every time a buffer is queued. This ensures that stale data 
in the cache will not be written back to memory during or after DMA and that 
all data written by the CPU is visible to the device.

Invalidating the cache for large resolutions take a considerable amount of 
time. Preliminary tests showed that cache invalidation for a 5MP buffer 
requires several hundreds of milliseconds on an OMAP3 platform for range 
invalidation, or several tens of milliseconds when invalidating the whole D 
cache.

When video buffers are passed between two devices (for instance when passing 
the same USERPTR buffer to a video capture device and a hardware codec) 
without any userspace access to the memory, CPU cache invalidation/flushing 
isn't required on either side (video capture and hardware codec) and could be 
skipped.

- Memory locking and IOMMU

Drivers need to lock the video buffer pages in memory to make sure that the 
physical pages will not be freed while DMA is in progress under low-memory 
conditions. This requires looping over all pages (typically 4kB long) that 
back the video buffer (10MB for a 5MP YUV image) and takes a considerable 
amount of time.

When using the MMAP streaming method, the buffers can be locked in memory when 
allocated (VIDIOC_REQBUFS). However, when using the USERPTR streaming method, 
the buffers can only be locked the first time they are queued, adding to the 
VIDIOC_QBUF latency.

A similar issue arises when using IOMMUs. The IOMMU needs to be programmed to 
translate physically scattered pages into a contiguous memory range on the 
bus. This operation is done the first time buffers are queued for USERPTR 
buffers.

Sharing buffers between devices
---

Video buffers memory can be shared between several devices when at most one of 
them uses the MMAP method, and the others the USERPTR method. This avoids 
memcpy() operations when transferring video data from one device to another 
through memory (video acquisition - hardware