Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2004-01-04 Thread Andrew Stevens

> In floating point, all you have to do is flip a sign bit.  But with
> integers, it's not so easy.  There is no instruction for absolute value in
> MMX, you have to use a four instruction sequence and two registers.  Slower
> than squaring a value, which only takes two instructions.

> I finally found a mmx2 reference, and you're right about that.  MMX2 added
> psadbw, packed sum of absolute differences.  If you have 8-bit unsigned
> data it makes computing SAD pretty darn easy, you can find the SAD of 8
> pixels in one instruction.

There are really very very few MMX-only current CPUs.   Basically everything 
after K6 with MMX supports the psadbw instruction.

> Though why did mpeg2enc use variance in the first place?  Maybe it's a
> better estimator than SAD for motion compensation fit?

Its just not that simple.  It uses SAD for 'coarse' motion estimation and 
switches to variance for the final selection of the particular motion 
estimation mode.  This combination provides a good speed/quality trade-off.
Experiments with 'only variance' were unimpressive in their quality 
improvements and 'only SAD' costs quite a lot of quality for modest speed 
gain.   All the 'low hanging fruit' in the current motion estimation 
algorithm has long since been picked...

> The level of altivec optimizations ffmpeg vs mpeg2 is probably an important
> factor in any speed difference, and one that wouldn't matter for other
> CPUs, which the level of MMX/MMX2/SSE optimizations makes a large
> difference.

You have to be very careful to compare like coding profiles, motion search 
radii and suchlike too.  

Its easy to be twice as fast if your simply trying half as hard to find a good 
encoding.   However, there *are* bottlenecks in mpeg2enc.   For a real speedy 
mode predictive motion estimation algorithms would need to be used.

Andrew



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-20 Thread Steven M. Schultz

On 19 Dec 2003, Florin Andrei wrote:

> On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote:
> 
> > At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
> > my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but

The difference is even larger than I thought...   ffmpeg was decoding
the DV file and encoding the audio at the same time but I had mpeg2enc
reading a pre-staged .y4m file.

> Any chance repeating that on an Intel or AMD processor?

ffmpeg -i input.dv -vcodec mpeg2video -f mpeg -b 5000 -g 15 foo.mpg

then compare to 

mpeg2enc -R 0 -b 5000 -4 3 -2 2 -o foo1.mpg < input.y4m

Have fun :)

At the moment my AMD system's booked solid for other encoding jobs,
be a couple days before I can run some tests on it.Hmmm, time to
fire up the dual P4 system and get it sync'd up on all the projects.
Maybe over the weekend.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Florin Andrei
On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote:

>   At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
>   my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but
>   the resulting output is 'grainy' (same bitrate, no B frames) (and the
>   rate control is, well, almost non existent - ~2x spikes that'd drive
>   a hardware player nuts).   

Any chance repeating that on an Intel or AMD processor?

-- 
Florin Andrei

http://florin.myip.org/



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Richard Ellis
On Fri, Dec 19, 2003 at 01:34:38AM -0800, Trent Piepho wrote:
> On Fri, 19 Dec 2003, Andrew Stevens wrote:
> > The next bottlenecks would be the run-length coding and the use
> > of variance instead of SAD in motion compensation mode and DCT
> > mode selection.  Sadly
> 
> Is SAD really any faster to calculate than variance?  SAD uses an
> absolute value-add operation while variance is multiply-add. 
> Multiply-add is usually the most heavily optimized operation a cpu
> can perform.

You are thinking DSP chips, not general purpose CPU's.  For DSP's,
yes, multiply-add is very heavilly optimized, but for general purpose
CPU's, it's often not quite so heavilly optimized.

Additionally, if you've got an SSE2 capable x86 chip, it's got
parallel SAD operations in the SSE2 instruction set.  There isn't an
SSE2 mul-add operation yet.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Trent Piepho
On Fri, 19 Dec 2003, Steven M. Schultz wrote:
> On Fri, 19 Dec 2003, Trent Piepho wrote:
> 
> > On Fri, 19 Dec 2003, Andrew Stevens wrote:
> > 
> > Is SAD really any faster to calculate than variance?  SAD uses an absolute
> > value-add operation while variance is multiply-add.  Multiply-add is usually
> > the most heavily optimized operation a cpu can perform.
> 
>   Au contraire.   Multiply is a lot slower than abs().  All abs() has
>   to do is flip a sign bit (effectively) and that's going to be a lot

In floating point, all you have to do is flip a sign bit.  But with integers,
it's not so easy.  There is no instruction for absolute value in MMX, you have
to use a four instruction sequence and two registers.  Slower than squaring a
value, which only takes two instructions. 

Though you can cleverly combine an unsigned subtraction and absolute value
operation into four instructions total, and perform it on eight unsigned bytes
at a time.  So you can compute an absolute value of differences quite a bit
faster under MMX than I was thinking you could.  Clearly SAD would be faster
than variance.

Another advantage of SAD is that you can find an intermediate result easier
than with variance.  That way you can short-circuit the SAD calculation if you
have already reached the best SAD already found.

>   faster than any multiply.   And aren't there MMX2/SSE abs+add 
>   instructions - that would make abs/add quite fast.

I finally found a mmx2 reference, and you're right about that.  MMX2 added
psadbw, packed sum of absolute differences.  If you have 8-bit unsigned data
it makes computing SAD pretty darn easy, you can find the SAD of 8 pixels in
one instruction. 

Though why did mpeg2enc use variance in the first place?  Maybe it's a better
estimator than SAD for motion compensation fit?

>   At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
>   my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but

The level of altivec optimizations ffmpeg vs mpeg2 is probably an important
factor in any speed difference, and one that wouldn't matter for other CPUs,
which the level of MMX/MMX2/SSE optimizations makes a large difference.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Steven M. Schultz

On Fri, 19 Dec 2003, Trent Piepho wrote:

> On Fri, 19 Dec 2003, Andrew Stevens wrote:
> 
> Is SAD really any faster to calculate than variance?  SAD uses an absolute
> value-add operation while variance is multiply-add.  Multiply-add is usually
> the most heavily optimized operation a cpu can perform.

Au contraire.   Multiply is a lot slower than abs().  All abs() has
to do is flip a sign bit (effectively) and that's going to be a lot
faster than any multiply.   And aren't there MMX2/SSE abs+add 
instructions - that would make abs/add quite fast.

At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on
my G4 Powerbook.  Yes, ffmpeg has a big speed advantage (~2x) but
the resulting output is 'grainy' (same bitrate, no B frames) (and the
rate control is, well, almost non existent - ~2x spikes that'd drive
a hardware player nuts).   

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Trent Piepho
On Fri, 19 Dec 2003, Andrew Stevens wrote:
> The next bottlenecks would be the run-length coding and the use of variance 
> instead of SAD in motion compensation mode and DCT mode selection.  Sadly 

Is SAD really any faster to calculate than variance?  SAD uses an absolute
value-add operation while variance is multiply-add.  Multiply-add is usually
the most heavily optimized operation a cpu can perform.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-19 Thread Andrew Stevens
On Tuesday 16 December 2003 23:35, Richard Ellis wrote:
Hi Richard,

> In that case it will kill the majority of the performance benifit
> provided by the caches, because there's very little locality of
> reference for the cache to compensate for.  It moves through at least
> 512k for pass one, then through the same 512k again for pass two, but
> the data in the cache is from the end of the frame, and we are
> starting over at the beginning of the frame.  Massive cache thrash in
> that case.  Memory bandwidth becomes a much more limiting factor.

Exactly what I though when I restructured encoding to a per macroblock basis a 
few months back.  The performance gain was not measurable.

They key 'thinko' here is that most of the time goes into motion estimation 
and in motion estimation the search windows of neighbouring macroblocks 
overlap > 90%.  Cache locality is pretty good. Playing around with prefetch 
(etc etc) has never brought measurable gains.

The main bottleneck in the current encoder (for modern CPUs) is the first 
phase (4X4 subsampling) of the subsampling motion estimation hierarchy.   For 
speed this would need to be replaced with a predictive estimator.

The next bottlenecks would be the run-length coding and the use of variance 
instead of SAD in motion compensation mode and DCT mode selection.  Sadly 
there's not too much can be done easily about the former and the latter 
cannot be removed without noticeable reduction in encoding quality (I tried 
it :-().

However, I have some ideas to try when I get back in the new year!

Andrew




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Richard Ellis
On Tue, Dec 16, 2003 at 06:54:22PM -0700, Slepp Lukwai wrote:
> As a side note, I'm also using a 200Hz timer, instead of the standard
> 100Hz. Though I don't see this doing anything but making it quicker, as
> it reduces latency on scheduling, while slightly increasing scheduler
> overhead and context switching (or is an SSE/3Dnow! CS really expensive,
> anyone know?).

A 200Hz timer will have only one effect on batch type processes,
slowing them down.  And mpeg2enc is essentially a batch type process. 
Why?  Because of the increased scheduler overhead.  Now, you may be
hard put to measure the slowdown because so many other effects will
swamp it (one HD seek that takes a few ms would swamp a large part of
the scheduler overhead) but it's still there.

The only thing that's "quicker" with a 200hz timer is interactive
response where you want to see your X cursor move the instant you
touch the mouse.

Yes, context switching (at least for SSE) is more expensive, because
the 8 128bit SSE registers may need to be saved.  I don't know off
the top of my head if Intel implimented lazy context saves for SSE
like with the x86-fpu stack.  If they did, then not all context swaps
incur the SSE save overhead, but when one does, there is more data to
save.

> I wonder if it comes back to the increased timing of the scheduler?
> (Though it's using a supposed O(1) scheduler, which should offset
> that).

The O(1) scheduler does not change the context switch overhead
timing.  The O(1) scheduler simply says that no matter how many
processes are waiting to run, it's a constant time to find the "next"
one when we do need to context switch.  But a 200hz timer will still
use up 2x as much cpu time running the scheduler as 100hz timer will.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 23:17, Bernhard Praschinger wrote:
> > -M 0: 2m 11.9s
> > -M 1: 2m 10.6s, -1.3s
> > -M 2: 1m 27.7s, -44.2s
> > -M 3: 1m 26.5s, -45.4s
> That values look much better.  :-)
> Now you have seen the mpeg2enc can go faster.

It's like it used to be. :> I'm going to try it on a full video, with a
few options. I figure I'll let it run through 24 hours of encoding time
(about 6 different trials) and see how each result turns out, and so on.
I'll let you all know when it's done. :>

> I have tried the command you used on my machine, and I have seen the 
> same "problem". Also 3 processes and each only 33% .
> 
> (time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v
> -S  -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0)

Yes.. So it's definitely the -R 0, but -R 1 is faster than the default
of -R 2 (i think that's the default?)

> > Note that I responded in an earlier message with a total of 24 timings
> > across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting
> > results that -M 3 -I 0 -R 1 worked fastest of all of them (same source
> > material I used for the above, and it took 51 seconds). So, I think the
> > -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it
> > is still quite a bit slower than -I 0 (which I use since the input is
> > Progressive 23.976fps)
> Thats strange.

It makes a mild bit of sense.. But just a little.

> I'm just running some encodings to see which option causes the problem. 
> 
> On my machine the -R 0 caused the problem. If I used -R 1/2 or or R
> option, I got 3 processes each using about 45-50%. 

Which should total about 150% CPU instead of 99% that it uses with -R 0.

> > > My brain had given up the time I started my computer that evening ;)
> > Mine usually does that at about 8am. :>
> Just as you enter work ? ;)

Self employed, thereby just as I crawl out of bed, and it my brain stays
broken until about noon. That's what I get for staying up till 4am
playing with mpeg2enc. :>

> Encoding without the -R 0 seems to solve the problem, by now.

I'm going to see what speeds I get about halfway through a video, when
nothing from disk is in cache anymore, the encoders/decoders are in full
swing, and everything sort of settles down.. Should be interesting.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
Just a side note, I find it interesting your name is Andrew Stevens,
whereby mine is Stephen Andrew (middle name).

On Tue, 2003-12-16 at 14:41, Andrew Stevens wrote:
> Yep.   You should (in theory) get a lot closer to that with the current 
> MPEG_DEVEL branch mpeg2enc.   However, your scaling is really remarkably bad 
> as even the -R 2 values where two CPUs should be fairly busy are unusually 
> bad.  I've never heard of worse than 70% utilisation on dual CPU machines.

And I'm wondering why it's not scaling... Hence the original post about
this.

> Here's a fairly typical snapshot of mpeg2enc -M 2 -I 1 -R 2 in action on my 
> dual P-III machine...
> 
>   PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
> 12620 as18   0 46464  45M   768 R80.9 24.3   0:18 lt-mpeg2enc
> 12621 as18   0 46464  45M   768 R70.8 24.3   0:18 lt-mpeg2enc
> 12619 as 9   0 46464  45M   768 S 3.9 24.3   0:01 lt-mpeg2enc

Which is nothing like I see. I rarely see two of them break 60%, but
they hover closer to 45%.

> You're getting very very symmetrical CPU loads and very very poor utilisation.  
> What kernel are you using... I vaugely recall 2.6.x series radically changed 
> the threading libs.  It could be something pathological is happening in the 
> scheduling.  

It's 2.4.20-gentoo-r9, actually. I'm wondering if a patch in here is
causing problems, but I'm very hesitant to try any other kernels since
this chipset/board are rather flaky and now that it's working again, I
don't want it to break (I couldn't run mpeg2enc, let alone
transcode/dvdrip for almost 5 months because it would lock the system
hard when it was under load). The newest kernels, 2.6.x, don't let me
disable the APIC in the kernel itself, and that causes problems. Perhaps
tonight I'll test a 2.4.23 without any patches (just vanilla) and see
what happens with scheduling.

As a side note, I'm also using a 200Hz timer, instead of the standard
100Hz. Though I don't see this doing anything but making it quicker, as
it reduces latency on scheduling, while slightly increasing scheduler
overhead and context switching (or is an SSE/3Dnow! CS really expensive,
anyone know?).

> The  2100+ is of course  a lot faster than the P-III but: I doubt the balance 
> between the motion estimation and the rest of the code is hugely shifted.  
> Cerainly, the approximate proportions of time spent in each are quite similar 
> on my 2100+ single-CPU machine and a P-III.

On the single 2000 XP we have, it runs about 90% of the speed of my
machine in SMP mode (-M 3).

> > Also, encoding with one B frame is a touch faster in -I 1 mode than
> > encoding without them, but it is slower when you encode two B frames
> ...
>
> Not really. However: I would expect going to two B frames to greatly increase 
> your CPU utilisation without much wall-clock time increase due the increased 
> scope for parallel computation.

but I was more or less pointing out the timings from the message that -I
1 -R 1 was faster than -I 0 -R 1, for some reason. Not all that much,
but noticably.

> This is what you'd expect: -R 2 offers much more scope for the 3 worker 
> threads of -M 3 to do something useful.

It still worked out 3-0-1 was the shortest overall time spent, even if
CPU usage was still not peaked.

> The usefulness of B frames depends a *lot* on the type of material.  For 
> captured stuff they rarely buy you much apart from free room heating from 
> your CPU. Hence the provision of -R 0 ;-).  They should get a little more 
> useful when I add dynamic frame type selection to mpeg2enc in the new year.

Strictly DVD copies. With three cats and being lazy, I have more DVDs
ruined than I'd like to count. (Speaking of room heating, it's about
-20degC at the moment outside, and my window is about 10cm open, and
it's still a toasty 25 degrees in here. my office doubles as the server
closet).

> > > - There is also a parallel read-ahead thread but this rarely soaks much
> > > CPU on modern CPUs.
> 
> Weirdly enough on your machine the reader thread is exceedingly busy

I use LVM, but can read 35MB/s off the disks with that. The memory
buffer cache is about 250MB/s. I wonder if it comes back to the
increased timing of the scheduler? (Though it's using a supposed O(1)
scheduler, which should offset that).

> cvs co -d :ext:[EMAIL PROTECTED]:/cvsroot/mjpeg mjpeg_play
> cd mjpeg_play
> cvs update -r MPEG_DEVEL mpeg2enc

:ext: wanted a password for anonymous, and 'enter' didn't work. So, I
used :pserver:. I sent you a message about the problems I encountered
thereof.

> The 'mjpeg_play' is a bit of a historical oddity but it is momumentally 
> painful to change directory names in CVS...

I tend to just drop the entire project, clean it up, and reimport it
into a fresh tree to rename it. :> You'd think that in the years and
revisions CVS has undergone, renaming of directories wouldn't be nearly
as painful.




---

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 10:27, Steven M. Schultz wrote:
> On Tue, 16 Dec 2003, Slepp Lukwai wrote:
> 
> > Tried it without any options, same effect. I'm definitely seeing nowhere
> > near 40% speedup, which is what boggles me. I expected at least
> > reasonable gains of 25%.
> 
>   I think that has to do with the -I setting...

The -I is frightening me. Take a look at my previous post with the -M x
-I x -R x settings on each. The -I 1 with two B frames (-R 2) shows a
huge gain over the -M 0 -I 1 -R 2, but it is still significantly slower
than -M 3 -I 0 -R 1 or 2.

> > Sorry, upon further testing, I actually average around 14fps at DVD
> > quality (720x480, 9800kbit/s). (see all the details of my command lines
> 
>   Ah, that's more like it then.   

Yup.

> > It's interesting that I'm faster with dual 2100s than the dual 2800 (or
> > at least on par). I suppose it really comes down to command line
> > options, but you would need to compare those yourself (since I haven't
> 
>   Friend of mine has dual 2400s and my setup is ~10-15% faster as I
>   recall - he's getting around 11fps as a rule where I see 14 or so.
> 
>   I'm usually adding a bit of overhead with the chroma conversion.  I
>   build smilutils with ffmpeg/libavcodec (to use ffmpeg's DV codec)
>   and then run the data thru something like:
>   "smil2yuv -i 2 file.dv | filters | y4mscaler -O chromass=420_MPEG2 |..."
> 
>   Produces better output that the default which uses libdv but does
>   cost a bit in cpu use.

You could run my test case. I pre-decoded an MPEG2 DVD stream into 1010
frames and then used this:

for M in 0 1 2 3 ; do for I in 0 1 ; do for R in 0 1 2 ; do export
LD_LIBRARY_PATH=/home/slepp/mp/lib ; echo -M $M -I $I -R $R ; time
/home/slepp/mp/bin/mpeg2enc -f 8 -M $M -g 9 -G 18 -I $I -v 0 -E -10 -K
kvcd -4 2 -2 1 -R $R -F 1 -o test-$M$I$R.m2v < pgmy4m.raw ; done ; done
; done

It was rather handy, but it took a long time to run. The source
pgmy4m.raw is 536MB here.

> > According to the docs -I 1 turns on interlacing support, and causes
> > un-needed overhead if it is known progressive material. Hence the -I 0
> > (plus transcode sets that, though I could override it).
> 
>   But unless you have the raw 23.976fps progressive data (with the 3:2
>   pulldown undone) then I think '-I 1' is the option to use.   But then 
>   I might be confused (wouldn't be the first time ;)).

Yup. It's MPEG-2 DVD @ 23.976 fps, and I need to add in the pulldown.

>   That would explain why the encoding rate I see is lower since I'm
>   using -I 1.

See above. :>

> > >   wrong I'm sure someone will tactfully point that out ;)) the speedup
> > >   comes from the motion estimation of the 2 fields/frame being done in
> > >   parallel.
> > 
> > Oh. Son of a... If that's all it is...
> 
>   Yep - I'm fairly sure that is why you're not seeing any improvement
>   when using "-M 2".

You're right that -I improves performance over baselines later on, but
it doesn't improve over -I 0.

>   On noisy source material the -E option has almost no effect  but the
>   cleaner the input the more effect even modest values of -E have.

So a -E helps with transcoding these DVDs.

> > now, in combination with -Q, but I find the artifacts are almost never
> > there (I used to do -q 4 and -Q 4.0, and it looked about the same as the
> > 5/3.0).
> 
>   Perhaps Richard Ellis could chime in with his experiences with -Q ;)

I will need to look at that. :> Any idea how it impacts performance?

>   Looking back on it that makes sense though.   A P frame depends on the
>   preceeding P frame - rather sequential in nature since you can't
>   move on to the next one without completing the first one...

But it would make sense, I thought, to do IPBBPBBPBB on thread 1 and
IPBBPBBPBB on thread 2. So each thread does a GOP (in fact, I found a
marginal increase ins peeds by using a gop size of 3 (-g 3 -G 3) and -M
3, it was a little faster than when I had variable gop sizing).

But, I think that the encoder doesn't stream it that way, but rather
does the I frame, then fires off the P frame, then each B is done in
parallel, and then it does another P, etc.

> > The MPEG decoding doesn't take much, and the pipe overhead is negligble,
> 
>   Pipe overhead sneaks up on you though.   One pipe?  Not a real problem,
>   two?  Begins to be noticed but isn't too bad.   Four or five?   Yeah,
>   it starts to take a hit on the overall speed of the system - the data
>   has to go up/down thru the kernel all those times and that's not "free".

Everything should be free. ;> What I'd like to test, just for fun, is to
fire up mpeg2enc on one of the 16 processor SGI systems at the
University and see how that threads across 16. It says it doesn't do
well above 4, but again, it's fun to try.

>   You might try, for timing purposes, without -I 0 and see what, if any
>   effect that has.   Might b

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 13:15, Richard Ellis wrote:
> On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote:
> >.. It's a dual Athlon, which inherently means 266FSB (DDR 266),
> > though the memory is actually Hynix PC3200 w/ timings set as low as
> > they go on this board (2-2-2), which gives me about 550MB/s memory
> > bandwidth according to memtest, with a 13GB/s L1 and something like
> > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s, it
> > SHOULD be able to push enough to keep the frames encoding at 100%
> > CPU, in theory.
> 
> Yes, but just one 720x480 DVD quality frame is larger than 256k in
> size, so a 256k cache per CPU isn't helping too much overall
> considering how many frames there are in a typical video to be
> encoded.  Plus, my experience with Athlon's is that they are actually
> faster at mpeg2enc encoding that Intel chips of equivalent speed
> ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they
> put a heavier stress on one's memory bandwidth than an equivalent
> speed Intel chip would.  It's possible that 275MB/s per CPU just
> isn't fast enough to keep up with the rate that mpeg2enc can consume
> data on an Athlon.

Yes, I expect the cache to only be able to fit the mpeg2enc code
sections, not any of the data it uses. If the code keeps getting bumped
out, then that's a problem. And 275MB/s may not be enough, true... It's
too bad the Athlon dual chipset (AMD 768MPX) can't do above about 140
MHz bus speeds to see how much memory speed affects it.

> Of course, Andrew would be much better suited to discuss mpeg2enc's
> memory access patterns during encoding, which depending on how it
> does go about accessing memory can better make use of the 256k of
> cache, or cause the 256k of cache to be constantly thrashed in and
> out.

It could be interesting to use cachegrind on mpeg2enc and see what it
declares for cache hit/miss, but I find cachegrind tends to make a 1
minute runtime hit 10 minutes, so I may not bother..

> > Now that's just silly. Why would you hurt the CPUs by running such bloat
> > as Mozilla? I can't think of how many times Mozilla has gone nuts on me
> > and used 100% CPU without reason, and you can't kill it any normal UI
> > way.. Good ol' killall. However, I love it. It's a great browser. Just
> > rather hungry at times. I suppose there's a reason the logo is a
> > dinosaur. :>
> 
> Hmm... Interesting.  I've had it sometimes just stop but never go
> nuts with 100% CPU, and although I usually do CLI kill it if need be,
> FVWM2's "destroy" window command has never failed to get rid of it if
> I don't bother to go CLI to do so.  In fact, FVWM2's "destroy" has
> never failed to get rid of anything that went wonky.  It's the X
> windows equivalent to a "kill -9" from the CLI.

I've had it lock up and X becomes unresponsive since it's in a loop
doing some expensive operation of some sort. It's strange. I don't see
it nearly as often with the newer Mozillas as I did the old ones (in
fact, haven't seen it in over a month).



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 12:57, Bernhard Praschinger wrote:
> Could you run a few test (please).  Get some frames (100-1000) as yuv
> format. I gues that should be possible even with transcode. ;)
> (I do not use transcode so I can't help, or get the test streams on
> mjpeg.sf.net)

With about 1010 frames of YUV using < to dump it in (instead of cat), I
get these:

-M 0: 2m 11.9s
-M 1: 2m 10.6s, -1.3s
-M 2: 1m 27.7s, -44.2s
-M 3: 1m 26.5s, -45.4s

Note that I responded in an earlier message with a total of 24 timings
across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting
results that -M 3 -I 0 -R 1 worked fastest of all of them (same source
material I used for the above, and it took 51 seconds). So, I think the
-I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it
is still quite a bit slower than -I 0 (which I use since the input is
Progressive 23.976fps)

> And do afterwards something like that:
> cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v 
> or 
> lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v
> 
> So you can be soure that nothing else makes any troubels. And check
> thant how it is going. That should not take to long. Than you can add
> the options you used, to see if anything there causes the probelm of non
> increasing framerate. 

Compared to the run with my long options line, these are 

> Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) 

Nice board, that one. Asus A7M-266D.. I should've grabbed the MSI K7D
Master for the same price, I hear much nicer things about it.

> My brain had given up the time I started my computer that evening ;)

Mine usually does that at about 8am. :>

> But I'm not really knowing why the situation is that bad.

I'm just not seeing the dual CPU usage that would warrant even running
in multiple threads, when I could instead transcode two entirely
separate items as though I had two machines, which makes some sense (I
did that the other day, worked rather well). But, if I can make a single
copy work by flooding both CPUs with activity, then I'll be happier,
since it should take quite a bit less time to encode a full movie.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-17 Thread Slepp Lukwai
On Tue, 2003-12-16 at 12:33, Andrew Stevens wrote:
> Hi all,
> 
> First off a bit of background to the multi-threading in the current stable 
> branch.  First off:
> 
> - Parallelism is primarily frame-by-frame.  This means that the final phases 
> of the encoding lock on completion of the reference frame (prediction and DCT 
> transform) and the predecessor (bit allocation).   If you have a really fast 
> CPU that motion estimates and DCT's very fast you will get lower 
> parallelisation.  If you use -R 0 you will get very litte parallelism *at 
> all*.   Certainly not enough to make -M 3 sensible.

Yet again, good to know.

This line (generally, a triple loop for 0-3 M, 0-1 I and 0-2 R):

Produces this (approximately 1010 frames), encoding times (real time /
user time, gives a bit of a view as to how busy the CPUs were during the
real time, optimal should be 1m realtime, 2m user time, right? and
average system time was 3.0s, with +/- 0.2s for all tests):

(options on each call were:
 -f 8 -g 9 -G 18 -v 0 -E -10 -K kvcd -4 2 -2 1 -F 1 < rawstream.yuv
)

-M 0 -I 0 -R 0: 1m  6.082s  0m 50.050s  baselines
-M 0 -I 0 -R 1: 1m 16.545s  0m 58.980s  ..
-M 0 -I 0 -R 2: 1m 34.511s  1m 17.045s  ..
-M 0 -I 1 -R 0: 2m  7.344s  1m 49.495s  ..
-M 0 -I 1 -R 1: 1m 59.665s  1m 42.215s  ..
-M 0 -I 1 -R 2: 2m 30.990s  2m 30.990s  ..

-M 1 -I 0 -R 0: 1m  5.713s  0m 49.800s  -0.35s
-M 1 -I 0 -R 1: 1m 15.305s  0m 58.975s  -1.2s
-M 1 -I 0 -R 2: 1m 34.057s  1m 17.090s  -0.5s
-M 1 -I 1 -R 0: 2m  5.928s  1m 49.700s  -1.3s
-M 1 -I 1 -R 1: 1m 59.019s  1m 41.955s  -0.6s
-M 1 -I 1 -R 2: 2m 49.149s  2m 31.440s  +19.2s

-M 2 -I 0 -R 0: 1m  0.503s  0m 25.930s  -5.5s
-M 2 -I 0 -R 1: 0m 53.418s  0m 58.950s  -23s
-M 2 -I 0 -R 2: 1m  7.418s  1m 18.145s  -27s
-M 2 -I 1 -R 0: 1m 54.534s  1m 50.060s  -13s
-M 2 -I 1 -R 1: 1m 15.489s  0m  1.040s -- uhm...?
-M 2 -I 1 -R 2: 1m 54.720s  1m 16.720s  -36s

-M 3 -I 0 -R 0: 0m 57.533s  0m 50.610s  -8.5s
-M 3 -I 0 -R 1: 0m 51.541s  0m 40.265s  -25s
-M 3 -I 0 -R 2: 1m  5.996s  0m 54.325s  -29s
-M 3 -I 1 -R 0: 1m 50.570s  1m 49.715s  -17s
-M 3 -I 1 -R 1: 1m 14.462s  1m  8.530s  -45s
-M 3 -I 1 -R 2: 1m 36.192   0m 52.145s  -54s

Interestingly, and I think this has to do with the I/O buffering, -M 0
is slower than -M 1 by a small fraction in all tests. And as Steven
Shultz had suggested, -I 1 is a bad bad idea. It never improved
performance, and made it in fact quite a bit worse (the man page is
right :). (Of course, -M 1 will be at least two processes, and since I
have a real dual system, it makes sense, and may not hold true for a
single CPU)

Also, encoding with one B frame is a touch faster in -I 1 mode than
encoding without them, but it is slower when you encode two B frames
instead of just one. I find this interesting.. I would have expected a
single B frame to take a bit longer than none at all, and that is the
case when -I 0 is on, but not when it's -I 1. Any ideas on that one?

In the end -M 3 is not reasonably faster in -I 0 -R 0, but flys along at
-I 0 -R 2 compared to baseline, and gets fair gains at -I 0 -R 1, while
dropping encoding time by another 14 seconds for the same frameset. So,
does this boil down to the fastest is -M 3 -I 0 -R 1?

The numbers on -M 3 -I 1 -R 2 show a 54 second improvement over the
tests with -M 0, but it takes almost 50% longer than -M 3 -I 0 -R 1. The
file size of 3-1-2 is 13,807,067 and the file size of 3-0-1 is
13,402,673. The file is smaller, and is encoded faster, and viewing them
now, the quality is at least on par (3-0-1 looked a tad better).

> - There is also a parallel read-ahead thread but this rarely soaks much CPU on 
> modern CPUs.
> 
> The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more 
> scalable parallelisation.  You might want to give it a go - I'd be interested 
> in the results!

I'd love to, but I couldn't find it in CVS. I found everything else in
the SF CVS branch, but not mjpegtools itself.

> N.b. in a 'realistic' scenario you're running the multiplexer and audio 
> encoding in parallel with the encoder and video filters communicating via 
> pipes and named FIFO's.   This setup usually saturate a modern dual machine 

No multiplexing and no audio encoding (AC3 pass through and multiplexing
of DVD streams is done after completion of the video encoding). There is
the overhead of decoding the original MPEG2 stream into YUV, but that's
about all else that transcode (which I'm using) is dumping into the
pipe. I avoided any of that on this run by just dumping the file in an
already decoded format (pgmtoy4m output).

> cheers,
> 
>   Andrew
> PS
> I'm away on vacation for a couple of weeks from friday so there'll be a bit of 
> pause in answering emails / posts from then ;-)




---
This

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Bernhard Praschinger
Hallo

> On Tue, 2003-12-16 at 12:57, Bernhard Praschinger wrote:
> > Could you run a few test (please).  Get some frames (100-1000) as yuv
> > format. I gues that should be possible even with transcode. ;)
> > (I do not use transcode so I can't help, or get the test streams on
> > mjpeg.sf.net)
> 
> With about 1010 frames of YUV using < to dump it in (instead of cat), I
> get these:
> 
> -M 0: 2m 11.9s
> -M 1: 2m 10.6s, -1.3s
> -M 2: 1m 27.7s, -44.2s
> -M 3: 1m 26.5s, -45.4s
That values look much better.  :-)
Now you have seen the mpeg2enc can go faster.

I have tried the command you used on my machine, and I have seen the 
same "problem". Also 3 processes and each only 33% .

(time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v
-S  -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0)


> Note that I responded in an earlier message with a total of 24 timings
> across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting
> results that -M 3 -I 0 -R 1 worked fastest of all of them (same source
> material I used for the above, and it took 51 seconds). So, I think the
> -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it
> is still quite a bit slower than -I 0 (which I use since the input is
> Progressive 23.976fps)
Thats strange.

> > And do afterwards something like that:
> > cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v
> > or
> > lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v
> >
> > So you can be soure that nothing else makes any troubels. And check
> > thant how it is going. That should not take to long. Than you can add
> > the options you used, to see if anything there causes the probelm of non
> > increasing framerate.
> Compared to the run with my long options line, these are 
I'm just running some encodings to see which option causes the problem. 

On my machine the -R 0 caused the problem. If I used -R 1/2 or or R
option, I got 3 processes each using about 45-50%. 

> > Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX)
> Nice board, that one. Asus A7M-266D.. I should've grabbed the MSI K7D
> Master for the same price, I hear much nicer things about it.


> > My brain had given up the time I started my computer that evening ;)
> Mine usually does that at about 8am. :>
Just as you enter work ? ;)

> > But I'm not really knowing why the situation is that bad.
> 
> I'm just not seeing the dual CPU usage that would warrant even running
> in multiple threads, when I could instead transcode two entirely
> separate items as though I had two machines, which makes some sense (I
> did that the other day, worked rather well). But, if I can make a single
> copy work by flooding both CPUs with activity, then I'll be happier,
> since it should take quite a bit less time to encode a full movie.
Encoding without the -R 0 seems to solve the problem, by now.


auf hoffentlich bald,

Berni the Chaos of Woodquarter

Email: [EMAIL PROTECTED]
www: http://www.lysator.liu.se/~gz/bernhard


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Richard Ellis
On Tue, Dec 16, 2003 at 12:45:48PM -0800, Trent Piepho wrote:
> On Tue, 16 Dec 2003, Richard Ellis wrote:
> > > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s,
> > > it SHOULD be able to push enough to keep the frames encoding at
> > > 100% CPU, in theory.
> > 
> > Yes, but just one 720x480 DVD quality frame is larger than 256k
> > in size, so a 256k cache per CPU isn't helping too much overall
> > considering how many frames there are in a typical video to be
> 
> A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough
> memory bandwidth to encode at about 1000 frames/sec if all you had
> to do was read the data.  Obviously the encoder runs somewhat
> slower than that, so each byte of data must be accessed multiple
> times.  That's where the cache helps.

With motion estimation each byte would end up being accessed more
than once for each new "radius" that was examined.  Plus motion
estimation is between at least two frames, so we are dealing with at
least about 1M of data to be accessed eventually in the course of
encoding one frame.

> > Of course, Andrew would be much better suited to discuss
> > mpeg2enc's memory access patterns during encoding, which
> > depending on how it does go about accessing memory can better
> > make use of the 256k of cache, or cause the 256k of cache to be
> > constantly thrashed in and out.
> 
> I seem to recall that one of the biggest performance bottlenecks of
> mpeg2enc is they way it accesses memory.  It runs each step of the
> encoding processes and en entire frame at a time.  It's much more
> cache friendly run every stage of the encoding process on a single
> macroblock before moving on the to next macroblock.

In that case it will kill the majority of the performance benifit
provided by the caches, because there's very little locality of
reference for the cache to compensate for.  It moves through at least
512k for pass one, then through the same 512k again for pass two, but
the data in the cache is from the end of the frame, and we are
starting over at the beginning of the frame.  Massive cache thrash in
that case.  Memory bandwidth becomes a much more limiting factor.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Andrew Stevens
Hi Steven,  Trent,

> But what about bit allocation?  You need to know how big the last GOP was
> to figure out how many bits you can use for the next GOP.

Actually, this is not such a big deal provided the GOPs are well seperated.  
Simplifying a little, you just need to ensure that you have >= the assumed 
amount of decoder buffer full at the end of each 'chunk' as you assumed 
starting to encode its successor.

However, this idea came to mind more as a sneaky way of doing accurately sized 
single-pass encoding: work on multiple 'segments' spread across the video 
sequence so you get a good statistical sample of how your total 
bit-consumption is going relative to your target.  This is rotten for 
parallelism thought because you have two more or less totally uncorrelated 
memory footprints.  For DVD 'segments' would kind of naturally correlate with 
'chapters' at the authoring level.

In the MPEG_DEVEL branch encoding of each frame (apart from the bit-packed 
coding and bit allocation which is only a small fraction of the CPU load) is 
simply striped across the available CPUs.  This has a nice side effect of 
reducing each CPUs working set too as it only deals with a fraction of a 
frame.

Having said all that I'll probably simply do a simple two-pass encoding mode 
first (much simpler frame feeding!).


> > Of course, Andrew would be much better suited to discuss mpeg2enc's
> > memory access patterns during encoding, which depending on how it
> > does go about accessing memory can better make use of the 256k of
> > cache, or cause the 256k of cache to be constantly thrashed in and
> > out.
>
> I seem to recall that one of the biggest performance bottlenecks of
> mpeg2enc is they way it accesses memory.  It runs each step of the encoding
> processes and en entire frame at a time.  It's much more cache friendly run
> every stage of the encoding process on a single macroblock before moving on
> the to next macroblock.

The single-macroblock approach has been implemented for quite some time now 
(since the move to C++ roughly).  In rather basic English speed improved 
by... bugger all.  I was *most* surprised, it could well be that the story is 
rather different on multi-CPU machines.  At least I like to hope the work 
wasn't wasted ;-)

Actually, the memory footprint of encoding is much larger than you'd think.  
Remember each 16x16 int16_t difference macroblock gets generated from nastily 
unaligned 16x16 or 16x8 uint8_t predictors and a 16x16 uint8_t picture 
macroblock.  The difference is then DCT-ed in place into 4 8x8 int16_t DCT 
blocks which are then quantised in 4 8x8 int16_t quantised DCT blocks.

Where mpeg2enc could speed up is:

- DCT blocks are in 'correct' and not transposed form.  This is simply a waste 
as by transposing quantiser matrices and the scan sequence you can simply 
skip this.

- Each quantised DCT block is seperately stored.  Nice for debugging, poor for 
memory performance ;-)

- DCT is not combined with quantisation when this is possible.

- Motion estimation (probably wastefully) computes a lot of variances that 
could probably better be replaced by SAD for fast encoding modes.

- The current GOP sizing approach is wasteful.   Frame type should only be 
decided once the best encoding modest (Intra, various inter motion prediction 
modes) is known.  Basically, you turn a B/P frame into an I frame if you've 
reached your GOP length limit or it has enough Intra coded blocks that it is 
more compact that way.   Unfortunately, the current allocation algorithm 
still has a few 'left over' elements that need to know GOP size in advance 
that need to be replaced before this can be fixed.   I'm currently working on 
bit-allocation (basically, a two-pass / look-ahead mode plus the above 
improvement).

A similar approach can be used for deciding B/P frame selection but this is 
expensive in CPU as you basically have to do encode each potential B frame's 
reference frame twice.  I'm playing around with ideas for trying B frames out 
and if they don't seem worthwhile turning them off and then periodically 
checking if it might make sense to turn them on a again.


Andrew



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Andrew Stevens


> Produces this (approximately 1010 frames), encoding times (real time /
> user time, gives a bit of a view as to how busy the CPUs were during the
> real time, optimal should be 1m realtime, 2m user time, right? and
> average system time was 3.0s, with +/- 0.2s for all tests):
...

Yep.   You should (in theory) get a lot closer to that with the current 
MPEG_DEVEL branch mpeg2enc.   However, your scaling is really remarkably bad 
as even the -R 2 values where two CPUs should be fairly busy are unusually 
bad.  I've never heard of worse than 70% utilisation on dual CPU machines.

Here's a fairly typical snapshot of mpeg2enc -M 2 -I 1 -R 2 in action on my 
dual P-III machine...

  PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
12620 as18   0 46464  45M   768 R80.9 24.3   0:18 lt-mpeg2enc
12621 as18   0 46464  45M   768 R70.8 24.3   0:18 lt-mpeg2enc
12619 as 9   0 46464  45M   768 S 3.9 24.3   0:01 lt-mpeg2enc

You're getting very very symmetrical CPU loads and very very poor utilisation.  
What kernel are you using... I vaugely recall 2.6.x series radically changed 
the threading libs.  It could be something pathological is happening in the 
scheduling.  

The  2100+ is of course  a lot faster than the P-III but: I doubt the balance 
between the motion estimation and the rest of the code is hugely shifted.  
Cerainly, the approximate proportions of time spent in each are quite similar 
on my 2100+ single-CPU machine and a P-III.


> Also, encoding with one B frame is a touch faster in -I 1 mode than
> encoding without them, but it is slower when you encode two B frames
> instead of just one. I find this interesting.. I would have expected a
> single B frame to take a bit longer than none at all, and that is the
> case when -I 0 is on, but not when it's -I 1. Any ideas on that one?

Not really. However: I would expect going to two B frames to greatly increase 
your CPU utilisation without much wall-clock time increase due the increased 
scope for parallel computation.

> In the end -M 3 is not reasonably faster in -I 0 -R 0, but flys along at
> -I 0 -R 2 compared to baseline, and gets fair gains at -I 0 -R 1, while
> dropping encoding time by another 14 seconds for the same frameset.
This is what you'd expect: -R 2 offers much more scope for the 3 worker 
threads of -M 3 to do something useful.

> The numbers on -M 3 -I 1 -R 2 show a 54 second improvement over the
> tests with -M 0, but it takes almost 50% longer than -M 3 -I 0 -R 1. The
> file size of 3-1-2 is 13,807,067 and the file size of 3-0-1 is
> 13,402,673. The file is smaller, and is encoded faster, and viewing them
> now, the quality is at least on par (3-0-1 looked a tad better).

The usefulness of B frames depends a *lot* on the type of material.  For 
captured stuff they rarely buy you much apart from free room heating from 
your CPU. Hence the provision of -R 0 ;-).  They should get a little more 
useful when I add dynamic frame type selection to mpeg2enc in the new year.


> > - There is also a parallel read-ahead thread but this rarely soaks much
> > CPU on modern CPUs.

Weirdly enough on your machine the reader thread is exceedingly busy
> > The MPEG_DEVEL branch encoder stripes all encoding phases to allow much
> > more scalable parallelisation.  You might want to give it a go - I'd be
> > interested in the results!
>
> I'd love to, but I couldn't find it in CVS. I found everything else in
> the SF CVS branch, but not mjpegtools itself.

cvs co -d :ext:[EMAIL PROTECTED]:/cvsroot/mjpeg mjpeg_play
cd mjpeg_play
cvs update -r MPEG_DEVEL mpeg2enc

The 'mjpeg_play' is a bit of a historical oddity but it is momumentally 
painful to change directory names in CVS...

Andrew




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Steven M. Schultz

On Tue, 16 Dec 2003, Trent Piepho wrote:

> But what about bit allocation?  You need to know how big the last GOP was to
> figure out how many bits you can use for the next GOP.

Well, you know the maximum bitrate allowed (via the -b option) - could
encode each GOP with that limit in mind.I'm not sure how bits
"carry over" from GOP to GOP.

Nice self-contained chunks of data should parallelize nicely - perhaps 
not that hard to extend to a "cluster".That'd be fast I'd think.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Trent Piepho
On Tue, 16 Dec 2003, Steven M. Schultz wrote:
> > First off a bit of background to the multi-threading in the current stable 
> > branch.  First off:
> > 
> > - Parallelism is primarily frame-by-frame.  This means that the final phases 
> > of the encoding lock on completion of the reference frame (prediction and DCT 
> 
>   If one were using closed and fixed length GOPs would it make
>   sense to parallelize the encoding of complete GOPs?   Each cpu
>   could be dispatched a set of N frames that comprise a closed GOP and
>   a master thread could write the GOPs out in the correct order.

But what about bit allocation?  You need to know how big the last GOP was to
figure out how many bits you can use for the next GOP.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Trent Piepho
On Tue, 16 Dec 2003, Richard Ellis wrote:
> > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s, it
> > SHOULD be able to push enough to keep the frames encoding at 100%
> > CPU, in theory.
> 
> Yes, but just one 720x480 DVD quality frame is larger than 256k in
> size, so a 256k cache per CPU isn't helping too much overall
> considering how many frames there are in a typical video to be

A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough memory
bandwidth to encode at about 1000 frames/sec if all you had to do was read the
data.  Obviously the encoder runs somewhat slower than that, so each byte of
data must be accessed multiple times.  That's where the cache helps.

> Of course, Andrew would be much better suited to discuss mpeg2enc's
> memory access patterns during encoding, which depending on how it
> does go about accessing memory can better make use of the 256k of
> cache, or cause the 256k of cache to be constantly thrashed in and
> out.

I seem to recall that one of the biggest performance bottlenecks of mpeg2enc
is they way it accesses memory.  It runs each step of the encoding processes
and en entire frame at a time.  It's much more cache friendly run every stage
of the encoding process on a single macroblock before moving on the to next
macroblock.




---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Steven M. Schultz

On Tue, 16 Dec 2003, Andrew Stevens wrote:

> Hi all,
> 
> First off a bit of background to the multi-threading in the current stable 
> branch.  First off:
> 
> - Parallelism is primarily frame-by-frame.  This means that the final phases 
> of the encoding lock on completion of the reference frame (prediction and DCT 

If one were using closed and fixed length GOPs would it make
sense to parallelize the encoding of complete GOPs?   Each cpu
could be dispatched a set of N frames that comprise a closed GOP and
a master thread could write the GOPs out in the correct order.

But as Andrew mentioned - but the time filters and other processing
is added in a dual cpu system's pretty well saturated.   Quad cpu
systems are very much a niche (and expensive) item (not to mention 
the noise they make;))

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Richard Ellis
On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote:
> On Mon, 2003-12-15 at 21:08, Richard Ellis wrote:
> > Additionally, why kind of memory do you have attached to the cpu's? 
> > Mpeg encoding is very memory bandwidth hungry to begin with, and with
> > two cpu's trying to eat at the same trough, a not quite as fast as it
> > should be memory subsystem can produce results like what you are
> > seeing. ...

> ... It's a dual Athlon, which inherently means 266FSB (DDR 266),
> though the memory is actually Hynix PC3200 w/ timings set as low as
> they go on this board (2-2-2), which gives me about 550MB/s memory
> bandwidth according to memtest, with a 13GB/s L1 and something like
> 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.  At 550MB/s, it
> SHOULD be able to push enough to keep the frames encoding at 100%
> CPU, in theory.

Yes, but just one 720x480 DVD quality frame is larger than 256k in
size, so a 256k cache per CPU isn't helping too much overall
considering how many frames there are in a typical video to be
encoded.  Plus, my experience with Athlon's is that they are actually
faster at mpeg2enc encoding that Intel chips of equivalent speed
ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they
put a heavier stress on one's memory bandwidth than an equivalent
speed Intel chip would.  It's possible that 275MB/s per CPU just
isn't fast enough to keep up with the rate that mpeg2enc can consume
data on an Athlon.

Of course, Andrew would be much better suited to discuss mpeg2enc's
memory access patterns during encoding, which depending on how it
does go about accessing memory can better make use of the 256k of
cache, or cause the 256k of cache to be constantly thrashed in and
out.

> > FWIW, when my desktop machine was a dual PII-400Mhz box, I almost
> > always had two mpeg2enc threads eating up 97-98%cpu on both PII
> > chips.  The few times both cpu's were not fully saturated at mpeg
> > encoding was when I'd bother them with something silly like browsing
> > the web with mozilla. :)
> 
> Now that's just silly. Why would you hurt the CPUs by running such bloat
> as Mozilla? I can't think of how many times Mozilla has gone nuts on me
> and used 100% CPU without reason, and you can't kill it any normal UI
> way.. Good ol' killall. However, I love it. It's a great browser. Just
> rather hungry at times. I suppose there's a reason the logo is a
> dinosaur. :>

Hmm... Interesting.  I've had it sometimes just stop but never go
nuts with 100% CPU, and although I usually do CLI kill it if need be,
FVWM2's "destroy" window command has never failed to get rid of it if
I don't bother to go CLI to do so.  In fact, FVWM2's "destroy" has
never failed to get rid of anything that went wonky.  It's the X
windows equivalent to a "kill -9" from the CLI.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Richard Ellis
On Tue, Dec 16, 2003 at 09:27:53AM -0800, Steven M. Schultz wrote:
> 
> Perhaps Richard Ellis could chime in with his experiences with -Q
> ;)

It seems that with the right set of options, and the right set of
input data, -Q can help to create some really nasty looking
artifacts.  

> > And again, son of I didn't realize the parallelization was
> > done based on interlacing settings.
>   
> Looking back on it that makes sense though.   A P frame depends on
> the preceeding P frame - rather sequential in nature since you
> can't move on to the next one without completing the first one...

The P frame dependency chain is how the artifacts come about based on
Andrew's explanation.  It's accumulated round off error in the iDCT
routines.  Made worse by -Q as well as -R 0 and a few other options.



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Bernhard Praschinger
Hallo

> Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the
> Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder
> is only about 10% intermittent. So, I'm neglecting those for the moment.
> I'm using transcode, by the way (though I found the same results when
> not using transcode and doing a straight pipe from decoded MPEG2
> frames). Note the top dumps below ignore the memory usage (which has
> approximately 640MB of free RAM (really free, not cache or anything,
> it's a clean boot, 127 processes running in all cases)).
> 
>  Cpu0 :  50.0% user,   8.6% system,   0.0% nice,  41.4% idle
>  Cpu1 :  53.4% user,   4.3% system,   0.0% nice,  42.2% idle
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 11234 slepp 16   0 43436  42m  968 S 38.2  4.2   0:16.96 mpeg2enc
> 12422 slepp 16   0 43436  42m  968 S 34.5  4.2   0:16.86 mpeg2enc
>   623 slepp 16   0 43436  42m  968 R 33.6  4.2   0:17.14 mpeg2enc
> 
> Command line:
> time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x
> mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S  -M 3
> -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800
> -i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000
Could you run a few test (please).  Get some frames (100-1000) as yuv
format. I gues that should be possible even with transcode. ;)
(I do not use transcode so I can't help, or get the test streams on
mjpeg.sf.net)

And do afterwards something like that:
cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v 
or 
lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v

So you can be soure that nothing else makes any troubels. And check
thant how it is going. That should not take to long. Than you can add
the options you used, to see if anything there causes the probelm of non
increasing framerate. 

> > I use the 2.6.0-test8 kernel. Maybe that changes the situation.
> I used to be using 2.5.63 or similar, but have rebuilt the machine with
> 2.4.20 with scheduling optimizations and other goodies (gentoo). I
> noticed a number of speed ups in most other parallel processes
> (cinelerra, MPI povray, gcc). Of course, most of the patches in the
> gentoo 2.4.20 kernel are stock in 2.5+ (I also used 2.6.0-test8, but
> this Asus board doesn't behave under that kernel, and it crashed
> whenever i'd load the CPUs or IDE buses :<)
Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) 

> > Sorry if the mail is a bit confusing,
[...]
> Hopefully this one didn't ramble on TOO long.
My brain had given up the time I started my computer that evening ;)

But I'm not really knowing why the situation is that bad.


auf hoffentlich bald,

Berni the Chaos of Woodquarter

Email: [EMAIL PROTECTED]
www: http://www.lysator.liu.se/~gz/bernhard


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Andrew Stevens
Hi all,

First off a bit of background to the multi-threading in the current stable 
branch.  First off:

- Parallelism is primarily frame-by-frame.  This means that the final phases 
of the encoding lock on completion of the reference frame (prediction and DCT 
transform) and the predecessor (bit allocation).   If you have a really fast 
CPU that motion estimates and DCT's very fast you will get lower 
parallelisation.  If you use -R 0 you will get very litte parallelism *at 
all*.   Certainly not enough to make -M 3 sensible.

- There is also a parallel read-ahead thread but this rarely soaks much CPU on 
modern CPUs.

The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more 
scalable parallelisation.  You might want to give it a go - I'd be interested 
in the results!

N.b. in a 'realistic' scenario you're running the multiplexer and audio 
encoding in parallel with the encoder and video filters communicating via 
pipes and named FIFO's.   This setup usually saturate a modern dual machine 

cheers,

Andrew
PS
I'm away on vacation for a couple of weeks from friday so there'll be a bit of 
pause in answering emails / posts from then ;-)





---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Steven M. Schultz

On Tue, 16 Dec 2003, Slepp Lukwai wrote:

> Tried it without any options, same effect. I'm definitely seeing nowhere
> near 40% speedup, which is what boggles me. I expected at least
> reasonable gains of 25%.

I think that has to do with the -I setting...

> Sorry, upon further testing, I actually average around 14fps at DVD
> quality (720x480, 9800kbit/s). (see all the details of my command lines

Ah, that's more like it then.   

> It's interesting that I'm faster with dual 2100s than the dual 2800 (or
> at least on par). I suppose it really comes down to command line
> options, but you would need to compare those yourself (since I haven't

Friend of mine has dual 2400s and my setup is ~10-15% faster as I
recall - he's getting around 11fps as a rule where I see 14 or so.

I'm usually adding a bit of overhead with the chroma conversion.  I
build smilutils with ffmpeg/libavcodec (to use ffmpeg's DV codec)
and then run the data thru something like:
"smil2yuv -i 2 file.dv | filters | y4mscaler -O chromass=420_MPEG2 |..."

Produces better output that the default which uses libdv but does
cost a bit in cpu use.

> According to the docs -I 1 turns on interlacing support, and causes
> un-needed overhead if it is known progressive material. Hence the -I 0
> (plus transcode sets that, though I could override it).

But unless you have the raw 23.976fps progressive data (with the 3:2
pulldown undone) then I think '-I 1' is the option to use.   But then 
I might be confused (wouldn't be the first time ;)).

That would explain why the encoding rate I see is lower since I'm
using -I 1.

> > wrong I'm sure someone will tactfully point that out ;)) the speedup
> > comes from the motion estimation of the 2 fields/frame being done in
> > parallel.
> 
> Oh. Son of a... If that's all it is...

Yep - I'm fairly sure that is why you're not seeing any improvement
when using "-M 2".

> > without B frames.   Those are computationally a lot more expensive
> > than I or P frames.   "-R 0" will disable B frames.
> 
> I just enabled that, and that's how I'm hitting 15fps instead of 8, and
> the quality is good and the size is just fine.

Great!   It takes, from what I've seen, extraordinarily clean sources
before -R 0 has no or little effect.

> to their potentials and give me the equivalent of a 4200+ ;> If it takes
> 6 hours to transcode a movie because I set -r32 (I noticed a larger
> difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but

Yep - "-4 1" will close to double the time over "-4 2" and the 
difference in bitrate/filesize is measured in tenths of a percent. 
Hardly worth it.   Not all that much difference between "-4 2" and
"-4 3" though.

> > better results (especially with clean source material) can be obtained
> > with "-E -8" or perhaps "-E -10".
> 
> Until I upgraded to .92, I didn't have those options. I'm using them

On noisy source material the -E option has almost no effect  but the
cleaner the input the more effect even modest values of -E have.

> now, in combination with -Q, but I find the artifacts are almost never
> there (I used to do -q 4 and -Q 4.0, and it looked about the same as the
> 5/3.0).

Perhaps Richard Ellis could chime in with his experiences with -Q ;)

> > Right, with -I 0 the cpus take turns but there's little parallelism.
> 
> And again, son of I didn't realize the parallelization was done
> based on interlacing settings.

Looking back on it that makes sense though.   A P frame depends on the
preceeding P frame - rather sequential in nature since you can't
move on to the next one without completing the first one...

> The MPEG decoding doesn't take much, and the pipe overhead is negligble,

Pipe overhead sneaks up on you though.   One pipe?  Not a real problem,
two?  Begins to be noticed but isn't too bad.   Four or five?   Yeah,
it starts to take a hit on the overall speed of the system - the data
has to go up/down thru the kernel all those times and that's not "free".

> (As I write this, I'm still waiting for the -M 2 run to finish, so it'll
> arrive before the tests results to Bernhard make it out).

You might try, for timing purposes, without -I 0 and see what, if any
effect that has.   Might be a useful data point.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing li

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Slepp Lukwai
On Mon, 2003-12-15 at 22:44, Bernhard Praschinger wrote:
> Hallo
> 
> > I was doing some testing of both the older version (1.6.1.90) and the
> > newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat
> > faster to begin with. However, in both cases, after multiple tests and
> > trying different things, I can't get the SMP modes to be fast at all. In
> > fact, they're slower than the non-SMP modes.
> With slower, I hope you mean "mpeg2enc needs more time to encode the
> movie". 
> And not the time the encoding need in the "realtime". 

Slower by wallclock slower. It took less time to re-encode the entire
thing with -M 0 than when I used -M 3. (I didn't let it run through 2,
since it takes over 4 hours as is). (K, after all these tests, the dual
stuff is running faster, but not fast enough over a full movie to even
warrant the extra threads).

Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the
Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder
is only about 10% intermittent. So, I'm neglecting those for the moment.
I'm using transcode, by the way (though I found the same results when
not using transcode and doing a straight pipe from decoded MPEG2
frames). Note the top dumps below ignore the memory usage (which has
approximately 640MB of free RAM (really free, not cache or anything,
it's a clean boot, 127 processes running in all cases)).

 Cpu0 :  50.0% user,   8.6% system,   0.0% nice,  41.4% idle
 Cpu1 :  53.4% user,   4.3% system,   0.0% nice,  42.2% idle
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
11234 slepp 16   0 43436  42m  968 S 38.2  4.2   0:16.96 mpeg2enc
12422 slepp 16   0 43436  42m  968 S 34.5  4.2   0:16.86 mpeg2enc
  623 slepp 16   0 43436  42m  968 R 33.6  4.2   0:17.14 mpeg2enc

Command line:
time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x
mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S  -M 3
-g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800
-i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000

Results:[import_mpeg2.so] tcextract -x mpeg2 -i "28DaysLater.m2v" -d 1 |
tcdecode -x mpeg2 -d 1 -y yv12
[export_mpeg2enc.so] *** init-v *** !
[export_mpeg2enc.so] cmd=mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a
3 -o "test3".m2v -S  -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K
kvcd -R 0
++ WARN: [mpeg2enc] 3:2 movie pulldown with frame rate set to decode
rate not display rate
++ WARN: [mpeg2enc] 3:2 Setting frame rate code to display rate = 4
(29.970 fps)
encoding frame [950],  14.93 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

73.56user 7.76system 1:09.29elapsed 117%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (2055major+31007minor)pagefaults 0swaps

(I can't find how to turn off line wrap. Sorry...)

Note I used 120 incoming frame buffers with 2 threads decoding the
video. The buffer usage of transcode never dropped below 90 frames
buffered, so the buffering was keeping pace.

Here's the identical command, the only thing changed is -M 3 to -M 2
(this time I included a snapshot of tcdecode, but note that it isn't
always in the top 3 of the list, it comes and goes quite frequently, and
the transcode buffers stay right around 110 to 116 frames):

 Cpu0 :  61.8% user,   7.3% system,   0.0% nice,  30.9% idle
 Cpu1 :  50.5% user,  12.8% system,   0.0% nice,  36.7% idle
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
20631 slepp 19   0 39824  38m  984 R 51.1  3.9   0:03.79 mpeg2enc
14434 slepp 17   0 39824  38m  984 R 45.7  3.9   0:03.94 mpeg2enc
29969 slepp 16   0  2644 2644  668 S 13.7  0.3   0:01.95 tcdecode

And the output of time (and the end of transcode):
encoding frame [950],  14.33 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

74.89user 7.68system 1:11.95elapsed 114%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1979major+26920minor)pagefaults 0swaps


And with -M 1 instead of -M 2:

 Cpu0 :  87.0% user,  13.0% system,   0.0% nice,   0.0% idle
 Cpu1 :  22.2% user,   5.6% system,   0.0% nice,  72.2% idle
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
31916 slepp 25   0 36192  35m  984 R 90.3  3.5   0:07.58 mpeg2enc
 3690 slepp 16   0  2644 2644  668 S 14.7  0.3   0:01.91 tcdecode

Note that it's now using an entire CPU (other processes keep sharing,
but it's still using a full CPU).

And the transcode/time results:

encoding frame [950],  14.19 fps, 95.2%, ETA: 0:00:03, ( 0| 0|117)
clean up | frame threads | unload modules | cancel signal | internal
threads | done
[transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s

73.98user 7.51system 1:12

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Slepp Lukwai
On Mon, 2003-12-15 at 20:27, Steven M. Schultz wrote:
> On Mon, 15 Dec 2003, Slepp Lukwai wrote:
> 
> > faster to begin with. However, in both cases, after multiple tests and
> > trying different things, I can't get the SMP modes to be fast at all. In
> > fact, they're slower than the non-SMP modes.
> 
>   I think I see what you're doing that could cause that.   I've never
>   seen the problem - using "-M 2" is not going to be 2x as fast though
>   if that was the expectation.   ~40% speedup or so is what I see
>   (from about 10fps to 14fps) typically.

Tried it without any options, same effect. I'm definitely seeing nowhere
near 40% speedup, which is what boggles me. I expected at least
reasonable gains of 25%.

> > When encoding with the -M 0 with .92, I get around 19fps. When I use -M
> 
>   That's full sized (720x480) is it?   Sounds more like a SVCD 
>   or perhaps "1/2 D1" (bit of a misnomer - D1 is actually a digital
>   video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit
>   more I've seen.   But I'm usually tossing in a bit of filtering so
>   the process is a slower.

Sorry, upon further testing, I actually average around 14fps at DVD
quality (720x480, 9800kbit/s). (see all the details of my command lines
in the post I sent in responce to Bernhard).

> > I installed 'buffer', set it up with a 32MB buffer and put it in the
> 
>   10MB is about all I use - it's just a cushion to prevent the encoder
>   from having to wait (-M 1 is the default - there's I/O readahead
>   going on) for input.

Yeh, I tried 20 first, then 32, but in the end, it made no difference at
all.

> > Has anyone found a way around this, or is it time to look at the source
> > and see what's up?
>   
> > And for reference, it's a dual Athlon MP 2100+, which is below the
> > '2600' that the Howto references as fast.
>   
>   I'm using dual 2800s and around 14-15fps for DVD encoding is what I
>   usually get.

It's interesting that I'm faster with dual 2100s than the dual 2800 (or
at least on par). I suppose it really comes down to command line
options, but you would need to compare those yourself (since I haven't
seen yours).

> > The actual command line is:
> > mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S  -M
> > 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd
> 
>   You have progressive non-interlaced source?   If not then "-I 0" is
>   not the right option. 

According to the docs -I 1 turns on interlacing support, and causes
un-needed overhead if it is known progressive material. Hence the -I 0
(plus transcode sets that, though I could override it).

>   The speed up from multiple processors comes, I believe (but if I'm
>   wrong I'm sure someone will tactfully point that out ;)) the speedup
>   comes from the motion estimation of the 2 fields/frame being done in
>   parallel.

Oh. Son of a... If that's all it is...

>   Try "-I 1" (or just leave out the '-I" and let it default.
> 
>   Oh, and there's no real benefit from going above -M 2.   I had a 4
>   cpu box and tried "-M 4" and saw no gain over -M 3 (which in turn
>   was a very minimal increase over -M 2).

I've never even bothered with -M 4 (well, not for a real run, anyway,
just as a quick test).

>   If you want to speed things up by a good percentage try encoding
>   without B frames.   Those are computationally a lot more expensive
>   than I or P frames.   "-R 0" will disable B frames.

I just enabled that, and that's how I'm hitting 15fps instead of 8, and
the quality is good and the size is just fine.

>   And do you realize that increasing the search radius (-r) slows
>   things down?Leave the -r value defaulted to 16 and you should
>   see encoding speed up.

Yup, entirely aware. I do like the minor difference it makes, though.
I'm not in it for speed, really, I just want to see both CPUs get used
to their potentials and give me the equivalent of a 4200+ ;> If it takes
6 hours to transcode a movie because I set -r32 (I noticed a larger
difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but
I feel it could be faster.

>   All in all - the defaults are fairly sane so if you're not certain
>   about an option, well, let it default.
> 
>   And drop the -Q unless you want artifacting - especially values over 2.   
>   Under some conditions (it's partly material dependent) the -Q can
>   generate really obnoxious color blocks and similar artifacts.Much
>   better results (especially with clean source material) can be obtained
>   with "-E -8" or perhaps "-E -10".

Until I upgraded to .92, I didn't have those options. I'm using them
now, in combination with -Q, but I find the artifacts are almost never
there (I used to do -q 4 and -Q 4.0, and it looked about the same as the
5/3.0).

> > Of course the -M 3 changes to 2 and 0 in testing. I also tested it 

Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-16 Thread Slepp Lukwai
On Mon, 2003-12-15 at 21:08, Richard Ellis wrote:
> What program are you using to monitor CPU usage while mpeg2enc runs? 
> Some versions of top (if you are using top) report percentages as a
> roll-up of the whole SMP machine, so that 3x33% usage really means
> 99% utilization of the machine, where "the machine" means both
> processors combined.  Other versions report a per-cpu percentage
> instead of rolling everything together.

I hate the combined ratings, so I already setup top to report per CPU
usage, so I can see 200% usage instead of it showing 50% as 100% on one
CPU (it's misleading when you deal with single CPUs almost all day for
work).

> Additionally, why kind of memory do you have attached to the cpu's? 
> Mpeg encoding is very memory bandwidth hungry to begin with, and with
> two cpu's trying to eat at the same trough, a not quite as fast as it
> should be memory subsystem can produce results like what you are
> seeing.  It's because with the two cpu's trying to run mpeg2enc, they
> together oversaturate the memory bus, causing both to wait.  But with
> only one mpeg2enc thread running, the entire memory bus bandwidth is
> available to that one cpu alone.

I've noticed. I never saw really how much memory it used unti I used the
buffer program with -t. It was moving gigs of data for a short period of
frames (perhaps 10,000 frames). It's a dual Athlon, which inherently
means 266FSB (DDR 266), though the memory is actually Hynix PC3200 w/
timings set as low as they go on this board (2-2-2), which gives me
about 550MB/s memory bandwidth according to memtest, with a 13GB/s L1
and something like 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1.

At 550MB/s, it SHOULD be able to push enough to keep the frames encoding
at 100% CPU, in theory. I don't think there's enough overhead on this
machine to qualify as keeping it even half saturated. This is why I want
the Corsair XMS Pro memory with load meters on them. (Per bank load
meters, even).

> FWIW, when my desktop machine was a dual PII-400Mhz box, I almost
> always had two mpeg2enc threads eating up 97-98%cpu on both PII
> chips.  The few times both cpu's were not fully saturated at mpeg
> encoding was when I'd bother them with something silly like browsing
> the web with mozilla. :)

Now that's just silly. Why would you hurt the CPUs by running such bloat
as Mozilla? I can't think of how many times Mozilla has gone nuts on me
and used 100% CPU without reason, and you can't kill it any normal UI
way.. Good ol' killall. However, I love it. It's a great browser. Just
rather hungry at times. I suppose there's a reason the logo is a
dinosaur. :>



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-15 Thread Bernhard Praschinger
Hallo

> I was doing some testing of both the older version (1.6.1.90) and the
> newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat
> faster to begin with. However, in both cases, after multiple tests and
> trying different things, I can't get the SMP modes to be fast at all. In
> fact, they're slower than the non-SMP modes.
With slower, I hope you mean "mpeg2enc needs more time to encode the
movie". 
And not the time the encoding need in the "realtime". 

> When encoding with the -M 0 with .92, I get around 19fps. When I use -M
> 2 or -M 3, I get around 14fps. The CPU utilization sits at about 60 to
> 70% across both CPUs, but hits 99.9% when using just one.
Thats really strange. 
Which programm dod you use for monitoring your CPU utilisation ?
top and/or xosview ?

If you used time for knowing what amout of time is used, the important
value for you use the "real" line, and not the "user" line.
The user line reports the time the command needed on both CPUs. On a
dual machine that has nothing other things to do, the real time is lower
than the user time. The "overhead" you need for 2 threads incresses the
user time a litte, but lowers the real time. 

> I installed 'buffer', set it up with a 32MB buffer and put it in the
> stream, and it didn't make any difference at all. It would be nice to
> use mpeg2enc on two CPUs to it's full speed, which would net me faster
> than real-time, but thus far I haven't been able to.
What was your full comand ?

When I use lav2yuv files | mpeg2enc -f 8 -o test.m2v. 
My system (the 2600 Athlon MP I mentioned in howto) mpeg2enc needs
nearly 100% of one cpu and lav2yuv nedds another 5-10%. 
Encoding of 1000 frames takes that mount of time: 2m16.944s

When I add -M 2 
The speedup is nice, mpeg2enc has two thread eac needing about 65-70%,
lav2yuv needs about 15%.
Encoding of 1000 frames takes that mount of time: 1m37.881s

Adding buffer to a simple command line does not speed up anything.
buffer helps if you have a pipeline with serveral stages like: lav2yuv |
yuvdenoise | yuvscaler | mpeg2enc

> Has anyone found a way around this, or is it time to look at the source
> and see what's up?
I have no need, because I think it works properly. 
 
> And for reference, it's a dual Athlon MP 2100+, which is below the
> '2600' that the Howto references as fast.
>
> Of course the -M 3 changes to 2 and 0 in testing. I also tested it with
> and without the buffer program in the list. Another notable thing, is
> that with the newest version .92, -M3 causes three 33% usage processes
> to exist (leaving an entire CPU idle), while M2 causes two 60% processes
> to exist. With .90, -Mx causes 2 50-70% processes and the rest never do
> anything.
Just for the fun, I have tested it with -M 3, and than I saw 3 mpeg3nc
thread each using about 45-50%, that improved the needed time compared
to -M 2 by another 10 seconds. -M 4 didn't cange much at all, only a 4th
process needing about 10%.

I use the 2.6.0-test8 kernel. Maybe that changes the situation. 

The percent numbers reported by top have to be read carefully. At least
my top reports them fo a single CPU, so you can have processes using up
to 200% and then both cpus have full load. 
But in the task/cpu stats line 100% utilisation are for both CPUs 

Sorry if the mail is a bit confusing,
auf hoffentlich bald,

Berni the Chaos of Woodquarter

Email: [EMAIL PROTECTED]
www: http://www.lysator.liu.se/~gz/bernhard


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-15 Thread Richard Ellis
On Mon, Dec 15, 2003 at 01:46:32AM -0700, Slepp Lukwai wrote:
> ...
> 
> Of course the -M 3 changes to 2 and 0 in testing. I also tested it
> with and without the buffer program in the list. Another notable
> thing, is that with the newest version .92, -M3 causes three 33%
> usage processes to exist (leaving an entire CPU idle), while M2
> causes two 60% processes to exist. With .90, -Mx causes 2 50-70%
> processes and the rest never do anything.

What program are you using to monitor CPU usage while mpeg2enc runs? 
Some versions of top (if you are using top) report percentages as a
roll-up of the whole SMP machine, so that 3x33% usage really means
99% utilization of the machine, where "the machine" means both
processors combined.  Other versions report a per-cpu percentage
instead of rolling everything together.

Additionally, why kind of memory do you have attached to the cpu's? 
Mpeg encoding is very memory bandwidth hungry to begin with, and with
two cpu's trying to eat at the same trough, a not quite as fast as it
should be memory subsystem can produce results like what you are
seeing.  It's because with the two cpu's trying to run mpeg2enc, they
together oversaturate the memory bus, causing both to wait.  But with
only one mpeg2enc thread running, the entire memory bus bandwidth is
available to that one cpu alone.

FWIW, when my desktop machine was a dual PII-400Mhz box, I almost
always had two mpeg2enc threads eating up 97-98%cpu on both PII
chips.  The few times both cpu's were not fully saturated at mpeg
encoding was when I'd bother them with something silly like browsing
the web with mozilla. :)


---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users


Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0

2003-12-15 Thread Steven M. Schultz

On Mon, 15 Dec 2003, Slepp Lukwai wrote:

> faster to begin with. However, in both cases, after multiple tests and
> trying different things, I can't get the SMP modes to be fast at all. In
> fact, they're slower than the non-SMP modes.

I think I see what you're doing that could cause that.   I've never
seen the problem - using "-M 2" is not going to be 2x as fast though
if that was the expectation.   ~40% speedup or so is what I see
(from about 10fps to 14fps) typically.

> When encoding with the -M 0 with .92, I get around 19fps. When I use -M

That's full sized (720x480) is it?   Sounds more like a SVCD 
or perhaps "1/2 D1" (bit of a misnomer - D1 is actually a digital
video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit
more I've seen.   But I'm usually tossing in a bit of filtering so
the process is a slower.

> I installed 'buffer', set it up with a 32MB buffer and put it in the

10MB is about all I use - it's just a cushion to prevent the encoder
from having to wait (-M 1 is the default - there's I/O readahead
going on) for input.

> Has anyone found a way around this, or is it time to look at the source
> and see what's up?

> And for reference, it's a dual Athlon MP 2100+, which is below the
> '2600' that the Howto references as fast.

I'm using dual 2800s and around 14-15fps for DVD encoding is what I
usually get.

> The actual command line is:
> mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S  -M
> 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd

You have progressive non-interlaced source?   If not then "-I 0" is
not the right option. 

The speed up from multiple processors comes, I believe (but if I'm
wrong I'm sure someone will tactfully point that out ;)) the speedup
comes from the motion estimation of the 2 fields/frame being done in
parallel.

Try "-I 1" (or just leave out the '-I" and let it default.

Oh, and there's no real benefit from going above -M 2.   I had a 4
cpu box and tried "-M 4" and saw no gain over -M 3 (which in turn
was a very minimal increase over -M 2).

If you want to speed things up by a good percentage try encoding
without B frames.   Those are computationally a lot more expensive
than I or P frames.   "-R 0" will disable B frames.

And do you realize that increasing the search radius (-r) slows
things down?Leave the -r value defaulted to 16 and you should
see encoding speed up.

All in all - the defaults are fairly sane so if you're not certain
about an option, well, let it default.

And drop the -Q unless you want artifacting - especially values over 2.   
Under some conditions (it's partly material dependent) the -Q can
generate really obnoxious color blocks and similar artifacts.Much
better results (especially with clean source material) can be obtained
with "-E -8" or perhaps "-E -10".

> Of course the -M 3 changes to 2 and 0 in testing. I also tested it with
> and without the buffer program in the list. Another notable thing, is
> that with the newest version .92, -M3 causes three 33% usage processes

Right, with -I 0 the cpus take turns but there's little parallelism.

> to exist (leaving an entire CPU idle), while M2 causes two 60% processes
> to exist. With .90, -Mx causes 2 50-70% processes and the rest never do

Hmmm, I see 100% use on the two 2800s - but some of that would be
the DV decoding and pipe overhead of course.

First thing I'd try is lowering -r to 24 at most or just defaulting it.

Cheers,
Steven Schultz



---
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
___
Mjpeg-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/mjpeg-users