Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
> In floating point, all you have to do is flip a sign bit. But with > integers, it's not so easy. There is no instruction for absolute value in > MMX, you have to use a four instruction sequence and two registers. Slower > than squaring a value, which only takes two instructions. > I finally found a mmx2 reference, and you're right about that. MMX2 added > psadbw, packed sum of absolute differences. If you have 8-bit unsigned > data it makes computing SAD pretty darn easy, you can find the SAD of 8 > pixels in one instruction. There are really very very few MMX-only current CPUs. Basically everything after K6 with MMX supports the psadbw instruction. > Though why did mpeg2enc use variance in the first place? Maybe it's a > better estimator than SAD for motion compensation fit? Its just not that simple. It uses SAD for 'coarse' motion estimation and switches to variance for the final selection of the particular motion estimation mode. This combination provides a good speed/quality trade-off. Experiments with 'only variance' were unimpressive in their quality improvements and 'only SAD' costs quite a lot of quality for modest speed gain. All the 'low hanging fruit' in the current motion estimation algorithm has long since been picked... > The level of altivec optimizations ffmpeg vs mpeg2 is probably an important > factor in any speed difference, and one that wouldn't matter for other > CPUs, which the level of MMX/MMX2/SSE optimizations makes a large > difference. You have to be very careful to compare like coding profiles, motion search radii and suchlike too. Its easy to be twice as fast if your simply trying half as hard to find a good encoding. However, there *are* bottlenecks in mpeg2enc. For a real speedy mode predictive motion estimation algorithms would need to be used. Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On 19 Dec 2003, Florin Andrei wrote: > On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote: > > > At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on > > my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but The difference is even larger than I thought... ffmpeg was decoding the DV file and encoding the audio at the same time but I had mpeg2enc reading a pre-staged .y4m file. > Any chance repeating that on an Intel or AMD processor? ffmpeg -i input.dv -vcodec mpeg2video -f mpeg -b 5000 -g 15 foo.mpg then compare to mpeg2enc -R 0 -b 5000 -4 3 -2 2 -o foo1.mpg < input.y4m Have fun :) At the moment my AMD system's booked solid for other encoding jobs, be a couple days before I can run some tests on it.Hmmm, time to fire up the dual P4 system and get it sync'd up on all the projects. Maybe over the weekend. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 2003-12-19 at 01:49, Steven M. Schultz wrote: > At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on > my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but > the resulting output is 'grainy' (same bitrate, no B frames) (and the > rate control is, well, almost non existent - ~2x spikes that'd drive > a hardware player nuts). Any chance repeating that on an Intel or AMD processor? -- Florin Andrei http://florin.myip.org/ --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, Dec 19, 2003 at 01:34:38AM -0800, Trent Piepho wrote: > On Fri, 19 Dec 2003, Andrew Stevens wrote: > > The next bottlenecks would be the run-length coding and the use > > of variance instead of SAD in motion compensation mode and DCT > > mode selection. Sadly > > Is SAD really any faster to calculate than variance? SAD uses an > absolute value-add operation while variance is multiply-add. > Multiply-add is usually the most heavily optimized operation a cpu > can perform. You are thinking DSP chips, not general purpose CPU's. For DSP's, yes, multiply-add is very heavilly optimized, but for general purpose CPU's, it's often not quite so heavilly optimized. Additionally, if you've got an SSE2 capable x86 chip, it's got parallel SAD operations in the SSE2 instruction set. There isn't an SSE2 mul-add operation yet. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 19 Dec 2003, Steven M. Schultz wrote: > On Fri, 19 Dec 2003, Trent Piepho wrote: > > > On Fri, 19 Dec 2003, Andrew Stevens wrote: > > > > Is SAD really any faster to calculate than variance? SAD uses an absolute > > value-add operation while variance is multiply-add. Multiply-add is usually > > the most heavily optimized operation a cpu can perform. > > Au contraire. Multiply is a lot slower than abs(). All abs() has > to do is flip a sign bit (effectively) and that's going to be a lot In floating point, all you have to do is flip a sign bit. But with integers, it's not so easy. There is no instruction for absolute value in MMX, you have to use a four instruction sequence and two registers. Slower than squaring a value, which only takes two instructions. Though you can cleverly combine an unsigned subtraction and absolute value operation into four instructions total, and perform it on eight unsigned bytes at a time. So you can compute an absolute value of differences quite a bit faster under MMX than I was thinking you could. Clearly SAD would be faster than variance. Another advantage of SAD is that you can find an intermediate result easier than with variance. That way you can short-circuit the SAD calculation if you have already reached the best SAD already found. > faster than any multiply. And aren't there MMX2/SSE abs+add > instructions - that would make abs/add quite fast. I finally found a mmx2 reference, and you're right about that. MMX2 added psadbw, packed sum of absolute differences. If you have 8-bit unsigned data it makes computing SAD pretty darn easy, you can find the SAD of 8 pixels in one instruction. Though why did mpeg2enc use variance in the first place? Maybe it's a better estimator than SAD for motion compensation fit? > At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on > my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but The level of altivec optimizations ffmpeg vs mpeg2 is probably an important factor in any speed difference, and one that wouldn't matter for other CPUs, which the level of MMX/MMX2/SSE optimizations makes a large difference. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 19 Dec 2003, Trent Piepho wrote: > On Fri, 19 Dec 2003, Andrew Stevens wrote: > > Is SAD really any faster to calculate than variance? SAD uses an absolute > value-add operation while variance is multiply-add. Multiply-add is usually > the most heavily optimized operation a cpu can perform. Au contraire. Multiply is a lot slower than abs(). All abs() has to do is flip a sign bit (effectively) and that's going to be a lot faster than any multiply. And aren't there MMX2/SSE abs+add instructions - that would make abs/add quite fast. At any rate I checked out ffmpeg's mpeg2 encoding vs mpeg2enc on my G4 Powerbook. Yes, ffmpeg has a big speed advantage (~2x) but the resulting output is 'grainy' (same bitrate, no B frames) (and the rate control is, well, almost non existent - ~2x spikes that'd drive a hardware player nuts). Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Fri, 19 Dec 2003, Andrew Stevens wrote: > The next bottlenecks would be the run-length coding and the use of variance > instead of SAD in motion compensation mode and DCT mode selection. Sadly Is SAD really any faster to calculate than variance? SAD uses an absolute value-add operation while variance is multiply-add. Multiply-add is usually the most heavily optimized operation a cpu can perform. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tuesday 16 December 2003 23:35, Richard Ellis wrote: Hi Richard, > In that case it will kill the majority of the performance benifit > provided by the caches, because there's very little locality of > reference for the cache to compensate for. It moves through at least > 512k for pass one, then through the same 512k again for pass two, but > the data in the cache is from the end of the frame, and we are > starting over at the beginning of the frame. Massive cache thrash in > that case. Memory bandwidth becomes a much more limiting factor. Exactly what I though when I restructured encoding to a per macroblock basis a few months back. The performance gain was not measurable. They key 'thinko' here is that most of the time goes into motion estimation and in motion estimation the search windows of neighbouring macroblocks overlap > 90%. Cache locality is pretty good. Playing around with prefetch (etc etc) has never brought measurable gains. The main bottleneck in the current encoder (for modern CPUs) is the first phase (4X4 subsampling) of the subsampling motion estimation hierarchy. For speed this would need to be replaced with a predictive estimator. The next bottlenecks would be the run-length coding and the use of variance instead of SAD in motion compensation mode and DCT mode selection. Sadly there's not too much can be done easily about the former and the latter cannot be removed without noticeable reduction in encoding quality (I tried it :-(). However, I have some ideas to try when I get back in the new year! Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 06:54:22PM -0700, Slepp Lukwai wrote: > As a side note, I'm also using a 200Hz timer, instead of the standard > 100Hz. Though I don't see this doing anything but making it quicker, as > it reduces latency on scheduling, while slightly increasing scheduler > overhead and context switching (or is an SSE/3Dnow! CS really expensive, > anyone know?). A 200Hz timer will have only one effect on batch type processes, slowing them down. And mpeg2enc is essentially a batch type process. Why? Because of the increased scheduler overhead. Now, you may be hard put to measure the slowdown because so many other effects will swamp it (one HD seek that takes a few ms would swamp a large part of the scheduler overhead) but it's still there. The only thing that's "quicker" with a 200hz timer is interactive response where you want to see your X cursor move the instant you touch the mouse. Yes, context switching (at least for SSE) is more expensive, because the 8 128bit SSE registers may need to be saved. I don't know off the top of my head if Intel implimented lazy context saves for SSE like with the x86-fpu stack. If they did, then not all context swaps incur the SSE save overhead, but when one does, there is more data to save. > I wonder if it comes back to the increased timing of the scheduler? > (Though it's using a supposed O(1) scheduler, which should offset > that). The O(1) scheduler does not change the context switch overhead timing. The O(1) scheduler simply says that no matter how many processes are waiting to run, it's a constant time to find the "next" one when we do need to context switch. But a 200hz timer will still use up 2x as much cpu time running the scheduler as 100hz timer will. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 23:17, Bernhard Praschinger wrote: > > -M 0: 2m 11.9s > > -M 1: 2m 10.6s, -1.3s > > -M 2: 1m 27.7s, -44.2s > > -M 3: 1m 26.5s, -45.4s > That values look much better. :-) > Now you have seen the mpeg2enc can go faster. It's like it used to be. :> I'm going to try it on a full video, with a few options. I figure I'll let it run through 24 hours of encoding time (about 6 different trials) and see how each result turns out, and so on. I'll let you all know when it's done. :> > I have tried the command you used on my machine, and I have seen the > same "problem". Also 3 processes and each only 33% . > > (time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v > -S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0) Yes.. So it's definitely the -R 0, but -R 1 is faster than the default of -R 2 (i think that's the default?) > > Note that I responded in an earlier message with a total of 24 timings > > across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting > > results that -M 3 -I 0 -R 1 worked fastest of all of them (same source > > material I used for the above, and it took 51 seconds). So, I think the > > -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it > > is still quite a bit slower than -I 0 (which I use since the input is > > Progressive 23.976fps) > Thats strange. It makes a mild bit of sense.. But just a little. > I'm just running some encodings to see which option causes the problem. > > On my machine the -R 0 caused the problem. If I used -R 1/2 or or R > option, I got 3 processes each using about 45-50%. Which should total about 150% CPU instead of 99% that it uses with -R 0. > > > My brain had given up the time I started my computer that evening ;) > > Mine usually does that at about 8am. :> > Just as you enter work ? ;) Self employed, thereby just as I crawl out of bed, and it my brain stays broken until about noon. That's what I get for staying up till 4am playing with mpeg2enc. :> > Encoding without the -R 0 seems to solve the problem, by now. I'm going to see what speeds I get about halfway through a video, when nothing from disk is in cache anymore, the encoders/decoders are in full swing, and everything sort of settles down.. Should be interesting. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Just a side note, I find it interesting your name is Andrew Stevens, whereby mine is Stephen Andrew (middle name). On Tue, 2003-12-16 at 14:41, Andrew Stevens wrote: > Yep. You should (in theory) get a lot closer to that with the current > MPEG_DEVEL branch mpeg2enc. However, your scaling is really remarkably bad > as even the -R 2 values where two CPUs should be fairly busy are unusually > bad. I've never heard of worse than 70% utilisation on dual CPU machines. And I'm wondering why it's not scaling... Hence the original post about this. > Here's a fairly typical snapshot of mpeg2enc -M 2 -I 1 -R 2 in action on my > dual P-III machine... > > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND > 12620 as18 0 46464 45M 768 R80.9 24.3 0:18 lt-mpeg2enc > 12621 as18 0 46464 45M 768 R70.8 24.3 0:18 lt-mpeg2enc > 12619 as 9 0 46464 45M 768 S 3.9 24.3 0:01 lt-mpeg2enc Which is nothing like I see. I rarely see two of them break 60%, but they hover closer to 45%. > You're getting very very symmetrical CPU loads and very very poor utilisation. > What kernel are you using... I vaugely recall 2.6.x series radically changed > the threading libs. It could be something pathological is happening in the > scheduling. It's 2.4.20-gentoo-r9, actually. I'm wondering if a patch in here is causing problems, but I'm very hesitant to try any other kernels since this chipset/board are rather flaky and now that it's working again, I don't want it to break (I couldn't run mpeg2enc, let alone transcode/dvdrip for almost 5 months because it would lock the system hard when it was under load). The newest kernels, 2.6.x, don't let me disable the APIC in the kernel itself, and that causes problems. Perhaps tonight I'll test a 2.4.23 without any patches (just vanilla) and see what happens with scheduling. As a side note, I'm also using a 200Hz timer, instead of the standard 100Hz. Though I don't see this doing anything but making it quicker, as it reduces latency on scheduling, while slightly increasing scheduler overhead and context switching (or is an SSE/3Dnow! CS really expensive, anyone know?). > The 2100+ is of course a lot faster than the P-III but: I doubt the balance > between the motion estimation and the rest of the code is hugely shifted. > Cerainly, the approximate proportions of time spent in each are quite similar > on my 2100+ single-CPU machine and a P-III. On the single 2000 XP we have, it runs about 90% of the speed of my machine in SMP mode (-M 3). > > Also, encoding with one B frame is a touch faster in -I 1 mode than > > encoding without them, but it is slower when you encode two B frames > ... > > Not really. However: I would expect going to two B frames to greatly increase > your CPU utilisation without much wall-clock time increase due the increased > scope for parallel computation. but I was more or less pointing out the timings from the message that -I 1 -R 1 was faster than -I 0 -R 1, for some reason. Not all that much, but noticably. > This is what you'd expect: -R 2 offers much more scope for the 3 worker > threads of -M 3 to do something useful. It still worked out 3-0-1 was the shortest overall time spent, even if CPU usage was still not peaked. > The usefulness of B frames depends a *lot* on the type of material. For > captured stuff they rarely buy you much apart from free room heating from > your CPU. Hence the provision of -R 0 ;-). They should get a little more > useful when I add dynamic frame type selection to mpeg2enc in the new year. Strictly DVD copies. With three cats and being lazy, I have more DVDs ruined than I'd like to count. (Speaking of room heating, it's about -20degC at the moment outside, and my window is about 10cm open, and it's still a toasty 25 degrees in here. my office doubles as the server closet). > > > - There is also a parallel read-ahead thread but this rarely soaks much > > > CPU on modern CPUs. > > Weirdly enough on your machine the reader thread is exceedingly busy I use LVM, but can read 35MB/s off the disks with that. The memory buffer cache is about 250MB/s. I wonder if it comes back to the increased timing of the scheduler? (Though it's using a supposed O(1) scheduler, which should offset that). > cvs co -d :ext:[EMAIL PROTECTED]:/cvsroot/mjpeg mjpeg_play > cd mjpeg_play > cvs update -r MPEG_DEVEL mpeg2enc :ext: wanted a password for anonymous, and 'enter' didn't work. So, I used :pserver:. I sent you a message about the problems I encountered thereof. > The 'mjpeg_play' is a bit of a historical oddity but it is momumentally > painful to change directory names in CVS... I tend to just drop the entire project, clean it up, and reimport it into a fresh tree to rename it. :> You'd think that in the years and revisions CVS has undergone, renaming of directories wouldn't be nearly as painful. ---
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 10:27, Steven M. Schultz wrote: > On Tue, 16 Dec 2003, Slepp Lukwai wrote: > > > Tried it without any options, same effect. I'm definitely seeing nowhere > > near 40% speedup, which is what boggles me. I expected at least > > reasonable gains of 25%. > > I think that has to do with the -I setting... The -I is frightening me. Take a look at my previous post with the -M x -I x -R x settings on each. The -I 1 with two B frames (-R 2) shows a huge gain over the -M 0 -I 1 -R 2, but it is still significantly slower than -M 3 -I 0 -R 1 or 2. > > Sorry, upon further testing, I actually average around 14fps at DVD > > quality (720x480, 9800kbit/s). (see all the details of my command lines > > Ah, that's more like it then. Yup. > > It's interesting that I'm faster with dual 2100s than the dual 2800 (or > > at least on par). I suppose it really comes down to command line > > options, but you would need to compare those yourself (since I haven't > > Friend of mine has dual 2400s and my setup is ~10-15% faster as I > recall - he's getting around 11fps as a rule where I see 14 or so. > > I'm usually adding a bit of overhead with the chroma conversion. I > build smilutils with ffmpeg/libavcodec (to use ffmpeg's DV codec) > and then run the data thru something like: > "smil2yuv -i 2 file.dv | filters | y4mscaler -O chromass=420_MPEG2 |..." > > Produces better output that the default which uses libdv but does > cost a bit in cpu use. You could run my test case. I pre-decoded an MPEG2 DVD stream into 1010 frames and then used this: for M in 0 1 2 3 ; do for I in 0 1 ; do for R in 0 1 2 ; do export LD_LIBRARY_PATH=/home/slepp/mp/lib ; echo -M $M -I $I -R $R ; time /home/slepp/mp/bin/mpeg2enc -f 8 -M $M -g 9 -G 18 -I $I -v 0 -E -10 -K kvcd -4 2 -2 1 -R $R -F 1 -o test-$M$I$R.m2v < pgmy4m.raw ; done ; done ; done It was rather handy, but it took a long time to run. The source pgmy4m.raw is 536MB here. > > According to the docs -I 1 turns on interlacing support, and causes > > un-needed overhead if it is known progressive material. Hence the -I 0 > > (plus transcode sets that, though I could override it). > > But unless you have the raw 23.976fps progressive data (with the 3:2 > pulldown undone) then I think '-I 1' is the option to use. But then > I might be confused (wouldn't be the first time ;)). Yup. It's MPEG-2 DVD @ 23.976 fps, and I need to add in the pulldown. > That would explain why the encoding rate I see is lower since I'm > using -I 1. See above. :> > > > wrong I'm sure someone will tactfully point that out ;)) the speedup > > > comes from the motion estimation of the 2 fields/frame being done in > > > parallel. > > > > Oh. Son of a... If that's all it is... > > Yep - I'm fairly sure that is why you're not seeing any improvement > when using "-M 2". You're right that -I improves performance over baselines later on, but it doesn't improve over -I 0. > On noisy source material the -E option has almost no effect but the > cleaner the input the more effect even modest values of -E have. So a -E helps with transcoding these DVDs. > > now, in combination with -Q, but I find the artifacts are almost never > > there (I used to do -q 4 and -Q 4.0, and it looked about the same as the > > 5/3.0). > > Perhaps Richard Ellis could chime in with his experiences with -Q ;) I will need to look at that. :> Any idea how it impacts performance? > Looking back on it that makes sense though. A P frame depends on the > preceeding P frame - rather sequential in nature since you can't > move on to the next one without completing the first one... But it would make sense, I thought, to do IPBBPBBPBB on thread 1 and IPBBPBBPBB on thread 2. So each thread does a GOP (in fact, I found a marginal increase ins peeds by using a gop size of 3 (-g 3 -G 3) and -M 3, it was a little faster than when I had variable gop sizing). But, I think that the encoder doesn't stream it that way, but rather does the I frame, then fires off the P frame, then each B is done in parallel, and then it does another P, etc. > > The MPEG decoding doesn't take much, and the pipe overhead is negligble, > > Pipe overhead sneaks up on you though. One pipe? Not a real problem, > two? Begins to be noticed but isn't too bad. Four or five? Yeah, > it starts to take a hit on the overall speed of the system - the data > has to go up/down thru the kernel all those times and that's not "free". Everything should be free. ;> What I'd like to test, just for fun, is to fire up mpeg2enc on one of the 16 processor SGI systems at the University and see how that threads across 16. It says it doesn't do well above 4, but again, it's fun to try. > You might try, for timing purposes, without -I 0 and see what, if any > effect that has. Might b
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 13:15, Richard Ellis wrote: > On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote: > >.. It's a dual Athlon, which inherently means 266FSB (DDR 266), > > though the memory is actually Hynix PC3200 w/ timings set as low as > > they go on this board (2-2-2), which gives me about 550MB/s memory > > bandwidth according to memtest, with a 13GB/s L1 and something like > > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it > > SHOULD be able to push enough to keep the frames encoding at 100% > > CPU, in theory. > > Yes, but just one 720x480 DVD quality frame is larger than 256k in > size, so a 256k cache per CPU isn't helping too much overall > considering how many frames there are in a typical video to be > encoded. Plus, my experience with Athlon's is that they are actually > faster at mpeg2enc encoding that Intel chips of equivalent speed > ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they > put a heavier stress on one's memory bandwidth than an equivalent > speed Intel chip would. It's possible that 275MB/s per CPU just > isn't fast enough to keep up with the rate that mpeg2enc can consume > data on an Athlon. Yes, I expect the cache to only be able to fit the mpeg2enc code sections, not any of the data it uses. If the code keeps getting bumped out, then that's a problem. And 275MB/s may not be enough, true... It's too bad the Athlon dual chipset (AMD 768MPX) can't do above about 140 MHz bus speeds to see how much memory speed affects it. > Of course, Andrew would be much better suited to discuss mpeg2enc's > memory access patterns during encoding, which depending on how it > does go about accessing memory can better make use of the 256k of > cache, or cause the 256k of cache to be constantly thrashed in and > out. It could be interesting to use cachegrind on mpeg2enc and see what it declares for cache hit/miss, but I find cachegrind tends to make a 1 minute runtime hit 10 minutes, so I may not bother.. > > Now that's just silly. Why would you hurt the CPUs by running such bloat > > as Mozilla? I can't think of how many times Mozilla has gone nuts on me > > and used 100% CPU without reason, and you can't kill it any normal UI > > way.. Good ol' killall. However, I love it. It's a great browser. Just > > rather hungry at times. I suppose there's a reason the logo is a > > dinosaur. :> > > Hmm... Interesting. I've had it sometimes just stop but never go > nuts with 100% CPU, and although I usually do CLI kill it if need be, > FVWM2's "destroy" window command has never failed to get rid of it if > I don't bother to go CLI to do so. In fact, FVWM2's "destroy" has > never failed to get rid of anything that went wonky. It's the X > windows equivalent to a "kill -9" from the CLI. I've had it lock up and X becomes unresponsive since it's in a loop doing some expensive operation of some sort. It's strange. I don't see it nearly as often with the newer Mozillas as I did the old ones (in fact, haven't seen it in over a month). --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 12:57, Bernhard Praschinger wrote: > Could you run a few test (please). Get some frames (100-1000) as yuv > format. I gues that should be possible even with transcode. ;) > (I do not use transcode so I can't help, or get the test streams on > mjpeg.sf.net) With about 1010 frames of YUV using < to dump it in (instead of cat), I get these: -M 0: 2m 11.9s -M 1: 2m 10.6s, -1.3s -M 2: 1m 27.7s, -44.2s -M 3: 1m 26.5s, -45.4s Note that I responded in an earlier message with a total of 24 timings across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting results that -M 3 -I 0 -R 1 worked fastest of all of them (same source material I used for the above, and it took 51 seconds). So, I think the -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it is still quite a bit slower than -I 0 (which I use since the input is Progressive 23.976fps) > And do afterwards something like that: > cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v > or > lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v > > So you can be soure that nothing else makes any troubels. And check > thant how it is going. That should not take to long. Than you can add > the options you used, to see if anything there causes the probelm of non > increasing framerate. Compared to the run with my long options line, these are > Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) Nice board, that one. Asus A7M-266D.. I should've grabbed the MSI K7D Master for the same price, I hear much nicer things about it. > My brain had given up the time I started my computer that evening ;) Mine usually does that at about 8am. :> > But I'm not really knowing why the situation is that bad. I'm just not seeing the dual CPU usage that would warrant even running in multiple threads, when I could instead transcode two entirely separate items as though I had two machines, which makes some sense (I did that the other day, worked rather well). But, if I can make a single copy work by flooding both CPUs with activity, then I'll be happier, since it should take quite a bit less time to encode a full movie. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 2003-12-16 at 12:33, Andrew Stevens wrote: > Hi all, > > First off a bit of background to the multi-threading in the current stable > branch. First off: > > - Parallelism is primarily frame-by-frame. This means that the final phases > of the encoding lock on completion of the reference frame (prediction and DCT > transform) and the predecessor (bit allocation). If you have a really fast > CPU that motion estimates and DCT's very fast you will get lower > parallelisation. If you use -R 0 you will get very litte parallelism *at > all*. Certainly not enough to make -M 3 sensible. Yet again, good to know. This line (generally, a triple loop for 0-3 M, 0-1 I and 0-2 R): Produces this (approximately 1010 frames), encoding times (real time / user time, gives a bit of a view as to how busy the CPUs were during the real time, optimal should be 1m realtime, 2m user time, right? and average system time was 3.0s, with +/- 0.2s for all tests): (options on each call were: -f 8 -g 9 -G 18 -v 0 -E -10 -K kvcd -4 2 -2 1 -F 1 < rawstream.yuv ) -M 0 -I 0 -R 0: 1m 6.082s 0m 50.050s baselines -M 0 -I 0 -R 1: 1m 16.545s 0m 58.980s .. -M 0 -I 0 -R 2: 1m 34.511s 1m 17.045s .. -M 0 -I 1 -R 0: 2m 7.344s 1m 49.495s .. -M 0 -I 1 -R 1: 1m 59.665s 1m 42.215s .. -M 0 -I 1 -R 2: 2m 30.990s 2m 30.990s .. -M 1 -I 0 -R 0: 1m 5.713s 0m 49.800s -0.35s -M 1 -I 0 -R 1: 1m 15.305s 0m 58.975s -1.2s -M 1 -I 0 -R 2: 1m 34.057s 1m 17.090s -0.5s -M 1 -I 1 -R 0: 2m 5.928s 1m 49.700s -1.3s -M 1 -I 1 -R 1: 1m 59.019s 1m 41.955s -0.6s -M 1 -I 1 -R 2: 2m 49.149s 2m 31.440s +19.2s -M 2 -I 0 -R 0: 1m 0.503s 0m 25.930s -5.5s -M 2 -I 0 -R 1: 0m 53.418s 0m 58.950s -23s -M 2 -I 0 -R 2: 1m 7.418s 1m 18.145s -27s -M 2 -I 1 -R 0: 1m 54.534s 1m 50.060s -13s -M 2 -I 1 -R 1: 1m 15.489s 0m 1.040s -- uhm...? -M 2 -I 1 -R 2: 1m 54.720s 1m 16.720s -36s -M 3 -I 0 -R 0: 0m 57.533s 0m 50.610s -8.5s -M 3 -I 0 -R 1: 0m 51.541s 0m 40.265s -25s -M 3 -I 0 -R 2: 1m 5.996s 0m 54.325s -29s -M 3 -I 1 -R 0: 1m 50.570s 1m 49.715s -17s -M 3 -I 1 -R 1: 1m 14.462s 1m 8.530s -45s -M 3 -I 1 -R 2: 1m 36.192 0m 52.145s -54s Interestingly, and I think this has to do with the I/O buffering, -M 0 is slower than -M 1 by a small fraction in all tests. And as Steven Shultz had suggested, -I 1 is a bad bad idea. It never improved performance, and made it in fact quite a bit worse (the man page is right :). (Of course, -M 1 will be at least two processes, and since I have a real dual system, it makes sense, and may not hold true for a single CPU) Also, encoding with one B frame is a touch faster in -I 1 mode than encoding without them, but it is slower when you encode two B frames instead of just one. I find this interesting.. I would have expected a single B frame to take a bit longer than none at all, and that is the case when -I 0 is on, but not when it's -I 1. Any ideas on that one? In the end -M 3 is not reasonably faster in -I 0 -R 0, but flys along at -I 0 -R 2 compared to baseline, and gets fair gains at -I 0 -R 1, while dropping encoding time by another 14 seconds for the same frameset. So, does this boil down to the fastest is -M 3 -I 0 -R 1? The numbers on -M 3 -I 1 -R 2 show a 54 second improvement over the tests with -M 0, but it takes almost 50% longer than -M 3 -I 0 -R 1. The file size of 3-1-2 is 13,807,067 and the file size of 3-0-1 is 13,402,673. The file is smaller, and is encoded faster, and viewing them now, the quality is at least on par (3-0-1 looked a tad better). > - There is also a parallel read-ahead thread but this rarely soaks much CPU on > modern CPUs. > > The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more > scalable parallelisation. You might want to give it a go - I'd be interested > in the results! I'd love to, but I couldn't find it in CVS. I found everything else in the SF CVS branch, but not mjpegtools itself. > N.b. in a 'realistic' scenario you're running the multiplexer and audio > encoding in parallel with the encoder and video filters communicating via > pipes and named FIFO's. This setup usually saturate a modern dual machine No multiplexing and no audio encoding (AC3 pass through and multiplexing of DVD streams is done after completion of the video encoding). There is the overhead of decoding the original MPEG2 stream into YUV, but that's about all else that transcode (which I'm using) is dumping into the pipe. I avoided any of that on this run by just dumping the file in an already decoded format (pgmtoy4m output). > cheers, > > Andrew > PS > I'm away on vacation for a couple of weeks from friday so there'll be a bit of > pause in answering emails / posts from then ;-) --- This
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hallo > On Tue, 2003-12-16 at 12:57, Bernhard Praschinger wrote: > > Could you run a few test (please). Get some frames (100-1000) as yuv > > format. I gues that should be possible even with transcode. ;) > > (I do not use transcode so I can't help, or get the test streams on > > mjpeg.sf.net) > > With about 1010 frames of YUV using < to dump it in (instead of cat), I > get these: > > -M 0: 2m 11.9s > -M 1: 2m 10.6s, -1.3s > -M 2: 1m 27.7s, -44.2s > -M 3: 1m 26.5s, -45.4s That values look much better. :-) Now you have seen the mpeg2enc can go faster. I have tried the command you used on my machine, and I have seen the same "problem". Also 3 processes and each only 33% . (time lav2yuv n1000.eli | mpeg2enc -I 0 -f 8 -b 9800 -p -a 3 -o test.m2v -S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0) > Note that I responded in an earlier message with a total of 24 timings > across -M 0-3 -I 0-1 -R 0-2 settings, which turned up some interesting > results that -M 3 -I 0 -R 1 worked fastest of all of them (same source > material I used for the above, and it took 51 seconds). So, I think the > -I 1 is on, which makes a huge boost in -M ratings from 0 to 3, but it > is still quite a bit slower than -I 0 (which I use since the input is > Progressive 23.976fps) Thats strange. > > And do afterwards something like that: > > cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v > > or > > lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v > > > > So you can be soure that nothing else makes any troubels. And check > > thant how it is going. That should not take to long. Than you can add > > the options you used, to see if anything there causes the probelm of non > > increasing framerate. > Compared to the run with my long options line, these are I'm just running some encodings to see which option causes the problem. On my machine the -R 0 caused the problem. If I used -R 1/2 or or R option, I got 3 processes each using about 45-50%. > > Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) > Nice board, that one. Asus A7M-266D.. I should've grabbed the MSI K7D > Master for the same price, I hear much nicer things about it. > > My brain had given up the time I started my computer that evening ;) > Mine usually does that at about 8am. :> Just as you enter work ? ;) > > But I'm not really knowing why the situation is that bad. > > I'm just not seeing the dual CPU usage that would warrant even running > in multiple threads, when I could instead transcode two entirely > separate items as though I had two machines, which makes some sense (I > did that the other day, worked rather well). But, if I can make a single > copy work by flooding both CPUs with activity, then I'll be happier, > since it should take quite a bit less time to encode a full movie. Encoding without the -R 0 seems to solve the problem, by now. auf hoffentlich bald, Berni the Chaos of Woodquarter Email: [EMAIL PROTECTED] www: http://www.lysator.liu.se/~gz/bernhard --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 12:45:48PM -0800, Trent Piepho wrote: > On Tue, 16 Dec 2003, Richard Ellis wrote: > > > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, > > > it SHOULD be able to push enough to keep the frames encoding at > > > 100% CPU, in theory. > > > > Yes, but just one 720x480 DVD quality frame is larger than 256k > > in size, so a 256k cache per CPU isn't helping too much overall > > considering how many frames there are in a typical video to be > > A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough > memory bandwidth to encode at about 1000 frames/sec if all you had > to do was read the data. Obviously the encoder runs somewhat > slower than that, so each byte of data must be accessed multiple > times. That's where the cache helps. With motion estimation each byte would end up being accessed more than once for each new "radius" that was examined. Plus motion estimation is between at least two frames, so we are dealing with at least about 1M of data to be accessed eventually in the course of encoding one frame. > > Of course, Andrew would be much better suited to discuss > > mpeg2enc's memory access patterns during encoding, which > > depending on how it does go about accessing memory can better > > make use of the 256k of cache, or cause the 256k of cache to be > > constantly thrashed in and out. > > I seem to recall that one of the biggest performance bottlenecks of > mpeg2enc is they way it accesses memory. It runs each step of the > encoding processes and en entire frame at a time. It's much more > cache friendly run every stage of the encoding process on a single > macroblock before moving on the to next macroblock. In that case it will kill the majority of the performance benifit provided by the caches, because there's very little locality of reference for the cache to compensate for. It moves through at least 512k for pass one, then through the same 512k again for pass two, but the data in the cache is from the end of the frame, and we are starting over at the beginning of the frame. Massive cache thrash in that case. Memory bandwidth becomes a much more limiting factor. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hi Steven, Trent, > But what about bit allocation? You need to know how big the last GOP was > to figure out how many bits you can use for the next GOP. Actually, this is not such a big deal provided the GOPs are well seperated. Simplifying a little, you just need to ensure that you have >= the assumed amount of decoder buffer full at the end of each 'chunk' as you assumed starting to encode its successor. However, this idea came to mind more as a sneaky way of doing accurately sized single-pass encoding: work on multiple 'segments' spread across the video sequence so you get a good statistical sample of how your total bit-consumption is going relative to your target. This is rotten for parallelism thought because you have two more or less totally uncorrelated memory footprints. For DVD 'segments' would kind of naturally correlate with 'chapters' at the authoring level. In the MPEG_DEVEL branch encoding of each frame (apart from the bit-packed coding and bit allocation which is only a small fraction of the CPU load) is simply striped across the available CPUs. This has a nice side effect of reducing each CPUs working set too as it only deals with a fraction of a frame. Having said all that I'll probably simply do a simple two-pass encoding mode first (much simpler frame feeding!). > > Of course, Andrew would be much better suited to discuss mpeg2enc's > > memory access patterns during encoding, which depending on how it > > does go about accessing memory can better make use of the 256k of > > cache, or cause the 256k of cache to be constantly thrashed in and > > out. > > I seem to recall that one of the biggest performance bottlenecks of > mpeg2enc is they way it accesses memory. It runs each step of the encoding > processes and en entire frame at a time. It's much more cache friendly run > every stage of the encoding process on a single macroblock before moving on > the to next macroblock. The single-macroblock approach has been implemented for quite some time now (since the move to C++ roughly). In rather basic English speed improved by... bugger all. I was *most* surprised, it could well be that the story is rather different on multi-CPU machines. At least I like to hope the work wasn't wasted ;-) Actually, the memory footprint of encoding is much larger than you'd think. Remember each 16x16 int16_t difference macroblock gets generated from nastily unaligned 16x16 or 16x8 uint8_t predictors and a 16x16 uint8_t picture macroblock. The difference is then DCT-ed in place into 4 8x8 int16_t DCT blocks which are then quantised in 4 8x8 int16_t quantised DCT blocks. Where mpeg2enc could speed up is: - DCT blocks are in 'correct' and not transposed form. This is simply a waste as by transposing quantiser matrices and the scan sequence you can simply skip this. - Each quantised DCT block is seperately stored. Nice for debugging, poor for memory performance ;-) - DCT is not combined with quantisation when this is possible. - Motion estimation (probably wastefully) computes a lot of variances that could probably better be replaced by SAD for fast encoding modes. - The current GOP sizing approach is wasteful. Frame type should only be decided once the best encoding modest (Intra, various inter motion prediction modes) is known. Basically, you turn a B/P frame into an I frame if you've reached your GOP length limit or it has enough Intra coded blocks that it is more compact that way. Unfortunately, the current allocation algorithm still has a few 'left over' elements that need to know GOP size in advance that need to be replaced before this can be fixed. I'm currently working on bit-allocation (basically, a two-pass / look-ahead mode plus the above improvement). A similar approach can be used for deciding B/P frame selection but this is expensive in CPU as you basically have to do encode each potential B frame's reference frame twice. I'm playing around with ideas for trying B frames out and if they don't seem worthwhile turning them off and then periodically checking if it might make sense to turn them on a again. Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
> Produces this (approximately 1010 frames), encoding times (real time / > user time, gives a bit of a view as to how busy the CPUs were during the > real time, optimal should be 1m realtime, 2m user time, right? and > average system time was 3.0s, with +/- 0.2s for all tests): ... Yep. You should (in theory) get a lot closer to that with the current MPEG_DEVEL branch mpeg2enc. However, your scaling is really remarkably bad as even the -R 2 values where two CPUs should be fairly busy are unusually bad. I've never heard of worse than 70% utilisation on dual CPU machines. Here's a fairly typical snapshot of mpeg2enc -M 2 -I 1 -R 2 in action on my dual P-III machine... PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 12620 as18 0 46464 45M 768 R80.9 24.3 0:18 lt-mpeg2enc 12621 as18 0 46464 45M 768 R70.8 24.3 0:18 lt-mpeg2enc 12619 as 9 0 46464 45M 768 S 3.9 24.3 0:01 lt-mpeg2enc You're getting very very symmetrical CPU loads and very very poor utilisation. What kernel are you using... I vaugely recall 2.6.x series radically changed the threading libs. It could be something pathological is happening in the scheduling. The 2100+ is of course a lot faster than the P-III but: I doubt the balance between the motion estimation and the rest of the code is hugely shifted. Cerainly, the approximate proportions of time spent in each are quite similar on my 2100+ single-CPU machine and a P-III. > Also, encoding with one B frame is a touch faster in -I 1 mode than > encoding without them, but it is slower when you encode two B frames > instead of just one. I find this interesting.. I would have expected a > single B frame to take a bit longer than none at all, and that is the > case when -I 0 is on, but not when it's -I 1. Any ideas on that one? Not really. However: I would expect going to two B frames to greatly increase your CPU utilisation without much wall-clock time increase due the increased scope for parallel computation. > In the end -M 3 is not reasonably faster in -I 0 -R 0, but flys along at > -I 0 -R 2 compared to baseline, and gets fair gains at -I 0 -R 1, while > dropping encoding time by another 14 seconds for the same frameset. This is what you'd expect: -R 2 offers much more scope for the 3 worker threads of -M 3 to do something useful. > The numbers on -M 3 -I 1 -R 2 show a 54 second improvement over the > tests with -M 0, but it takes almost 50% longer than -M 3 -I 0 -R 1. The > file size of 3-1-2 is 13,807,067 and the file size of 3-0-1 is > 13,402,673. The file is smaller, and is encoded faster, and viewing them > now, the quality is at least on par (3-0-1 looked a tad better). The usefulness of B frames depends a *lot* on the type of material. For captured stuff they rarely buy you much apart from free room heating from your CPU. Hence the provision of -R 0 ;-). They should get a little more useful when I add dynamic frame type selection to mpeg2enc in the new year. > > - There is also a parallel read-ahead thread but this rarely soaks much > > CPU on modern CPUs. Weirdly enough on your machine the reader thread is exceedingly busy > > The MPEG_DEVEL branch encoder stripes all encoding phases to allow much > > more scalable parallelisation. You might want to give it a go - I'd be > > interested in the results! > > I'd love to, but I couldn't find it in CVS. I found everything else in > the SF CVS branch, but not mjpegtools itself. cvs co -d :ext:[EMAIL PROTECTED]:/cvsroot/mjpeg mjpeg_play cd mjpeg_play cvs update -r MPEG_DEVEL mpeg2enc The 'mjpeg_play' is a bit of a historical oddity but it is momumentally painful to change directory names in CVS... Andrew --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Trent Piepho wrote: > But what about bit allocation? You need to know how big the last GOP was to > figure out how many bits you can use for the next GOP. Well, you know the maximum bitrate allowed (via the -b option) - could encode each GOP with that limit in mind.I'm not sure how bits "carry over" from GOP to GOP. Nice self-contained chunks of data should parallelize nicely - perhaps not that hard to extend to a "cluster".That'd be fast I'd think. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Steven M. Schultz wrote: > > First off a bit of background to the multi-threading in the current stable > > branch. First off: > > > > - Parallelism is primarily frame-by-frame. This means that the final phases > > of the encoding lock on completion of the reference frame (prediction and DCT > > If one were using closed and fixed length GOPs would it make > sense to parallelize the encoding of complete GOPs? Each cpu > could be dispatched a set of N frames that comprise a closed GOP and > a master thread could write the GOPs out in the correct order. But what about bit allocation? You need to know how big the last GOP was to figure out how many bits you can use for the next GOP. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Richard Ellis wrote: > > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it > > SHOULD be able to push enough to keep the frames encoding at 100% > > CPU, in theory. > > Yes, but just one 720x480 DVD quality frame is larger than 256k in > size, so a 256k cache per CPU isn't helping too much overall > considering how many frames there are in a typical video to be A 720x480 4:2:0 frame is about 512KB, at 550MB/sec there is enough memory bandwidth to encode at about 1000 frames/sec if all you had to do was read the data. Obviously the encoder runs somewhat slower than that, so each byte of data must be accessed multiple times. That's where the cache helps. > Of course, Andrew would be much better suited to discuss mpeg2enc's > memory access patterns during encoding, which depending on how it > does go about accessing memory can better make use of the 256k of > cache, or cause the 256k of cache to be constantly thrashed in and > out. I seem to recall that one of the biggest performance bottlenecks of mpeg2enc is they way it accesses memory. It runs each step of the encoding processes and en entire frame at a time. It's much more cache friendly run every stage of the encoding process on a single macroblock before moving on the to next macroblock. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Andrew Stevens wrote: > Hi all, > > First off a bit of background to the multi-threading in the current stable > branch. First off: > > - Parallelism is primarily frame-by-frame. This means that the final phases > of the encoding lock on completion of the reference frame (prediction and DCT If one were using closed and fixed length GOPs would it make sense to parallelize the encoding of complete GOPs? Each cpu could be dispatched a set of N frames that comprise a closed GOP and a master thread could write the GOPs out in the correct order. But as Andrew mentioned - but the time filters and other processing is added in a dual cpu system's pretty well saturated. Quad cpu systems are very much a niche (and expensive) item (not to mention the noise they make;)) Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 12:33:52AM -0700, Slepp Lukwai wrote: > On Mon, 2003-12-15 at 21:08, Richard Ellis wrote: > > Additionally, why kind of memory do you have attached to the cpu's? > > Mpeg encoding is very memory bandwidth hungry to begin with, and with > > two cpu's trying to eat at the same trough, a not quite as fast as it > > should be memory subsystem can produce results like what you are > > seeing. ... > ... It's a dual Athlon, which inherently means 266FSB (DDR 266), > though the memory is actually Hynix PC3200 w/ timings set as low as > they go on this board (2-2-2), which gives me about 550MB/s memory > bandwidth according to memtest, with a 13GB/s L1 and something like > 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it > SHOULD be able to push enough to keep the frames encoding at 100% > CPU, in theory. Yes, but just one 720x480 DVD quality frame is larger than 256k in size, so a 256k cache per CPU isn't helping too much overall considering how many frames there are in a typical video to be encoded. Plus, my experience with Athlon's is that they are actually faster at mpeg2enc encoding that Intel chips of equivalent speed ratings (the Athlon's 3dnow/mmx implimentation is faster) and so they put a heavier stress on one's memory bandwidth than an equivalent speed Intel chip would. It's possible that 275MB/s per CPU just isn't fast enough to keep up with the rate that mpeg2enc can consume data on an Athlon. Of course, Andrew would be much better suited to discuss mpeg2enc's memory access patterns during encoding, which depending on how it does go about accessing memory can better make use of the 256k of cache, or cause the 256k of cache to be constantly thrashed in and out. > > FWIW, when my desktop machine was a dual PII-400Mhz box, I almost > > always had two mpeg2enc threads eating up 97-98%cpu on both PII > > chips. The few times both cpu's were not fully saturated at mpeg > > encoding was when I'd bother them with something silly like browsing > > the web with mozilla. :) > > Now that's just silly. Why would you hurt the CPUs by running such bloat > as Mozilla? I can't think of how many times Mozilla has gone nuts on me > and used 100% CPU without reason, and you can't kill it any normal UI > way.. Good ol' killall. However, I love it. It's a great browser. Just > rather hungry at times. I suppose there's a reason the logo is a > dinosaur. :> Hmm... Interesting. I've had it sometimes just stop but never go nuts with 100% CPU, and although I usually do CLI kill it if need be, FVWM2's "destroy" window command has never failed to get rid of it if I don't bother to go CLI to do so. In fact, FVWM2's "destroy" has never failed to get rid of anything that went wonky. It's the X windows equivalent to a "kill -9" from the CLI. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, Dec 16, 2003 at 09:27:53AM -0800, Steven M. Schultz wrote: > > Perhaps Richard Ellis could chime in with his experiences with -Q > ;) It seems that with the right set of options, and the right set of input data, -Q can help to create some really nasty looking artifacts. > > And again, son of I didn't realize the parallelization was > > done based on interlacing settings. > > Looking back on it that makes sense though. A P frame depends on > the preceeding P frame - rather sequential in nature since you > can't move on to the next one without completing the first one... The P frame dependency chain is how the artifacts come about based on Andrew's explanation. It's accumulated round off error in the iDCT routines. Made worse by -Q as well as -R 0 and a few other options. --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hallo > Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the > Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder > is only about 10% intermittent. So, I'm neglecting those for the moment. > I'm using transcode, by the way (though I found the same results when > not using transcode and doing a straight pipe from decoded MPEG2 > frames). Note the top dumps below ignore the memory usage (which has > approximately 640MB of free RAM (really free, not cache or anything, > it's a clean boot, 127 processes running in all cases)). > > Cpu0 : 50.0% user, 8.6% system, 0.0% nice, 41.4% idle > Cpu1 : 53.4% user, 4.3% system, 0.0% nice, 42.2% idle > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 11234 slepp 16 0 43436 42m 968 S 38.2 4.2 0:16.96 mpeg2enc > 12422 slepp 16 0 43436 42m 968 S 34.5 4.2 0:16.86 mpeg2enc > 623 slepp 16 0 43436 42m 968 R 33.6 4.2 0:17.14 mpeg2enc > > Command line: > time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x > mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S -M 3 > -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800 > -i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000 Could you run a few test (please). Get some frames (100-1000) as yuv format. I gues that should be possible even with transcode. ;) (I do not use transcode so I can't help, or get the test streams on mjpeg.sf.net) And do afterwards something like that: cat stream.yuv | mpeg2enc -f8 -M 0-3 -o test.m2v or lav2yuv stream.avi | mpeg2enc -f 8 -M 0-3 -o test.m2v So you can be soure that nothing else makes any troubels. And check thant how it is going. That should not take to long. Than you can add the options you used, to see if anything there causes the probelm of non increasing framerate. > > I use the 2.6.0-test8 kernel. Maybe that changes the situation. > I used to be using 2.5.63 or similar, but have rebuilt the machine with > 2.4.20 with scheduling optimizations and other goodies (gentoo). I > noticed a number of speed ups in most other parallel processes > (cinelerra, MPI povray, gcc). Of course, most of the patches in the > gentoo 2.4.20 kernel are stock in 2.5+ (I also used 2.6.0-test8, but > this Asus board doesn't behave under that kernel, and it crashed > whenever i'd load the CPUs or IDE buses :<) Bad. WHich board do you have ? (Mine is a Tyan Tiger MPX) > > Sorry if the mail is a bit confusing, [...] > Hopefully this one didn't ramble on TOO long. My brain had given up the time I started my computer that evening ;) But I'm not really knowing why the situation is that bad. auf hoffentlich bald, Berni the Chaos of Woodquarter Email: [EMAIL PROTECTED] www: http://www.lysator.liu.se/~gz/bernhard --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hi all, First off a bit of background to the multi-threading in the current stable branch. First off: - Parallelism is primarily frame-by-frame. This means that the final phases of the encoding lock on completion of the reference frame (prediction and DCT transform) and the predecessor (bit allocation). If you have a really fast CPU that motion estimates and DCT's very fast you will get lower parallelisation. If you use -R 0 you will get very litte parallelism *at all*. Certainly not enough to make -M 3 sensible. - There is also a parallel read-ahead thread but this rarely soaks much CPU on modern CPUs. The MPEG_DEVEL branch encoder stripes all encoding phases to allow much more scalable parallelisation. You might want to give it a go - I'd be interested in the results! N.b. in a 'realistic' scenario you're running the multiplexer and audio encoding in parallel with the encoder and video filters communicating via pipes and named FIFO's. This setup usually saturate a modern dual machine cheers, Andrew PS I'm away on vacation for a couple of weeks from friday so there'll be a bit of pause in answering emails / posts from then ;-) --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Tue, 16 Dec 2003, Slepp Lukwai wrote: > Tried it without any options, same effect. I'm definitely seeing nowhere > near 40% speedup, which is what boggles me. I expected at least > reasonable gains of 25%. I think that has to do with the -I setting... > Sorry, upon further testing, I actually average around 14fps at DVD > quality (720x480, 9800kbit/s). (see all the details of my command lines Ah, that's more like it then. > It's interesting that I'm faster with dual 2100s than the dual 2800 (or > at least on par). I suppose it really comes down to command line > options, but you would need to compare those yourself (since I haven't Friend of mine has dual 2400s and my setup is ~10-15% faster as I recall - he's getting around 11fps as a rule where I see 14 or so. I'm usually adding a bit of overhead with the chroma conversion. I build smilutils with ffmpeg/libavcodec (to use ffmpeg's DV codec) and then run the data thru something like: "smil2yuv -i 2 file.dv | filters | y4mscaler -O chromass=420_MPEG2 |..." Produces better output that the default which uses libdv but does cost a bit in cpu use. > According to the docs -I 1 turns on interlacing support, and causes > un-needed overhead if it is known progressive material. Hence the -I 0 > (plus transcode sets that, though I could override it). But unless you have the raw 23.976fps progressive data (with the 3:2 pulldown undone) then I think '-I 1' is the option to use. But then I might be confused (wouldn't be the first time ;)). That would explain why the encoding rate I see is lower since I'm using -I 1. > > wrong I'm sure someone will tactfully point that out ;)) the speedup > > comes from the motion estimation of the 2 fields/frame being done in > > parallel. > > Oh. Son of a... If that's all it is... Yep - I'm fairly sure that is why you're not seeing any improvement when using "-M 2". > > without B frames. Those are computationally a lot more expensive > > than I or P frames. "-R 0" will disable B frames. > > I just enabled that, and that's how I'm hitting 15fps instead of 8, and > the quality is good and the size is just fine. Great! It takes, from what I've seen, extraordinarily clean sources before -R 0 has no or little effect. > to their potentials and give me the equivalent of a 4200+ ;> If it takes > 6 hours to transcode a movie because I set -r32 (I noticed a larger > difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but Yep - "-4 1" will close to double the time over "-4 2" and the difference in bitrate/filesize is measured in tenths of a percent. Hardly worth it. Not all that much difference between "-4 2" and "-4 3" though. > > better results (especially with clean source material) can be obtained > > with "-E -8" or perhaps "-E -10". > > Until I upgraded to .92, I didn't have those options. I'm using them On noisy source material the -E option has almost no effect but the cleaner the input the more effect even modest values of -E have. > now, in combination with -Q, but I find the artifacts are almost never > there (I used to do -q 4 and -Q 4.0, and it looked about the same as the > 5/3.0). Perhaps Richard Ellis could chime in with his experiences with -Q ;) > > Right, with -I 0 the cpus take turns but there's little parallelism. > > And again, son of I didn't realize the parallelization was done > based on interlacing settings. Looking back on it that makes sense though. A P frame depends on the preceeding P frame - rather sequential in nature since you can't move on to the next one without completing the first one... > The MPEG decoding doesn't take much, and the pipe overhead is negligble, Pipe overhead sneaks up on you though. One pipe? Not a real problem, two? Begins to be noticed but isn't too bad. Four or five? Yeah, it starts to take a hit on the overall speed of the system - the data has to go up/down thru the kernel all those times and that's not "free". > (As I write this, I'm still waiting for the -M 2 run to finish, so it'll > arrive before the tests results to Bernhard make it out). You might try, for timing purposes, without -I 0 and see what, if any effect that has. Might be a useful data point. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing li
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 2003-12-15 at 22:44, Bernhard Praschinger wrote: > Hallo > > > I was doing some testing of both the older version (1.6.1.90) and the > > newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat > > faster to begin with. However, in both cases, after multiple tests and > > trying different things, I can't get the SMP modes to be fast at all. In > > fact, they're slower than the non-SMP modes. > With slower, I hope you mean "mpeg2enc needs more time to encode the > movie". > And not the time the encoding need in the "realtime". Slower by wallclock slower. It took less time to re-encode the entire thing with -M 0 than when I used -M 3. (I didn't let it run through 2, since it takes over 4 hours as is). (K, after all these tests, the dual stuff is running faster, but not fast enough over a full movie to even warrant the extra threads). Top output of the 3 running mpeg2enc with mjpegtools 1.6.1.92 on the Dual Athlon MP 2100+. That's with -M3. Top usage is 2% and the decoder is only about 10% intermittent. So, I'm neglecting those for the moment. I'm using transcode, by the way (though I found the same results when not using transcode and doing a straight pipe from decoded MPEG2 frames). Note the top dumps below ignore the memory usage (which has approximately 640MB of free RAM (really free, not cache or anything, it's a clean boot, 127 processes running in all cases)). Cpu0 : 50.0% user, 8.6% system, 0.0% nice, 41.4% idle Cpu1 : 53.4% user, 4.3% system, 0.0% nice, 42.2% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 11234 slepp 16 0 43436 42m 968 S 38.2 4.2 0:16.96 mpeg2enc 12422 slepp 16 0 43436 42m 968 S 34.5 4.2 0:16.86 mpeg2enc 623 slepp 16 0 43436 42m 968 R 33.6 4.2 0:17.14 mpeg2enc Command line: time /usr/bin/transcode -u 120,2 -M 0 -V -q 1 -f 24,1 --color 1 -x mpeg2,null -y mpeg2enc,null -e 48000,16 -A -N 0x2000 -F 8,'-S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0' --pulldown -w 9800 -i 28DaysLater.m2v -o test3 --print_status 50 -c 0-1000 Results:[import_mpeg2.so] tcextract -x mpeg2 -i "28DaysLater.m2v" -d 1 | tcdecode -x mpeg2 -d 1 -y yv12 [export_mpeg2enc.so] *** init-v *** ! [export_mpeg2enc.so] cmd=mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o "test3".m2v -S -M 3 -g 9 -G 18 -4 2 -2 1 -r 32 -q 4 -Q 3.0 -K kvcd -R 0 ++ WARN: [mpeg2enc] 3:2 movie pulldown with frame rate set to decode rate not display rate ++ WARN: [mpeg2enc] 3:2 Setting frame rate code to display rate = 4 (29.970 fps) encoding frame [950], 14.93 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116) clean up | frame threads | unload modules | cancel signal | internal threads | done [transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s 73.56user 7.76system 1:09.29elapsed 117%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2055major+31007minor)pagefaults 0swaps (I can't find how to turn off line wrap. Sorry...) Note I used 120 incoming frame buffers with 2 threads decoding the video. The buffer usage of transcode never dropped below 90 frames buffered, so the buffering was keeping pace. Here's the identical command, the only thing changed is -M 3 to -M 2 (this time I included a snapshot of tcdecode, but note that it isn't always in the top 3 of the list, it comes and goes quite frequently, and the transcode buffers stay right around 110 to 116 frames): Cpu0 : 61.8% user, 7.3% system, 0.0% nice, 30.9% idle Cpu1 : 50.5% user, 12.8% system, 0.0% nice, 36.7% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 20631 slepp 19 0 39824 38m 984 R 51.1 3.9 0:03.79 mpeg2enc 14434 slepp 17 0 39824 38m 984 R 45.7 3.9 0:03.94 mpeg2enc 29969 slepp 16 0 2644 2644 668 S 13.7 0.3 0:01.95 tcdecode And the output of time (and the end of transcode): encoding frame [950], 14.33 fps, 95.2%, ETA: 0:00:03, ( 0| 0|116) clean up | frame threads | unload modules | cancel signal | internal threads | done [transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s 74.89user 7.68system 1:11.95elapsed 114%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1979major+26920minor)pagefaults 0swaps And with -M 1 instead of -M 2: Cpu0 : 87.0% user, 13.0% system, 0.0% nice, 0.0% idle Cpu1 : 22.2% user, 5.6% system, 0.0% nice, 72.2% idle PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 31916 slepp 25 0 36192 35m 984 R 90.3 3.5 0:07.58 mpeg2enc 3690 slepp 16 0 2644 2644 668 S 14.7 0.3 0:01.91 tcdecode Note that it's now using an entire CPU (other processes keep sharing, but it's still using a full CPU). And the transcode/time results: encoding frame [950], 14.19 fps, 95.2%, ETA: 0:00:03, ( 0| 0|117) clean up | frame threads | unload modules | cancel signal | internal threads | done [transcode] encoded 999 frames (0 dropped, 0 cloned), clip length 41.67s 73.98user 7.51system 1:12
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 2003-12-15 at 20:27, Steven M. Schultz wrote: > On Mon, 15 Dec 2003, Slepp Lukwai wrote: > > > faster to begin with. However, in both cases, after multiple tests and > > trying different things, I can't get the SMP modes to be fast at all. In > > fact, they're slower than the non-SMP modes. > > I think I see what you're doing that could cause that. I've never > seen the problem - using "-M 2" is not going to be 2x as fast though > if that was the expectation. ~40% speedup or so is what I see > (from about 10fps to 14fps) typically. Tried it without any options, same effect. I'm definitely seeing nowhere near 40% speedup, which is what boggles me. I expected at least reasonable gains of 25%. > > When encoding with the -M 0 with .92, I get around 19fps. When I use -M > > That's full sized (720x480) is it? Sounds more like a SVCD > or perhaps "1/2 D1" (bit of a misnomer - D1 is actually a digital > video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit > more I've seen. But I'm usually tossing in a bit of filtering so > the process is a slower. Sorry, upon further testing, I actually average around 14fps at DVD quality (720x480, 9800kbit/s). (see all the details of my command lines in the post I sent in responce to Bernhard). > > I installed 'buffer', set it up with a 32MB buffer and put it in the > > 10MB is about all I use - it's just a cushion to prevent the encoder > from having to wait (-M 1 is the default - there's I/O readahead > going on) for input. Yeh, I tried 20 first, then 32, but in the end, it made no difference at all. > > Has anyone found a way around this, or is it time to look at the source > > and see what's up? > > > And for reference, it's a dual Athlon MP 2100+, which is below the > > '2600' that the Howto references as fast. > > I'm using dual 2800s and around 14-15fps for DVD encoding is what I > usually get. It's interesting that I'm faster with dual 2100s than the dual 2800 (or at least on par). I suppose it really comes down to command line options, but you would need to compare those yourself (since I haven't seen yours). > > The actual command line is: > > mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S -M > > 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd > > You have progressive non-interlaced source? If not then "-I 0" is > not the right option. According to the docs -I 1 turns on interlacing support, and causes un-needed overhead if it is known progressive material. Hence the -I 0 (plus transcode sets that, though I could override it). > The speed up from multiple processors comes, I believe (but if I'm > wrong I'm sure someone will tactfully point that out ;)) the speedup > comes from the motion estimation of the 2 fields/frame being done in > parallel. Oh. Son of a... If that's all it is... > Try "-I 1" (or just leave out the '-I" and let it default. > > Oh, and there's no real benefit from going above -M 2. I had a 4 > cpu box and tried "-M 4" and saw no gain over -M 3 (which in turn > was a very minimal increase over -M 2). I've never even bothered with -M 4 (well, not for a real run, anyway, just as a quick test). > If you want to speed things up by a good percentage try encoding > without B frames. Those are computationally a lot more expensive > than I or P frames. "-R 0" will disable B frames. I just enabled that, and that's how I'm hitting 15fps instead of 8, and the quality is good and the size is just fine. > And do you realize that increasing the search radius (-r) slows > things down?Leave the -r value defaulted to 16 and you should > see encoding speed up. Yup, entirely aware. I do like the minor difference it makes, though. I'm not in it for speed, really, I just want to see both CPUs get used to their potentials and give me the equivalent of a 4200+ ;> If it takes 6 hours to transcode a movie because I set -r32 (I noticed a larger difference with -4 -2 options, btw, than -r16 vs -r32), that's fine, but I feel it could be faster. > All in all - the defaults are fairly sane so if you're not certain > about an option, well, let it default. > > And drop the -Q unless you want artifacting - especially values over 2. > Under some conditions (it's partly material dependent) the -Q can > generate really obnoxious color blocks and similar artifacts.Much > better results (especially with clean source material) can be obtained > with "-E -8" or perhaps "-E -10". Until I upgraded to .92, I didn't have those options. I'm using them now, in combination with -Q, but I find the artifacts are almost never there (I used to do -q 4 and -Q 4.0, and it looked about the same as the 5/3.0). > > Of course the -M 3 changes to 2 and 0 in testing. I also tested it
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 2003-12-15 at 21:08, Richard Ellis wrote: > What program are you using to monitor CPU usage while mpeg2enc runs? > Some versions of top (if you are using top) report percentages as a > roll-up of the whole SMP machine, so that 3x33% usage really means > 99% utilization of the machine, where "the machine" means both > processors combined. Other versions report a per-cpu percentage > instead of rolling everything together. I hate the combined ratings, so I already setup top to report per CPU usage, so I can see 200% usage instead of it showing 50% as 100% on one CPU (it's misleading when you deal with single CPUs almost all day for work). > Additionally, why kind of memory do you have attached to the cpu's? > Mpeg encoding is very memory bandwidth hungry to begin with, and with > two cpu's trying to eat at the same trough, a not quite as fast as it > should be memory subsystem can produce results like what you are > seeing. It's because with the two cpu's trying to run mpeg2enc, they > together oversaturate the memory bus, causing both to wait. But with > only one mpeg2enc thread running, the entire memory bus bandwidth is > available to that one cpu alone. I've noticed. I never saw really how much memory it used unti I used the buffer program with -t. It was moving gigs of data for a short period of frames (perhaps 10,000 frames). It's a dual Athlon, which inherently means 266FSB (DDR 266), though the memory is actually Hynix PC3200 w/ timings set as low as they go on this board (2-2-2), which gives me about 550MB/s memory bandwidth according to memtest, with a 13GB/s L1 and something like 6 or 8GB/s L2. The cache size is 256k/CPU, 64k L1. At 550MB/s, it SHOULD be able to push enough to keep the frames encoding at 100% CPU, in theory. I don't think there's enough overhead on this machine to qualify as keeping it even half saturated. This is why I want the Corsair XMS Pro memory with load meters on them. (Per bank load meters, even). > FWIW, when my desktop machine was a dual PII-400Mhz box, I almost > always had two mpeg2enc threads eating up 97-98%cpu on both PII > chips. The few times both cpu's were not fully saturated at mpeg > encoding was when I'd bother them with something silly like browsing > the web with mozilla. :) Now that's just silly. Why would you hurt the CPUs by running such bloat as Mozilla? I can't think of how many times Mozilla has gone nuts on me and used 100% CPU without reason, and you can't kill it any normal UI way.. Good ol' killall. However, I love it. It's a great browser. Just rather hungry at times. I suppose there's a reason the logo is a dinosaur. :> --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
Hallo > I was doing some testing of both the older version (1.6.1.90) and the > newer version of mpeg2enc (1.6.1.92). First off, the .92 was somewhat > faster to begin with. However, in both cases, after multiple tests and > trying different things, I can't get the SMP modes to be fast at all. In > fact, they're slower than the non-SMP modes. With slower, I hope you mean "mpeg2enc needs more time to encode the movie". And not the time the encoding need in the "realtime". > When encoding with the -M 0 with .92, I get around 19fps. When I use -M > 2 or -M 3, I get around 14fps. The CPU utilization sits at about 60 to > 70% across both CPUs, but hits 99.9% when using just one. Thats really strange. Which programm dod you use for monitoring your CPU utilisation ? top and/or xosview ? If you used time for knowing what amout of time is used, the important value for you use the "real" line, and not the "user" line. The user line reports the time the command needed on both CPUs. On a dual machine that has nothing other things to do, the real time is lower than the user time. The "overhead" you need for 2 threads incresses the user time a litte, but lowers the real time. > I installed 'buffer', set it up with a 32MB buffer and put it in the > stream, and it didn't make any difference at all. It would be nice to > use mpeg2enc on two CPUs to it's full speed, which would net me faster > than real-time, but thus far I haven't been able to. What was your full comand ? When I use lav2yuv files | mpeg2enc -f 8 -o test.m2v. My system (the 2600 Athlon MP I mentioned in howto) mpeg2enc needs nearly 100% of one cpu and lav2yuv nedds another 5-10%. Encoding of 1000 frames takes that mount of time: 2m16.944s When I add -M 2 The speedup is nice, mpeg2enc has two thread eac needing about 65-70%, lav2yuv needs about 15%. Encoding of 1000 frames takes that mount of time: 1m37.881s Adding buffer to a simple command line does not speed up anything. buffer helps if you have a pipeline with serveral stages like: lav2yuv | yuvdenoise | yuvscaler | mpeg2enc > Has anyone found a way around this, or is it time to look at the source > and see what's up? I have no need, because I think it works properly. > And for reference, it's a dual Athlon MP 2100+, which is below the > '2600' that the Howto references as fast. > > Of course the -M 3 changes to 2 and 0 in testing. I also tested it with > and without the buffer program in the list. Another notable thing, is > that with the newest version .92, -M3 causes three 33% usage processes > to exist (leaving an entire CPU idle), while M2 causes two 60% processes > to exist. With .90, -Mx causes 2 50-70% processes and the rest never do > anything. Just for the fun, I have tested it with -M 3, and than I saw 3 mpeg3nc thread each using about 45-50%, that improved the needed time compared to -M 2 by another 10 seconds. -M 4 didn't cange much at all, only a 4th process needing about 10%. I use the 2.6.0-test8 kernel. Maybe that changes the situation. The percent numbers reported by top have to be read carefully. At least my top reports them fo a single CPU, so you can have processes using up to 200% and then both cpus have full load. But in the task/cpu stats line 100% utilisation are for both CPUs Sorry if the mail is a bit confusing, auf hoffentlich bald, Berni the Chaos of Woodquarter Email: [EMAIL PROTECTED] www: http://www.lysator.liu.se/~gz/bernhard --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, Dec 15, 2003 at 01:46:32AM -0700, Slepp Lukwai wrote: > ... > > Of course the -M 3 changes to 2 and 0 in testing. I also tested it > with and without the buffer program in the list. Another notable > thing, is that with the newest version .92, -M3 causes three 33% > usage processes to exist (leaving an entire CPU idle), while M2 > causes two 60% processes to exist. With .90, -Mx causes 2 50-70% > processes and the rest never do anything. What program are you using to monitor CPU usage while mpeg2enc runs? Some versions of top (if you are using top) report percentages as a roll-up of the whole SMP machine, so that 3x33% usage really means 99% utilization of the machine, where "the machine" means both processors combined. Other versions report a per-cpu percentage instead of rolling everything together. Additionally, why kind of memory do you have attached to the cpu's? Mpeg encoding is very memory bandwidth hungry to begin with, and with two cpu's trying to eat at the same trough, a not quite as fast as it should be memory subsystem can produce results like what you are seeing. It's because with the two cpu's trying to run mpeg2enc, they together oversaturate the memory bus, causing both to wait. But with only one mpeg2enc thread running, the entire memory bus bandwidth is available to that one cpu alone. FWIW, when my desktop machine was a dual PII-400Mhz box, I almost always had two mpeg2enc threads eating up 97-98%cpu on both PII chips. The few times both cpu's were not fully saturated at mpeg encoding was when I'd bother them with something silly like browsing the web with mozilla. :) --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users
Re: [Mjpeg-users] -M 2/3 on SMP is slower than -M 0
On Mon, 15 Dec 2003, Slepp Lukwai wrote: > faster to begin with. However, in both cases, after multiple tests and > trying different things, I can't get the SMP modes to be fast at all. In > fact, they're slower than the non-SMP modes. I think I see what you're doing that could cause that. I've never seen the problem - using "-M 2" is not going to be 2x as fast though if that was the expectation. ~40% speedup or so is what I see (from about 10fps to 14fps) typically. > When encoding with the -M 0 with .92, I get around 19fps. When I use -M That's full sized (720x480) is it? Sounds more like a SVCD or perhaps "1/2 D1" (bit of a misnomer - D1 is actually a digital video tape deck) at 352x480.At 1/2 size yes, around 20fps or a bit more I've seen. But I'm usually tossing in a bit of filtering so the process is a slower. > I installed 'buffer', set it up with a 32MB buffer and put it in the 10MB is about all I use - it's just a cushion to prevent the encoder from having to wait (-M 1 is the default - there's I/O readahead going on) for input. > Has anyone found a way around this, or is it time to look at the source > and see what's up? > And for reference, it's a dual Athlon MP 2100+, which is below the > '2600' that the Howto references as fast. I'm using dual 2800s and around 14-15fps for DVD encoding is what I usually get. > The actual command line is: > mpeg2enc -v 0 -I 0 -f 8 -b 9800 -F 1 -n n -p -a 3 -o test.m2v -S -M > 3 -4 2 -2 1 -r 32 -q 5 -Q 3.0 -K kvcd You have progressive non-interlaced source? If not then "-I 0" is not the right option. The speed up from multiple processors comes, I believe (but if I'm wrong I'm sure someone will tactfully point that out ;)) the speedup comes from the motion estimation of the 2 fields/frame being done in parallel. Try "-I 1" (or just leave out the '-I" and let it default. Oh, and there's no real benefit from going above -M 2. I had a 4 cpu box and tried "-M 4" and saw no gain over -M 3 (which in turn was a very minimal increase over -M 2). If you want to speed things up by a good percentage try encoding without B frames. Those are computationally a lot more expensive than I or P frames. "-R 0" will disable B frames. And do you realize that increasing the search radius (-r) slows things down?Leave the -r value defaulted to 16 and you should see encoding speed up. All in all - the defaults are fairly sane so if you're not certain about an option, well, let it default. And drop the -Q unless you want artifacting - especially values over 2. Under some conditions (it's partly material dependent) the -Q can generate really obnoxious color blocks and similar artifacts.Much better results (especially with clean source material) can be obtained with "-E -8" or perhaps "-E -10". > Of course the -M 3 changes to 2 and 0 in testing. I also tested it with > and without the buffer program in the list. Another notable thing, is > that with the newest version .92, -M3 causes three 33% usage processes Right, with -I 0 the cpus take turns but there's little parallelism. > to exist (leaving an entire CPU idle), while M2 causes two 60% processes > to exist. With .90, -Mx causes 2 50-70% processes and the rest never do Hmmm, I see 100% use on the two 2800s - but some of that would be the DV decoding and pipe overhead of course. First thing I'd try is lowering -r to 24 at most or just defaulting it. Cheers, Steven Schultz --- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click ___ Mjpeg-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/mjpeg-users