Re: CPU cache utilities? and P4 info

Evan T. Kaiser Thu, 28 Dec 2000 13:40:27 -0800

All L1 and L2 cache handling is done in hardware, and I doubt there are any utilities that would actually speed them up in any way. Hardware cache is supposed to be transparent to the operating system and software, and although in theory I imagine you could optimize software to work better with a certain cache configuration (e.g. 256k of on-die L2 cache or 512k of slower, on-cartridge cache), it'd involve recompilation of whatever software is involved, which is complicated. I'm no programmer, so I can't talk about such things with authority. If there are any programmers here that know anything about that kind of optimization, I'd be interested in hearing your input.
Now, someone asked not terribly long ago about the new Pentium 4, so I'll give a quick overview about what's different about it. The P4 is currently being marketed as a workstation and multimedia system chip, and for good reason. The platform is too expensive for the average user, and its performance is currently NOT that great in most applications.
The P4 is a new core, meaning it's not just a glorified Pentium Pro like the Celeron, Celeron II, Pentium II, and !!! are. The first substantial change with the core is that it has a 20-stage pipeline, which is pretty long. The pipeline is essentially the chain of processes a processor has to go through to get things done, and different pipeline lengths provide different advantages and disadvantages. Short pipelines get more work done per clock cycle, and suffer small penalties from branch mispredictions (see below), but they are limited in clock speed. Motorola's PowerPC 7400, a.k.a. G4, processor is a good example of this. I think it has a 4-stage pipeline, which is quite short, but it maxes out at only 500MHz on a .22-micron fabrication process. Now, it gets a lot more work done in a cycle than a P3, because the P3 has a 10-stage pipeline, but as we know the P3 maxes out at a much higher clock frequency (1GHz+ on a .18-micron fab). As you have surely guessed by know, longer pipelines allow for higher clockspeeds, but they also get less done per clock cycle because each stage in the pipeline does less work. Thus, the 20-stage pipeline of the P4 will allow it to get to very high clock speeds, but the tradeoff is that it's not as fast as other processors per MHz. Another disadvantage of long pipelines is that branch mispredictions are very costly. Modern processors include branch prediction hardware, which attempts to guess the result of a particular task in one part of the processor so it can start doing follow-up work with other, unused parts. Here's an oversimplified and unrealistic example. Say one part of the processor is working on figuring out what X + 2 is. If the result is 3, the next step would be to multiply 3 by Y. But if the result isn't 3, the next step is to divide X by Y. If the branch predictor thinks that X + 2 is most likely to be 3, it will start calculating 3 x Y in other, unused registers of the processor. But if the result isn't 3, a lot of work is wasted and those registers that were doing the useless work have to be flushed so they can start over again with the correct calculations. If you're flushing 20 stages, that's a lot of lost work, so in software where branches are difficult to predict, the P4 will suffer, even though its branch prediction unit is quite good, simply because of its very long pipeline.
The P4 has only 8k of L1 data cache, which is half of the P3's 16k. This was done partly to conserve die space (cache takes up a lot of space) and decrease complexity, but it also allows for somewhat higher clock speeds since the smaller the amount of transistors, the easier to get to high frequencies. Instead of using instruction cache like previous x86 processors do, the P4 uses an interesting new cache called trace cache. Modern x86 processors use RISC cores, but since the x86 instruction set is a CISC one, the x86 instructions need to be translated to RISC ones so the cores can execute them. Previous instruction caches stored instructions in x86 form, but the P4's trace cache stores them in decoded RISC form, so they can be accessed more quickly. I don't remember the exact size of the trace cache, but I think it's 8k. I believe it uses a compression algorithm to increase the number of decoded instructions it can store, which is good. The P4's L2 cache is also much faster than the P3's, as it can be accessed every clock cycle, whereas the P3's could be accessed only every other clock cycle.
Another interesting thing about the P4 is that it has a fairly weak FPU (math co-processor) unit. This was another thing that was stripped down to reduce die space. Instead, Intel concentrated on SSE2, the second generation of Streaming SIMD Extensions, which are new instructions intended to take over most of the functionality of the old x86 FPU. Unfortunately, software needs to be compiled to take advantage of SSE2, meaning that when a P4 runs software that isn't, it doesn't do terribly well. However, it's extremely powerful with SSE- and SSE2-optimized software, like PhotoShop and Quake 3 Arena. That's a big reason why currently the P4 is not that great- more software needs to be optimized for it before it becomes a worthwhile buy.
As you have probably heard, the P4 uses a new chipset, the i850, which is a glorified yet stripped-down i840, its P3 workstation chipset. The 850 uses dual-channel Rambus memory, which means memory will cost out the nose, but its most impressive feature is its FSB. It uses a quad-rate 100MHz FSB, giving a practical bus speed of 400MHz. That's a lot. Matched with dual-channel Rambus, that's an awful lot of memory bandwidth. Not much software today takes advantage of that kind of bandwidth, so yet again the P4 will work better with future software that does.
In terms of SETI@home performance, I'm inclined to think that the P4 will not be all that great. Although it runs at high frequencies and has very powerful cache, its FPU is fairly weak- definitely weaker than that of a 1GHz P3. I don't know if its impressive cache throughput can compensate for that. Now, if SETI@home were optimized for SSE2, it'd be a different story, but I wouldn't hold my breath for that to happen.
In general, I would NOT recommend that any of you buy a Pentium 4 unless you run PhotoShop all day. It's very expensive and currently just not worth it for most applications. Additionally, the P4s that are out now will be very difficult to upgrade in the future, as Intel plans to release a different socket for the P4 in the middle of next year, meaning that you'd be stuck with your current motherboard and might not be able to upgrade past 1.7GHz or so. If you're thinking about getting a P4, I'd seriously wait until the second half of 2001 before you even consider buying one. By then, more software will take advantage of its strengths, and you won't be stuck with an almost non-upgradeable motherboard.

Evan

At 05:39 PM 12/28/2000 +1000, you wrote:

Hi there from Australia :o)

I'm relatively new to the list and have been religiously processing SETI@home data with 50% of my resources only since accidently discovering the software on November 5th 2000, after a life-long interest in the SETI Project.

It seems there are many intelligent technically-minded people on this list, so I have a quick question...

I've recently updated my machine to a P3-800mHz but the motherboard is only running at 133mHz, so for every floating point operation (FLOP) the reasonably modest CPU has to wait 6 clock cycles for the bus to catch up.

Do any of you bright people out there know of any Level 1 and/or 2 cache optimisation utilities out there? Or do you reckon the existing L1 / L2 architecture on board the Pentium-III is already running as good as it's going to get. (I'm shying away from hardware procedures such as "clocking-up" since I only do this on my older, superceded machines which are past their shelf-life anyway...)

With something like 3� teraFLOPs (3.54 x 10^12 operations) to calculate per workunit in SETI@home v3.03, even a minutely optimised method would be cummulatively advantageous.

Feel free to reply directly if this has already been discussed before or u want to avoid mailing list clutter :-)
Thanks in advance!

--
Regards, Brett M.

Home:           [EMAIL PROTECTED] (2mb mailbox only)
Work:           [EMAIL PROTECTED] (Political-correctness filter!)
Site:           ftp://ninja:duck@ae????.powerup.com.au (ask for dynamic ip)
ICQ:            icq#6969350
SMS:   (+61) 040 7020 168
Unsubscribe instructions: http://www.talkspace.net/mlists/setiathome.html This list sponsored by talkspace.net: building space communities online. Mailing list services provided by klx.communications -- www.klx.com

Re: CPU cache utilities? and P4 info

Reply via email to