Re: LOCK overheads (was Re: objtrm problem probably found)
A bit late, but some more data points. 90MHz Pentium, FreeBSD 2.2.7 mode 0 60.80 ns/loop nproc=1 lcks=EMPTY mode 1 91.13 ns/loop nproc=1 lcks=no mode 2 91.11 ns/loop nproc=2 lcks=no mode 3 242.59 ns/loop nproc=1 lcks=yes mode 4 242.69 ns/loop nproc=2 lcks=yes mode 5 586.27 ns/loop nproc=1 lcks=no mode 6 586.91 ns/loop nproc=2 lcks=no mode 7 749.28 ns/loop nproc=1 lcks=yes mode 8 746.70 ns/loop nproc=2 lcks=yes mode 9 181.96 ns/loop nproc=1 lcks=EMPTY mode 10 242.56 ns/loop nproc=1 lcks=no mode 11 242.69 ns/loop nproc=2 lcks=no mode 12 343.80 ns/loop nproc=1 lcks=yes mode 13 343.77 ns/loop nproc=2 lcks=yes mode 14 727.79 ns/loop nproc=1 lcks=no mode 15 729.95 ns/loop nproc=2 lcks=no mode 16 850.10 ns/loop nproc=1 lcks=yes mode 17 848.02 ns/loop nproc=2 lcks=yes 200MHz Pentium Pro, -current, same binary as above; mode 0 42.76 ns/loop nproc=1 lcks=EMPTY mode 1 32.01 ns/loop nproc=1 lcks=no mode 2 33.30 ns/loop nproc=2 lcks=no mode 3 191.30 ns/loop nproc=1 lcks=yes mode 4 191.62 ns/loop nproc=2 lcks=yes mode 5 93.12 ns/loop nproc=1 lcks=no mode 6 94.54 ns/loop nproc=2 lcks=no mode 7 195.16 ns/loop nproc=1 lcks=yes mode 8 200.91 ns/loop nproc=2 lcks=yes mode 9 65.83 ns/loop nproc=1 lcks=EMPTY mode 10 90.32 ns/loop nproc=1 lcks=no mode 11 90.33 ns/loop nproc=2 lcks=no mode 12 236.61 ns/loop nproc=1 lcks=yes mode 13 236.70 ns/loop nproc=2 lcks=yes mode 14 120.83 ns/loop nproc=1 lcks=no mode 15 122.12 ns/loop nproc=2 lcks=no mode 16 276.92 ns/loop nproc=1 lcks=yes mode 17 277.19 ns/loop nproc=2 lcks=yes 200MHz pentium Pro, -current, compiled with -current compiler mode 0 35.30 ns/loop nproc=1 lcks=EMPTY mode 1 22.13 ns/loop nproc=1 lcks=no mode 2 22.31 ns/loop nproc=2 lcks=no mode 3 186.26 ns/loop nproc=1 lcks=yes mode 4 186.39 ns/loop nproc=2 lcks=yes mode 5 75.61 ns/loop nproc=1 lcks=no mode 6 78.52 ns/loop nproc=2 lcks=no mode 7 191.46 ns/loop nproc=1 lcks=yes mode 8 191.65 ns/loop nproc=2 lcks=yes mode 9 69.34 ns/loop nproc=1 lcks=EMPTY mode 10 86.68 ns/loop nproc=1 lcks=no mode 11 86.49 ns/loop nproc=2 lcks=no mode 12 237.49 ns/loop nproc=1 lcks=yes mode 13 236.67 ns/loop nproc=2 lcks=yes mode 14 134.96 ns/loop nproc=1 lcks=no mode 15 134.99 ns/loop nproc=2 lcks=no mode 16 276.90 ns/loop nproc=1 lcks=yes mode 17 277.33 ns/loop nproc=2 lcks=yes Not exactly sure what all this means but whatever mode1 17 is, it can sure be expensive.. of course this is a UP machine... julian To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: LOCK overheads (was Re: objtrm problem probably found)
According to Matthew Dillon: Wow, now that *is* expensive! The K6 must be implementing it in microcode for it to be that bad. K6-200: 244 [21:57] roberto@keltia:src/C ./locktest 0 ... empty 26.84 ns/loop 1proc 22.62 ns/loop 2proc 22.64 ns/loop empty w/locks 17.58 ns/loop 1proc w/locks 288.28 ns/loop 2proc w/locks 288.16 ns/loop It hurts :( -- Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- [EMAIL PROTECTED] FreeBSD keltia.freenix.fr 4.0-CURRENT #72: Mon Jul 12 08:26:43 CEST 1999 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: LOCK overheads (was Re: objtrm problem probably found)
Matthew Dillon [EMAIL PROTECTED] wrote: :mode 1 17.99 ns/loop nproc=1 lcks=no :mode 3 166.33 ns/loop nproc=1 lcks=yes ... :This is a K6-2 350. Locks are pretty expensive on them. Wow, now that *is* expensive! The K6 must be implementing it in microcode for it to be that bad. I wouldn't be surprised if lock prefixes did result in microcode execution. As I stated yesterday, I don't believe locked instructions are implemented frequently enough to warrant special handling, and are therefore likely to be implemented in whichever way need the least chip area. Since you need to be able to track and mark the memory references associated with the instruction, the cheapest implementation (in terms of dedicated chip area) is likely to be something like: wait until all currently executing instructions are complete, wait until all pending memory writes are complete (at least to L1 cache), assert the lock pin and execute RMW instuction without allowing any other instructions to commence, deassert lock pin. This is (of course) probably the worst case as far as execution time as seen by that CPU - though it's not far off optimal as seen by other CPUs. (`Assert lock pin' should also be mapped into a `begin locked memory reference' using whatever cache coherency protocol is being used). I'm not saying that you can't implement a locked RMW sequence a lot better, but until the chip architects decide that the performance is an issue, they aren't likely to spend any silicon on it. The big IA-32 market is UP systems running games - and locked RMW instructions don't affect this market. Intel see the high-end of the market (where SMP and hence locked RMW is more of an issue) moving to Merced. This suggests that it's unlikely that the IA-32 will ever acquire a decent lock capability (though at least the PIII is no worse than the PII). That said, the above timings make a lock prefix cost over 50 core clocks (or 15 bus clocks) - even microcode couldn't be that bad. My other timings (core/bus cycles) were: 486DX2: 20/10, Pentium: 28/7, P-II: 34/8.5, P-III 34/7.5. I suspect that these timings are a combination of inefficient on-chip implementation of the lock prefix (see above for my reasoning behind this), together with poor off-chip handling of locked cycles. Peter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message