Re: [HACKERS] Spinlock backoff algorithm
Gregory Stark wrote: "Tom Lane" <[EMAIL PROTECTED]> writes: Mark Mielke <[EMAIL PROTECTED]> writes: Tom Lane wrote: My goodness that's a hardware-dependent proposal. Shall we discuss how many CPUs there are where an integer division is *slower* than a floating-point op? Do you have one in mind, or is this a straw man? :-) I've got one upstairs (HPPA), and I believe that it's actually a pretty common situation in scientifically-oriented workstations from a few years back. I think floating point is fast on many common platforms, even many i386 variants. But usually that's assuming you're comparing doing a whole bunch of work in floating point or integer math. Converting a bunch of integers to floating point for a single operation doesn't seem like a case that's going to shine on any floating point unit. Just for fullness, task context switch is more complex (slower) when application uses FP operation. It is not important for PostgreSQL because it has FP operation on many places, but it is good to know. Zdenek ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlock backoff algorithm
"Tom Lane" <[EMAIL PROTECTED]> writes: > Mark Mielke <[EMAIL PROTECTED]> writes: >> Tom Lane wrote: >>> My goodness that's a hardware-dependent proposal. Shall we discuss >>> how many CPUs there are where an integer division is *slower* than >>> a floating-point op? > >> Do you have one in mind, or is this a straw man? :-) > > I've got one upstairs (HPPA), and I believe that it's actually a pretty > common situation in scientifically-oriented workstations from a few > years back. I think floating point is fast on many common platforms, even many i386 variants. But usually that's assuming you're comparing doing a whole bunch of work in floating point or integer math. Converting a bunch of integers to floating point for a single operation doesn't seem like a case that's going to shine on any floating point unit. And there are plenty of platforms with *no* floating point. On embedded 486, ARM, etc the floating point will be emulated by either the compiler or the kernel. But as Tom said there are plenty of places using floating point arithmetic in the kernel. The planner does all of its selectivity and costing in floating point for example. I wonder if you wouldn't be better off going whole hog with soft floating point emulation. I et Sun's cc has an option to implement floating point arithmetic entirely in user-side software. Another option might be to compile just some modules with soft float and do utils/adt/*.c using hardfloat -- so user-defined types still use the floating point unit but the database's internal arithmetic is done using software emulated math. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! ---(end of broadcast)--- TIP 6: explain analyze is your friend
Off-Topic: Float point division academia? (was: Re: [HACKERS] Spinlock backoff algorithm)
Martijn van Oosterhout wrote: On Wed, Nov 14, 2007 at 06:57:18PM -0800, Josh Berkus wrote: The question is, for our most common platforms (like AMD and Intel) is the FPU notably slower/more contended than integer division? I'd the impression that it was, but my knowledge of chip architectures is liable to be out of date. Can we have a hardware geek speak up? I did some searching and the best figures I can find for integer division is 15-40 cycles whereas for floating point the best I found was 5 cycles. The FP units on modern processer as very highly optimsed, but the figures quoted are for pipelined division, which is not what we're doing here. I agree with Tom that the although the user might be experiencing odd issues as PostgreSQL was not designed for 32 light-weight cores like the Niagara, that this particular issue is unlikely to be the primary issue. In response to the academic discussion above, did you also look up the time to convert from INT -> FLOAT, and then FLOAT -> INT? Also, if divided by a constant, there are automatic optimization tricks performed, such as: $ cat a.c int f(int i) { return i / 7; } $ gcc -O3 -S a.c $ cat a.s ... .globl f .type f, @function f: .LFB2: movl%edi, %eax movl$-1840700269, %edx imull %edx leal(%rdx,%rdi), %eax sarl$31, %edi sarl$2, %eax subl%edi, %eax ret No divide. No need to convert from INT -> FLOAT -> INT. Here is non-conclusive evidence both that integer division by a constant may still be faster on a fairly modern processor (AMD X2 3800+), and that, as Tom believes, this issue is a waste of time: $ cat a.c #define MAX_RANDOM_VALUE (0x7FFF) int old_f(int i) { return ((double) i / (double) MAX_RANDOM_VALUE) + 0.5; } int new_f(int i) { return (i + MAX_RANDOM_VALUE - 1) / MAX_RANDOM_VALUE; } $ cat old.c int old_f(int); int new_f(int); int main() { for (int i = 0; i < 1; i++) old_f(50); return 0; } $ cat new.c int old_f(int); int new_f(int); int main() { for (int i = 0; i < 1; i++) new_f(50); return 0; } $ gcc -o old -O3 -std=c99 old.c a.c $ gcc -o new -O3 -std=c99 new.c a.c $ time ./old ./old 1.58s user 0.00s system 99% cpu 1.590 total $ time ./old ./old 1.31s user 0.00s system 99% cpu 1.314 total $ time ./old ./old 1.14s user 0.00s system 99% cpu 1.142 total $ time ./old ./old 1.46s user 0.00s system 99% cpu 1.461 total $ time ./new ./new 0.68s user 0.00s system 99% cpu 0.682 total $ time ./new ./new 0.70s user 0.01s system 99% cpu 0.703 total $ time ./new ./new 0.70s user 0.00s system 99% cpu 0.705 total $ time ./new ./new 0.70s user 0.00s system 99% cpu 0.703 total I say non-conclusive, because: 1) my CPU is probably doing branch prediction and may possibly be doing out-of-order execution. Since it has more integer units that floating point units, this would work in the favour of integers. The contention case described by the original poster does not allow for multiple integer units to be in use at the same time. 2) as Tom has already said, 100 million divides in either 0.7 seconds or 1.5 seconds, is still orders of magnitude more than the contended case would ever be called. For the future, please consider doing integer division when working with integers and the rvalue is a constant. For this case, it doesn't seem like it is worth it to risk breaking the code. Cheers, mark -- Mark Mielke <[EMAIL PROTECTED]>
Re: [HACKERS] Spinlock backoff algorithm
On Wed, Nov 14, 2007 at 06:57:18PM -0800, Josh Berkus wrote: > The question is, for our most common platforms (like AMD and Intel) is the > FPU > notably slower/more contended than integer division? I'd the impression that > it was, but my knowledge of chip architectures is liable to be out of date. > > Can we have a hardware geek speak up? I did some searching and the best figures I can find for integer division is 15-40 cycles whereas for floating point the best I found was 5 cycles. The FP units on modern processer as very highly optimsed, but the figures quoted are for pipelined division, which is not what we're doing here. http://www.behardware.com/art/imprimer/623/ I did find a thread about how integer division on the Itanium was implemented by copying the integers to the FP registers, doing the division and rounding there and copying back. So at least there it's not true. http://gcc.gnu.org/ml/gcc-patches/2005-12/msg01518.html Hope this helps, -- Martijn van Oosterhout <[EMAIL PROTECTED]> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution > inevitable. > -- John F Kennedy signature.asc Description: Digital signature
Re: [HACKERS] Spinlock backoff algorithm
Josh Berkus <[EMAIL PROTECTED]> writes: > Nah, if it's only Niagara, it's not worth bothering. It's not only that aspect of it --- it's that I am 100% convinced that Magne has misidentified the source of whatever FPU contention he's seeing. The floating-point code in s_lock() is executed only just after having returned from a sleep that is at least one millisecond and often many times that. If Niagara cannot handle a few kiloflops then you need to find some other company to work for ;-) I am interested to find out what the true cause of the reported FPU contention is, but I'll bet our next lunch that s_lock.c ain't it. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlock backoff algorithm
Greg, > That says precisely nothing about the matter at hand. Someone should > simply change it and benchmark it in pgsql. I doubt you'll see a > difference there on regular AMD/Intel ... and if it makes the sun > hyperthreaded cpu happier... Nah, if it's only Niagara, it's not worth bothering. I was under the impression that most x86 chips had similar issues around having less FP than IP, but I could be wrong. It's also worth thinking about the new Intel multi-core archtectures; do they starve FPs as well? On a busy oltp system, spinlock waits get called 100's of times per second, so like procarraylock this it's frequent enought to call for microoptimization. I think maybe we should try making the change and testing it on Niagara and on some standard x86 platforms, and then on the new x86 architectures, to see if it makes any difference in CPU utilization. FYI, if nobody had guessed this is coming out of study Magne is doing on improving PostgreSQL SMP scalability. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlock backoff algorithm
Gregory Maxwell wrote: On Nov 14, 2007 10:12 PM, Joshua D. Drake <[EMAIL PROTECTED]> wrote: http://www.intel.com/performance/server/xeon/intspd.htm http://www.intel.com/performance/server/xeon/fpspeed.htm That says precisely nothing about the matter at hand. Someone should simply change it and benchmark it in pgsql. I doubt you'll see a difference there on regular AMD/Intel ... and if it makes the sun hyperthreaded cpu happier... Are you offering? Joshua D. Drake ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlock backoff algorithm
On Nov 14, 2007 10:12 PM, Joshua D. Drake <[EMAIL PROTECTED]> wrote: > http://www.intel.com/performance/server/xeon/intspd.htm > http://www.intel.com/performance/server/xeon/fpspeed.htm That says precisely nothing about the matter at hand. Someone should simply change it and benchmark it in pgsql. I doubt you'll see a difference there on regular AMD/Intel ... and if it makes the sun hyperthreaded cpu happier... ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Spinlock backoff algorithm
On Nov 14, 2007, at 6:57 PM, Josh Berkus wrote: Tom, I've got one upstairs (HPPA), and I believe that it's actually a pretty common situation in scientifically-oriented workstations from a few years back. Last I checked, scientific workstations aren't exactly a common platform for PostgreSQL servers. The question is, for our most common platforms (like AMD and Intel) is the FPU notably slower/more contended than integer division? I'd the impression that it was, but my knowledge of chip architectures is liable to be out of date. Can we have a hardware geek speak up? Somewhat. The last version of K7 I looked at had three integer execution units versus one floating point unit. They're also scheduled fairly independently, meaning that casts from double to integer or back again will have some minor negative effects on the pipeline or the scheduler more than the use of floating point itself. In the grand scheme of things, though, I don't believe it's a big deal for typical code on most modern desktop CPUs, certainly not compared to memory starvation, use of less than optimal compilers and all the other reasons the pipeline might stall. I might care in the innermost of inner loops, but possibly not even then unless a profiler told me differently. Cheers, Steve ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlock backoff algorithm
Josh Berkus wrote: Tom, I've got one upstairs (HPPA), and I believe that it's actually a pretty common situation in scientifically-oriented workstations from a few years back. Last I checked, scientific workstations aren't exactly a common platform for PostgreSQL servers. The question is, for our most common platforms (like AMD and Intel) is the FPU notably slower/more contended than integer division? I'd the impression that it was, but my knowledge of chip architectures is liable to be out of date. Can we have a hardware geek speak up? http://www.intel.com/performance/server/xeon/intspd.htm http://www.intel.com/performance/server/xeon/fpspeed.htm Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlock backoff algorithm
Tom, > I've got one upstairs (HPPA), and I believe that it's actually a pretty > common situation in scientifically-oriented workstations from a few > years back. Last I checked, scientific workstations aren't exactly a common platform for PostgreSQL servers. The question is, for our most common platforms (like AMD and Intel) is the FPU notably slower/more contended than integer division? I'd the impression that it was, but my knowledge of chip architectures is liable to be out of date. Can we have a hardware geek speak up? -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlock backoff algorithm
On 11/14/07, Tom Lane <[EMAIL PROTECTED]> wrote: > The other problem with using modulo is that it makes the result depend > mostly on the low-order bits of the random() result, rather than mostly > on the high-order bits; with lower-grade implementations of random(), > the lower bits are materially less random than the higher. Now > admittedly high-grade randomness is probably not too important for this > specific context, but I dislike putting in poor coding practices that > someone might see and copy without thinking... If there's a dependency on a particular quality of random() implementation, why not just include one? Mersenne Twister is easy, while not being cryptographic strength. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlock backoff algorithm
On Wed, 14 Nov 2007, Mark Mielke wrote: The other problem with using modulo is that it makes the result depend mostly on the low-order bits of the random() result, rather than mostly on the high-order bits; with lower-grade implementations of random(), the lower bits are materially less random than the higher. If this was a serious problem, there is the >> operator. I see it as a poor coding practice to make assumptions about which bits are most "random" in a call to random(). There are many types of pseudo-random number generators where the low-order bits are not so random, and the assumption Tom has described is pretty likely to be true. See http://www.fourmilab.ch/random/ as one comment about the badness of the standard UNIX random generator for example. There is an interesting discussion of this issue along with code showing a way to improve things while only using integer math (which in some cases uses >> as you suggest) as part of the Java standard library: http://java.sun.com/j2se/1.4.2/docs/api/java/util/Random.html#nextInt(int) -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlock backoff algorithm
Mark Mielke <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> My goodness that's a hardware-dependent proposal. Shall we discuss >> how many CPUs there are where an integer division is *slower* than >> a floating-point op? > Do you have one in mind, or is this a straw man? :-) I've got one upstairs (HPPA), and I believe that it's actually a pretty common situation in scientifically-oriented workstations from a few years back. >> Why do you think that a couple of FP ops here are a problem, anyway? >> This is a code path where we've already yielded the processor, so >> by definition the repetition rate has to be pretty low. > Yielded the processor? Yielded the processor, as in pg_usleep. It is absolutely impossible for any thread to execute that line of code more than 1000 times per second, and the expected rate would be very much less. Furthermore, the entire point of the function is to try to make processes come out of the sleep at different times, so they shouldn't be ganging up on the FPU anyway. There may be some place where we have an unnecessarily high amount of FP usage, but I very much doubt that this is it. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlock backoff algorithm
Tom Lane wrote: =?ISO-8859-1?Q?Magne_M=E6hre?= <[EMAIL PROTECTED]> writes: I understand the reasoning for the backoff (as of the discussion on 2003-08-05), but is there any particular reason for using floating point operations here ? Maybe a modulo would be just as good (or better since it doesn't involve the FPU) ? My goodness that's a hardware-dependent proposal. Shall we discuss how many CPUs there are where an integer division is *slower* than a floating-point op? Hi Tom: Do you have one in mind, or is this a straw man? :-) Why do you think that a couple of FP ops here are a problem, anyway? This is a code path where we've already yielded the processor, so by definition the repetition rate has to be pretty low. Yielded the processor? Or yielded the lock? With 32 active threads contending for the lock, but first contending for the FPU, one could see how it might be relevant. I think I agree with you that this won't be the only problem, however, the FPU may not need to contribute to the problem? The other problem with using modulo is that it makes the result depend mostly on the low-order bits of the random() result, rather than mostly on the high-order bits; with lower-grade implementations of random(), the lower bits are materially less random than the higher. Now admittedly high-grade randomness is probably not too important for this specific context, but I dislike putting in poor coding practices that someone might see and copy without thinking... If this was a serious problem, there is the >> operator. I see it as a poor coding practice to make assumptions about which bits are most "random" in a call to random(). If anybody fixes the problem you describe, then the opposite may become true. Perhaps the lowest bits are the most random. If random() is broken, random() should be fixed. Coding in specific implementation details about specific implementations of random() is just as bad. :-) IMHO, use of FPU should be avoided wherever possible on any platform. On some platforms, the FPU may be fully implemented in software. My memory is faint, but I think SPARC v7 either implemented / in software, or had a trap that implemented it in software. Cheers, mark -- Mark Mielke <[EMAIL PROTECTED]> ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlock backoff algorithm
=?ISO-8859-1?Q?Magne_M=E6hre?= <[EMAIL PROTECTED]> writes: > I understand the reasoning for the backoff (as of the discussion on > 2003-08-05), but is there any particular reason for using floating > point operations here ? Maybe a modulo would be just as good (or > better since it doesn't involve the FPU) ? My goodness that's a hardware-dependent proposal. Shall we discuss how many CPUs there are where an integer division is *slower* than a floating-point op? Why do you think that a couple of FP ops here are a problem, anyway? This is a code path where we've already yielded the processor, so by definition the repetition rate has to be pretty low. The other problem with using modulo is that it makes the result depend mostly on the low-order bits of the random() result, rather than mostly on the high-order bits; with lower-grade implementations of random(), the lower bits are materially less random than the higher. Now admittedly high-grade randomness is probably not too important for this specific context, but I dislike putting in poor coding practices that someone might see and copy without thinking... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlock backoff algorithm
Magne Mæhre wrote: I was playing with a Nevada server and noticed a rush on the FPU (the Nevada has a single shared FPU for its 32 threads). Probably you mean Niagara ;-). Zdenek ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Spinlock backoff algorithm
I was playing with a Nevada server and noticed a rush on the FPU (the Nevada has a single shared FPU for its 32 threads). Looking at the spinlock code, I found : cur_delay += (int) (cur_delay * ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5); I understand the reasoning for the backoff (as of the discussion on 2003-08-05), but is there any particular reason for using floating point operations here ? Maybe a modulo would be just as good (or better since it doesn't involve the FPU) ? Something like: cur_delay += random() % (cur_delay + 1) ; --Magne ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings