Re: Performance degradation w/ -current - GENERIC.MP {amd64,i386}
a == Ariane van der Steldt ari...@stack.nl writes: a On Mon, Apr 20, 2009 at 01:57:55PM -0400, RD Thrush wrote: I've recently noticed reduced performance when building ports for amd64 and i386 platforms on multiprocessor boxes. I found the problem was associated with running a 'nice'd dnetc [1] process on each processor. Without the 'nice'd processes, performance improves dramatically. In a test case, elapsed time increased 25X (from ~24 seconds to more than 650 seconds) in one case and 12X (from ~30 seconds to more than 350 seconds) in another case. Since I received the 4.5 CD on Saturday (thanks for another very cool release!), I used the bsd.mp kernels from that release and found the problem with reduced performance has occurred since the 4.5 release. The problem can be reproduced by busying each core w/ a 'nice'd process. Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool illustrates the problem. a Using a make and recording the time used is useless: the most important a numbers (user and sys) are only recorded for the initial 'make' program, a not the programs it starts. I didn't know that time(1) didn't accumulate user and sys. Thanks for that. a You may want to redo the test with a program that doesn't spawn other a processes, like gzipping a large file. Ok, I've since updated both build boxes with the 20090419 snap and have used a simpler test case of gzcat comp45.tgz | wc. Without the pipe, I was unable to stimulate the condition that resulted in the performance degradation. The new testing adds more wrinkles... The i386 and amd64 platforms behave differently than expected, ie. I was expecting both platforms to show the dramatic degradation noticed with the make test. However, only the 2-core i386 showed significant degradation. The 4-core amd64 platform showed much less degradation with -current. In the following mini-analysis, time(1) is used as the measuring stick. 4.5 refers to the 4.5 release as found on the official CD. snap refers to the 20090419 -current snapshot. baseline means no niced processes. niced busy means one busy process per core. x4 refers to an AMD Phenom 9550 quad-core processor on an ASUS M3A78-EM motherboard. v1 refers to an AMD Athlon X2 dual-core processor on an ASUS M2N-E motherboard. dmesgs are appended for both x4 and v1 running the forementioned snap. dmesgs for the 4.5 cases are in my previous message. baseline shows no significant difference between 4.5 and snap on either the v1 or x4 platforms. niced busy on x4 with 4.5 shows little significant difference with the baseline (If anything, the numbers are unexpectedly slightly better). The additional niced processes apparently don't offer enough perturbation perhaps due to 4 cores available instead of 2 with v1. niced busy on x4 with snap shows more than 3 seconds increase in real time. This is much less dramatic than the same test case with v1. niced busy on v1 with 4.5 shows appx. 4 second increase in real (elapsed) time while the user and system time are appx. the same. The additional niced processes probably prevent overlapping the gzcat and wc processes thus resulting in the increased elapsed time. niced busy on v1 with snap shows more than 150 seconds increase in real time. Interestingly enough, the user and system time are more variable and less than the baseline results. I can't explain the results of this test. I've appended the gory details which include dnetc results as described in my previous message. The dnetc results are similar to niced busy and are only included for completeness. FWIW, I have a soekris 5501 that does *not* have the problem which may indicate the issue is not in the uniprocessor environment. Is this new 'nice' behavior expected? Or, the side-effect of some other updates to the multiprocessor environment? Hopefully, performance can be restored to that of the 4.5 release. a A lot of cpu affinity changes have gone into the kernel. That sounds like fertile ground to explain the observations. a Please note that, while a niced process may only eat left-over processor a time (with a lower bound), the niced process will still take away a responsiveness from the system: it will finish its timeslice and the a context switches may be expensive too. Taking away responsiveness also a means more delay between waiting for data from disk and the process a processing it. I realize the niced processes still have an effect on performance. However, with 4.5 release and earlier, it was seemingly insignificant. However, there is now a quite noticeable performance reduction while building ports. FWIW, I have noticed additional sluggishness in the past few months but am unable to quantify with reproducible results and haven't reported. I've appended a list of the 'time make fake' results for the release and snapshot bsd.mp kernels for
Re: Performance degradation w/ -current - GENERIC.MP {amd64,i386}
On Tue, 21 Apr 2009, Hannah Schroeter wrote: On Tue, Apr 21, 2009 at 07:07:13AM -0400, RD Thrush wrote: a == Ariane van der Steldt ari...@stack.nl writes: a On Mon, Apr 20, 2009 at 01:57:55PM -0400, RD Thrush wrote: [...] The problem can be reproduced by busying each core w/ a 'nice'd process. Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool illustrates the problem. a Using a make and recording the time used is useless: the most important a numbers (user and sys) are only recorded for the initial 'make' program, a not the programs it starts. I didn't know that time(1) didn't accumulate user and sys. Thanks for that. It does. Ariane is wrong IMO. Only the resource usage of children/grandchildren that weren't wait()ed for is lost. But IIRC make usually *does* wait() for its children. Also note this: time, the built-in command in ksh(1) reports different numbers than time(1) (i.e. /usr/bin/time) when measuring pipelined commands. Once, I was very surprised by this. Look at the man. Regards, David
Performance degradation w/ -current - GENERIC.MP {amd64,i386}
I've recently noticed reduced performance when building ports for amd64 and i386 platforms on multiprocessor boxes. I found the problem was associated with running a 'nice'd dnetc [1] process on each processor. Without the 'nice'd processes, performance improves dramatically. In a test case, elapsed time increased 25X (from ~24 seconds to more than 650 seconds) in one case and 12X (from ~30 seconds to more than 350 seconds) in another case. Since I received the 4.5 CD on Saturday (thanks for another very cool release!), I used the bsd.mp kernels from that release and found the problem with reduced performance has occurred since the 4.5 release. The problem can be reproduced by busying each core w/ a 'nice'd process. Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool illustrates the problem. FWIW, I have a soekris 5501 that does *not* have the problem which may indicate the issue is not in the uniprocessor environment. Is this new 'nice' behavior expected? Or, the side-effect of some other updates to the multiprocessor environment? Hopefully, performance can be restored to that of the 4.5 release. I've appended a list of the 'time make fake' results for the release and snapshot bsd.mp kernels for both an amd quad-core (amd64 platform) and an amd dual-core (i386 platform). The busy source and associated dmesgs are at the end. [1] http://www.distributed.net/ ## amd64 quad-core 4.5 bsd.mp - baseline (ie. neither busy nor dnetc running) 24.69 real 8.58 user16.09 sys 4.5 bsd.mp - dnetc 22.06 real 7.69 user 9.32 sys 22.10 real 7.94 user 8.79 sys 4.5 bsd.mp - busy 24.46 real 7.42 user 8.85 sys 21.82 real 7.56 user 8.84 sys 22.04 real 7.61 user 8.86 sys 21.97 real 7.83 user 8.60 sys 21.79 real 7.85 user 8.47 sys 20090415 bsd.mp - baseline 25.74 real 8.29 user17.63 sys 25.75 real 8.28 user18.31 sys 20090415 bsd.mp - dnetc 667.76 real 5.30 user 4.08 sys 20090415 bsd.mp - busy 670.28 real 4.92 user 4.92 sys 652.32 real 6.03 user 7.11 sys 659.84 real 5.72 user 6.44 sys ## i386 dual-core 4.5 bsd.mp - baseline 30.14 real11.38 user14.26 sys 27.80 real11.52 user13.90 sys 4.5 bsd.mp - dnetc 27.37 real11.07 user 9.06 sys 4.5 bsd.mp - busy - nice 27.30 real11.23 user 8.66 sys 27.29 real11.34 user 8.67 sys 20090415 bsd.mp - baseline 28.85 real11.74 user15.55 sys 28.81 real11.33 user16.08 sys 20090415 bsd.mp - dnetc 527.43 real 6.70 user 0.84 sys 20090415 bsd.mp - busy 380.40 real 7.14 user 1.01 sys 367.23 real 7.22 user 0.75 sys ## a8v:Projects/busy 4024cat busy.c int main (char **argv, int argc) { for ( ;; ) ; return; } ## a8v:Projects/busy 4025cat Do_busy #!/bin/sh if [ X$1 != X ]; then NICE=nice fi CPUS=$(sysctl -n hw.ncpu) TIME=/usr/bin/time BUSY=./busy CNT=0 echo Busying $CPUS processors while : do CNT=$((CNT + 1)) $NICE $TIME -l $BUSY if [ $CNT = $CPUS ]; then exit fi done ## amd64 quad-core 4.5 release ## booting hd1a:/bsd.45.mp: 4803420+967047+840784+0+620768 [80+437952+277736]=0xb957d8 entry point at 0x1001e0 [7205c766, 3404, 24448b12, e0c0a304] Ignoring 512MB above 4GB [ using 716536 bytes of bsd ELF symbol table ] Copyright (c) 1982, 1986, 1989, 1991, 1993 The Regents of the University of California. All rights reserved. Copyright (c) 1995-2009 OpenBSD. All rights reserved. http://www.OpenBSD.org OpenBSD 4.5 (GENERIC.MP) #2133: Sat Feb 28 15:02:16 MST 2009 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 3488088064 (3326MB) avail mem = 3372306432 (3216MB) mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.5 @ 0x9f000 (72 entries) bios0: vendor American Megatrends Inc. version 1502 date 02/11/2009 bios0: ASUSTeK Computer INC. M3A78-EM acpi0 at bios0: rev 2 acpi0: tables DSDT FACP APIC MCFG OEMB HPET acpi0: wakeup devices PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4) RLAN(S4) PCE7(S4) PCE9(S4) PCEA(S4) PCEB(S4) PCEC(S4) SBAZ(S4) PS2M(S4) PS2K(S4) UAR1(S4) P0PC(S4) UHC1(S4) UHC2(S4) UHC3(S4) USB4(S4) UHC5(S4) UHC6(S4) UHC7(S4) acpitimer0 at acpi0: 3579545 Hz, 32 bits acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot
Re: Performance degradation w/ -current - GENERIC.MP {amd64,i386}
On Mon, Apr 20, 2009 at 01:57:55PM -0400, RD Thrush wrote: I've recently noticed reduced performance when building ports for amd64 and i386 platforms on multiprocessor boxes. I found the problem was associated with running a 'nice'd dnetc [1] process on each processor. Without the 'nice'd processes, performance improves dramatically. In a test case, elapsed time increased 25X (from ~24 seconds to more than 650 seconds) in one case and 12X (from ~30 seconds to more than 350 seconds) in another case. Since I received the 4.5 CD on Saturday (thanks for another very cool release!), I used the bsd.mp kernels from that release and found the problem with reduced performance has occurred since the 4.5 release. The problem can be reproduced by busying each core w/ a 'nice'd process. Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool illustrates the problem. Using a make and recording the time used is useless: the most important numbers (user and sys) are only recorded for the initial 'make' program, not the programs it starts. You may want to redo the test with a program that doesn't spawn other processes, like gzipping a large file. FWIW, I have a soekris 5501 that does *not* have the problem which may indicate the issue is not in the uniprocessor environment. Is this new 'nice' behavior expected? Or, the side-effect of some other updates to the multiprocessor environment? Hopefully, performance can be restored to that of the 4.5 release. A lot of cpu affinity changes have gone into the kernel. Please note that, while a niced process may only eat left-over processor time (with a lower bound), the niced process will still take away responsiveness from the system: it will finish its timeslice and the context switches may be expensive too. Taking away responsiveness also means more delay between waiting for data from disk and the process processing it. I've appended a list of the 'time make fake' results for the release and snapshot bsd.mp kernels for both an amd quad-core (amd64 platform) and an amd dual-core (i386 platform). The busy source and associated dmesgs are at the end. ## a8v:Projects/busy 4025cat Do_busy #!/bin/sh if [ X$1 != X ]; then NICE=nice fi CPUS=$(sysctl -n hw.ncpu) TIME=/usr/bin/time BUSY=./busy CNT=0 echo Busying $CPUS processors while : do CNT=$((CNT + 1)) $NICE $TIME -l $BUSY Note that nice behaviour is not to apply the maximum nice level, but a default value which makes the process a bit nicer. Most likely, your dnetc process switches to nice level 20 (the maximum) while this invocation probably uses 10 (as per /usr/bin/nice default). if [ $CNT = $CPUS ]; then exit fi done -- Ariane