Re: Performance degradation w/ -current - GENERIC.MP {amd64,i386}

2009-04-21 Thread RD Thrush
 a == Ariane van der Steldt ari...@stack.nl writes:
a On Mon, Apr 20, 2009 at 01:57:55PM -0400, RD Thrush wrote:
 I've recently noticed reduced performance when building ports for
 amd64 and i386 platforms on multiprocessor boxes.  I found the problem
 was associated with running a 'nice'd dnetc [1] process on each
 processor.  Without the 'nice'd processes, performance improves
 dramatically.
 
 In a test case, elapsed time increased 25X (from ~24 seconds to more
 than 650 seconds) in one case and 12X (from ~30 seconds to more than
 350 seconds) in another case.
 
 Since I received the 4.5 CD on Saturday (thanks for another very cool
 release!), I used the bsd.mp kernels from that release and found the
 problem with reduced performance has occurred since the 4.5 release.
 
 The problem can be reproduced by busying each core w/ a 'nice'd
 process.  Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool
 illustrates the problem.

a Using a make and recording the time used is useless: the most important
a numbers (user and sys) are only recorded for the initial 'make' program,
a not the programs it starts.

  I didn't know that time(1) didn't accumulate user and sys.  Thanks
  for that.

a You may want to redo the test with a program that doesn't spawn other
a processes, like gzipping a large file.

  Ok, I've since updated both build boxes with the 20090419 snap and
  have used a simpler test case of gzcat comp45.tgz | wc.  Without the
  pipe, I was unable to stimulate the condition that resulted in the
  performance degradation.

  The new testing adds more wrinkles...  The i386 and amd64 platforms
  behave differently than expected, ie. I was expecting both platforms
  to show the dramatic degradation noticed with the make test.
  However, only the 2-core i386 showed significant degradation.  The
  4-core amd64 platform showed much less degradation with -current.

  In the following mini-analysis, time(1) is used as the measuring
  stick.  4.5 refers to the 4.5 release as found on the official CD.
  snap refers to the 20090419 -current snapshot.  baseline means
  no niced processes.  niced busy means one busy process per core.
  x4 refers to an AMD Phenom 9550 quad-core processor on an ASUS
  M3A78-EM motherboard.  v1 refers to an AMD Athlon X2 dual-core
  processor on an ASUS M2N-E motherboard.  dmesgs are appended for
  both x4 and v1 running the forementioned snap.  dmesgs for the
  4.5 cases are in my previous message.

  baseline shows no significant difference between 4.5 and snap
  on either the v1 or x4 platforms.

  niced busy on x4 with 4.5 shows little significant difference
  with the baseline (If anything, the numbers are unexpectedly
  slightly better).  The additional niced processes apparently don't
  offer enough perturbation perhaps due to 4 cores available instead
  of 2 with v1.

  niced busy on x4 with snap shows more than 3 seconds increase in
  real time.  This is much less dramatic than the same test case with
  v1.

  niced busy on v1 with 4.5 shows appx. 4 second increase in real
  (elapsed) time while the user and system time are appx. the same.
  The additional niced processes probably prevent overlapping the
  gzcat and wc processes thus resulting in the increased elapsed time.

  niced busy on v1 with snap shows more than 150 seconds increase
  in real time.  Interestingly enough, the user and system time are
  more variable and less than the baseline results.  I can't explain
  the results of this test.

  I've appended the gory details which include dnetc results as
  described in my previous message.  The dnetc results are similar
  to niced busy and are only included for completeness.

 FWIW, I have a soekris 5501 that does *not* have the problem which may
 indicate the issue is not in the uniprocessor environment.
 
 Is this new 'nice' behavior expected?  Or, the side-effect of some
 other updates to the multiprocessor environment?  Hopefully,
 performance can be restored to that of the 4.5 release.

a A lot of cpu affinity changes have gone into the kernel.

  That sounds like fertile ground to explain the observations.

a Please note that, while a niced process may only eat left-over processor
a time (with a lower bound), the niced process will still take away
a responsiveness from the system: it will finish its timeslice and the
a context switches may be expensive too. Taking away responsiveness also
a means more delay between waiting for data from disk and the process
a processing it.

  I realize the niced processes still have an effect on performance.
  However, with 4.5 release and earlier, it was seemingly
  insignificant.  However, there is now a quite noticeable performance
  reduction while building ports.  FWIW, I have noticed additional
  sluggishness in the past few months but am unable to quantify with
  reproducible results and haven't reported.


 I've appended a list of the 'time make fake' results for the release
 and snapshot bsd.mp kernels for 

Re: Performance degradation w/ -current - GENERIC.MP {amd64,i386}

2009-04-21 Thread David Vasek

On Tue, 21 Apr 2009, Hannah Schroeter wrote:


On Tue, Apr 21, 2009 at 07:07:13AM -0400, RD Thrush wrote:

a == Ariane van der Steldt ari...@stack.nl writes:

a On Mon, Apr 20, 2009 at 01:57:55PM -0400, RD Thrush wrote:


[...]


The problem can be reproduced by busying each core w/ a 'nice'd
process.  Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool
illustrates the problem.



a Using a make and recording the time used is useless: the most important
a numbers (user and sys) are only recorded for the initial 'make' program,
a not the programs it starts.



 I didn't know that time(1) didn't accumulate user and sys.  Thanks
 for that.


It does. Ariane is wrong IMO. Only the resource usage of
children/grandchildren that weren't wait()ed for is lost.
But IIRC make usually *does* wait() for its children.


Also note this: time, the built-in command in ksh(1) reports different 
numbers than time(1) (i.e. /usr/bin/time) when measuring pipelined 
commands. Once, I was very surprised by this. Look at the man.


Regards,
David



Performance degradation w/ -current - GENERIC.MP {amd64,i386}

2009-04-20 Thread RD Thrush
I've recently noticed reduced performance when building ports for
amd64 and i386 platforms on multiprocessor boxes.  I found the problem
was associated with running a 'nice'd dnetc [1] process on each
processor.  Without the 'nice'd processes, performance improves
dramatically.

In a test case, elapsed time increased 25X (from ~24 seconds to more
than 650 seconds) in one case and 12X (from ~30 seconds to more than
350 seconds) in another case.

Since I received the 4.5 CD on Saturday (thanks for another very cool
release!), I used the bsd.mp kernels from that release and found the
problem with reduced performance has occurred since the 4.5 release.

The problem can be reproduced by busying each core w/ a 'nice'd
process.  Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool
illustrates the problem.

FWIW, I have a soekris 5501 that does *not* have the problem which may
indicate the issue is not in the uniprocessor environment.

Is this new 'nice' behavior expected?  Or, the side-effect of some
other updates to the multiprocessor environment?  Hopefully,
performance can be restored to that of the 4.5 release.

I've appended a list of the 'time make fake' results for the release
and snapshot bsd.mp kernels for both an amd quad-core (amd64 platform)
and an amd dual-core (i386 platform).  The busy source and associated
dmesgs are at the end.

[1] http://www.distributed.net/

##
amd64 quad-core
4.5 bsd.mp - baseline (ie. neither busy nor dnetc running)
24.69 real 8.58 user16.09 sys
4.5 bsd.mp - dnetc
22.06 real 7.69 user 9.32 sys
22.10 real 7.94 user 8.79 sys
4.5 bsd.mp - busy
24.46 real 7.42 user 8.85 sys
21.82 real 7.56 user 8.84 sys
22.04 real 7.61 user 8.86 sys
21.97 real 7.83 user 8.60 sys
21.79 real 7.85 user 8.47 sys

20090415 bsd.mp - baseline
25.74 real 8.29 user17.63 sys
25.75 real 8.28 user18.31 sys
20090415 bsd.mp - dnetc
   667.76 real 5.30 user 4.08 sys
20090415 bsd.mp - busy
   670.28 real 4.92 user 4.92 sys
   652.32 real 6.03 user 7.11 sys
   659.84 real 5.72 user 6.44 sys

##
i386 dual-core
4.5 bsd.mp - baseline
30.14 real11.38 user14.26 sys
27.80 real11.52 user13.90 sys
4.5 bsd.mp - dnetc
27.37 real11.07 user 9.06 sys
4.5 bsd.mp - busy - nice
27.30 real11.23 user 8.66 sys
27.29 real11.34 user 8.67 sys

20090415 bsd.mp - baseline
28.85 real11.74 user15.55 sys
28.81 real11.33 user16.08 sys
20090415 bsd.mp - dnetc
   527.43 real 6.70 user 0.84 sys
20090415 bsd.mp - busy
   380.40 real 7.14 user 1.01 sys
   367.23 real 7.22 user 0.75 sys


##
a8v:Projects/busy 4024cat busy.c
int main (char **argv, int argc)
{
   for ( ;; ) ;
   return;
}
##
a8v:Projects/busy 4025cat Do_busy
#!/bin/sh

if [ X$1 != X ]; then
  NICE=nice
fi
CPUS=$(sysctl -n hw.ncpu)
TIME=/usr/bin/time
BUSY=./busy
CNT=0
echo Busying $CPUS processors
while :
do
  CNT=$((CNT + 1))
  $NICE $TIME -l $BUSY 
  if [ $CNT = $CPUS ]; then
exit
  fi
done


##
amd64 quad-core 4.5 release
##
booting hd1a:/bsd.45.mp: 4803420+967047+840784+0+620768 
[80+437952+277736]=0xb957d8
entry point at 0x1001e0 [7205c766, 3404, 24448b12, e0c0a304]
Ignoring 512MB above 4GB
[ using 716536 bytes of bsd ELF symbol table ]
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.  All rights reserved.
Copyright (c) 1995-2009 OpenBSD. All rights reserved.  http://www.OpenBSD.org

OpenBSD 4.5 (GENERIC.MP) #2133: Sat Feb 28 15:02:16 MST 2009
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 3488088064 (3326MB)
avail mem = 3372306432 (3216MB)
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.5 @ 0x9f000 (72 entries)
bios0: vendor American Megatrends Inc. version 1502 date 02/11/2009
bios0: ASUSTeK Computer INC. M3A78-EM
acpi0 at bios0: rev 2
acpi0: tables DSDT FACP APIC MCFG OEMB HPET
acpi0: wakeup devices PCE2(S4) PCE3(S4) PCE4(S4) PCE5(S4) PCE6(S4) RLAN(S4) 
PCE7(S4) PCE9(S4) PCEA(S4) PCEB(S4) PCEC(S4) SBAZ(S4) PS2M(S4) PS2K(S4) 
UAR1(S4) P0PC(S4) UHC1(S4) UHC2(S4) UHC3(S4) USB4(S4) UHC5(S4) UHC6(S4) UHC7(S4)
acpitimer0 at acpi0: 3579545 Hz, 32 bits
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot 

Re: Performance degradation w/ -current - GENERIC.MP {amd64,i386}

2009-04-20 Thread Ariane van der Steldt
On Mon, Apr 20, 2009 at 01:57:55PM -0400, RD Thrush wrote:
 I've recently noticed reduced performance when building ports for
 amd64 and i386 platforms on multiprocessor boxes.  I found the problem
 was associated with running a 'nice'd dnetc [1] process on each
 processor.  Without the 'nice'd processes, performance improves
 dramatically.
 
 In a test case, elapsed time increased 25X (from ~24 seconds to more
 than 650 seconds) in one case and 12X (from ~30 seconds to more than
 350 seconds) in another case.
 
 Since I received the 4.5 CD on Saturday (thanks for another very cool
 release!), I used the bsd.mp kernels from that release and found the
 problem with reduced performance has occurred since the 4.5 release.
 
 The problem can be reproduced by busying each core w/ a 'nice'd
 process.  Then, 'make clean;time make fake' in $PORTSDIR/devel/libtool
 illustrates the problem.

Using a make and recording the time used is useless: the most important
numbers (user and sys) are only recorded for the initial 'make' program,
not the programs it starts.

You may want to redo the test with a program that doesn't spawn other
processes, like gzipping a large file.

 FWIW, I have a soekris 5501 that does *not* have the problem which may
 indicate the issue is not in the uniprocessor environment.
 
 Is this new 'nice' behavior expected?  Or, the side-effect of some
 other updates to the multiprocessor environment?  Hopefully,
 performance can be restored to that of the 4.5 release.

A lot of cpu affinity changes have gone into the kernel.

Please note that, while a niced process may only eat left-over processor
time (with a lower bound), the niced process will still take away
responsiveness from the system: it will finish its timeslice and the
context switches may be expensive too. Taking away responsiveness also
means more delay between waiting for data from disk and the process
processing it.

 I've appended a list of the 'time make fake' results for the release
 and snapshot bsd.mp kernels for both an amd quad-core (amd64 platform)
 and an amd dual-core (i386 platform).  The busy source and associated
 dmesgs are at the end.



 ##
 a8v:Projects/busy 4025cat Do_busy
 #!/bin/sh
 
 if [ X$1 != X ]; then
   NICE=nice
 fi
 CPUS=$(sysctl -n hw.ncpu)
 TIME=/usr/bin/time
 BUSY=./busy
 CNT=0
 echo Busying $CPUS processors
 while :
 do
   CNT=$((CNT + 1))
   $NICE $TIME -l $BUSY 

Note that nice behaviour is not to apply the maximum nice level, but a
default value which makes the process a bit nicer. Most likely, your
dnetc process switches to nice level 20 (the maximum) while this
invocation probably uses 10 (as per /usr/bin/nice default).

   if [ $CNT = $CPUS ]; then
 exit
   fi
 done

-- 
Ariane