[tools-compilers] [perf-discuss] Application runs almost 2x slower on Nevada than Linux
Yeah, I did some digging when I had a free moment. The following is
the most germane to your issue.
5070823 poor malloc() performance for small byte sizes
-j
On Thu, May 01, 2008 at 05:36:26PM -0400, Matty wrote:
> We are building our application as a 32-bit entity on both Linux and
> Solaris, so our
> comparison should be apples to apples. Does anyone happen to know what the
> bug id of the small malloc issue is? I searched the opensolaris bug
> database, but
> wasn't able to dig this up.
>
> Thanks,
> - Ryan
>
>
>
> On Thu, May 1, 2008 at 4:33 PM, wrote:
> > Part of the problem is that these allocations are very small:
> >
> > # dtrace -n 'pid$target::malloc:entry { @a["allocsz"] = quantize(arg0); }'
> > -c /tmp/xml
> >
> > allocsz
> >value - Distribution - count
> >1 | 0
> >2 | 30
> >4 |@475
> >8 |@@ 166
> > 16 |@4300015
> > 32 |@@@ 2406
> > 64 | 21
> > 128 | 41
> > 256 | 10
> > 512 | 0
> > 1024 | 10
> > 2048 | 10
> > 4096 | 0
> > 8192 | 10
> >16384 | 0
> >
> > After seeing this, I took a look at the exact breakdown of the
> > allocation sizes:
> >
> > # dtrace -n 'pid$target::malloc:entry {...@a[arg0] = count();}' -c /tmp/xml
> >
> >121
> >961
> > 2001
> >21 10
> >43 10
> >44 10
> >51 10
> >61 10
> >75 10
> >88 10
> > 128 10
> > 147 10
> > 181 10
> > 220 10
> > 440 10
> > 1024 10
> > 2048 10
> > 8194 10
> > 8 11
> >52 11
> > 6 12
> >36 14
> >24 15
> >33 20
> > 4 21
> >17 21
> > 9 23
> > 3 30
> >10 30
> >13 30
> >14 30
> >25 30
> >28 40
> >11 41
> >20 79
> >40 90
> > 5 91
> >16 250
> > 7 351
> >48 381
> >60 1850
> >
> > The most frequent malloc call is to allocate 60 bytes. I believe that
> > we have a known issue with small mallocs on Solaris. There's a bug open
> > for this somewhere; however, I can't find it's number at the moment.
> >
> > Another problem that you may have run into is the 32-bit versus 64-bit
> > compilation problem. I was able to shave about 10 seconds off my
> > runtime by compiling your testcase as a 64-bit app instead of a 32-bit
> > one:
> >
> >
> > $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
> > $ file xml
> > xml:ELF 32-bit LSB executable 80386 Version 1 [FPU],
> > dynamically linked, not stripped, no debugging information available
> > $ ./xml
> > 10 iter in 22.749836 sec
> >
> > versus:
> >
> > $ gcc -m64 -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
> > $ file xml
> > xml:ELF 64-bit LSB executable AMD64 Version 1, dynamically
> > linked, not stripped, no debugging information available
> > $ ./xml
> > 10 iter in 13.785916 sec
> >
> >
> > -j
> >
> >
> >
> > On Wed, Apr 30, 2008 at 06:44:31PM -0400, Matty wrote:
> >
> >
> > > On Wed, Apr 30, 2008 at 6:26 PM, David Lutz wrote:
> > > > If your application is single threaded, you could try using the
> > > > bsdmalloc library. This is a fast malloc, but it is not multi-t
[tools-compilers] [perf-discuss] Application runs almost 2x slower on Nevada than Linux
Part of the problem is that these allocations are very small:
# dtrace -n 'pid$target::malloc:entry { @a["allocsz"] = quantize(arg0); }' -c
/tmp/xml
allocsz
value - Distribution - count
1 | 0
2 | 30
4 |@475
8 |@@ 166
16 |@4300015
32 |@@@ 2406
64 | 21
128 | 41
256 | 10
512 | 0
1024 | 10
2048 | 10
4096 | 0
8192 | 10
16384 | 0
After seeing this, I took a look at the exact breakdown of the
allocation sizes:
# dtrace -n 'pid$target::malloc:entry {...@a[arg0] = count();}' -c /tmp/xml
121
961
2001
21 10
43 10
44 10
51 10
61 10
75 10
88 10
128 10
147 10
181 10
220 10
440 10
1024 10
2048 10
8194 10
8 11
52 11
6 12
36 14
24 15
33 20
4 21
17 21
9 23
3 30
10 30
13 30
14 30
25 30
28 40
11 41
20 79
40 90
5 91
16 250
7 351
48 381
60 1850
The most frequent malloc call is to allocate 60 bytes. I believe that
we have a known issue with small mallocs on Solaris. There's a bug open
for this somewhere; however, I can't find it's number at the moment.
Another problem that you may have run into is the 32-bit versus 64-bit
compilation problem. I was able to shave about 10 seconds off my
runtime by compiling your testcase as a 64-bit app instead of a 32-bit
one:
$ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
$ file xml
xml:ELF 32-bit LSB executable 80386 Version 1 [FPU], dynamically
linked, not stripped, no debugging information available
$ ./xml
10 iter in 22.749836 sec
versus:
$ gcc -m64 -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
$ file xml
xml:ELF 64-bit LSB executable AMD64 Version 1, dynamically linked,
not stripped, no debugging information available
$ ./xml
10 iter in 13.785916 sec
-j
On Wed, Apr 30, 2008 at 06:44:31PM -0400, Matty wrote:
> On Wed, Apr 30, 2008 at 6:26 PM, David Lutz wrote:
> > If your application is single threaded, you could try using the
> > bsdmalloc library. This is a fast malloc, but it is not multi-thread
> > safe and will also tend to use more memory than the default
> > malloc. For a comparison of different malloc libraries, look
> > at the NOTES section at the end of umem_alloc(3MALLOC).
> >
> > I got the following result with your example code:
> >
> >
> > $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
> > $ ./xml
> > 10 iter in 21.445672 sec
> > $
> > $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c -lbsdmalloc
> > $ ./xml
> > 10 iter in 12.761969 sec
> > $
> >
> > I got similar results using Sun Studio 12.
> >
> > Again, bsdmalloc is not multi-thread safe, so use it with caution.
>
> Thanks David. Does anyone happen to know why the memory allocation
> libraries in Solaris are so much slower than their Linux counterparts? If
> the various malloc implementations were a second or two slower, I could
> understand. But they appear to be 10 - 1
[tools-compilers] [perf-discuss] Application runs almost 2xslower on Nevada than Linux
> Could you try an experiment and compile you sources with > /usr/lib/libast.so.1 (you need to compile the sources with > -I/usr/include/ast before /usr/include/ since libast uses a different > symbol namespace and cannot be used to "intercept" other > |malloc()|/|free()| calls like libbsdmalloc) ? AFAIK that should cure > the performance problem (assuming the library functions which heavily > use |malloc()|/|free()| are compiled against libast headers, too) ... What would this experiment prove? The test program links to libxml2, libz, libm, and libc. It seems like it would be a lot of work to re-compile all of these libraries against libast headers just for amusement. In the case of libc, I'm not even sure it's possible. If libast has its own memory allocator, it would make a lot more sense to move that functionality into an interposition library like we already do for libumem, libmapmalloc, libwatchmalloc, libmtmalloc, and libbsdmalloc. -j
[tools-compilers] [perf-discuss] Application runs almost 2x slower on Nevada than Linux
Hey Dude,
I pulled down a copy of your test program and ran a few experiments.
$ time ./xml
10 iter in 22.715982 sec
real0m22.721s
user0m22.694s
sys 0m0.007s
This seems to indicate that all of our time is being spent in usermode, so
whatever it is in Solaris that is slower than Linux, it's not the operating
system.
Just to double check, I used the following dtrace to look at the time this
application is spending in syscalls:
# dtrace -n 'syscall:::entry /execname == "xml"/ { self->traced = timestamp;
self->func = probefunc;} syscall:::return /self->traced/ { @a[self->func] =
sum(timestamp - self->traced); @b[self->func] = count(); self->func = (char
*)0; self->timestamp = 0;}'
(This returns the time xml spends in syscalls as well as the number of syscalls
made)
Time (nanoseconds):
getpid 1978
sysconfig 2229
sigpending 3071
sysi86 3529
getrlimit 4234
setcontext 5763
fstat647042
close 16606
write 19898
getcwd21302
ioctl 23668
munmap25757
brk 28445
open 40712
resolvepath 51777
xstat 57616
mmap 159275
memcntl 267070
Number of invocations:
fstat64 1
getcwd1
getpid1
getrlimit 1
ioctl 1
sigpending1
sysconfig 1
sysi861
write 1
setcontext2
memcntl 6
close 7
munmap8
open 8
resolvepath 9
xstat 9
brk 10
mmap 32
This agrees with the output from time. You're spending a miniscule amount
of system time performing memory operations.
The next place to look would be at libxml2. What library versions do
you have installed on the different machines. If the library versions
don't match, or you've compiled with a different set of options, or used
a different compiler to build the library, you'll get different results
for the benchmark.
I ran my test on snv_84. It has version 2.6.23 installed.
Another approach would be to use DTrace's profile provider. I obtained
usable results by performing the following:
# dtrace -n 'profile-1001 /execname == "xml"/ {...@a[ustack()] = count();} END
{ trunc(@a, 20);}' -c /tmp/xml
This returns the 20 most frequent stack frames (and the number of times
the occurred) when DTrace runs a profile probe at 1001hz.
xmlDictLookup shows up as a popular function, but that may be a red
herring. I'll include the top 5 here, but you can run this command, or adjust
the truncation to view any amount you'd like:
libc.so.1`_free_unlocked+0x4c
libc.so.1`free+0x2b
libxml2.so.2`xmlFreeNodeList+0x19e
libxml2.so.2`xmlFreeNodeList+0xb6
libxml2.so.2`xmlFreeNodeList+0xb6
libxml2.so.2`xmlFreeDoc+0xac
xml`main+0x55
xml`_start+0x80
61
libxml2.so.2`xmlDictLookup+0x60
libxml2.so.2`xmlParseNCName+0xab
libxml2.so.2`xmlParseQName+0x44
libxml2.so.2`xmlParseAttribute2+0x54
libxml2.so.2`xmlParseStartTag2+0x246
