Re: Memory allocation performance

2008-03-05 Thread Alexander Motin
Bruce Evans wrote: Try profiling it one another type of CPU, to get different performance counters but hopefully not very different stalls. If the other CPU doesn't stall at all, put another black mark against P4 and delete your copies of it :-). I have tried to profile the same system with

Re: Memory allocation performance

2008-02-04 Thread Dag-Erling Smørgrav
Julian Elischer [EMAIL PROTECTED] writes: Dag-Erling Smørgrav [EMAIL PROTECTED] writes: Julian Elischer [EMAIL PROTECTED] writes: you mean FILO or LIFO right? Uh, no. You want to reuse the last-freed object, as it is most likely to still be in cache. exactly.. FILO or LIFO (last in

Re: Memory allocation performance

2008-02-03 Thread Dag-Erling Smørgrav
Julian Elischer [EMAIL PROTECTED] writes: Robert Watson [EMAIL PROTECTED] writes: be a good time to try to revalidate that. Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the

Re: Memory allocation performance

2008-02-03 Thread Julian Elischer
Dag-Erling Smørgrav wrote: Julian Elischer [EMAIL PROTECTED] writes: Robert Watson [EMAIL PROTECTED] writes: be a good time to try to revalidate that. Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object

Re: Memory allocation performance

2008-02-03 Thread Alexander Motin
Kris Kennaway wrote: You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track down the precise causes. Thanks to the

Re: Memory allocation performance

2008-02-03 Thread Bruce Evans
On Mon, 4 Feb 2008, Alexander Motin wrote: Kris Kennaway wrote: You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track

Re: Memory allocation performance

2008-02-02 Thread Alexander Motin
Robert Watson wrote: I guess the question is: where are the cycles going? Are we suffering excessive cache misses in managing the slabs? Are you effectively cycling through objects rather than using a smaller set that fits better in the cache? In my test setup only several objects from

Re: Memory allocation performance

2008-02-02 Thread Robert Watson
On Sat, 2 Feb 2008, Alexander Motin wrote: Robert Watson wrote: I guess the question is: where are the cycles going? Are we suffering excessive cache misses in managing the slabs? Are you effectively cycling through objects rather than using a smaller set that fits better in the cache?

Re: Memory allocation performance

2008-02-02 Thread Joseph Koshy
I have tried it for measuring number of instructions. But I am in doubt that instructions is a correct counter for performance measurement as different instructions may have very different execution times depending on many reasons, like cache misses and current memory traffic. I have tried to

Re: Memory allocation performance

2008-02-02 Thread Alexander Motin
Joseph Koshy wrote: You cannot sample with the TSC since the TSC does not interrupt the CPU. For CPU cycles you would probably want to use p4-global-power-events; see pmc(3). Thanks, I have already found this. There was only problem, that by default it counts cycles only when both logical

Re: Memory allocation performance

2008-02-02 Thread Joseph Koshy
Thanks, I have already found this. There was only problem, that by default it counts cycles only when both logical cores are active while one of my cores was halted. Did you try the 'active' event modifier: p4-global-power-events,active=any? Sampling on this, profiler shown results close to

Re: Memory allocation performance

2008-02-02 Thread Peter Jeremy
On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote: To check UMA dependency I have made a trivial one-element cache which in my test case allows to avoid two for four allocations per packet. You should be able to implement this lockless using atomic(9). I haven't verified it, but

Re: Memory allocation performance

2008-02-02 Thread Alexander Motin
Peter Jeremy пишет: On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote: To check UMA dependency I have made a trivial one-element cache which in my test case allows to avoid two for four allocations per packet. You should be able to implement this lockless using atomic(9). I

Re: Memory allocation performance

2008-02-02 Thread Peter Jeremy
On Sat, Feb 02, 2008 at 09:56:42PM +0200, Alexander Motin wrote: Peter Jeremy ?: On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote: To check UMA dependency I have made a trivial one-element cache which in my test case allows to avoid two for four allocations per packet.

Re: Memory allocation performance

2008-02-02 Thread Alexander Motin
Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend, but hwpmc shows huge amount of p4-resource-stalls in UMA functions:

Re: Memory allocation performance

2008-02-02 Thread Kris Kennaway
Alexander Motin wrote: Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend, but hwpmc shows huge amount of

Re: Memory allocation performance

2008-02-02 Thread Robert Watson
On Sat, 2 Feb 2008, Kris Kennaway wrote: Alexander Motin wrote: Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend,

Re: Memory allocation performance

2008-02-02 Thread Max Laier
Am Sa, 2.02.2008, 23:05, schrieb Alexander Motin: Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend, but hwpmc shows

Re: Memory allocation performance

2008-02-02 Thread Alexander Motin
Robert Watson wrote: Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the cache. Why FIFO? I think LIFO (stack) should be better for this goal as the last freed object has more

Re: Memory allocation performance

2008-02-02 Thread Robert Watson
On Sun, 3 Feb 2008, Alexander Motin wrote: Robert Watson wrote: Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the cache. Why FIFO? I think LIFO (stack) should be better for this

Re: Memory allocation performance

2008-02-02 Thread Julian Elischer
Robert Watson wrote: be a good time to try to revalidate that. Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the you mean FILO or LIFO right? chances that the newly allocated object already has lines in the cache. It's a fairly trivial

Re: Memory allocation performance

2008-02-01 Thread Kris Kennaway
Alexander Motin wrote: Kris Kennaway пишет: Alexander Motin wrote: Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just

Re: Memory allocation performance

2008-02-01 Thread Robert Watson
On Fri, 1 Feb 2008, Alexander Motin wrote: That was actually my second question. As there is only 512 items by default and they are small in size I can easily preallocate them all on boot. But is it a good way? Why UMA can't do just the same when I have created zone with specified element

Re: Memory allocation performance

2008-02-01 Thread Alexander Motin
Hi. Robert Watson wrote: It would be very helpful if you could try doing some analysis with hwpmc -- high resolution profiling is of increasingly limited utility with modern CPUs, where even a high frequency timer won't run very often. It's also quite subject to cycle events that align with

Re: Memory allocation performance

2008-02-01 Thread Bruce Evans
On Fri, 1 Feb 2008, Alexander Motin wrote: Robert Watson wrote: It would be very helpful if you could try doing some analysis with hwpmc -- high resolution profiling is of increasingly limited utility with modern You mean of increasingly greater utility with modern CPUs. Low resolution

Memory allocation performance

2008-01-31 Thread Alexander Motin
Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward cycle 1 [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18] 0.27 0.10

Re: Memory allocation performance

2008-01-31 Thread Alexander Motin
Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS, WITNESS and SMP but with 'profile 2'. --

Re: Memory allocation performance

2008-01-31 Thread Kris Kennaway
Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward cycle 1 [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18]

Re: Memory allocation performance

2008-01-31 Thread Julian Elischer
Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward cycle 1 [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18]

Re: Memory allocation performance

2008-01-31 Thread Kris Kennaway
Alexander Motin wrote: Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS, WITNESS and SMP but

Re: Memory allocation performance

2008-01-31 Thread Alexander Motin
Kris Kennaway пишет: Alexander Motin wrote: Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS,

Re: Memory allocation performance

2008-01-31 Thread Alexander Motin
Julian Elischer пишет: Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward cycle 1 [12] 0.14 0.05 133127/545292

Re: Memory allocation performance

2008-01-31 Thread Julian Elischer
Alexander Motin wrote: Julian Elischer пишет: Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward cycle 1 [12] 0.14 0.05