Re: ppc44x - how do i optimize driver for tlb hits

2010-10-03 Thread Benjamin Herrenschmidt
On Sun, 2010-10-03 at 14:13 -0500, Ayman El-Khashab wrote: > On Sat, Sep 25, 2010 at 08:11:04AM +1000, Benjamin Herrenschmidt wrote: > > On Fri, 2010-09-24 at 08:08 -0500, Ayman El-Khashab wrote: > > > > > > I suppose another option is to to use the kernel profiling option I > > > always see but

Re: ppc44x - how do i optimize driver for tlb hits

2010-10-03 Thread Ayman El-Khashab
On Sat, Sep 25, 2010 at 08:11:04AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2010-09-24 at 08:08 -0500, Ayman El-Khashab wrote: > > > > I suppose another option is to to use the kernel profiling option I > > always see but have never used. Is that a viable option to figure out > > what is h

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-24 Thread Benjamin Herrenschmidt
On Fri, 2010-09-24 at 08:08 -0500, Ayman El-Khashab wrote: > > I suppose another option is to to use the kernel profiling option I > always see but have never used. Is that a viable option to figure out > what is happening here? With perf and stochastic sampling ? If you sample fast enough...

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-24 Thread Ayman El-Khashab
On Fri, Sep 24, 2010 at 06:30:34AM -0400, Josh Boyer wrote: > On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote: > >> The DMA is what I use in the "real world case" to get data into and out > >> of these buffers. However, I can disable the DMA completely and do only > >> the

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-24 Thread Josh Boyer
On Fri, Sep 24, 2010 at 02:43:52PM +1000, Benjamin Herrenschmidt wrote: >> The DMA is what I use in the "real world case" to get data into and out >> of these buffers. However, I can disable the DMA completely and do only >> the kmalloc. In this case I still see the same poor performance. My >>

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-23 Thread Benjamin Herrenschmidt
> > No. The first pinned entry (0...256M) is inserted by the asm code in > > head_44x.S. The code in 44x_mmu.c will later map the rest of lowmem > > (typically up to 768M but various settings can change that) using more > > 256M entries. > > Thanks Ben, appreciate all your wisdom and insight. >

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-23 Thread Ayman El-Khashab
On Fri, Sep 24, 2010 at 11:07:24AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2010-09-23 at 17:35 -0500, Ayman El-Khashab wrote: > > Anything you allocate with kmalloc() is going to be mapped by bolted > > > 256M TLB entries, so there should be no TLB misses happening in the > > > kernel case.

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-23 Thread Benjamin Herrenschmidt
On Thu, 2010-09-23 at 17:35 -0500, Ayman El-Khashab wrote: > Anything you allocate with kmalloc() is going to be mapped by bolted > > 256M TLB entries, so there should be no TLB misses happening in the > > kernel case. > > > > Hi Ben, can you or somebody elaborate? I saw the pinned tlb in > 44x_

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-23 Thread Ayman El-Khashab
On Fri, Sep 24, 2010 at 08:01:04AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2010-09-23 at 10:12 -0500, Ayman El-Khashab wrote: > > I've implemented a working driver on my 460EX. it allocates a couple > > of buffers of 4MB each. I have a custom memcmp algorithm in asm that > > is extremely f

Re: ppc44x - how do i optimize driver for tlb hits

2010-09-23 Thread Benjamin Herrenschmidt
On Thu, 2010-09-23 at 10:12 -0500, Ayman El-Khashab wrote: > I've implemented a working driver on my 460EX. it allocates a couple > of buffers of 4MB each. I have a custom memcmp algorithm in asm that > is extremely fast in user space, but 1/2 as fast when run on these > buffers. > > my tests ar

ppc44x - how do i optimize driver for tlb hits

2010-09-23 Thread Ayman El-Khashab
I've implemented a working driver on my 460EX. it allocates a couple of buffers of 4MB each. I have a custom memcmp algorithm in asm that is extremely fast in user space, but 1/2 as fast when run on these buffers. my tests are showing that the algorithm seems to be memory bandwidth bound. my gu