Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
I would expect to see dcbtst in here, no? Nah, dcbtst is expensive (it causes some non-cheap bus transactions) and not needed at all; dcbz is much better (but can only be used if you kill the whole cache line; which is true here). Both functions (copy and clear) could stand a little loop unrolling. ldu ; stdu ; bdnz is not the best loop possible, esp. not on 970/P4/P5. You guys got Mac's, use Shark (go to the code browser, cmd-shift-M, select "show 970 dispatch groups" and "show 970 details drawer"). In most cases the time spent in the loop will be dominated by memory (cache) speed, of course, but still. I can understand if you're not *really* trying to optimize these, but in that case why do you want to add dcbz? Is there a noticeable performance improvement? Yes, dcbz is (should be) a huge improvement. Segher ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Mon, 2006-08-28 at 13:20 -0400, Jimi Xenidis wrote: > On Aug 28, 2006, at 1:10 PM, Hollis Blanchard wrote: > > Right. As long as we're changing this code, let's include copy_page(). > > > > ... which brings us back to the original patch a) removing the call to > > mambo_memcpy(), and b) missing dcbtst. > > I think we should use the linux copy4kpage() strategy for this.. no > sense reinventing the wheel. I'm fine with copying whatever Linux does here. At least they've unrolled it. :) It's just too bad it's 100% comment-free. :( > We should also consider sending a linux patch upstream that has this > function use dcbz or find out why they did not. Sure. -- Hollis Blanchard IBM Linux Technology Center ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Aug 28, 2006, at 1:10 PM, Hollis Blanchard wrote: On Mon, 2006-08-28 at 11:58 -0400, Jimi Xenidis wrote: On Aug 28, 2006, at 11:49 AM, Hollis Blanchard wrote: On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote: On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote: Also, it looks like you've removed support for mambo_memcpy(). I don't use Mambo *ahem* systemsim myself, but that seems worth keeping. I guess you could rename the function while you're in there. :) if we get these down to dcbz's then system performance is fine and am happy to drop. I think you're saying that if we have a clear_page() loop that uses dcbz, systemsim performance is fine. Is that correct? yes Is that just because it reduces the number of loops? Using 128 byte increments instead of 8 would reduce the iterations from 512 to 32. According the the patch clear_page_cachable() (which we all agreed is just clear_page()) does look only 32 times systemsym can to dcbz very fast not worth using the callthru. Is that also the case for copy_page()? No but it is not called to often, infact not at all AFAICT. I just removed it and build fine :) Thats not to say that there aren't any opportunities to use it. Right. As long as we're changing this code, let's include copy_page(). ... which brings us back to the original patch a) removing the call to mambo_memcpy(), and b) missing dcbtst. I think we should use the linux copy4kpage() strategy for this.. no sense reinventing the wheel. We should also consider sending a linux patch upstream that has this function use dcbz or find out why they did not. -JX ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Mon, 2006-08-28 at 11:58 -0400, Jimi Xenidis wrote: > On Aug 28, 2006, at 11:49 AM, Hollis Blanchard wrote: > > > On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote: > >> On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote: > >> > >>> > >>> Also, it looks like you've removed support for mambo_memcpy(). I > >>> don't > >>> use Mambo *ahem* systemsim myself, but that seems worth keeping. I > >>> guess > >>> you could rename the function while you're in there. :) > >> > >> if we get these down to dcbz's then system performance is fine and am > >> happy to drop. > > > > I think you're saying that if we have a clear_page() loop that uses > > dcbz, systemsim performance is fine. Is that correct? > yes Is that just because it reduces the number of loops? Using 128 byte increments instead of 8 would reduce the iterations from 512 to 32. > > > > Is that also the case for copy_page()? > > No but it is not called to often, infact not at all AFAICT. I just > removed it and build fine :) > Thats not to say that there aren't any opportunities to use it. Right. As long as we're changing this code, let's include copy_page(). ... which brings us back to the original patch a) removing the call to mambo_memcpy(), and b) missing dcbtst. -- Hollis Blanchard IBM Linux Technology Center ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Aug 28, 2006, at 11:49 AM, Hollis Blanchard wrote: On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote: On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote: Also, it looks like you've removed support for mambo_memcpy(). I don't use Mambo *ahem* systemsim myself, but that seems worth keeping. I guess you could rename the function while you're in there. :) if we get these down to dcbz's then system performance is fine and am happy to drop. I think you're saying that if we have a clear_page() loop that uses dcbz, systemsim performance is fine. Is that correct? yes Is that also the case for copy_page()? No but it is not called to often, infact not at all AFAICT. I just removed it and build fine :) Thats not to say that there aren't any opportunities to use it. ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote: > On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote: > > > > > Also, it looks like you've removed support for mambo_memcpy(). I don't > > use Mambo *ahem* systemsim myself, but that seems worth keeping. I > > guess > > you could rename the function while you're in there. :) > > if we get these down to dcbz's then system performance is fine and am > happy to drop. I think you're saying that if we have a clear_page() loop that uses dcbz, systemsim performance is fine. Is that correct? Is that also the case for copy_page()? -- Hollis Blanchard IBM Linux Technology Center ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote: Also, it looks like you've removed support for mambo_memcpy(). I don't use Mambo *ahem* systemsim myself, but that seems worth keeping. I guess you could rename the function while you're in there. :) if we get these down to dcbz's then system performance is fine and am happy to drop. -JX ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Fri, 2006-08-25 at 17:48 -0400, [EMAIL PROTECTED] wrote: > +/* assumes destination page, *dp, is cacheable */ > +static __inline__ void copy_page_cacheable(void *dp, void *sp) > +{ > + unsigned long dwords, dword_size; > + > + dword_size = 8; > + dwords = (PAGE_SIZE / dword_size) - 1; > + > + clear_page_cacheable(dp); > + > + __asm__ __volatile__( > + "mtctr %2 # copy_page\n\ > + ld %2,0(%1)\n\ > + std %2,0(%0)\n\ > +1: ldu %2,8(%1)\n\ > + stdu%2,8(%0)\n\ > + bdnz1b" > + : /* no result */ > + : "r" (dp), "r" (sp), "r" (dwords) > + : "%ctr", "memory"); > +} I would expect to see dcbtst in here, no? Both functions (copy and clear) could stand a little loop unrolling. I can understand if you're not *really* trying to optimize these, but in that case why do you want to add dcbz? Is there a noticeable performance improvement? Also, it looks like you've removed support for mambo_memcpy(). I don't use Mambo *ahem* systemsim myself, but that seems worth keeping. I guess you could rename the function while you're in there. :) -- Hollis Blanchard IBM Linux Technology Center ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
> Break how? build? run? hang? Never mind - Turns-out to be an out-of-memory problem. Booting goes all way to 'Waiting for /dev to be fully populated... ' then soon starts out-of-memory info, repeatedly. Had assumed it was caused by something wrong in dcbz code. But when I added debug code to page_alloc.c, same problem occured. So set rma size to 128mb, and clear_page_cacheable() now works everywhere... ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Aug 25, 2006, at 5:48 PM, [EMAIL PROTECTED] wrote: note2: in page_alloc.c, one of the three clear_page() calls breaks the system when changed to clear_page_cacheable(). I suspected why this was troubleseme, but I was wrong, can you give us a reason why you think it is? Off to the code... Since everything is caches I'll ignore the non-cacheable versions and only consider the cacheable versions. +static __inline__ void copy_page(void *dp, void *sp) [ignore] +/* assumes page, *addr, is cacheable */ +static __inline__ void clear_page_cacheable(void *addr) EVERYTHING is cacheable. +static __inline__ void clear_page(void *addr) +{ + unsigned long lines, line_size; + + line_size = CACHE_LINE_SIZE; This should really be a function defined in arch/powerpc/powerpc64/ ppc970.c maybe something like cpu_cache_line_size(), or at some point we should prolly have a struct cpu_info that contains all this stuff rather than make a call for this stuff all the time. Tho I think Hollis and I will have to debate this one :) +/* assumes destination page, *dp, is cacheable */ +static __inline__ void copy_page_cacheable(void *dp, void *sp) +static __inline__ void copy_page(void *dp, void *sp) +{ + unsigned long dwords, dword_size; + + dword_size = 8; + dwords = (PAGE_SIZE / dword_size) - 1; + + clear_page_cacheable(dp); why the clear? If your intent was to clear the cache line to avoid read-modify- write, then integrating it into the loop would have a much bigger effect. + + __asm__ __volatile__( + "mtctr %2 # copy_page\n\ + ld %2,0(%1)\n\ + std %2,0(%0)\n\ +1: ldu %2,8(%1)\n\ + stdu%2,8(%0)\n\ + bdnz1b" + : /* no result */ + : "r" (dp), "r" (sp), "r" (dwords) Nt 100% sure on this, but since, at the end of your assembler you are changing %0-%2 the should be "=r" in the output area or in the clobber list. + : "%ctr", "memory"); +} -JX ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel
Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()
On Aug 25, 2006, at 5:48 PM, [EMAIL PROTECTED] wrote: Following code includes assembler versions of clear_page_cacheable (), by Xenidis, copy_page(), and copy_page_cacheable(). The 'cacheable' versions use 'dcbz' for clearing cache lines; the target page is assumed to be cacheable. On PowerPC, the cache is always on in RealMode and we have to do backflips (see io.S) to perform cache-inhibited actions. This code has been debugged with a small application program on JS21. Also code has been incorporated into a xen tree and runs on JS21 (though copy_page() is not being used). note: in some documentation, dcbz is for 32 byte cache lines, dcbzl for 128 Dan, we (like linux) force all our cache line operations to be 128 bytes via the HID setting in ppc970.c where we set hid5. I see not that that could be better documented, I got lazy since the user manual for 970 is now public. However, the crossbuild assembler does not recognize dcbzl. The native assembler (for building the test application) would accept either dcbz or dcbzl, and in either case the 128 byte cache line was cleared. the instruction is reserved for apple who are the only SW vendor that currently support and programming model for 32 byte cache operations. note2: in page_alloc.c, one of the three clear_page() calls breaks the system when changed to clear_page_cacheable(). Break how? build? run? hang? -JX ___ Xen-ppc-devel mailing list Xen-ppc-devel@lists.xensource.com http://lists.xensource.com/xen-ppc-devel