Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Segher Boessenkool

I would expect to see dcbtst in here, no?


Nah, dcbtst is expensive (it causes some non-cheap bus
transactions) and not needed at all; dcbz is much better
(but can only be used if you kill the whole cache line;
which is true here).


Both functions (copy and clear) could stand a little loop unrolling.


ldu ; stdu ; bdnz is not the best loop possible, esp. not on
970/P4/P5.  You guys got Mac's, use Shark (go to the code browser,
cmd-shift-M, select "show 970 dispatch groups" and "show 970
details drawer").  In most cases the time spent in the loop will
be dominated by memory (cache) speed, of course, but still.

I can understand if you're not *really* trying to optimize these,  
but in
that case why do you want to add dcbz? Is there a noticeable  
performance

improvement?


Yes, dcbz is (should be) a huge improvement.


Segher


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Hollis Blanchard
On Mon, 2006-08-28 at 13:20 -0400, Jimi Xenidis wrote:
> On Aug 28, 2006, at 1:10 PM, Hollis Blanchard wrote:
> > Right. As long as we're changing this code, let's include copy_page().
> >
> > ... which brings us back to the original patch a) removing the call to
> > mambo_memcpy(), and b) missing dcbtst.
> 
> I think we should use the linux copy4kpage() strategy for this.. no  
> sense reinventing the wheel.

I'm fine with copying whatever Linux does here. At least they've
unrolled it. :) It's just too bad it's 100% comment-free. :(

> We should also consider sending a linux patch upstream that has this  
> function use dcbz or find out why they did not.

Sure.

-- 
Hollis Blanchard
IBM Linux Technology Center


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Jimi Xenidis


On Aug 28, 2006, at 1:10 PM, Hollis Blanchard wrote:


On Mon, 2006-08-28 at 11:58 -0400, Jimi Xenidis wrote:

On Aug 28, 2006, at 11:49 AM, Hollis Blanchard wrote:


On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote:

On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote:



Also, it looks like you've removed support for mambo_memcpy(). I
don't
use Mambo *ahem* systemsim myself, but that seems worth keeping. I
guess
you could rename the function while you're in there. :)


if we get these down to dcbz's then system performance is fine  
and am

happy to drop.


I think you're saying that if we have a clear_page() loop that uses
dcbz, systemsim performance is fine. Is that correct?

yes


Is that just because it reduces the number of loops? Using 128 byte
increments instead of 8 would reduce the iterations from 512 to 32.


According the the patch clear_page_cachable() (which we all agreed is  
just clear_page()) does look only 32 times

systemsym can to dcbz very fast not worth using the callthru.





Is that also the case for copy_page()?


No but it is not called to often, infact not at all AFAICT. I just
removed it and build fine :)
Thats not to say that there aren't any opportunities to use it.


Right. As long as we're changing this code, let's include copy_page().

... which brings us back to the original patch a) removing the call to
mambo_memcpy(), and b) missing dcbtst.


I think we should use the linux copy4kpage() strategy for this.. no  
sense reinventing the wheel.
We should also consider sending a linux patch upstream that has this  
function use dcbz or find out why they did not.


-JX



___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Hollis Blanchard
On Mon, 2006-08-28 at 11:58 -0400, Jimi Xenidis wrote:
> On Aug 28, 2006, at 11:49 AM, Hollis Blanchard wrote:
> 
> > On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote:
> >> On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote:
> >>
> >>>
> >>> Also, it looks like you've removed support for mambo_memcpy(). I  
> >>> don't
> >>> use Mambo *ahem* systemsim myself, but that seems worth keeping. I
> >>> guess
> >>> you could rename the function while you're in there. :)
> >>
> >> if we get these down to dcbz's then system performance is fine and am
> >> happy to drop.
> >
> > I think you're saying that if we have a clear_page() loop that uses
> > dcbz, systemsim performance is fine. Is that correct?
> yes

Is that just because it reduces the number of loops? Using 128 byte
increments instead of 8 would reduce the iterations from 512 to 32.

> >
> > Is that also the case for copy_page()?
> 
> No but it is not called to often, infact not at all AFAICT. I just  
> removed it and build fine :)
> Thats not to say that there aren't any opportunities to use it.

Right. As long as we're changing this code, let's include copy_page().

... which brings us back to the original patch a) removing the call to
mambo_memcpy(), and b) missing dcbtst.

-- 
Hollis Blanchard
IBM Linux Technology Center


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Jimi Xenidis


On Aug 28, 2006, at 11:49 AM, Hollis Blanchard wrote:


On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote:

On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote:



Also, it looks like you've removed support for mambo_memcpy(). I  
don't

use Mambo *ahem* systemsim myself, but that seems worth keeping. I
guess
you could rename the function while you're in there. :)


if we get these down to dcbz's then system performance is fine and am
happy to drop.


I think you're saying that if we have a clear_page() loop that uses
dcbz, systemsim performance is fine. Is that correct?

yes


Is that also the case for copy_page()?


No but it is not called to often, infact not at all AFAICT. I just  
removed it and build fine :)

Thats not to say that there aren't any opportunities to use it.





___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Hollis Blanchard
On Mon, 2006-08-28 at 11:31 -0400, Jimi Xenidis wrote:
> On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote:
> 
> >
> > Also, it looks like you've removed support for mambo_memcpy(). I don't
> > use Mambo *ahem* systemsim myself, but that seems worth keeping. I  
> > guess
> > you could rename the function while you're in there. :)
> 
> if we get these down to dcbz's then system performance is fine and am  
> happy to drop.

I think you're saying that if we have a clear_page() loop that uses
dcbz, systemsim performance is fine. Is that correct?

Is that also the case for copy_page()?

-- 
Hollis Blanchard
IBM Linux Technology Center


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Jimi Xenidis


On Aug 28, 2006, at 10:50 AM, Hollis Blanchard wrote:



Also, it looks like you've removed support for mambo_memcpy(). I don't
use Mambo *ahem* systemsim myself, but that seems worth keeping. I  
guess

you could rename the function while you're in there. :)


if we get these down to dcbz's then system performance is fine and am  
happy to drop.

-JX


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel



Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-28 Thread Hollis Blanchard
On Fri, 2006-08-25 at 17:48 -0400, [EMAIL PROTECTED] wrote:
> +/* assumes destination page, *dp, is cacheable */
> +static __inline__ void copy_page_cacheable(void *dp, void *sp)
> +{
> +   unsigned long dwords, dword_size;
> +
> +   dword_size = 8;
> +   dwords = (PAGE_SIZE / dword_size) - 1;
> +
> +   clear_page_cacheable(dp);
> +
> +   __asm__ __volatile__(
> +   "mtctr  %2  # copy_page\n\
> +   ld  %2,0(%1)\n\
> +   std %2,0(%0)\n\
> +1: ldu %2,8(%1)\n\
> +   stdu%2,8(%0)\n\
> +   bdnz1b"
> +   : /* no result */
> +   : "r" (dp), "r" (sp), "r" (dwords)
> +   : "%ctr", "memory");
> +} 

I would expect to see dcbtst in here, no?

Both functions (copy and clear) could stand a little loop unrolling.

I can understand if you're not *really* trying to optimize these, but in
that case why do you want to add dcbz? Is there a noticeable performance
improvement?

Also, it looks like you've removed support for mambo_memcpy(). I don't
use Mambo *ahem* systemsim myself, but that seems worth keeping. I guess
you could rename the function while you're in there. :)

-- 
Hollis Blanchard
IBM Linux Technology Center


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-27 Thread poff

> Break how? build? run? hang?

Never mind -  Turns-out to be an out-of-memory problem. 

Booting goes all way to 'Waiting for /dev to be fully populated... '
then soon starts out-of-memory info, repeatedly. Had assumed it was
caused by something wrong in dcbz code. But when I added debug code
to page_alloc.c, same problem occured. So set rma size to 128mb,
and clear_page_cacheable() now works everywhere...

___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-27 Thread Jimi Xenidis


On Aug 25, 2006, at 5:48 PM, [EMAIL PROTECTED] wrote:



note2: in page_alloc.c, one of the three clear_page() calls breaks  
the system

when changed to clear_page_cacheable().

I suspected why this was troubleseme, but I was wrong, can you give  
us a reason why you think it is?

Off to the code...
Since everything is caches I'll ignore the non-cacheable versions and  
only consider the cacheable versions.



+static __inline__ void copy_page(void *dp, void *sp)

[ignore]


+/* assumes page, *addr, is cacheable */
+static __inline__ void clear_page_cacheable(void *addr)

EVERYTHING is cacheable.
  +static __inline__ void clear_page(void *addr)

+{
+   unsigned long lines, line_size;
+
+   line_size = CACHE_LINE_SIZE;
This should really be a function defined in arch/powerpc/powerpc64/ 
ppc970.c
maybe something like cpu_cache_line_size(), or at some point we  
should prolly have a struct cpu_info that contains all this stuff  
rather than make a call for this stuff all the time.

Tho I think Hollis and I will have to debate this one :)


+/* assumes destination page, *dp, is cacheable */
+static __inline__ void copy_page_cacheable(void *dp, void *sp)

+static __inline__ void copy_page(void *dp, void *sp)

+{
+   unsigned long dwords, dword_size;
+
+   dword_size = 8;
+   dwords = (PAGE_SIZE / dword_size) - 1;
+
+   clear_page_cacheable(dp);


why the clear?
If your intent was to clear the cache line to avoid read-modify- 
write, then integrating it into the loop would have a much bigger  
effect.



+
+   __asm__ __volatile__(
+   "mtctr %2  # copy_page\n\
+   ld  %2,0(%1)\n\
+   std %2,0(%0)\n\
+1: ldu %2,8(%1)\n\
+   stdu%2,8(%0)\n\
+   bdnz1b"
+   : /* no result */
+   : "r" (dp), "r" (sp), "r" (dwords)
Nt 100% sure on this, but since, at the end of your assembler you are  
changing %0-%2 the should be "=r" in the output area or in the  
clobber list.

+   : "%ctr", "memory");
+}



-JX




___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel


Re: [XenPPC] PATCH: Inline assembler for clear_page() and copy_page()

2006-08-27 Thread Jimi Xenidis

On Aug 25, 2006, at 5:48 PM, [EMAIL PROTECTED] wrote:


Following code includes assembler versions of clear_page_cacheable 
(), by Xenidis,
copy_page(), and copy_page_cacheable(). The 'cacheable' versions  
use 'dcbz' for

clearing cache lines; the target page is assumed to be cacheable.


On PowerPC, the cache is always on in RealMode and we have to do  
backflips (see io.S) to perform cache-inhibited actions.


This code has been debugged with a small application program on  
JS21. Also code
has been incorporated into a xen tree and runs on JS21 (though  
copy_page() is not

being used).

note: in some documentation, dcbz is for 32 byte cache lines, dcbzl  
for 128


Dan, we (like linux) force all our cache line operations to be 128  
bytes via the HID setting in ppc970.c where we set hid5.  I see not  
that that could be better documented, I got lazy since the user  
manual for 970 is now public.


However, the crossbuild assembler does not recognize dcbzl. The  
native assembler
(for building the test application) would accept either dcbz or  
dcbzl, and in either

case the 128 byte cache line was cleared.


the instruction is reserved for apple who are the only SW vendor that  
currently support and programming model for 32 byte cache operations.




note2: in page_alloc.c, one of the three clear_page() calls breaks  
the system

when changed to clear_page_cacheable().


Break how? build? run? hang?

-JX


___
Xen-ppc-devel mailing list
Xen-ppc-devel@lists.xensource.com
http://lists.xensource.com/xen-ppc-devel