Re: Small memcpy optimization

2012-11-10 Thread Stefan Fritsch

On Thu, 8 Nov 2012, Mark Kettenis wrote:

On Tuesday 21 August 2012, Stefan Fritsch wrote:

On x86, the xchg operation between reg and mem has an implicit lock
prefix, i.e. it is a relatively expensive atomic operation. This is
not needed here.


OKs, anyone?


What you say makes sense, although it might matter only on MP
(capable) systems.


True, but MP is the norm nowadays.


If you really want to make things faster, I
suppose you could change the code into something like

   pushl   %esi
   pushl   %edi
   movl12(%esp),%edi
   movl16(%esp),%esi


That's true. Like this (suggestions for a better label name are 
welcome):


--- locore.s
+++ locore.s
@@ -789,7 +789,7 @@ ENTRY(bcopy)
pushl   %edi
movl12(%esp),%esi
movl16(%esp),%edi
-   movl20(%esp),%ecx
+bcopy2:movl20(%esp),%ecx
movl%edi,%eax
subl%esi,%eax
cmpl%ecx,%eax   # overlapping?
@@ -827,13 +827,15 @@ ENTRY(bcopy)
ret

 /*
- * Emulate memcpy() by swapping the first two arguments and calling bcopy()
+ * Emulate memcpy() by loading the first two arguments in reverse order
+ * and jumping into bcopy()
  */
 ENTRY(memcpy)
-   movl4(%esp),%ecx
-   xchg8(%esp),%ecx
-   movl%ecx,4(%esp)
-   jmp _C_LABEL(bcopy)
+   pushl   %esi
+   pushl   %edi
+   movl12(%esp),%edi
+   movl16(%esp),%esi
+   jmp bcopy2

 /*/



Re: Small memcpy optimization

2012-11-10 Thread Mark Kettenis
 Date: Sat, 10 Nov 2012 18:10:53 +0100 (CET)
 From: Stefan Fritsch s...@sfritsch.de
 
 On Thu, 8 Nov 2012, Mark Kettenis wrote:
  On Tuesday 21 August 2012, Stefan Fritsch wrote:
  On x86, the xchg operation between reg and mem has an implicit lock
  prefix, i.e. it is a relatively expensive atomic operation. This is
  not needed here.
 
  OKs, anyone?
 
  What you say makes sense, although it might matter only on MP
  (capable) systems.
 
 True, but MP is the norm nowadays.
 
  If you really want to make things faster, I
  suppose you could change the code into something like
 
 pushl   %esi
 pushl   %edi
 movl12(%esp),%edi
 movl16(%esp),%esi
 
 That's true. Like this (suggestions for a better label name are 
 welcome):

What about doocpy?  And I would put the label on a line of its own,
such that it stands out more.

 --- locore.s
 +++ locore.s
 @@ -789,7 +789,7 @@ ENTRY(bcopy)
   pushl   %edi
   movl12(%esp),%esi
   movl16(%esp),%edi
 - movl20(%esp),%ecx
 +bcopy2:  movl20(%esp),%ecx
   movl%edi,%eax
   subl%esi,%eax
   cmpl%ecx,%eax   # overlapping?
 @@ -827,13 +827,15 @@ ENTRY(bcopy)
   ret
 
   /*
 - * Emulate memcpy() by swapping the first two arguments and calling bcopy()
 + * Emulate memcpy() by loading the first two arguments in reverse order
 + * and jumping into bcopy()
*/
   ENTRY(memcpy)
 - movl4(%esp),%ecx
 - xchg8(%esp),%ecx
 - movl%ecx,4(%esp)
 - jmp _C_LABEL(bcopy)
 + pushl   %esi
 + pushl   %edi
 + movl12(%esp),%edi
 + movl16(%esp),%esi
 + jmp bcopy2
 
   
 /*/



Re: Small memcpy optimization

2012-11-08 Thread Mark Kettenis
 From: Stefan Fritsch s...@sfritsch.de
 Date: Thu, 1 Nov 2012 22:43:33 +0100
 
 On Tuesday 21 August 2012, Stefan Fritsch wrote:
  On x86, the xchg operation between reg and mem has an implicit lock
  prefix, i.e. it is a relatively expensive atomic operation. This is
  not needed here.
 
 OKs, anyone?

What you say makes sense, although it might matter only on MP
(capable) systems.  If you really want to make things faster, I
suppose you could change the code into something like

pushl   %esi
pushl   %edi
movl12(%esp),%edi
movl16(%esp),%esi

and then jump to the 5th instruction of bcopy().

  --- a/sys/arch/i386/i386/locore.s
  +++ b/sys/arch/i386/i386/locore.s
  @@ -802,8 +802,9 @@ ENTRY(bcopy)
 */
ENTRY(memcpy)
   movl4(%esp),%ecx
  -   xchg8(%esp),%ecx
  -   movl%ecx,4(%esp)
  +   movl8(%esp),%eax
  +   movl%ecx,8(%esp)
  +   movl%eax,4(%esp)
   jmp _C_LABEL(bcopy)



Re: Small memcpy optimization

2012-11-08 Thread Ted Unangst
On Thu, Nov 01, 2012 at 22:43, Stefan Fritsch wrote:
 On Tuesday 21 August 2012, Stefan Fritsch wrote:
 On x86, the xchg operation between reg and mem has an implicit lock
 prefix, i.e. it is a relatively expensive atomic operation. This is
 not needed here.
 
 OKs, anyone?

What do other implementations do?  Benchmarks?  I'm sure
someone somewhere has spent a lot of effort making the world's fastest
memcpy.  Taking that work seems better than home grown fiddling.



Re: Small memcpy optimization

2012-11-01 Thread Stefan Fritsch
On Tuesday 21 August 2012, Stefan Fritsch wrote:
 On x86, the xchg operation between reg and mem has an implicit lock
 prefix, i.e. it is a relatively expensive atomic operation. This is
 not needed here.

OKs, anyone?

 --- a/sys/arch/i386/i386/locore.s
 +++ b/sys/arch/i386/i386/locore.s
 @@ -802,8 +802,9 @@ ENTRY(bcopy)
*/
   ENTRY(memcpy)
  movl4(%esp),%ecx
 -   xchg8(%esp),%ecx
 -   movl%ecx,4(%esp)
 +   movl8(%esp),%eax
 +   movl%ecx,8(%esp)
 +   movl%eax,4(%esp)
  jmp _C_LABEL(bcopy)



Small memcpy optimization

2012-08-21 Thread Stefan Fritsch
On x86, the xchg operation between reg and mem has an implicit lock 
prefix, i.e. it is a relatively expensive atomic operation. This is not 
needed here.


--- a/sys/arch/i386/i386/locore.s
+++ b/sys/arch/i386/i386/locore.s
@@ -802,8 +802,9 @@ ENTRY(bcopy)
  */
 ENTRY(memcpy)
movl4(%esp),%ecx
-   xchg8(%esp),%ecx
-   movl%ecx,4(%esp)
+   movl8(%esp),%eax
+   movl%ecx,8(%esp)
+   movl%eax,4(%esp)
jmp _C_LABEL(bcopy)