subject:"Small memcpy optimization"

Re: Small memcpy optimization

2012-11-10 Thread Stefan Fritsch


On Thu, 8 Nov 2012, Mark Kettenis wrote:

On Tuesday 21 August 2012, Stefan Fritsch wrote:

On x86, the xchg operation between reg and mem has an implicit lock
prefix, i.e. it is a relatively expensive atomic operation. This is
not needed here.


OKs, anyone?


What you say makes sense, although it might matter only on MP
(capable) systems.


True, but MP is the norm nowadays.


If you really want to make things faster, I
suppose you could change the code into something like

   pushl   %esi
   pushl   %edi
   movl12(%esp),%edi
   movl16(%esp),%esi


That's true. Like this (suggestions for a better label name are 
welcome):


--- locore.s
+++ locore.s
@@ -789,7 +789,7 @@ ENTRY(bcopy)
pushl   %edi
movl12(%esp),%esi
movl16(%esp),%edi
-   movl20(%esp),%ecx
+bcopy2:movl20(%esp),%ecx
movl%edi,%eax
subl%esi,%eax
cmpl%ecx,%eax   # overlapping?
@@ -827,13 +827,15 @@ ENTRY(bcopy)
ret

 /*
- * Emulate memcpy() by swapping the first two arguments and calling bcopy()
+ * Emulate memcpy() by loading the first two arguments in reverse order
+ * and jumping into bcopy()
  */
 ENTRY(memcpy)
-   movl4(%esp),%ecx
-   xchg8(%esp),%ecx
-   movl%ecx,4(%esp)
-   jmp _C_LABEL(bcopy)
+   pushl   %esi
+   pushl   %edi
+   movl12(%esp),%edi
+   movl16(%esp),%esi
+   jmp bcopy2

 /*/

Re: Small memcpy optimization

2012-11-10 Thread Mark Kettenis

 Date: Sat, 10 Nov 2012 18:10:53 +0100 (CET)
 From: Stefan Fritsch s...@sfritsch.de

 On Thu, 8 Nov 2012, Mark Kettenis wrote:
  On Tuesday 21 August 2012, Stefan Fritsch wrote:
  On x86, the xchg operation between reg and mem has an implicit lock
  prefix, i.e. it is a relatively expensive atomic operation. This is
  not needed here.

  OKs, anyone?

  What you say makes sense, although it might matter only on MP
  (capable) systems.

 True, but MP is the norm nowadays.

  If you really want to make things faster, I
  suppose you could change the code into something like

 pushl   %esi
 pushl   %edi
 movl12(%esp),%edi
 movl16(%esp),%esi

 That's true. Like this (suggestions for a better label name are 
 welcome):

What about doocpy?  And I would put the label on a line of its own,
such that it stands out more.

 --- locore.s
 +++ locore.s
 @@ -789,7 +789,7 @@ ENTRY(bcopy)
   pushl   %edi
   movl12(%esp),%esi
   movl16(%esp),%edi
 - movl20(%esp),%ecx
 +bcopy2:  movl20(%esp),%ecx
   movl%edi,%eax
   subl%esi,%eax
   cmpl%ecx,%eax   # overlapping?
 @@ -827,13 +827,15 @@ ENTRY(bcopy)
   ret

   /*
 - * Emulate memcpy() by swapping the first two arguments and calling bcopy()
 + * Emulate memcpy() by loading the first two arguments in reverse order
 + * and jumping into bcopy()
*/
   ENTRY(memcpy)
 - movl4(%esp),%ecx
 - xchg8(%esp),%ecx
 - movl%ecx,4(%esp)
 - jmp _C_LABEL(bcopy)
 + pushl   %esi
 + pushl   %edi
 + movl12(%esp),%edi
 + movl16(%esp),%esi
 + jmp bcopy2

 /*/

Re: Small memcpy optimization

2012-11-08 Thread Mark Kettenis

 From: Stefan Fritsch s...@sfritsch.de
 Date: Thu, 1 Nov 2012 22:43:33 +0100

 On Tuesday 21 August 2012, Stefan Fritsch wrote:
  On x86, the xchg operation between reg and mem has an implicit lock
  prefix, i.e. it is a relatively expensive atomic operation. This is
  not needed here.

 OKs, anyone?

What you say makes sense, although it might matter only on MP
(capable) systems.  If you really want to make things faster, I
suppose you could change the code into something like

pushl   %esi
pushl   %edi
movl12(%esp),%edi
movl16(%esp),%esi

and then jump to the 5th instruction of bcopy().

  --- a/sys/arch/i386/i386/locore.s
  +++ b/sys/arch/i386/i386/locore.s
  @@ -802,8 +802,9 @@ ENTRY(bcopy)
 */
ENTRY(memcpy)
   movl4(%esp),%ecx
  -   xchg8(%esp),%ecx
  -   movl%ecx,4(%esp)
  +   movl8(%esp),%eax
  +   movl%ecx,8(%esp)
  +   movl%eax,4(%esp)
   jmp _C_LABEL(bcopy)

Re: Small memcpy optimization

2012-11-08 Thread Ted Unangst

On Thu, Nov 01, 2012 at 22:43, Stefan Fritsch wrote:
 On Tuesday 21 August 2012, Stefan Fritsch wrote:
 On x86, the xchg operation between reg and mem has an implicit lock
 prefix, i.e. it is a relatively expensive atomic operation. This is
 not needed here.
 
 OKs, anyone?

What do other implementations do?  Benchmarks?  I'm sure
someone somewhere has spent a lot of effort making the world's fastest
memcpy.  Taking that work seems better than home grown fiddling.

Re: Small memcpy optimization

2012-11-01 Thread Stefan Fritsch

On Tuesday 21 August 2012, Stefan Fritsch wrote:
 On x86, the xchg operation between reg and mem has an implicit lock
 prefix, i.e. it is a relatively expensive atomic operation. This is
 not needed here.

OKs, anyone?

 --- a/sys/arch/i386/i386/locore.s
 +++ b/sys/arch/i386/i386/locore.s
 @@ -802,8 +802,9 @@ ENTRY(bcopy)
*/
   ENTRY(memcpy)
  movl4(%esp),%ecx
 -   xchg8(%esp),%ecx
 -   movl%ecx,4(%esp)
 +   movl8(%esp),%eax
 +   movl%ecx,8(%esp)
 +   movl%eax,4(%esp)
  jmp _C_LABEL(bcopy)

Small memcpy optimization

2012-08-21 Thread Stefan Fritsch

On x86, the xchg operation between reg and mem has an implicit lock 
prefix, i.e. it is a relatively expensive atomic operation. This is not 
needed here.


--- a/sys/arch/i386/i386/locore.s
+++ b/sys/arch/i386/i386/locore.s
@@ -802,8 +802,9 @@ ENTRY(bcopy)
  */
 ENTRY(memcpy)
movl4(%esp),%ecx
-   xchg8(%esp),%ecx
-   movl%ecx,4(%esp)
+   movl8(%esp),%eax
+   movl%ecx,8(%esp)
+   movl%eax,4(%esp)
jmp _C_LABEL(bcopy)

Re: Small memcpy optimization

Re: Small memcpy optimization

Re: Small memcpy optimization

Re: Small memcpy optimization

Re: Small memcpy optimization

Small memcpy optimization

6 matches

Site Navigation

Mail list logo

Footer information