Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Miao Xie

On Wed, 8 Sep 2010 15:05:19 +0200, Andi Kleen wrote:

On Wed, 08 Sep 2010 20:57:07 +0800
Miao Xie  wrote:


On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:



According to the data, the length of the most copies is>=128.


Thanks for the data. Large is easier to optimize than small, that's
good.

Could you also measure how many memsets need the backwards copy?

^^^ :-)

(should be easy to add)


I think memset doesn't need the backwards copy.


I meant for memmove of course. Obviously memset doesn't need a backwards
copy. That was just the only thing the script didn't measure because
the original version didn't have memmove support.


memmove
total 5224252
[snip]

forward copy
value |-- count
0 |@  980253


backward copy
value |-- count
0 |@  4243999

the backward copy is much more than the forward copy.

Thanks
Miao


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Andi Kleen
On Wed, 08 Sep 2010 20:57:07 +0800
Miao Xie  wrote:

> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:
> >
> >> According to the data, the length of the most copies is>=128.
> >
> > Thanks for the data. Large is easier to optimize than small, that's
> > good.
> >
> > Could you also measure how many memsets need the backwards copy?
> > (should be easy to add)
> 
> I think memset doesn't need the backwards copy.

I meant for memmove of course. Obviously memset doesn't need a backwards
copy. That was just the only thing the script didn't measure because
the original version didn't have memmove support.

Your whole thread was about making memmove faster, right?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Miao Xie

On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:



According to the data, the length of the most copies is>=128.


Thanks for the data. Large is easier to optimize than small, that's good.

Could you also measure how many memsets need the backwards copy?
(should be easy to add)


I think memset doesn't need the backwards copy.
The reason why memmove needs the backward copy is because the forward copy may
cover the data which will be used later if the dest string intersect with the
src string. But memset needn't worry about this problem.

Thanks!
Miao  



If the number is small that needs backwards then the easiest fix
would be to simply call the normal memcpy in the forward case.

That is for backward could also use a string instruction copy
of course, just have to set the direction flag.

That would be a very small code change.

-Andi






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Andi Kleen

> According to the data, the length of the most copies is >=128.

Thanks for the data. Large is easier to optimize than small, that's good.

Could you also measure how many memsets need the backwards copy?
(should be easy to add)

If the number is small that needs backwards then the easiest fix
would be to simply call the normal memcpy in the forward case.

That is for backward could also use a string instruction copy
of course, just have to set the direction flag.

That would be a very small code change.

-Andi


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-08 Thread Miao Xie

Hi, Andi

On Thu, 2 Sep 2010 12:40:08 +0200, Andi Kleen wrote:

So I improve the generic version of memcpy and memmove, and x86_64's memmove
which are implemented by byte copy.


One should also add that most memmove()s and memcpy()s are actually
generated by gcc as inlines (especially if you don't use the
"make my code slow" option aka -Os) and don't use the fallback.
The fallback depends on the gcc version and if gcc thinks the
data is aligned or not.

Sometimes one can get better code in the caller by making sure
gcc knows the correct alignment (e.g. with suitable
types) and size. This might be worth looking at for btrfs
if it's really that memmove heavy.


Right! But the src address and dest address is not fixed, so it is hard to
tell gcc that the address is alignment or not.

The problem is memmove is very inefficient in fact, and it is used at some fast 
path,
such as: it is used to do metadata copy for filesystem, so we must improve it.


I have some systemtap scripts to measure size/alignment distributions of
copies on a kernel, if you have a particular workload you're interested
in those could be tried.


Good! Could you give me these script?


ftp://firstfloor.org/pub/ak/probes/csum.m4

You need to run them through .m4 first.
They don't measure memmove, but that should be easy to add.


I used your script to measure size/alignment distributions of copies on a 
kernel when
I ran the btrfs test, and got some data:

memmove
total 325903
length
value |-- count
1 |   0
2 |   0
4 |   3
8 |   0
   16 |   57062
   32 |@@  5903
   64 |@@@ 7296
  128 |@  18868
  256 |@  33790
  512 |   64886
 1024 |@@ 98450
 2048 |   39645
 4096 |   0
 8192 |   0

length upto 50
value |-- count
2 |0
3 |0
4 |3
5 |0
6 |0
  ~
   21 |0
   22 |0
   23 |   24
   24 |0
   25 |@@  57037
   26 |0
   27 |0
  ~
   29 |0
   30 |0
   31 |1
   32 |3
   33 |   78
   34 |  215
   35 | 1865
   36 |  432
   37 |0
   38 |0
   39 |0
   40 |0
   41 |  130
   42 |0
   43 |   80
   44 |0
   45 |0
   46 |0
   47 |0
   48 |   80
   49 |0
   50 | 1077
  >50 |@  264878

src unalignments
value |-- count
0 |   23173
1 |@  

Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Andi Kleen
> So I improve the generic version of memcpy and memmove, and x86_64's memmove
> which are implemented by byte copy.

One should also add that most memmove()s and memcpy()s are actually
generated by gcc as inlines (especially if you don't use the
"make my code slow" option aka -Os) and don't use the fallback.
The fallback depends on the gcc version and if gcc thinks the 
data is aligned or not.

Sometimes one can get better code in the caller by making sure
gcc knows the correct alignment (e.g. with suitable 
types) and size. This might be worth looking at for btrfs
if it's really that memmove heavy.

> >
> >I have some systemtap scripts to measure size/alignment distributions of
> >copies on a kernel, if you have a particular workload you're interested
> >in those could be tried.
> 
> Good! Could you give me these script?

ftp://firstfloor.org/pub/ak/probes/csum.m4

You need to run them through .m4 first.
They don't measure memmove, but that should be easy to add.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Miao Xie

On Thu, 02 Sep 2010 10:55:58 +0200, Andi Kleen wrote:

Miao Xie  writes:


Changes from V1 to V2:
- change the version of GPL from version 2.1 to version 2

the kernel's memcpy and memmove is very inefficient. But the glibc version is
quite fast, in some cases it is 10 times faster than the kernel version. So I



Can you elaborate on which CPUs and with what workloads you measured that?


I did this test on x86_64 box with 4 cores, and the workload is quite low,
and I just do 500 bytes copy for 5,000,000 times.

the attached file is my test program.


The kernel memcpy is optimized for copies smaller than a page size
for example (kernel very rarely does anything on larger than 4k),
the glibc isn't. etc. There are various other differences.

memcpy and memmove are very different. AFAIK noone has tried
to optimize memmove() before because traditionally it wasn't
used for anything performance critical in the kernel. Has that
that changed? memcpy on the other hand while not perfect
is actually quite optimized for typical workloads.


Yes,the performance of memcpy on the most architecture is well,

But some of memmoves are implemented by byte copy, it is quite inefficient.
Unfortunately those memmove are used to modify the metadata of some filesystems,
such as: btrfs. That is memmove is importent for the performance of those 
filesystems.

So I improve the generic version of memcpy and memmove, and x86_64's memmove
which are implemented by byte copy.


One big difference between the kernel and glibc is that kernel
is often cache cold, so you e.g. the cost of a very large code footprint
memcpy/memset is harder to amortize.

Microbenchmarks often leave out that crucial variable.

I have some systemtap scripts to measure size/alignment distributions of
copies on a kernel, if you have a particular workload you're interested
in those could be tried.


Good! Could you give me these script?


Just copying the glibc bloat uncritical is very likely
the wrong move at least.


Agree!

Thanks!
Miao
#include 
#include 
#include 
#include 
#include 
#include 
#include 

void get_start_time(struct timeval *tv)
{
	do_gettimeofday(tv);
}

void account_time(struct timeval *stv, struct timeval *etv, int loops)
{
	do_gettimeofday(etv);

	if (loops) {
		while (etv->tv_usec < stv->tv_usec) {
			etv->tv_sec--;
			etv->tv_usec += 100;
		}

		etv->tv_usec -= stv->tv_usec;
		etv->tv_sec -= stv->tv_sec;

		while (etv->tv_usec > 100) {
			etv->tv_usec -= 100;
			etv->tv_sec++;
		}

		printk("\tTotal loops: %d\n", loops);
		printk("\tTotal time: %lds%ldus\n", etv->tv_sec, etv->tv_usec);
	} else
		printk("Didn't do any loop!\n");

}

char *str;

int init_module(void)
{
	struct timeval stv, etv;
	int loops, i;

	str = kmalloc(1000, GFP_KERNEL);
	if (!str)
		return 0;
	loops = i = 500;

	printk("memcpy:\n");
	get_start_time(&stv);
	while (i--)
		memcpy(str + 400, str, 500);
	account_time(&stv, &etv, loops);

	i = loops;
	printk("\nmemmove:\n");
	get_start_time(&stv);
	while (i--)
		memmove(str + 400, str, 500);
	account_time(&stv, &etv, loops);

	return 0;
}

void cleanup_module(void)
{
	kfree(str);
}

MODULE_LICENSE("GPL");


Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-02 Thread Andi Kleen
Miao Xie  writes:

> Changes from V1 to V2:
> - change the version of GPL from version 2.1 to version 2
>
> the kernel's memcpy and memmove is very inefficient. But the glibc version is
> quite fast, in some cases it is 10 times faster than the kernel version. So I


Can you elaborate on which CPUs and with what workloads you measured that?

The kernel memcpy is optimized for copies smaller than a page size 
for example (kernel very rarely does anything on larger than 4k), 
the glibc isn't. etc. There are various other differences.

memcpy and memmove are very different. AFAIK noone has tried
to optimize memmove() before because traditionally it wasn't
used for anything performance critical in the kernel. Has that
that changed? memcpy on the other hand while not perfect
is actually quite optimized for typical workloads.

One big difference between the kernel and glibc is that kernel 
is often cache cold, so you e.g. the cost of a very large code footprint
memcpy/memset is harder to amortize.

Microbenchmarks often leave out that crucial variable.

I have some systemtap scripts to measure size/alignment distributions of
copies on a kernel, if you have a particular workload you're interested
in those could be tried.

Just copying the glibc bloat uncritical is very likely
the wrong move at least.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 1/3] lib: introduce some memory copy macros and functions

2010-09-01 Thread Miao Xie
Changes from V1 to V2:
- change the version of GPL from version 2.1 to version 2

the kernel's memcpy and memmove is very inefficient. But the glibc version is
quite fast, in some cases it is 10 times faster than the kernel version. So I
introduce some memory copy macros and functions of the glibc to improve the
kernel version's performance.

The strategy of the memory functions is:
1. Copy bytes until the destination pointer is aligned.
2. Copy words in unrolled loops.  If the source and destination are not aligned
   in the same way, use word memory operations, but shift and merge two read
   words before writing.
3. Copy the few remaining bytes.

Signed-off-by: Miao Xie 
---
 include/linux/memcopy.h |  226 ++
 lib/Makefile|3 +-
 lib/memcopy.c   |  403 +++
 3 files changed, 631 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/memcopy.h
 create mode 100644 lib/memcopy.c

diff --git a/include/linux/memcopy.h b/include/linux/memcopy.h
new file mode 100644
index 000..9c65ac8
--- /dev/null
+++ b/include/linux/memcopy.h
@@ -0,0 +1,226 @@
+/*
+ * memcopy.h -- definitions for memory copy functions.  Generic C version.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ * 
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+ * Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * The code is derived from the GNU C Library.
+ * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc.
+ */
+#ifndef _LINUX_MEMCOPY_H_
+#define _LINUX_MEMCOPY_H_
+
+/*
+ * The strategy of the memory functions is:
+ *
+ *   1. Copy bytes until the destination pointer is aligned.
+ *
+ *   2. Copy words in unrolled loops.  If the source and destination
+ *  are not aligned in the same way, use word memory operations,
+ *  but shift and merge two read words before writing.
+ *
+ *   3. Copy the few remaining bytes.
+ *
+ * This is fast on processors that have at least 10 registers for
+ * allocation by GCC, and that can access memory at reg+const in one
+ * instruction.
+ */
+
+#include 
+#include 
+#include 
+
+/*
+ * The macros defined in this file are:
+ *
+ * BYTE_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ * 
+ * BYTE_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ * 
+ * WORD_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_remaining, nbytes_to_copy)
+ * 
+ * WORD_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_remaining, nbytes_to_copy)
+ * 
+ * MERGE(old_word, sh_1, new_word, sh_2)
+ * 
+ * MEM_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy)
+ * 
+ * MEM_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy)
+ */
+
+#define OP_T_THRESHOLD 16
+
+/*
+ * Type to use for aligned memory operations.
+ * This should normally be the biggest type supported by a single load
+ * and store.
+ */
+#defineop_tunsigned long int
+#define OPSIZ  (sizeof(op_t))
+
+/* Type to use for unaligned operations.  */
+typedef unsigned char byte;
+
+#ifndef MERGE
+# ifdef __LITTLE_ENDIAN
+#  define MERGE(w0, sh_1, w1, sh_2) (((w0) >> (sh_1)) | ((w1) << (sh_2)))
+# elif defined(__BIG_ENDIAN)
+#  define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2)))
+# else
+#  error "Macro MERGE() hasn't defined!"
+# endif
+#endif
+
+/*
+ * Copy exactly NBYTES bytes from SRC_BP to DST_BP,
+ * without any assumptions about alignment of the pointers.
+ */
+#ifndef BYTE_COPY_FWD
+#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)\
+do { \
+   size_t __nbytes = (nbytes);   \
+   while (__nbytes > 0) {\
+   byte __x = ((byte *) src_bp)[0];  \
+   src_bp += 1;  \
+   __nbytes -= 1;\
+   ((byte *) dst_bp)[0] = __x;   \
+   dst_bp += 1;  \
+   } \
+} while (0)
+#endif
+
+/*
+ * Copy exactly NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR,
+ * beginning at the bytes right before the pointers and continuing towards
+ * smaller addres