Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 8 Sep 2010 15:05:19 +0200, Andi Kleen wrote: On Wed, 08 Sep 2010 20:57:07 +0800 Miao Xie wrote: On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote: According to the data, the length of the most copies is>=128. Thanks for the data. Large is easier to optimize than small, that's good. Could you also measure how many memsets need the backwards copy? ^^^ :-) (should be easy to add) I think memset doesn't need the backwards copy. I meant for memmove of course. Obviously memset doesn't need a backwards copy. That was just the only thing the script didn't measure because the original version didn't have memmove support. memmove total 5224252 [snip] forward copy value |-- count 0 |@ 980253 backward copy value |-- count 0 |@ 4243999 the backward copy is much more than the forward copy. Thanks Miao -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 08 Sep 2010 20:57:07 +0800 Miao Xie wrote: > On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote: > > > >> According to the data, the length of the most copies is>=128. > > > > Thanks for the data. Large is easier to optimize than small, that's > > good. > > > > Could you also measure how many memsets need the backwards copy? > > (should be easy to add) > > I think memset doesn't need the backwards copy. I meant for memmove of course. Obviously memset doesn't need a backwards copy. That was just the only thing the script didn't measure because the original version didn't have memmove support. Your whole thread was about making memmove faster, right? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote: According to the data, the length of the most copies is>=128. Thanks for the data. Large is easier to optimize than small, that's good. Could you also measure how many memsets need the backwards copy? (should be easy to add) I think memset doesn't need the backwards copy. The reason why memmove needs the backward copy is because the forward copy may cover the data which will be used later if the dest string intersect with the src string. But memset needn't worry about this problem. Thanks! Miao If the number is small that needs backwards then the easiest fix would be to simply call the normal memcpy in the forward case. That is for backward could also use a string instruction copy of course, just have to set the direction flag. That would be a very small code change. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
> According to the data, the length of the most copies is >=128. Thanks for the data. Large is easier to optimize than small, that's good. Could you also measure how many memsets need the backwards copy? (should be easy to add) If the number is small that needs backwards then the easiest fix would be to simply call the normal memcpy in the forward case. That is for backward could also use a string instruction copy of course, just have to set the direction flag. That would be a very small code change. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
Hi, Andi On Thu, 2 Sep 2010 12:40:08 +0200, Andi Kleen wrote: So I improve the generic version of memcpy and memmove, and x86_64's memmove which are implemented by byte copy. One should also add that most memmove()s and memcpy()s are actually generated by gcc as inlines (especially if you don't use the "make my code slow" option aka -Os) and don't use the fallback. The fallback depends on the gcc version and if gcc thinks the data is aligned or not. Sometimes one can get better code in the caller by making sure gcc knows the correct alignment (e.g. with suitable types) and size. This might be worth looking at for btrfs if it's really that memmove heavy. Right! But the src address and dest address is not fixed, so it is hard to tell gcc that the address is alignment or not. The problem is memmove is very inefficient in fact, and it is used at some fast path, such as: it is used to do metadata copy for filesystem, so we must improve it. I have some systemtap scripts to measure size/alignment distributions of copies on a kernel, if you have a particular workload you're interested in those could be tried. Good! Could you give me these script? ftp://firstfloor.org/pub/ak/probes/csum.m4 You need to run them through .m4 first. They don't measure memmove, but that should be easy to add. I used your script to measure size/alignment distributions of copies on a kernel when I ran the btrfs test, and got some data: memmove total 325903 length value |-- count 1 | 0 2 | 0 4 | 3 8 | 0 16 | 57062 32 |@@ 5903 64 |@@@ 7296 128 |@ 18868 256 |@ 33790 512 | 64886 1024 |@@ 98450 2048 | 39645 4096 | 0 8192 | 0 length upto 50 value |-- count 2 |0 3 |0 4 |3 5 |0 6 |0 ~ 21 |0 22 |0 23 | 24 24 |0 25 |@@ 57037 26 |0 27 |0 ~ 29 |0 30 |0 31 |1 32 |3 33 | 78 34 | 215 35 | 1865 36 | 432 37 |0 38 |0 39 |0 40 |0 41 | 130 42 |0 43 | 80 44 |0 45 |0 46 |0 47 |0 48 | 80 49 |0 50 | 1077 >50 |@ 264878 src unalignments value |-- count 0 | 23173 1 |@
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
> So I improve the generic version of memcpy and memmove, and x86_64's memmove > which are implemented by byte copy. One should also add that most memmove()s and memcpy()s are actually generated by gcc as inlines (especially if you don't use the "make my code slow" option aka -Os) and don't use the fallback. The fallback depends on the gcc version and if gcc thinks the data is aligned or not. Sometimes one can get better code in the caller by making sure gcc knows the correct alignment (e.g. with suitable types) and size. This might be worth looking at for btrfs if it's really that memmove heavy. > > > >I have some systemtap scripts to measure size/alignment distributions of > >copies on a kernel, if you have a particular workload you're interested > >in those could be tried. > > Good! Could you give me these script? ftp://firstfloor.org/pub/ak/probes/csum.m4 You need to run them through .m4 first. They don't measure memmove, but that should be easy to add. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Thu, 02 Sep 2010 10:55:58 +0200, Andi Kleen wrote: Miao Xie writes: Changes from V1 to V2: - change the version of GPL from version 2.1 to version 2 the kernel's memcpy and memmove is very inefficient. But the glibc version is quite fast, in some cases it is 10 times faster than the kernel version. So I Can you elaborate on which CPUs and with what workloads you measured that? I did this test on x86_64 box with 4 cores, and the workload is quite low, and I just do 500 bytes copy for 5,000,000 times. the attached file is my test program. The kernel memcpy is optimized for copies smaller than a page size for example (kernel very rarely does anything on larger than 4k), the glibc isn't. etc. There are various other differences. memcpy and memmove are very different. AFAIK noone has tried to optimize memmove() before because traditionally it wasn't used for anything performance critical in the kernel. Has that that changed? memcpy on the other hand while not perfect is actually quite optimized for typical workloads. Yes,the performance of memcpy on the most architecture is well, But some of memmoves are implemented by byte copy, it is quite inefficient. Unfortunately those memmove are used to modify the metadata of some filesystems, such as: btrfs. That is memmove is importent for the performance of those filesystems. So I improve the generic version of memcpy and memmove, and x86_64's memmove which are implemented by byte copy. One big difference between the kernel and glibc is that kernel is often cache cold, so you e.g. the cost of a very large code footprint memcpy/memset is harder to amortize. Microbenchmarks often leave out that crucial variable. I have some systemtap scripts to measure size/alignment distributions of copies on a kernel, if you have a particular workload you're interested in those could be tried. Good! Could you give me these script? Just copying the glibc bloat uncritical is very likely the wrong move at least. Agree! Thanks! Miao #include #include #include #include #include #include #include void get_start_time(struct timeval *tv) { do_gettimeofday(tv); } void account_time(struct timeval *stv, struct timeval *etv, int loops) { do_gettimeofday(etv); if (loops) { while (etv->tv_usec < stv->tv_usec) { etv->tv_sec--; etv->tv_usec += 100; } etv->tv_usec -= stv->tv_usec; etv->tv_sec -= stv->tv_sec; while (etv->tv_usec > 100) { etv->tv_usec -= 100; etv->tv_sec++; } printk("\tTotal loops: %d\n", loops); printk("\tTotal time: %lds%ldus\n", etv->tv_sec, etv->tv_usec); } else printk("Didn't do any loop!\n"); } char *str; int init_module(void) { struct timeval stv, etv; int loops, i; str = kmalloc(1000, GFP_KERNEL); if (!str) return 0; loops = i = 500; printk("memcpy:\n"); get_start_time(&stv); while (i--) memcpy(str + 400, str, 500); account_time(&stv, &etv, loops); i = loops; printk("\nmemmove:\n"); get_start_time(&stv); while (i--) memmove(str + 400, str, 500); account_time(&stv, &etv, loops); return 0; } void cleanup_module(void) { kfree(str); } MODULE_LICENSE("GPL");
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
Miao Xie writes: > Changes from V1 to V2: > - change the version of GPL from version 2.1 to version 2 > > the kernel's memcpy and memmove is very inefficient. But the glibc version is > quite fast, in some cases it is 10 times faster than the kernel version. So I Can you elaborate on which CPUs and with what workloads you measured that? The kernel memcpy is optimized for copies smaller than a page size for example (kernel very rarely does anything on larger than 4k), the glibc isn't. etc. There are various other differences. memcpy and memmove are very different. AFAIK noone has tried to optimize memmove() before because traditionally it wasn't used for anything performance critical in the kernel. Has that that changed? memcpy on the other hand while not perfect is actually quite optimized for typical workloads. One big difference between the kernel and glibc is that kernel is often cache cold, so you e.g. the cost of a very large code footprint memcpy/memset is harder to amortize. Microbenchmarks often leave out that crucial variable. I have some systemtap scripts to measure size/alignment distributions of copies on a kernel, if you have a particular workload you're interested in those could be tried. Just copying the glibc bloat uncritical is very likely the wrong move at least. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 1/3] lib: introduce some memory copy macros and functions
Changes from V1 to V2: - change the version of GPL from version 2.1 to version 2 the kernel's memcpy and memmove is very inefficient. But the glibc version is quite fast, in some cases it is 10 times faster than the kernel version. So I introduce some memory copy macros and functions of the glibc to improve the kernel version's performance. The strategy of the memory functions is: 1. Copy bytes until the destination pointer is aligned. 2. Copy words in unrolled loops. If the source and destination are not aligned in the same way, use word memory operations, but shift and merge two read words before writing. 3. Copy the few remaining bytes. Signed-off-by: Miao Xie --- include/linux/memcopy.h | 226 ++ lib/Makefile|3 +- lib/memcopy.c | 403 +++ 3 files changed, 631 insertions(+), 1 deletions(-) create mode 100644 include/linux/memcopy.h create mode 100644 lib/memcopy.c diff --git a/include/linux/memcopy.h b/include/linux/memcopy.h new file mode 100644 index 000..9c65ac8 --- /dev/null +++ b/include/linux/memcopy.h @@ -0,0 +1,226 @@ +/* + * memcopy.h -- definitions for memory copy functions. Generic C version. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General + * Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. + * + * The code is derived from the GNU C Library. + * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc. + */ +#ifndef _LINUX_MEMCOPY_H_ +#define _LINUX_MEMCOPY_H_ + +/* + * The strategy of the memory functions is: + * + * 1. Copy bytes until the destination pointer is aligned. + * + * 2. Copy words in unrolled loops. If the source and destination + * are not aligned in the same way, use word memory operations, + * but shift and merge two read words before writing. + * + * 3. Copy the few remaining bytes. + * + * This is fast on processors that have at least 10 registers for + * allocation by GCC, and that can access memory at reg+const in one + * instruction. + */ + +#include +#include +#include + +/* + * The macros defined in this file are: + * + * BYTE_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy) + * + * BYTE_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy) + * + * WORD_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_remaining, nbytes_to_copy) + * + * WORD_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_remaining, nbytes_to_copy) + * + * MERGE(old_word, sh_1, new_word, sh_2) + * + * MEM_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy) + * + * MEM_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy) + */ + +#define OP_T_THRESHOLD 16 + +/* + * Type to use for aligned memory operations. + * This should normally be the biggest type supported by a single load + * and store. + */ +#defineop_tunsigned long int +#define OPSIZ (sizeof(op_t)) + +/* Type to use for unaligned operations. */ +typedef unsigned char byte; + +#ifndef MERGE +# ifdef __LITTLE_ENDIAN +# define MERGE(w0, sh_1, w1, sh_2) (((w0) >> (sh_1)) | ((w1) << (sh_2))) +# elif defined(__BIG_ENDIAN) +# define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2))) +# else +# error "Macro MERGE() hasn't defined!" +# endif +#endif + +/* + * Copy exactly NBYTES bytes from SRC_BP to DST_BP, + * without any assumptions about alignment of the pointers. + */ +#ifndef BYTE_COPY_FWD +#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)\ +do { \ + size_t __nbytes = (nbytes); \ + while (__nbytes > 0) {\ + byte __x = ((byte *) src_bp)[0]; \ + src_bp += 1; \ + __nbytes -= 1;\ + ((byte *) dst_bp)[0] = __x; \ + dst_bp += 1; \ + } \ +} while (0) +#endif + +/* + * Copy exactly NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR, + * beginning at the bytes right before the pointers and continuing towards + * smaller addres