Re: [PATCH] x86_64/lib: improve the performance of memmove
On Fri, 17 Sep 2010 08:55:18 +0800, ykzhao wrote: > On Thu, 2010-09-16 at 15:16 +0800, Miao Xie wrote: >> On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote: When the dest and the src do overlap and the memory area is large, memmove of x86_64 is very inefficient, and it led to bad performance, such as btrfs's file deletion performance. This patch improved the performance of memmove on x86_64 by using __memcpy_bwd() instead of byte copy when doing large memory area copy (len> 64). >>> >>> >>> I still don't understand why you don't simply use a backwards >>> string copy (with std) ? That should be much simpler and >>> hopefully be as optimized for kernel copies on recent CPUs. >> >> But according to the comment of memcpy, some CPUs don't support "REP" >> instruction, > > Where do you find that the "REP" instruction is not supported on some > CPUs? The comment in arch/x86/lib/memcpy_64.s only states that some CPUs > will run faster when using string copy instruction. Sorry! I misread the comment. >> so I think we must implement a backwards string copy by other method for >> those CPUs, >> But that implement is complex, so I write it as a function -- __memcpy_bwd(). > > Will you please look at tip/x86/ tree(mem branch)? The memory copy on > x86_64 is already optimized. Thanks for your reminding! It is very helpful. Miao > thanks. > Yakui >> >> Thanks! >> Miao >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 2010-09-16 at 15:16 +0800, Miao Xie wrote: > On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote: > >> When the dest and the src do overlap and the memory area is large, memmove > >> of > >> x86_64 is very inefficient, and it led to bad performance, such as btrfs's > >> file > >> deletion performance. This patch improved the performance of memmove on > >> x86_64 > >> by using __memcpy_bwd() instead of byte copy when doing large memory area > >> copy > >> (len> 64). > > > > > > I still don't understand why you don't simply use a backwards > > string copy (with std) ? That should be much simpler and > > hopefully be as optimized for kernel copies on recent CPUs. > > But according to the comment of memcpy, some CPUs don't support "REP" > instruction, Where do you find that the "REP" instruction is not supported on some CPUs? The comment in arch/x86/lib/memcpy_64.s only states that some CPUs will run faster when using string copy instruction. > so I think we must implement a backwards string copy by other method for > those CPUs, > But that implement is complex, so I write it as a function -- __memcpy_bwd(). Will you please look at tip/x86/ tree(mem branch)? The memory copy on x86_64 is already optimized. thanks. Yakui > > Thanks! > Miao > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: add support for mixed data+metadata block groups
There are just a few things that need to be fixed in the kernel to support mixed data+metadata block groups. Mostly we just need to make sure that if we are using mixed block groups that we continue to allocate mixed block groups as we need them. Also we need to make sure __find_space_info will find our space info if we search for DATA or METADATA only. Tested this with xfstests and it works nicely. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 24 +--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 91a0a41..c0a9ad3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -547,7 +547,7 @@ static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info, rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { - if (found->flags == flags) { + if (found->flags & flags) { rcu_read_unlock(); return found; } @@ -3267,6 +3267,13 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, spin_unlock(&space_info->lock); /* +* If we have mixed data/metadata chunks we want to make sure we keep +* allocating mixed chunks instead of individual chunks. +*/ + if (space_info->flags != flags) + flags = space_info->flags; + + /* * if we're doing a data chunk, go ahead and make sure that * we keep a reasonable number of metadata chunks allocated in the * FS as well. @@ -4793,6 +4800,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, bool found_uncached_bg = false; bool failed_cluster_refill = false; bool failed_alloc = false; + bool use_cluster = true; u64 ideal_cache_percent = 0; u64 ideal_cache_offset = 0; @@ -4807,16 +4815,26 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, return -ENOSPC; } + /* +* If the space info is for both data and metadata it means we have a +* small filesystem and we can't use the clustering stuff. +*/ + if ((space_info->flags & (BTRFS_BLOCK_GROUP_METADATA | + BTRFS_BLOCK_GROUP_DATA)) == + ((BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA))) + use_cluster = false; + if (orig_root->ref_cows || empty_size) allowed_chunk_alloc = 1; - if (data & BTRFS_BLOCK_GROUP_METADATA) { + if (data & BTRFS_BLOCK_GROUP_METADATA && use_cluster) { last_ptr = &root->fs_info->meta_alloc_cluster; if (!btrfs_test_opt(root, SSD)) empty_cluster = 64 * 1024; } - if ((data & BTRFS_BLOCK_GROUP_DATA) && btrfs_test_opt(root, SSD)) { + if ((data & BTRFS_BLOCK_GROUP_DATA) && use_cluster && + btrfs_test_opt(root, SSD)) { last_ptr = &root->fs_info->data_alloc_cluster; } -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: check cache->caching_ctl before returning if caching has started
With the free space disk caching we can mark the block group as started with the caching, but we don't have a caching ctl. This can race with anybody else who tries to get the caching ctl before we cache (this is very hard to do btw). So instead check to see if cache->caching_ctl is set, and if not return NULL. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8ef6aa6..91a0a41 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -242,6 +242,12 @@ get_caching_control(struct btrfs_block_group_cache *cache) return NULL; } + /* We're loading it the fast way, so we don't have a caching_ctl. */ + if (!cache->caching_ctl) { + spin_unlock(&cache->lock); + return NULL; + } + ctl = cache->caching_ctl; atomic_inc(&ctl->count); spin_unlock(&cache->lock); -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: add support for mixed data+metadata block groups
So alot of crazy people (I'm looking at you Meego) want to use btrfs on phones and such with small devices. Unfortunately the way we split out metadata/data chunks it makes space usage inefficient for volumes that are smaller than 1gigabyte. So add a -M option for mixing metadata+data, and default to this mixed mode if the filesystem is less than or equal to 1 gigabyte. I've tested this with xfstests on a 100mb filesystem and everything is a-ok. Signed-off-by: Josef Bacik --- btrfs-vol.c |4 +- btrfs_cmds.c | 13 +- mkfs.c | 121 -- utils.c | 10 ++-- utils.h |2 +- 5 files changed, 103 insertions(+), 47 deletions(-) diff --git a/btrfs-vol.c b/btrfs-vol.c index 8069778..7200bbc 100644 --- a/btrfs-vol.c +++ b/btrfs-vol.c @@ -129,7 +129,9 @@ int main(int ac, char **av) exit(1); } if (cmd == BTRFS_IOC_ADD_DEV) { - ret = btrfs_prepare_device(devfd, device, 1, &dev_block_count); + int mixed = 0; + + ret = btrfs_prepare_device(devfd, device, 1, &dev_block_count, &mixed); if (ret) { fprintf(stderr, "Unable to init %s\n", device); exit(1); diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 8031c58..683aec0 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -705,6 +705,7 @@ int do_add_volume(int nargs, char **args) int devfd, res; u64 dev_block_count = 0; struct stat st; + int mixed = 0; devfd = open(args[i], O_RDWR); if (!devfd) { @@ -727,7 +728,7 @@ int do_add_volume(int nargs, char **args) continue; } - res = btrfs_prepare_device(devfd, args[i], 1, &dev_block_count); + res = btrfs_prepare_device(devfd, args[i], 1, &dev_block_count, &mixed); if (res) { fprintf(stderr, "ERROR: Unable to init '%s'\n", args[i]); close(devfd); @@ -889,8 +890,14 @@ int do_df_filesystem(int nargs, char **argv) memset(description, 0, 80); if (flags & BTRFS_BLOCK_GROUP_DATA) { - snprintf(description, 5, "%s", "Data"); - written += 4; + if (flags & BTRFS_BLOCK_GROUP_METADATA) { + snprintf(description, 15, "%s", +"Data+Metadata"); + written += 14; + } else { + snprintf(description, 5, "%s", "Data"); + written += 4; + } } else if (flags & BTRFS_BLOCK_GROUP_SYSTEM) { snprintf(description, 7, "%s", "System"); written += 6; diff --git a/mkfs.c b/mkfs.c index 2e99b95..4610f58 100644 --- a/mkfs.c +++ b/mkfs.c @@ -69,7 +69,7 @@ static u64 parse_size(char *s) return atol(s) * mult; } -static int make_root_dir(struct btrfs_root *root) +static int make_root_dir(struct btrfs_root *root, int mixed) { struct btrfs_trans_handle *trans; struct btrfs_key location; @@ -88,30 +88,47 @@ static int make_root_dir(struct btrfs_root *root) 0, BTRFS_MKFS_SYSTEM_GROUP_SIZE); BUG_ON(ret); - ret = btrfs_alloc_chunk(trans, root->fs_info->extent_root, - &chunk_start, &chunk_size, - BTRFS_BLOCK_GROUP_METADATA); - BUG_ON(ret); - ret = btrfs_make_block_group(trans, root, 0, -BTRFS_BLOCK_GROUP_METADATA, -BTRFS_FIRST_CHUNK_TREE_OBJECTID, -chunk_start, chunk_size); - BUG_ON(ret); + if (mixed) { + ret = btrfs_alloc_chunk(trans, root->fs_info->extent_root, + &chunk_start, &chunk_size, + BTRFS_BLOCK_GROUP_METADATA | + BTRFS_BLOCK_GROUP_DATA); + BUG_ON(ret); + ret = btrfs_make_block_group(trans, root, 0, +BTRFS_BLOCK_GROUP_METADATA | +BTRFS_BLOCK_GROUP_DATA, +BTRFS_FIRST_CHUNK_TREE_OBJECTID, +chunk_start, chunk_size); + BUG_ON(ret); + printf("Created a data/metadata chunk of size %llu\n", chunk_size); + } else { + ret = btrfs_alloc_chunk(trans, root->fs_info->extent_root, + &chunk_start, &chunk_size, + BTRFS_BLOCK_GROUP_METADATA); +
[PATCH] Btrfs: stop trying to shrink delalloc if there are no inodes to reclaim
In very severe ENOSPC cases we can run out of inodes to do delalloc on, which means we'll just keep looping trying to shrink delalloc. Instead, if we fail to shrink delalloc 3 times in a row break out since we're not likely to make any progress. Tested this with a 100mb fs an xfstests test 13. Before the patch it would hang the box, with the patch we get -ENOSPC like we should. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8db9330..8ef6aa6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3327,6 +3327,7 @@ static int shrink_delalloc(struct btrfs_trans_handle *trans, u64 reserved; u64 max_reclaim; u64 reclaimed = 0; + int no_reclaim = 0; int pause = 1; int ret; @@ -3343,12 +3344,16 @@ static int shrink_delalloc(struct btrfs_trans_handle *trans, while (1) { ret = btrfs_start_one_delalloc_inode(root, trans ? 1 : 0); if (!ret) { + if (no_reclaim > 2) + break; + no_reclaim++; __set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(pause); pause <<= 1; if (pause > HZ / 10) pause = HZ / 10; } else { + no_reclaim = 0; pause = 1; } -- 1.6.6.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add the "btrfs filesystem label" command
On Thursday, 16 September, 2010, Mike Fedyk wrote: > On Mon, Sep 13, 2010 at 12:24 PM, Goffredo Baroncelli > wrote: [...] > > + > > + if(ret != 0) > > + { > > + fprintf(stderr, "FATAL: the filesystem has to be > > unmounted\n"); > > + return -2; > > + } > > + get_label_unmounted(btrfs_dev); > > + return 0; > > +} > > + > > + > > Why can't the label be read while the fs is mounted? It shouldn't > hurt anything. I can read the superblock on my ext3 fs while it's > mounted... This is what people have come to expect. The main reason is that if a filesystem is mounted, the data read from userspace may be outdated by the internal data. In the original patch (the one of Morey Roof), there was a kernel space code that handled this case. In order to simplify the patch I split the code in two step the user space and the kernel space. This is the user space portion, and because it has to work without the kernel space portion, when it is required the code fail explicitly. I hope to publish the 2nd patch during the next weeks. Regards G.Baroncelli -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
> void *memmove(void *dest, const void *src, size_t count) > { > if (dest < src) { > return memcpy(dest, src, count); > } else { > - char *p = dest + count; > - const char *s = src + count; > - while (count--) > - *--p = *--s; > + return memcpy_backwards(dest, src, count); > } > return dest; > } Er... presumably, the forward-copy case is somewhat better optimized, so should be preferred if the areas don't overlap; that is, dest > src by more than the sount. Assuming that size_t can hold a pointer: if ((size_t)src - (size_t)dest >= count) return memcpy(dest, src, count); else return memcpy_backwards(dest, src, count); Or, using GCC's arithmetic on void * extension, if ((size_t)(src - dest) >= count) ... etc. If src == dest, it doesn't matter which you use. You could skip the copy entirely, but presumably that case doesn't arise often enough to be worth testing for: if ((size_t)(src - dest) >= count) return memcpy(dest, src, count); else if (src - dest != 0) return memcpy_backwards(dest, src, count); else return dest; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Removing a subvolume by an ordinary user
Hi all, currently BTRFS doesn't allow an ordinary user to remove a subvolume (or snapshot). I think that the reasons is simple: a subvolume may contain files/directories owned by other user. Allowing an ordinary user to remove a subvolume means allowing an ordinary user to remove filess/directories owned by other user. And this is not good. Moreover BTRFS removes a subvolume asynchronously, so it is not possible to return an error like “hey you are trying to remove a not your file ! Don’t do it !”. My idea is to add another ioctl that permits to remove a subvolume only when it is empty and its host directory is writable by the user… like a directory. An option is to allow to remove an empty subvolume with the unlink(2) syscall: no more tool is needed ! This will solve a lot of problem: - Consistently with the current unlink(2) behavior - The kernel has not to do complicate check - There no is necessity to add another interface to wait the releasing of the space (see other thread reserving an IOCTL number; other details ). The disadvantage is that it should be slower than the currently implementation. Of course I don’t want to remove the existing interface. I want only to add another one. Comments ? Thoughts ? Regards G.Baroncelli -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 18:47:59 +0800 , Miao Xie wrote: On Thu, 16 Sep 2010 12:11:41 +0200, Andi Kleen wrote: On Thu, 16 Sep 2010 17:29:32 +0800 Miao Xie wrote: Ok was a very broken patch. Sorry should have really done some more work on it. Anyways hopefully the corrected version is good for testing. -Andi The test result is following: Len Src Unalign Dest UnalignPatch applied Without Patch --- --- - - 8 0 0 0s 421117us 0s 70203us 8 0 3 0s 252622us 0s 42114us 8 0 7 0s 252663us 0s 42111us 8 3 0 0s 252666us 0s 42114us 8 3 3 0s 252667us 0s 42113us 8 3 7 0s 252667us 0s 42112us 32 0 0 0s 252672us 0s 114301us 32 0 3 0s 252676us 0s 114306us 32 0 7 0s 252663us 0s 114300us 32 3 0 0s 252661us 0s 114305us 32 3 3 0s 252663us 0s 114300us 32 3 7 0s 252668us 0s 114304us 64 0 0 0s 252672us 0s 236119us 64 0 3 0s 264671us 0s 236120us 64 0 7 0s 264702us 0s 236127us 64 3 0 0s 270701us 0s 236128us 64 3 3 0s 287236us 0s 236809us 64 3 7 0s 287257us 0s 236123us According to the above result, old version is better than the new one when the memory area is small. Len Src Unalign Dest UnalignPatch applied Without Patch --- --- - - 256 0 0 0s 281886us 0s 813660us 256 0 3 0s 332169us 0s 813645us 256 0 7 0s 342961us 0s 813639us 256 3 0 0s 305305us 0s 813634us 256 3 3 0s 386939us 0s 813638us 256 3 7 0s 370511us 0s 814335us 512 0 0 0s 310716us 1s 584677us 512 0 3 0s 456420us 1s 583353us 512 0 7 0s 468236us 1s 583248us 512 3 0 0s 493987us 1s 583659us 512 3 3 0s 588041us 1s 584294us 512 3 7 0s 605489us 1s 583650us 10240 0 0s 406971us 3s 123644us 10240 3 0s 748419us 3s 126514us 10240 7 0s 756153us 3s 127178us 10243 0 0s 854681us 3s 130013us 10243 3 1s 46828us 3s 140190us 10243 7 1s 35886us 3s 135508us the new version is better when the memory area is large. Thanks! Miao title: x86_64/lib: improve the performance of memmove Implement the 64bit memmmove backwards case using string instructions Signed-off-by: Andi Kleen Signed-off-by: Miao Xie --- arch/x86/lib/memcpy_64.S | 29 + arch/x86/lib/memmove_64.c | 8 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index bcbcd1e..9de5e9a 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -141,3 +141,32 @@ ENDPROC(__memcpy) .byte .Lmemcpy_e - .Lmemcpy_c .byte .Lmemcpy_e - .Lmemcpy_c .previous + +/* + * Copy memory backwards (for memmove) + * rdi target + * rsi source + * rdx count + */ + +ENTRY(memcpy_backwards) + CFI_STARTPROC + std + movq %rdi, %rax + movl %edx, %ecx + addq %rdx, %rdi + addq %rdx, %rsi + leaq -8(%rdi), %rdi + leaq -8(%rsi), %rsi + shrl $3, %ecx + andl $7, %edx + rep movsq + addq $7, %rdi + addq $7, %rsi + movl %edx, %ecx + rep movsb + cld + ret + CFI_ENDPROC +ENDPROC(memcpy_backwards) + diff --git a/arch/x86/lib/memmove_64.c b/arch/x86/lib/memmove_64.c index 0a33909..6774fd8 100644 --- a/arch/x86/lib/memmove_64.c +++ b/arch/x86/lib/memmove_64.c @@ -5,16 +5,16 @@ #include #include +extern void * asmlinkage memcpy_backwards(void *dst, const void *src, + size_t count); + #undef memmove void *memmove(void *dest, const void *src, size_t count) { if (dest < src) { return memcpy(dest, src, count); } else { - char *p = dest + count; - const char *s = src + count; - while (count--) - *--p = *--s; + return memcpy_backwards(dest, src, count); } return dest; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 12:11:41 +0200, Andi Kleen wrote: On Thu, 16 Sep 2010 17:29:32 +0800 Miao Xie wrote: Ok was a very broken patch. Sorry should have really done some more work on it. Anyways hopefully the corrected version is good for testing. -Andi title: x86_64/lib: improve the performance of memmove Implement the 64bit memmmove backwards case using string instructions Signed-off-by: Andi Kleen Signed-off-by: Miao Xie --- arch/x86/lib/memcpy_64.S | 29 + arch/x86/lib/memmove_64.c |8 2 files changed, 33 insertions(+), 4 deletions(-) diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index bcbcd1e..9de5e9a 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -141,3 +141,32 @@ ENDPROC(__memcpy) .byte .Lmemcpy_e - .Lmemcpy_c .byte .Lmemcpy_e - .Lmemcpy_c .previous + +/* + * Copy memory backwards (for memmove) + * rdi target + * rsi source + * rdx count + */ + +ENTRY(memcpy_backwards) + CFI_STARTPROC + std + movq %rdi, %rax + movl %edx, %ecx + addq %rdx, %rdi + addq %rdx, %rsi + leaq -8(%rdi), %rdi + leaq -8(%rsi), %rsi + shrl $3, %ecx + andl $7, %edx + rep movsq + addq $7, %rdi + addq $7, %rsi + movl %edx, %ecx + rep movsb + cld + ret + CFI_ENDPROC +ENDPROC(memcpy_backwards) + diff --git a/arch/x86/lib/memmove_64.c b/arch/x86/lib/memmove_64.c index 0a33909..6774fd8 100644 --- a/arch/x86/lib/memmove_64.c +++ b/arch/x86/lib/memmove_64.c @@ -5,16 +5,16 @@ #include #include +extern void * asmlinkage memcpy_backwards(void *dst, const void *src, + size_t count); + #undef memmove void *memmove(void *dest, const void *src, size_t count) { if (dest < src) { return memcpy(dest, src, count); } else { - char *p = dest + count; - const char *s = src + count; - while (count--) - *--p = *--s; + return memcpy_backwards(dest, src, count); } return dest; } -- 1.7.0.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 17:29:32 +0800 Miao Xie wrote: Ok was a very broken patch. Sorry should have really done some more work on it. Anyways hopefully the corrected version is good for testing. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 10:40:08 +0200, Andi Kleen wrote: On Thu, 16 Sep 2010 15:16:31 +0800 Miao Xie wrote: On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote: When the dest and the src do overlap and the memory area is large, memmove of x86_64 is very inefficient, and it led to bad performance, such as btrfs's file deletion performance. This patch improved the performance of memmove on x86_64 by using __memcpy_bwd() instead of byte copy when doing large memory area copy (len> 64). I still don't understand why you don't simply use a backwards string copy (with std) ? That should be much simpler and hopefully be as optimized for kernel copies on recent CPUs. But according to the comment of memcpy, some CPUs don't support "REP" instruction, I think you misread the comment. REP prefixes are in all x86 CPUs. On some very old systems it wasn't optimized very well, but it probably doesn't make too much sense to optimize for those. On newer CPUs in fact REP should be usually faster than an unrolled loop. I'm not sure how optimized the backwards copy is, but most likely it is optimized too. Here's an untested patch that implements backwards copy with string instructions. Could you run it through your test harness? Ok, I'll do it. + +/* + * Copy memory backwards (for memmove) + * rdi target + * rsi source + * rdx count + */ + +ENTRY(memcpy_backwards): s/:// + CFI_STARTPROC + std + movq %rdi, %rax + movl %edx, %ecx + add %rdx, %rdi + add %rdx, %rsi - add %rdx, %rdi - add %rdx, %rsi + addq %rdx, %rdi + addq %rdx, %rsi Besides that, the address that %rdi/%rsi pointed to is over the end of mempry area that going to be copied, so we must tune the address to be correct. + leaq -8(%rdi), %rdi + leaq -8(%rsi), %rsi + shrl $3, %ecx + andl $7, %edx + rep movsq The same as above. + leaq 8(%rdi), %rdi + leaq 8(%rsi), %rsi + decq %rsi + decq %rdi + movl %edx, %ecx + rep movsb + cld + ret + CFI_ENDPROC +ENDPROC(memcpy_backwards) + diff --git a/arch/x86/lib/memmove_64.c b/arch/x86/lib/memmove_64.c index 0a33909..6c00304 100644 --- a/arch/x86/lib/memmove_64.c +++ b/arch/x86/lib/memmove_64.c @@ -5,16 +5,16 @@ #include #include +extern void asmlinkage memcpy_backwards(void *dst, const void *src, + size_t count); The type of the return value must be "void *". Thanks Miao + #undef memmove void *memmove(void *dest, const void *src, size_t count) { if (dest< src) { return memcpy(dest, src, count); } else { - char *p = dest + count; - const char *s = src + count; - while (count--) - *--p = *--s; + return memcpy_backwards(dest, src, count); } return dest; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 15:16:31 +0800 Miao Xie wrote: > On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote: > >> When the dest and the src do overlap and the memory area is large, > >> memmove of > >> x86_64 is very inefficient, and it led to bad performance, such as > >> btrfs's file > >> deletion performance. This patch improved the performance of > >> memmove on x86_64 > >> by using __memcpy_bwd() instead of byte copy when doing large > >> memory area copy > >> (len> 64). > > > > > > I still don't understand why you don't simply use a backwards > > string copy (with std) ? That should be much simpler and > > hopefully be as optimized for kernel copies on recent CPUs. > > But according to the comment of memcpy, some CPUs don't support "REP" > instruction, I think you misread the comment. REP prefixes are in all x86 CPUs. On some very old systems it wasn't optimized very well, but it probably doesn't make too much sense to optimize for those. On newer CPUs in fact REP should be usually faster than an unrolled loop. I'm not sure how optimized the backwards copy is, but most likely it is optimized too. Here's an untested patch that implements backwards copy with string instructions. Could you run it through your test harness? -Andi Implement the 64bit memmmove backwards case using string instructions Signed-off-by: Andi Kleen diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index bcbcd1e..6e8258d 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -141,3 +141,28 @@ ENDPROC(__memcpy) .byte .Lmemcpy_e - .Lmemcpy_c .byte .Lmemcpy_e - .Lmemcpy_c .previous + +/* + * Copy memory backwards (for memmove) + * rdi target + * rsi source + * rdx count + */ + +ENTRY(memcpy_backwards): + CFI_STARTPROC + std + movq %rdi, %rax + movl %edx, %ecx + add %rdx, %rdi + add %rdx, %rsi + shrl $3, %ecx + andl $7, %edx + rep movsq + movl %edx, %ecx + rep movsb + cld + ret + CFI_ENDPROC +ENDPROC(memcpy_backwards) + diff --git a/arch/x86/lib/memmove_64.c b/arch/x86/lib/memmove_64.c index 0a33909..6c00304 100644 --- a/arch/x86/lib/memmove_64.c +++ b/arch/x86/lib/memmove_64.c @@ -5,16 +5,16 @@ #include #include +extern void asmlinkage memcpy_backwards(void *dst, const void *src, + size_t count); + #undef memmove void *memmove(void *dest, const void *src, size_t count) { if (dest < src) { return memcpy(dest, src, count); } else { - char *p = dest + count; - const char *s = src + count; - while (count--) - *--p = *--s; + return memcpy_backwards(dest, src, count); } return dest; } -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86_64/lib: improve the performance of memmove
On Thu, 16 Sep 2010 08:48:25 +0200 (cest), Andi Kleen wrote: When the dest and the src do overlap and the memory area is large, memmove of x86_64 is very inefficient, and it led to bad performance, such as btrfs's file deletion performance. This patch improved the performance of memmove on x86_64 by using __memcpy_bwd() instead of byte copy when doing large memory area copy (len> 64). I still don't understand why you don't simply use a backwards string copy (with std) ? That should be much simpler and hopefully be as optimized for kernel copies on recent CPUs. But according to the comment of memcpy, some CPUs don't support "REP" instruction, so I think we must implement a backwards string copy by other method for those CPUs, But that implement is complex, so I write it as a function -- __memcpy_bwd(). Thanks! Miao -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html