Re: Btrfs slowdown with ceph (how to reproduce)
Hi Chris, great to hear that, could you give me a ping if you fixed it, than I can retry it? -martin Am 24.01.2012 20:40, schrieb Chris Mason: On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote: Hi I tried the branch on one of my ceph osd, and there is a big difference in the performance. The average request size stayed high, but after around a hour the kernel crashed. IOstat http://pastebin.com/xjuriJ6J Kernel trace http://pastebin.com/SYE95GgH Aha, this I know how to fix. Thanks for trying it out. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
On Tue, Jan 24, 2012 at 08:15:58PM +0100, Martin Mailand wrote: > Hi > I tried the branch on one of my ceph osd, and there is a big > difference in the performance. > The average request size stayed high, but after around a hour the > kernel crashed. > > IOstat > http://pastebin.com/xjuriJ6J > > Kernel trace > http://pastebin.com/SYE95GgH Aha, this I know how to fix. Thanks for trying it out. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
Hi I tried the branch on one of my ceph osd, and there is a big difference in the performance. The average request size stayed high, but after around a hour the kernel crashed. IOstat http://pastebin.com/xjuriJ6J Kernel trace http://pastebin.com/SYE95GgH -martin Am 23.01.2012 19:50, schrieb Chris Mason: On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote: On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: As you might know, I have been seeing btrfs slowdowns in our ceph cluster for quite some time. Even with the latest btrfs code for 3.3 I'm still seeing these problems. To make things reproducible, I've now written a small test, that imitates ceph's behavior: On a freshly created btrfs filesystem (2 TB size, mounted with "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening 100 files. After that I'm doing random writes on these files with a sync_file_range after each write (each write has a size of 100 bytes) and ioctl(BTRFS_IOC_SYNC) after every 100 writes. After approximately 20 minutes, write activity suddenly increases fourfold and the average request size decreases (see chart in the attachment). You can find IOstat output here: http://pastebin.com/Smbfg1aG I hope that you are able to trace down the problem with the test program in the attachment. Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and formatted the fs with 64k node and leaf sizes and the problem appeared to go away. So surprise surprise fragmentation is biting us in the ass. If you can try running that branch with 64k node and leaf sizes with your ceph cluster and see how that works out. Course you should only do that if you dont mind if you lose everything :). Thanks, Please keep in mind this branch is only out there for development, and it really might have huge flaws. scrub doesn't work with it correctly right now, and the IO error recovery code is probably broken too. Long term though, I think the bigger block sizes are going to make a huge difference in these workloads. If you use the very dangerous code: mkfs.btrfs -l 64k -n 64k /dev/xxx (-l is leaf size, -n is node size). 64K is the max right now, 32K may help just as much at a lower CPU cost. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
2012/1/23 Chris Mason : > On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote: >> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: >> > As you might know, I have been seeing btrfs slowdowns in our ceph >> > cluster for quite some time. Even with the latest btrfs code for 3.3 >> > I'm still seeing these problems. To make things reproducible, I've now >> > written a small test, that imitates ceph's behavior: >> > >> > On a freshly created btrfs filesystem (2 TB size, mounted with >> > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening >> > 100 files. After that I'm doing random writes on these files with a >> > sync_file_range after each write (each write has a size of 100 bytes) >> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes. >> > >> > After approximately 20 minutes, write activity suddenly increases >> > fourfold and the average request size decreases (see chart in the >> > attachment). >> > >> > You can find IOstat output here: http://pastebin.com/Smbfg1aG >> > >> > I hope that you are able to trace down the problem with the test >> > program in the attachment. >> >> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree >> and >> formatted the fs with 64k node and leaf sizes and the problem appeared to go >> away. So surprise surprise fragmentation is biting us in the ass. If you >> can >> try running that branch with 64k node and leaf sizes with your ceph cluster >> and >> see how that works out. Course you should only do that if you dont mind if >> you >> lose everything :). Thanks, >> > > Please keep in mind this branch is only out there for development, and > it really might have huge flaws. scrub doesn't work with it correctly > right now, and the IO error recovery code is probably broken too. > > Long term though, I think the bigger block sizes are going to make a > huge difference in these workloads. > > If you use the very dangerous code: > > mkfs.btrfs -l 64k -n 64k /dev/xxx > > (-l is leaf size, -n is node size). > > 64K is the max right now, 32K may help just as much at a lower CPU cost. Thanks for taking a look. - I'm glad to hear that there is a solution on the horizon, but I'm not brave enough to try this on our ceph cluster. I'll try it when the code has stabilized a bit. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote: > On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: > > As you might know, I have been seeing btrfs slowdowns in our ceph > > cluster for quite some time. Even with the latest btrfs code for 3.3 > > I'm still seeing these problems. To make things reproducible, I've now > > written a small test, that imitates ceph's behavior: > > > > On a freshly created btrfs filesystem (2 TB size, mounted with > > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening > > 100 files. After that I'm doing random writes on these files with a > > sync_file_range after each write (each write has a size of 100 bytes) > > and ioctl(BTRFS_IOC_SYNC) after every 100 writes. > > > > After approximately 20 minutes, write activity suddenly increases > > fourfold and the average request size decreases (see chart in the > > attachment). > > > > You can find IOstat output here: http://pastebin.com/Smbfg1aG > > > > I hope that you are able to trace down the problem with the test > > program in the attachment. > > Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree > and > formatted the fs with 64k node and leaf sizes and the problem appeared to go > away. So surprise surprise fragmentation is biting us in the ass. If you can > try running that branch with 64k node and leaf sizes with your ceph cluster > and > see how that works out. Course you should only do that if you dont mind if > you > lose everything :). Thanks, > Please keep in mind this branch is only out there for development, and it really might have huge flaws. scrub doesn't work with it correctly right now, and the IO error recovery code is probably broken too. Long term though, I think the bigger block sizes are going to make a huge difference in these workloads. If you use the very dangerous code: mkfs.btrfs -l 64k -n 64k /dev/xxx (-l is leaf size, -n is node size). 64K is the max right now, 32K may help just as much at a lower CPU cost. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown with ceph (how to reproduce)
On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote: > As you might know, I have been seeing btrfs slowdowns in our ceph > cluster for quite some time. Even with the latest btrfs code for 3.3 > I'm still seeing these problems. To make things reproducible, I've now > written a small test, that imitates ceph's behavior: > > On a freshly created btrfs filesystem (2 TB size, mounted with > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening > 100 files. After that I'm doing random writes on these files with a > sync_file_range after each write (each write has a size of 100 bytes) > and ioctl(BTRFS_IOC_SYNC) after every 100 writes. > > After approximately 20 minutes, write activity suddenly increases > fourfold and the average request size decreases (see chart in the > attachment). > > You can find IOstat output here: http://pastebin.com/Smbfg1aG > > I hope that you are able to trace down the problem with the test > program in the attachment. Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and formatted the fs with 64k node and leaf sizes and the problem appeared to go away. So surprise surprise fragmentation is biting us in the ass. If you can try running that branch with 64k node and leaf sizes with your ceph cluster and see how that works out. Course you should only do that if you dont mind if you lose everything :). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs slowdown with ceph (how to reproduce)
As you might know, I have been seeing btrfs slowdowns in our ceph cluster for quite some time. Even with the latest btrfs code for 3.3 I'm still seeing these problems. To make things reproducible, I've now written a small test, that imitates ceph's behavior: On a freshly created btrfs filesystem (2 TB size, mounted with "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening 100 files. After that I'm doing random writes on these files with a sync_file_range after each write (each write has a size of 100 bytes) and ioctl(BTRFS_IOC_SYNC) after every 100 writes. After approximately 20 minutes, write activity suddenly increases fourfold and the average request size decreases (see chart in the attachment). You can find IOstat output here: http://pastebin.com/Smbfg1aG I hope that you are able to trace down the problem with the test program in the attachment. Thanks, Christian #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #define FILE_COUNT 100 #define FILE_SIZE 4194304 #define STRING "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789" #define BTRFS_IOCTL_MAGIC 0x94 #define BTRFS_IOC_SYNC _IO(BTRFS_IOCTL_MAGIC, 8) int main(int argc, char *argv[]) { char *imgname = argv[1]; char *tempname; int fd[FILE_COUNT]; int ilen, i; ilen = strlen(imgname); tempname = malloc(ilen + 8); for(i=0; i < FILE_COUNT; i++) { snprintf(tempname, ilen + 8, "%s.%i", imgname, i); fd[i] = open(tempname, O_CREAT|O_RDWR); } i=0; while(1) { int start = rand() % FILE_SIZE; int file = rand() % FILE_COUNT; putc('.', stderr); lseek(fd[file], start, SEEK_SET); write(fd[file], STRING, 100); sync_file_range(fd[file], start, 100, 0x2); usleep(25000); i++; if (i == 100) { i=0; ioctl(fd[file], BTRFS_IOC_SYNC); } } } <>
Re: Btrfs slowdown
Hi Sage, I did some testing with btrfs-unstable yesterday. With the recent commit from Chris it looks quite good: "Btrfs: force unplugs when switching from high to regular priority bios" However I can't test it extensively, because our main environment is on ext4 at the moment. Regards Christian 2011/8/8 Sage Weil : > Hi Christian, > > Are you still seeing this slowness? > > sage > > > On Wed, 27 Jul 2011, Christian Brunner wrote: >> 2011/7/25 Chris Mason : >> > Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400: >> >> Hi, >> >> >> >> we are running a ceph cluster with btrfs as it's base filesystem >> >> (kernel 3.0). At the beginning everything worked very well, but after >> >> a few days (2-3) things are getting very slow. >> >> >> >> When I look at the object store servers I see heavy disk-i/o on the >> >> btrfs filesystems (disk utilization is between 60% and 100%). I also >> >> did some tracing on the Cepp-Object-Store-Daemon, but I'm quite >> >> certain, that the majority of the disk I/O is not caused by ceph or >> >> any other userland process. >> >> >> >> When reboot the system(s) the problems go away for another 2-3 days, >> >> but after that, it starts again. I'm not sure if the problem is >> >> related to the kernel warning I've reported last week. At least there >> >> is no temporal relationship between the warning and the slowdown. >> >> >> >> Any hints on how to trace this would be welcome. >> > >> > The easiest way to trace this is with latencytop. >> > >> > Apply this patch: >> > >> > http://oss.oracle.com/~mason/latencytop.patch >> > >> > And then use latencytop -c for a few minutes while the system is slow. >> > Send the output here and hopefully we'll be able to figure it out. >> >> I've now installed latencytop. Attached are two output files: The >> first is from yesterday and was created aproxematly half an hour after >> the boot. The second on is from today, uptime is 19h. The load on the >> system is already rising. Disk utilization is approximately at 50%. >> >> Thanks for your help. >> >> Christian >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Hi Christian, Are you still seeing this slowness? sage On Wed, 27 Jul 2011, Christian Brunner wrote: > 2011/7/25 Chris Mason : > > Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400: > >> Hi, > >> > >> we are running a ceph cluster with btrfs as it's base filesystem > >> (kernel 3.0). At the beginning everything worked very well, but after > >> a few days (2-3) things are getting very slow. > >> > >> When I look at the object store servers I see heavy disk-i/o on the > >> btrfs filesystems (disk utilization is between 60% and 100%). I also > >> did some tracing on the Cepp-Object-Store-Daemon, but I'm quite > >> certain, that the majority of the disk I/O is not caused by ceph or > >> any other userland process. > >> > >> When reboot the system(s) the problems go away for another 2-3 days, > >> but after that, it starts again. I'm not sure if the problem is > >> related to the kernel warning I've reported last week. At least there > >> is no temporal relationship between the warning and the slowdown. > >> > >> Any hints on how to trace this would be welcome. > > > > The easiest way to trace this is with latencytop. > > > > Apply this patch: > > > > http://oss.oracle.com/~mason/latencytop.patch > > > > And then use latencytop -c for a few minutes while the system is slow. > > Send the output here and hopefully we'll be able to figure it out. > > I've now installed latencytop. Attached are two output files: The > first is from yesterday and was created aproxematly half an hour after > the boot. The second on is from today, uptime is 19h. The load on the > system is already rising. Disk utilization is approximately at 50%. > > Thanks for your help. > > Christian > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Slowdown (due to Memory Handling?)
Excerpts from Mitch Harder's message of 2011-08-04 14:40:20 -0400: > On Thu, Aug 4, 2011 at 10:05 AM, Chris Mason wrote: > >> > >> Ok, so I'm going to guess that your problem is really with either file > >> layout or just us using more metadata pages than the others. The file > >> layout part is easy to test, just replace your git repo with a fresh > >> clone (or completely repack it). > > > > Sorry, I should have said replace your git repo with a fresh, > > non-hardlinked clone. git clone by default will just make hardlinks if > > it can, so it has to be a fresh clone. > > > > -chris > > > > Oops, sorry, I let my responses slip off the list. > > You are right about there being a potentially huge difference between > a cloned git repo and it's parent. I didn't realize it could make > such a difference. > > This problem now appears to have nothing to do with btrfs. I can > replicate the problem on an ext4 partition also if I use a copy of the > parent git repository instead of a clone. The problem seems to lie in > the fragmentation of the git repository. > > If I work with a clone of my linux-btrfs repository, subsequent clones > are much faster. Cloning my parent linux-btrfs repo takes about 90 > minutes (when I have restricted free RAM). Cloning a clone of the > parent drops down to less than 10 minutes. > > With there being several other threads relating to btrfs 'slow downs', > I though this issue might be related. Great, glad to hear turned out to be filesystem agnostic. The original git file format was basically very filesystem unfriendly and it tends to fragment very badly. Linus' solution to this is the pack file format, which is space efficient and very fast to access. The only downside is that you need to repack the repo from time to time or performance tends to fall off a cliff. There is a git-pack command and a git gc command that you can use to restructure things, both making it smaller and much faster. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Slowdown (due to Memory Handling?)
On Thu, Aug 4, 2011 at 10:05 AM, Chris Mason wrote: > Excerpts from Chris Mason's message of 2011-08-04 11:04:54 -0400: >> Excerpts from Mitch Harder's message of 2011-08-04 10:45:51 -0400: >> > On Thu, Aug 4, 2011 at 9:22 AM, Chris Mason wrote: >> > > Excerpts from Mitch Harder's message of 2011-08-02 10:35:54 -0400: >> > >> I'm running into a significant slowdown in Btrfs (> 10x slower than >> > >> normal) that appears to be due to some issue between how Btrfs is >> > >> allocating memory, and how the kernel is expecting Btrfs to allocate >> > >> memory. >> > >> >> > >> The problem does seem to be somewhat hardware specific. I can >> > >> reproduce on two of my computers (an older AMD Athlon(tm) XP 2600+ >> > >> with PATA, and a newer ACER Aspire netbook with an Atom CPU). My >> > >> Core2Duo computer with SATA seems unaffected by this slowdown. >> > >> >> > >> I've replicated this on 2.6.38, 2.6.39, and 3.0 kernels. The >> > >> following information was all obtained running on a 3.0 kernel merged >> > >> with the latest 'for-linus' branch of Chris' git repo. I've also >> > >> tested on ext4 (no slow down encountered) to make sure the issue >> > >> wasn't completely unrelated to Btrfs. >> > > >> > > Just to double check, what was the top commit of for-linus when you did >> > > this? >> > > >> > > The tracing shows that you're spending your time in mmap'd readahead. >> > > So one of three things is happening: >> > > >> > > 1) The VM is favoring our metadata over data pages for the git packed >> > > file >> > > >> > > 2) We're reading ahead too aggressively, or not aggressively enough >> > > >> > > 3) The git pack file is somehow more fragmented, and this is making the >> > > read ahead much less effective. >> > > >> > > The very first thing I'd check is to make sure the .git repo between the >> > > slow machines and the fast machines are identical. Git does a lot of >> > > packing behind the scenes, and so an older repo that isn't freshly >> > > cloned is going to be slower than a new repo. >> > > >> > > -chris >> > > >> > >> > The top commit merged for the kernel used to generate the information >> > in this post was: >> > >> > Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors >> > 75c195a2cac2c3c8366c0b87de2d6814c4f4d638 >> > >> > I have since replicated the slowdown with a kernel merged with the >> > latest 'for-linus' branch, whose top commit was: >> > Btrfs: don't call writepages from within write_full_page >> > 0d10ee2e6deb5c8409ae65b970846344897d5e4e >> >> Ok, so I'm going to guess that your problem is really with either file >> layout or just us using more metadata pages than the others. The file >> layout part is easy to test, just replace your git repo with a fresh >> clone (or completely repack it). > > Sorry, I should have said replace your git repo with a fresh, > non-hardlinked clone. git clone by default will just make hardlinks if > it can, so it has to be a fresh clone. > > -chris > Oops, sorry, I let my responses slip off the list. You are right about there being a potentially huge difference between a cloned git repo and it's parent. I didn't realize it could make such a difference. This problem now appears to have nothing to do with btrfs. I can replicate the problem on an ext4 partition also if I use a copy of the parent git repository instead of a clone. The problem seems to lie in the fragmentation of the git repository. If I work with a clone of my linux-btrfs repository, subsequent clones are much faster. Cloning my parent linux-btrfs repo takes about 90 minutes (when I have restricted free RAM). Cloning a clone of the parent drops down to less than 10 minutes. With there being several other threads relating to btrfs 'slow downs', I though this issue might be related. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs Slowdown (due to Memory Handling?)
Excerpts from Mitch Harder's message of 2011-08-02 10:35:54 -0400: > I'm running into a significant slowdown in Btrfs (> 10x slower than > normal) that appears to be due to some issue between how Btrfs is > allocating memory, and how the kernel is expecting Btrfs to allocate > memory. > > The problem does seem to be somewhat hardware specific. I can > reproduce on two of my computers (an older AMD Athlon(tm) XP 2600+ > with PATA, and a newer ACER Aspire netbook with an Atom CPU). My > Core2Duo computer with SATA seems unaffected by this slowdown. > > I've replicated this on 2.6.38, 2.6.39, and 3.0 kernels. The > following information was all obtained running on a 3.0 kernel merged > with the latest 'for-linus' branch of Chris' git repo. I've also > tested on ext4 (no slow down encountered) to make sure the issue > wasn't completely unrelated to Btrfs. Just to double check, what was the top commit of for-linus when you did this? The tracing shows that you're spending your time in mmap'd readahead. So one of three things is happening: 1) The VM is favoring our metadata over data pages for the git packed file 2) We're reading ahead too aggressively, or not aggressively enough 3) The git pack file is somehow more fragmented, and this is making the read ahead much less effective. The very first thing I'd check is to make sure the .git repo between the slow machines and the fast machines are identical. Git does a lot of packing behind the scenes, and so an older repo that isn't freshly cloned is going to be slower than a new repo. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
I can confirm this as well (64-bit, Core i7, single-disk). > The issue seems to be gone in 3.0.0. After a few hours working 3.0.0 slows down on me too. The performance becomes unusable and a reboot is a must. Certain applications (particularly evolution ad firefox) are next to permanently greyed out. I have had a couple of corrupted tree logs recently and had to use btrfs-zero-log (mentioned in an earlier thread). Otherwise returning to 2.6.38 is the workaround. ~mck -- "A mind that has been stretched will never return to it's original dimension." Albert Einstein | www.semb.wever.org | www.sesat.no | http://tech.finn.no | http://xss-http-filter.sf.net signature.asc Description: This is a digitally signed message part
Btrfs Slowdown (due to Memory Handling?)
I'm running into a significant slowdown in Btrfs (> 10x slower than normal) that appears to be due to some issue between how Btrfs is allocating memory, and how the kernel is expecting Btrfs to allocate memory. The problem does seem to be somewhat hardware specific. I can reproduce on two of my computers (an older AMD Athlon(tm) XP 2600+ with PATA, and a newer ACER Aspire netbook with an Atom CPU). My Core2Duo computer with SATA seems unaffected by this slowdown. I've replicated this on 2.6.38, 2.6.39, and 3.0 kernels. The following information was all obtained running on a 3.0 kernel merged with the latest 'for-linus' branch of Chris' git repo. I've also tested on ext4 (no slow down encountered) to make sure the issue wasn't completely unrelated to Btrfs. The steps to reproduce are as follows: Prerequisite: Have a btrfs partition with a copy of a linux kernel git repository stored. (1) Boot with 768 MB RAM (using 'mem=768M' in the grub command line). (2) From a second machine, run a git clone of of the kernel git repository (such as 'git clone ssh://@/path/to/linux-git-repo'). The clone process slows down when it reaches the 'remote: Compressing objects:' step. Looking at the Alt-SysRq-W output and Latencytop output (see attached), I get a steady stream of memory page faults, and other memory issues. The git clone is definitely causing memory pressure when booted with only 768MB of RAM. However, I still see plenty of cached RAM available, and there is little or no activity on my swap partition. The dmesg output is otherwise silent except for the Alt-SysRq-W output. No OOM errors. A typical 'top' snapshot during the affected period looks like this: top - 08:53:08 up 32 min, 3 users, load average: 1.06, 1.01, 0.84 Tasks: 104 total, 1 running, 103 sleeping, 0 stopped, 0 zombie Cpu(s): 2.3%us, 12.3%sy, 0.0%ni, 0.0%id, 85.1%wa, 0.0%hi, 0.3%si, 0.0%st Mem:768452k total, 760248k used, 8204k free, 4396k buffers Swap: 1004056k total,13824k used, 990232k free, 352596k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2876 root 20 0 000 S 11.0 0.0 1:26.62 btrfs-endio-1 3117 dontpani 20 0 720m 386m 52m D 4.0 51.5 2:38.78 git 526 root 20 0 000 S 0.3 0.0 0:06.42 kswapd0 2576 root 20 0 000 S 0.3 0.0 0:44.09 btrfs-endio-0 1 root 20 0 1844 568 540 S 0.0 0.1 0:00.32 init 2 root 20 0 000 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 000 S 0.0 0.0 0:00.00 ksoftirqd/0 5 root 20 0 000 S 0.0 0.0 0:00.01 kworker/u:0 6 root -2 0 000 S 0.0 0.0 0:04.17 rcu_kthread 7 root 0 -20 000 S 0.0 0.0 0:00.00 cpuset 8 root 0 -20 000 S 0.0 0.0 0:00.00 khelper So, while I may be truly running out of RAM, the kernel doesn't seem to be handling issue normally (i.e., pushing more off to the Swap or giving OOM errors). Let me know if you have some feedback on how to track this issue down. === Mon Aug 1 14:24:05 2011 Globals: Cause Maximum Percentage Page fault 189.6 msec100.0 % Process details: Process kworker/0:1 (395) Total: 27.6 msec . 4.9 msec100.0 % worker_thread kthread kernel_thread_helper Process kswapd0 (526) Total: 11.5 msec kswapd() kernel thread3.6 msec100.0 % kswapd kthread kernel_thread_helper Process btrfs-endio-0 (2567) Total: 878.4 msec [worker_loop] 5.0 msec100.0 % worker_loop kthread kernel_thread_helper Process btrfs-endio-1 (2768) Total: 1.1 msec [worker_loop] 1.1 msec100.0 % worker_loop kthread kernel_thread_helper Process git (2769) Total: 1117.1 msec Page fault 189.6 msec100.0 % sleep_on_page_killable wait_on_page_bit_killable __lock_page_or_retry filemap_fault __do_fault handle_pte_fault handle_mm_fault do_page_fault error_code === Mon Aug 1 14:24:15 2011 Globals: Cause Maximum Percentage Page fault 388.9 msec 98.0 % Creating block layer request 74.6 msec 0.8 % Reading from file74.1 msec 0.8 % [sleep_on_page] 37.9 msec 0.4 % Waiting for event (poll) 1.8 msec 0.0 % Waiting for event (select)1.7 msec 0.0 % Process details: Process sync_supers (259) Total: 0.5 msec Waiting for buffer IO to complete 0.3 msec100.0 % sleep_on_buffer __wait_on_buffer flush_commit_list do_journal_end.clone.32 journal_end_sync reiserfs_sync_fs reiserfs_write_super sync_supers bdi_sync_supers kthread kernel_thread_helper Process kworker/0:1 (395) Total: 0.2 msec .
Re: Btrfs slowdown
On Thu, 28 Jul 2011, Christian Brunner wrote: > When I look at the latencytop results, there is a high latency when > calling "btrfs_commit_transaction_async". Isn't "async" supposed to > return immediately? It depends. That function has to block until the commit has started before returning in the case where it creates a new btrfs root (i.e., snapshot creation). Otherwise a subsequent operation (after the ioctl returns) can sneak in before the snapshot is taken. (IIRC there was also another problem with keeping internal structures consistent, tho I'm forgetting the details.) And there are a bunch of things btrfs_commit_transaction() does before setting blocked = 1 that can be slow. There is a fair bit of transaction commit optimization work that should eventually be done here that we sadly haven't had the resources to look at yet. sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
2011/7/28 Marcus Sorensen : > Christian, > > Have you checked up on the disks themselves and hardware? High > utilization can mean that the i/o load has increased, but it can also > mean that the i/o capacity has decreased. Your traces seem to > indicate that a good portion of the time is being spent on commits, > that could be waiting on disk. That "wait_for_commit" looks to > basically just spin waiting for the commit to complete, and at least > one thing that calls it raises a BUG_ON, not sure if it's one you've > seen even on 2.6.38. > > There could be all sorts of performance related reasons that aren't > specific to btrfs or ceph, on our various systems we've seen things > like the raid card module being upgraded in newer kernels and suddenly > our disks start to go into sleep mode after a bit, dirty_ratio causing > multiple gigs of memory to sync because its not optimized for the > workload, external SAS enclosures stop communicating a few days after > reboot (but the disks keep working with sporadic issues), things like > patrol read hitting a bad sector on a disk, causing it to go into > enhanced error recovery and stop responding, etc. I' fairly confident that the hardware is ok. We see the problem on four machines. It could be a problem with the hpsa driver/firmware, but we haven't seen the behavior with 2.6.38 and the changes in the hpsa driver are not that big. > Maybe you have already tried these things. It's where I would start > anyway. Looking at /proc/meminfo, dirty, writeback, swap, etc both > while the system is functioning desirably and when it's misbehaving. > Looking at anything else that might be in D state. Looking at not just > disk util, but the workload causing it (e.g. Was I doing 300 iops > previously with an average size of 64k, and now I'm only managing 50 > iops at 64k before the disk util reports 100%?) Testing the system in > a filesystem-agnostic manner, for example when performance is bad > through btrfs, is performance the same as you got on fresh boot when > testing iops on /dev/sdb or whatever? You're not by chance swapping > after a bit of uptime on any volume that's shared with the underlying > disks that make up your osd, obfuscated by a hardware raid? I didn't > see the kernel warning you're referring to, just the ixgbe malloc > failure you mentioned the other day. I've looked at most of this. What makes me point to btrfs, is that the problem goes away when I reboot on server in our cluster, but persists on the other systems. So it can't be related to the number of requests that come in. > I do not mean to presume that you have not looked at these things > already. I am not very knowledgeable in btrfs specifically, but I > would expect any degradation in performance over time to be due to > what's on disk (lots of small files, fragmented, etc). This is > obviously not the case in this situation since a reboot recovers the > performance. I suppose it could also be a memory leak or something > similar, but you should be able to detect something like that by > monitoring your memory situation, /proc/slabinfo etc. It could be related to a memory leak. The machine has a lot RAM (24 GB), but we have seen page allocation failures in the ixgbe driver, when we are using jumbo frames. > Just my thoughts, good luck on this. I am currently running 2.6.39.3 > (btrfs) on the 7 node cluster I put together, but I just built it and > am comparing between various configs. It will be awhile before it is > under load for several days straight. Thanks! When I look at the latencytop results, there is a high latency when calling "btrfs_commit_transaction_async". Isn't "async" supposed to return immediately? Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Christian, Have you checked up on the disks themselves and hardware? High utilization can mean that the i/o load has increased, but it can also mean that the i/o capacity has decreased. Your traces seem to indicate that a good portion of the time is being spent on commits, that could be waiting on disk. That "wait_for_commit" looks to basically just spin waiting for the commit to complete, and at least one thing that calls it raises a BUG_ON, not sure if it's one you've seen even on 2.6.38. There could be all sorts of performance related reasons that aren't specific to btrfs or ceph, on our various systems we've seen things like the raid card module being upgraded in newer kernels and suddenly our disks start to go into sleep mode after a bit, dirty_ratio causing multiple gigs of memory to sync because its not optimized for the workload, external SAS enclosures stop communicating a few days after reboot (but the disks keep working with sporadic issues), things like patrol read hitting a bad sector on a disk, causing it to go into enhanced error recovery and stop responding, etc. Maybe you have already tried these things. It's where I would start anyway. Looking at /proc/meminfo, dirty, writeback, swap, etc both while the system is functioning desirably and when it's misbehaving. Looking at anything else that might be in D state. Looking at not just disk util, but the workload causing it (e.g. Was I doing 300 iops previously with an average size of 64k, and now I'm only managing 50 iops at 64k before the disk util reports 100%?) Testing the system in a filesystem-agnostic manner, for example when performance is bad through btrfs, is performance the same as you got on fresh boot when testing iops on /dev/sdb or whatever? You're not by chance swapping after a bit of uptime on any volume that's shared with the underlying disks that make up your osd, obfuscated by a hardware raid? I didn't see the kernel warning you're referring to, just the ixgbe malloc failure you mentioned the other day. I do not mean to presume that you have not looked at these things already. I am not very knowledgeable in btrfs specifically, but I would expect any degradation in performance over time to be due to what's on disk (lots of small files, fragmented, etc). This is obviously not the case in this situation since a reboot recovers the performance. I suppose it could also be a memory leak or something similar, but you should be able to detect something like that by monitoring your memory situation, /proc/slabinfo etc. Just my thoughts, good luck on this. I am currently running 2.6.39.3 (btrfs) on the 7 node cluster I put together, but I just built it and am comparing between various configs. It will be awhile before it is under load for several days straight. On Wed, Jul 27, 2011 at 2:41 AM, Christian Brunner wrote: > 2011/7/25 Chris Mason : >> Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400: >>> Hi, >>> >>> we are running a ceph cluster with btrfs as it's base filesystem >>> (kernel 3.0). At the beginning everything worked very well, but after >>> a few days (2-3) things are getting very slow. >>> >>> When I look at the object store servers I see heavy disk-i/o on the >>> btrfs filesystems (disk utilization is between 60% and 100%). I also >>> did some tracing on the Cepp-Object-Store-Daemon, but I'm quite >>> certain, that the majority of the disk I/O is not caused by ceph or >>> any other userland process. >>> >>> When reboot the system(s) the problems go away for another 2-3 days, >>> but after that, it starts again. I'm not sure if the problem is >>> related to the kernel warning I've reported last week. At least there >>> is no temporal relationship between the warning and the slowdown. >>> >>> Any hints on how to trace this would be welcome. >> >> The easiest way to trace this is with latencytop. >> >> Apply this patch: >> >> http://oss.oracle.com/~mason/latencytop.patch >> >> And then use latencytop -c for a few minutes while the system is slow. >> Send the output here and hopefully we'll be able to figure it out. > > I've now installed latencytop. Attached are two output files: The > first is from yesterday and was created aproxematly half an hour after > the boot. The second on is from today, uptime is 19h. The load on the > system is already rising. Disk utilization is approximately at 50%. > > Thanks for your help. > > Christian > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400: > Hi, > > we are running a ceph cluster with btrfs as it's base filesystem > (kernel 3.0). At the beginning everything worked very well, but after > a few days (2-3) things are getting very slow. > > When I look at the object store servers I see heavy disk-i/o on the > btrfs filesystems (disk utilization is between 60% and 100%). I also > did some tracing on the Cepp-Object-Store-Daemon, but I'm quite > certain, that the majority of the disk I/O is not caused by ceph or > any other userland process. > > When reboot the system(s) the problems go away for another 2-3 days, > but after that, it starts again. I'm not sure if the problem is > related to the kernel warning I've reported last week. At least there > is no temporal relationship between the warning and the slowdown. > > Any hints on how to trace this would be welcome. The easiest way to trace this is with latencytop. Apply this patch: http://oss.oracle.com/~mason/latencytop.patch And then use latencytop -c for a few minutes while the system is slow. Send the output here and hopefully we'll be able to figure it out. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Christian Brunner wrote: > we are running a ceph cluster with btrfs as it's base filesystem > (kernel 3.0). At the beginning everything worked very well, but after > a few days (2-3) things are getting very slow. We get quite a slowdown over time, doing rsyncs to different snapshots. Btrfs seems to go from using several threads in parallel btrfs-endio-0,1,2, shown in top, to just using a single thread btrfs-delalloc. Jeremy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs slowdown
Just a quick note: The issue seems to be gone in 3.0.0. But that's just a wild guess based on 1/2 hour without thrashing. :-) Andrej Hello, I can see something similar on the machines I maintain, mostly single-disk setups with a 2.6.39 kernel: 1) Heavy and frequent disk thrashing, although less than 20% of RAM is used and no swap usage is reported. 2) During the disk thrashing, some processors (usually 2 or 3) spend 100% of their time busy waiting, according to htop. 3) Some userspace applications freeze for tens of seconds during the thrashing and busy waiting, sometimes even htop itself... The problem has only been observed on 64-bit multiprocessors (Core i7 laptop and Nehalem class server Xeons). A 32-bit multiprocessor (Intel Core Duo) and a 64-bit uniprocessor (Intel Core 2 Duo class Celeron) do not seem to have any issues. Furthermore, none of the machines had this problem with 2.6.38 and earlier kernels. Btrfs "just worked" before 2.6.39. I'll test 3.0 today to see whether some of these issues disappear. Neither ceph nor any other remote/distributed filesystem (not even NFS) runs on the machines. The second problem listed above looks like illegal blocking of a vital spinlock during a long disk operation, which freezes some kernel subsystems for an inordinate amount of time and causes a number of processors to wait actively for tens of seconds. (Needless to say that this is not acceptable on a laptop...) Web browsers (Firefox and Chromium) seem to trigger this issue slightly more often than other applications, but I have no detailed statistics to prove this. ;-) Two Core i7 class multiprocessors work 100% flawlessly with ext4, although their kernel configuration is otherwise identical to the machines that use Btrfs. Andrej Hi, we are running a ceph cluster with btrfs as it's base filesystem (kernel 3.0). At the beginning everything worked very well, but after a few days (2-3) things are getting very slow. When I look at the object store servers I see heavy disk-i/o on the btrfs filesystems (disk utilization is between 60% and 100%). I also did some tracing on the Cepp-Object-Store-Daemon, but I'm quite certain, that the majority of the disk I/O is not caused by ceph or any other userland process. When reboot the system(s) the problems go away for another 2-3 days, but after that, it starts again. I'm not sure if the problem is related to the kernel warning I've reported last week. At least there is no temporal relationship between the warning and the slowdown. Any hints on how to trace this would be welcome. Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: Elektronický podpis S/MIME
Re: Btrfs slowdown
Hello, I can see something similar on the machines I maintain, mostly single-disk setups with a 2.6.39 kernel: 1) Heavy and frequent disk thrashing, although less than 20% of RAM is used and no swap usage is reported. 2) During the disk thrashing, some processors (usually 2 or 3) spend 100% of their time busy waiting, according to htop. 3) Some userspace applications freeze for tens of seconds during the thrashing and busy waiting, sometimes even htop itself... The problem has only been observed on 64-bit multiprocessors (Core i7 laptop and Nehalem class server Xeons). A 32-bit multiprocessor (Intel Core Duo) and a 64-bit uniprocessor (Intel Core 2 Duo class Celeron) do not seem to have any issues. Furthermore, none of the machines had this problem with 2.6.38 and earlier kernels. Btrfs "just worked" before 2.6.39. I'll test 3.0 today to see whether some of these issues disappear. Neither ceph nor any other remote/distributed filesystem (not even NFS) runs on the machines. The second problem listed above looks like illegal blocking of a vital spinlock during a long disk operation, which freezes some kernel subsystems for an inordinate amount of time and causes a number of processors to wait actively for tens of seconds. (Needless to say that this is not acceptable on a laptop...) Web browsers (Firefox and Chromium) seem to trigger this issue slightly more often than other applications, but I have no detailed statistics to prove this. ;-) Two Core i7 class multiprocessors work 100% flawlessly with ext4, although their kernel configuration is otherwise identical to the machines that use Btrfs. Andrej Hi, we are running a ceph cluster with btrfs as it's base filesystem (kernel 3.0). At the beginning everything worked very well, but after a few days (2-3) things are getting very slow. When I look at the object store servers I see heavy disk-i/o on the btrfs filesystems (disk utilization is between 60% and 100%). I also did some tracing on the Cepp-Object-Store-Daemon, but I'm quite certain, that the majority of the disk I/O is not caused by ceph or any other userland process. When reboot the system(s) the problems go away for another 2-3 days, but after that, it starts again. I'm not sure if the problem is related to the kernel warning I've reported last week. At least there is no temporal relationship between the warning and the slowdown. Any hints on how to trace this would be welcome. Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: Elektronický podpis S/MIME
Btrfs slowdown
Hi, we are running a ceph cluster with btrfs as it's base filesystem (kernel 3.0). At the beginning everything worked very well, but after a few days (2-3) things are getting very slow. When I look at the object store servers I see heavy disk-i/o on the btrfs filesystems (disk utilization is between 60% and 100%). I also did some tracing on the Cepp-Object-Store-Daemon, but I'm quite certain, that the majority of the disk I/O is not caused by ceph or any other userland process. When reboot the system(s) the problems go away for another 2-3 days, but after that, it starts again. I'm not sure if the problem is related to the kernel warning I've reported last week. At least there is no temporal relationship between the warning and the slowdown. Any hints on how to trace this would be welcome. Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html