Re: [PATCH V3 0/8] Cleancache: overview
On 06/22/2010 04:48 AM, Dan Magenheimer wrote: [PATCH V3 0/8] Cleancache: overview snip Documentation/ABI/testing/sysfs-kernel-mm-cleancache | 11 + Documentation/vm/cleancache.txt | 194 +++ fs/btrfs/extent_io.c |9 fs/btrfs/super.c |2 fs/buffer.c |5 fs/ext3/super.c |2 fs/ext4/super.c |2 fs/mpage.c |7 fs/ocfs2/super.c |3 fs/super.c |7 include/linux/cleancache.h | 88 include/linux/fs.h |5 mm/Kconfig | 22 ++ mm/Makefile |1 mm/cleancache.c | 169 mm/filemap.c | 11 + mm/truncate.c| 10 17 files changed, 548 insertions(+) (following is a copy of Documentation/vm/cleancache.txt) MOTIVATION Cleancache can be thought of as a page-granularity victim cache for clean pages that the kernel's pageframe replacement algorithm (PFRA) would like to keep around, but can't since there isn't enough memory. So when the PFRA evicts a page, it first attempts to put it into a synchronous concurrency-safe page-oriented pseudo-RAM device (such as Xen's Transcendent Memory, aka tmem, or in-kernel compressed memory, aka zmem, or other RAM-like devices) which is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. And when a cleancache-enabled filesystem wishes to access a page in a file on disk, it first checks cleancache to see if it already contains it; if it does, the page is copied into the kernel and a disk access is avoided. Since zcache is now one of its use cases, I think the major objection that remains against cleancache is its intrusiveness -- in particular, need to change individual filesystems (even though one liners). Changes below should help avoid these per-fs changes and make it more self contained. I haven't tested these changes myself, so there might be missed cases or other mysterious problems: 1. Cleancache requires filesystem specific changes primarily to make a call to cleancache init and store (per-fs instance) pool_id. I think we can get rid of these by directly passing 'struct super_block' pointer which is also sufficient to identify FS instance a page belongs to. This should then be used as a 'handle' by cleancache_ops provider to find corresponding memory pool or create a new pool when a new handle is encountered. This leaves out case of ocfs2 for which cleancache needs 'uuid' to decide if a shared pool should be created. IMHO, this case (and cleancache.init_shared_fs) should be removed from cleancache_ops since it is applicable only for Xen's cleancache_ops provider. 2. I think change in btrfs can be avoided by moving cleancache_get_page() from do_mpage_reapage() to filemap_fault() and this should work for all filesystems. See: handle_pte_fault() - do_(non)linear_fault() - __do_fault() - vma-vm_ops-fault() which is defined as filemap_fault() for all filesystems. If some future filesystem uses its own custom function (why?) then it will have to arrange for call to cleancache_get_page(), if it wants this feature. With above changes, cleancache will be fairly self-contained: - cleancache_put_page() when page is removed from page-cache - cleacacache_get_page() when PF occurs (and after page-cache is searched) - cleancache_flush_*() on truncate_*() Thanks, Nitin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/8] Cleancache: overview
On Fri, Jul 23, 2010 at 4:36 PM, Nitin Gupta ngu...@vflare.org wrote: 2. I think change in btrfs can be avoided by moving cleancache_get_page() from do_mpage_reapage() to filemap_fault() and this should work for all filesystems. See: handle_pte_fault() - do_(non)linear_fault() - __do_fault() - vma-vm_ops-fault() which is defined as filemap_fault() for all filesystems. If some future filesystem uses its own custom function (why?) then it will have to arrange for call to cleancache_get_page(), if it wants this feature. filemap fault works only in case of file-backed page which is mapped but don't work not-mapped cache page. So we could miss cache page by read system call if we move it into filemap_fault. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/8] Cleancache: overview
On Fri, Jul 23, 2010 at 4:36 PM, Nitin Gupta ngu...@vflare.org wrote: 2. I think change in btrfs can be avoided by moving cleancache_get_page() from do_mpage_reapage() to filemap_fault() and this should work for all filesystems. See: handle_pte_fault() - do_(non)linear_fault() - __do_fault() - vma-vm_ops-fault() which is defined as filemap_fault() for all filesystems. If some future filesystem uses its own custom function (why?) then it will have to arrange for call to cleancache_get_page(), if it wants this feature. filemap fault works only in case of file-backed page which is mapped but don't work not-mapped cache page. So we could miss cache page by read system call if we move it into filemap_fault. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent-tree.c:1353
On 22 July 2010 19:07, Johannes Hirte johannes.hi...@fem.tu-ilmenau.de wrote: Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie: On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote: Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: I'm not sure if btrfs is to blame for this error. After the errors I switched to XFS on this system and got now this error: ls -l .kde4/share/apps/akregator/data/ ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure needs cleaning total 4 ?? ? ? ? ? ? feeds.opml What is the error reported in dmesg when the XFS filesytem shuts down? Nothing. I double checked the logs. There are only the messages when mounting the filesystem. No other errors are reported than the inaccessible file and the output from xfs_check. Is there anything wrong with your disks or memory? Sometimes the bad memory can break the filesystem. I have met this kind of problem some time ago. I don't think that's the case. I've checked the RAM with memtest86+ and got no errors. I got the errors with two different disks, the first one with btrfs the second one now with XFS. Before changing to the second disk, I've run badblocks on it to be sure it has no errors. There are some known-buggy chipsets also. One still around is the Nvidia CK804/MCP55, under certain patterns of spatially-local pending reads and writes to the memory controller, a 64-byte request would occasionally be returned with the wrong offset. I was hitting it with some 27-Gbit adapters and managed to capture it on a PCI-e protocol analyser. Rsync between network and local disk would hit sometimes too. -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent-tree.c:1353
On Thu, Jul 22, 2010 at 2:07 PM, Johannes Hirte johannes.hi...@fem.tu-ilmenau.de wrote: Is there anything wrong with your disks or memory? Sometimes the bad memory can break the filesystem. I have met this kind of problem some time ago. I don't think that's the case. I've checked the RAM with memtest86+ and got no errors. I got the errors with two different disks, the first one with btrfs the second one now with XFS. Before changing to the second disk, I've run badblocks on it to be sure it has no errors. You might also try kmemcheck. There's a good chance that a bug that scribbles random memory shows up as FS corruption. I have had ext4 corruption due to an inotify bug, which kmemcheck found on the first try. -- Bob Copeland %% www.bobcopeland.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/8] Cleancache: overview
On Fri, Jul 23, 2010 at 06:58:03AM -0700, Dan Magenheimer wrote: CHRISTOPH AND ANDREW, if you disagree and your concerns have not been resolved, please speak up. Anything that need modification of a normal non-shared fs is utterly broken and you'll get a clear NAK, so the propsal before is a good one. There's a couple more issues like the still weird prototypes, e.g. and i_ino might not be enoug to uniquely identify an inode on serveral filesystems that use 64-bit inode inode numbers on 32-bit systems. Also making the ops vector global is just a bad idea. There is nothing making this sort of caching inherently global. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 0/8] Cleancache: overview
From: Christoph Hellwig [mailto:h...@infradead.org] Subject: Re: [PATCH V3 0/8] Cleancache: overview On Fri, Jul 23, 2010 at 06:58:03AM -0700, Dan Magenheimer wrote: CHRISTOPH AND ANDREW, if you disagree and your concerns have not been resolved, please speak up. Hi Christoph -- Thanks very much for the quick (instantaneous?) reply! Anything that need modification of a normal non-shared fs is utterly broken and you'll get a clear NAK, so the propsal before is a good one. Unless/until all filesystems are 100% built on top of VFS, I have to disagree. Abstractions (e.g. VFS) are never perfect. And the relevant filesystem maintainers have acked, so I'm wondering who you are NAK'ing for? Nitin's proposal attempts to move the VFS hooks around to fix usage for one fs (btrfs) that, for whatever reason, has chosen to not layer itself completely on top of VFS; this sounds to me like a recipe for disaster. I think Minchan's reply quickly pointed out one issue... what other filesystems that haven't been changed might encounter a rare data corruption issue because cleancache is transparently enabled for its page cache pages? It also drops requires support to be dropped entirely for another fs (ocfs2) which one user (zcache) can't use, but the other (tmem) makes very good use of. No, the per-fs opt-in is very sensible; and its design is very minimal. Could you please explain your objection further? There's a couple more issues like the still weird prototypes, e.g. and i_ino might not be enoug to uniquely identify an inode on serveral filesystems that use 64-bit inode inode numbers on 32-bit systems. This reinforces my per-fs opt-in point. Such filesystems should not enable cleancache (or enable them only on the appropriate systems). Also making the ops vector global is just a bad idea. There is nothing making this sort of caching inherently global. I'm not sure I understand your point, but two very different users of cleancache have been provided, and more will be discussed at the MM summit next month. Do you have a suggestion on how to avoid a global ops vector while still serving the needs of both existing users? Thanks, Dan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 0/8] Cleancache: overview
From: Nitin Gupta [mailto:ngu...@vflare.org] Sent: Friday, July 23, 2010 9:05 AM To: Dan Magenheimer Cc: Christoph Hellwig; a...@linux-foundation.org; Chris Mason; v...@zeniv.linux.org.uk; adil...@sun.com; ty...@mit.edu; mfas...@suse.com; Joel Becker; matt...@wil.cx; linux- bt...@vger.kernel.org; linux-ker...@vger.kernel.org; linux- fsde...@vger.kernel.org; linux-e...@vger.kernel.org; ocfs2- de...@oss.oracle.com; linux...@kvack.org; jer...@goop.org; jbeul...@novell.com; Kurt Hackel; npig...@suse.de; Dave Mccracken; r...@redhat.com; a...@redhat.com; Konrad Wilk Subject: Re: [PATCH V3 0/8] Cleancache: overview On 07/23/2010 08:14 PM, Dan Magenheimer wrote: From: Christoph Hellwig [mailto:h...@infradead.org] Also making the ops vector global is just a bad idea. There is nothing making this sort of caching inherently global. I'm not sure I understand your point, but two very different users of cleancache have been provided, and more will be discussed at the MM summit next month. Do you have a suggestion on how to avoid a global ops vector while still serving the needs of both existing users? Maybe introduce cleancache_register(struct cleancache_ops *ops)? This will allow making cleancache_ops non-global. No value add but maybe that's cleaner? Oh, OK, that seems reasonable. Dan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 0/8] Cleancache: overview
On 07/23/2010 11:07 PM, Dan Magenheimer wrote: From: Dan Magenheimer Subject: RE: [PATCH V3 0/8] Cleancache: overview From: Christoph Hellwig [mailto:h...@infradead.org] Subject: Re: [PATCH V3 0/8] Cleancache: overview On Fri, Jul 23, 2010 at 06:58:03AM -0700, Dan Magenheimer wrote: CHRISTOPH AND ANDREW, if you disagree and your concerns have not been resolved, please speak up. Hi Christoph -- Thanks very much for the quick (instantaneous?) reply! Anything that need modification of a normal non-shared fs is utterly broken and you'll get a clear NAK, so the propsal before is a good one. Unless/until all filesystems are 100% built on top of VFS, I have to disagree. Abstractions (e.g. VFS) are never perfect. After thinking about this some more, I can see a way to enforce opt-in in the cleancache backend without any changes to non-generic fs code. I think it's a horrible hack and we can try it, but I expect fs maintainers would prefer the explicit one-line-patch opt-in. 1) Cleancache backend maintains a list of known working filesystems (those that have been tested). Checks against known working list indeed looks horrible. Isn't there any way to identify pagecache - disk I/O boundaries which every filesystem obeys? I'm not yet sure but if this is doable, then we won't require such hacks. 2) Nitin's proposed changes pass the *sb as a parameter. The string name of the filesystem type is available via sb-s_type-name. This can be compared against the known working list. sb-s_magic could also be used, or better if we can somehow get rid of these checks :) Using the sb pointer as a handle requires an extra table search on every cleancache get/put/flush, and fs/super.c changes are required for fs unmount notification anyway (e.g. to call cleancache_flush_fs) so I'd prefer to keep the cleancache_poolid addition to the sb. I'll assume this is OK since this is in generic fs code. I will also try making changes to cleancache so it does not touch any fs specific code. Though IMHO one liners to fs-code should really be acceptable but unfortunately this doesn't seem to be the case. Maybe generic cleancache will have better chances. Thanks, Nitin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfsctl returns bad exit status
Using btrfsctl from ubuntu maverick, (built and running on ubuntu-10.04), I get: r...@eisenhower btrfsctl -s snap1 /home || echo failed operation complete Btrfs Btrfs v0.19 failed r...@eisenhower ls -las snap1 total 4 4 drwxr-xr-x 1 rootroot32 2010-07-20 18:25 . 0 drwxr-xr-x 1 richrich 1324 2010-07-21 12:47 .. 0 drwxr-xr-x 1 checker build 22 2010-07-20 18:21 checker 0 drwxr-xr-x 1 richrich 1322 2010-07-21 12:44 rich 0 drwxr-xr-x 1 rootroot36 2010-07-21 10:47 za-cb This would seem to indicate a non-zero exit status despite the fact that the snapshotting operation appears to have succeeded. This makes it impossible to programmatically check whether a snapshot was created successfully or not as I will need to explicitly discard the exit status from btrfsctl. I get the same results when built from git, except that it identifies itself as v0.19-16-g075587c. I would expect that btrfsctl would return zero exit status when it was capable of doing what it was asked and non-zero exit status only when it could not. --rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsctl returns bad exit status
Try using btrfs tool; btrfsctl is/will be deprecated I think... and it's better anyway. C Anthony [mobile] On Jul 23, 2010, at 3:04 PM, K. Richard Pixley r...@noir.com wrote: Using btrfsctl from ubuntu maverick, (built and running on ubuntu-10.04), I get: r...@eisenhower btrfsctl -s snap1 /home || echo failed operation complete Btrfs Btrfs v0.19 failed r...@eisenhower ls -las snap1 total 4 4 drwxr-xr-x 1 rootroot32 2010-07-20 18:25 . 0 drwxr-xr-x 1 richrich 1324 2010-07-21 12:47 .. 0 drwxr-xr-x 1 checker build 22 2010-07-20 18:21 checker 0 drwxr-xr-x 1 richrich 1322 2010-07-21 12:44 rich 0 drwxr-xr-x 1 rootroot36 2010-07-21 10:47 za-cb This would seem to indicate a non-zero exit status despite the fact that the snapshotting operation appears to have succeeded. This makes it impossible to programmatically check whether a snapshot was created successfully or not as I will need to explicitly discard the exit status from btrfsctl. I get the same results when built from git, except that it identifies itself as v0.19-16-g075587c. I would expect that btrfsctl would return zero exit status when it was capable of doing what it was asked and non-zero exit status only when it could not. --rich -- To unsubscribe from this list: send the line unsubscribe linux- btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 0/8] Cleancache: overview
On Fri, Jul 23, 2010 at 06:58:03AM -0700, Dan Magenheimer wrote: CHRISTOPH AND ANDREW, if you disagree and your concerns have not been resolved, please speak up. Hi Christoph -- Thanks very much for the quick (instantaneous?) reply! Anything that need modification of a normal non-shared fs is utterly broken and you'll get a clear NAK, so the propsal before is a good one. No, the per-fs opt-in is very sensible; and its design is very minimal. Not to belabor the point, but maybe the right way to think about this is: Cleancache is a new optional feature provided by the VFS layer that potentially dramatically increases page cache effectiveness for many workloads in many environments at a negligible cost. Filesystems that are well-behaved and conform to certain restrictions can utilize cleancache simply by making a call to cleancache_init_fs at mount time. Unusual, misbehaving, or poorly layered filesystems must either add additional hooks and/or undergo extensive additional testing... or should just not enable the optional cleancache. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html