Re: Permissions model for btrfs?
On Thu, Dec 09, 2010 at 12:35:35PM -0500, Wayne Pollock wrote: I looked though the wiki (and searched the archives) but don't see an answer. Will btrfs support old POSIX-style ACLs and permissions, or the new NFS/NT style ACLs like ZFS? From the patch I saw, it seems old POSIX ACLs and permissions, but I'd like to know for sure. (And maybe the FAQ on the wiki could address this?) If it is the older POSIC ACLs, is there any plan to support NFSv4 ACLs in the future? Right now it supports POSIX ACLs. I don't know about future plans. On a related note, will btrfs support any ext4 attributes (via chattr)? It currently supports AaDdiS. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding COW-related behaviors
(sorry for sending twice) On Mon, Nov 08, 2010 at 02:23:13PM +, João Eduardo Luís wrote: Basically, I need to be aware how the COW works in BTRFS, and what it may allow one to achieve. Questions follow. From your questions, you don't seem to understand CoW. CoW is basically an alternative to the logging/journalling used by most filesystems. When you change a data structure in a journalling filesystem, like ext4, you actually write two copies--one into the journal, and one that overwrites the old data structure. If a crash happens, at least one copy will still be valid, making recovery possible. When you change a data structure in a CoW filesystem, like btrfs, you only write one copy, but you DON'T write it over the old data structure. You write it to a new, unallocated space. This means the location of the data structure changed, so you have to change the parent data structure; you use CoW for that and so on up to the superblocks, which actually are overwritten. Once that's finished, the old versions are no longer needed, so they will be unallocated and eventually overwritten. If a crash happens, the superblocks will still point to the old version of the data structures. This makes it relatively easy to add snapshot features--just add reference counting, and don't free old versions of data structures if they're still being used. However, this only happens if the user explicitly requests a snapshot. Otherwise, the old data structures are freed immediately once the new ones are completely written. 1) Is COW only used when creating or updating a file? While testing BTRFS, using 'btrfs subvolume find-new', I got the idea that neither creation of directories, nor any kind of deletion are covered by COW. Is this right? CoW is used anytime any structure is changed. find-new is not directly related to CoW. 2) Each time a COW happens, is there any kind implicit 'snapshotting' that may keep track of changes around the filesystem for each COW? By Rodeh's paper and some info on the wiki, I gather that a new root is created for each COW, due to shadowing, but will the previous tree be kept? The wiki, at BTRFS Design, states that after the commit finishes, the older subvolume root items may be removed. This would make impossible to track changes to files, but 'btrfs subvolume find-new' still manages to output file generations, so there must be some info left behind. The old tree is discarded unless the user requested a snapshot of it. Every time btrfs updates the roots is a new generation. Some data structures have generation fields, indicating the generation in which they were most recently changed. This is mostly used to verify the filesystem is correct, but it's also possible to scan the generation fields and find out which files have changed. 3) Following (2), is there any way to access this informations and, let's say, recover an older version of a given file? Or an entire previous tree? No, unless the user request a snapshot. I'm assuming you're not talking about tools like PhotoRec, that try to reassemble files from whatever disk data looks valid. 4) From Rodeh's paper I got the idea that BTRFS uses periodic checkpointing, in order to assign generations to operations. Using 'btrfs subvolume find-new' I confirmed my suspicions. After copying two different directories into the same subvolume at the same time, all files got assigned the same generation and it took a while until they all showed up. This raises the question: what triggers a new checkpoint? Is it based on elapsed time since last checkpoint? Is it triggered by a COW and then, all COWs happening at the same time will be put together and create a big new generation? Again, periodic checkpointing is probably the wrong way to think about it. It would be wasteful to overwrite the superblocks every time a change is made; instead, btrfs may combine multiple changes into one generation and only update the superblocks once. I'm not sure exactly how btrfs decides when to write a new generation. 5) If we have multiple jobs updating the same file at the same time, I assume the system will shadow their updates; when the time for committing comes, will there be any kind of 'conflict' between concurrent updates, or will they be applied by order of commit, ignoring whether there were previous commits or not? Regarding checkpointing, will all the changes be shown as part of the generation, or will they be considered as only one? This is handled just like in any other filesystem. There are no concurrent generations; if two threads both update a file, btrfs will handle the updates sequentially, one at a time. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: crc32c
On Mon, Oct 11, 2010 at 03:47:58AM -0500, Nathan Caza wrote: I think I'm on the verge of getting all my data back; the only missing piece is to recalculate the crc checksum of my altered superblock and I'm having trouble finding the correct function/method; the data I am checksumming is (based on the sheet) 0x20 (directly after the checksum) to 0x32b + n (226 bytes); if im doing something wrong let me know; or if theres a quick/dirty way to get the right checksum The checksum seems to be from 0x20 to 0x1000. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On-Disk Format
On Sun, Oct 10, 2010 at 10:43:56PM -0500, Nathan Caza wrote: Is this up-to-date? if not, has anyone put together something like this more recent?? https://btrfs.wiki.kernel.org/index.php/User:Wtachi/On-disk_Format It should be up-to-date, to the extent that it contains any useful information at all. It's basically a sketch I wrote when I was first figuring out btrfs, and I haven't gotten around to filling in the details. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. What do you think the major work that BTRFS can do to improve the performance for SSD? I know FTL has becomes smarter and smarter, the idea of log-structured file system is always implemented inside the SSD by FTL, in that case, it sounds all the issues have been solved no matter what the FS it is in upper stack. But at least, from the results of benchmarks on the internet show that the performance from different FS are quite different, such as NILFS2 and BTRFS. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote: I know BTRFS is a kind of Log-structured File System, which doesn't do overwrite. Here is my question, suppose file A is overwritten by A', instead of writing A' to the original place of A, a new place is selected to store it. However, we know that the address of a file should be recorded in its inode. In such case, the corresponding part in inode of A should update from the original place A to the new place A', is this a kind of overwrite actually? I think no matter what design it is for Log-Structured FS, a mapping table is always needed, such as inode map, DAT, etc. When a update operation happens for this mapping table, is it actually a kind of over-write? If it is, is it a bottleneck for the performance of write for SSD? In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. You mean that there is no over-write for inode too, once the inode need to be updated, this inode is actually written to a new place while the only thing to do is to change the point of its parent to this new place. However, for the last parent, or the superblock, does it need to be overwritten? Yes. The idea of copy-on-write, as used by btrfs, is that whenever *anything* is changed, it is simply written to a new location. This applies to data, inodes, and all of the B-trees used by the filesystem. However, it's necessary to have *something* in a fixed place on disk pointing to everything else. So the superblocks can't move, and they are overwritten instead. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS SSD
On Wed, Sep 29, 2010 at 03:39:07PM -0400, Aryeh Gregor wrote: On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com wrote: In btrfs, this is solved by doing the same thing for the inode--a new place for the leaf holding the inode is chosen. Then the parent of the leaf must point to the new position of the leaf, so the parent is moved, and the parent's parent, etc. This goes all the way up to the superblocks, which are actually overwritten one at a time. Sorry for the useless question, but just out of curiosity: doesn't this mean that btrfs has to do quite a lot more writes than ext4 for small file operations? E.g., if you append one block to a file, like a log file, then ext3 should have to do about three writes: data, metadata, and journal (and the latter is always sequential, so it's cheap). But btrfs will need to do more, rewriting parent nodes all the way up the line for both the data and metadata blocks. Why doesn't this hurt performance a lot? For a single change, it does write more. However, there are usually many changes to children being performed at once, which only require one change to the parent. Since it's moving everything to new places, btrfs also has much more control over where writes occur, so all the leaves and parents can be written sequentially. ext3 is a slave to the current locations on disk. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A device dedicated for metadata?
In response to your original questions, btrfs currently gives no control over the allocation of data or metadata. I'm sure someone will implement more control eventually. On Wed, Jul 28, 2010 at 11:49:33PM +0800, wks1986 wrote: Another issue is the speed of fsck. There will always be times when the operating system is brought down abnormally and fsck is necessary. In order to make the downtime as short as possible, fsck should be fast. In this case, when metadata are stored in a fast device, fsck will be significantly faster. The hot data tracking patch is based on the statistics of ONLINE accesses. Some data may suddenly become hot when the filesystem goes offline for fsck. Actually, because of copy-on-write and other aspects of btrfs' design, there's no need for the typical use of fsck after a crash. Even once a proper fsck is finished, it will only be necessary when important information is corrupted. So it generally doesn't make sense to worry about fsck speed. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid modes, balancing, and order in which data gets written
On Thu, Jul 15, 2010 at 10:29:07AM +0200, Mathijs Kwik wrote: Hi all, I read that btrfs - in a raid mode - does not mimic the behavior of traditional (hw/sw) raid. After writing to a btrfs raid filesystem, data will only be distributed the way you expect after running a rebalance. This is not the case. When you create a btrfs filesystem with RAID enabled, stuff written from then on will be written just like with traditional RAID. The difference with traditional RAID is that different parts of the FS can have different RAID settings. Btrfs reserves space in ~1GiB block groups for data or metadata, each of which has its own RAID settings. If you change the RAID mode for an existing filesystem (not yet supported IIUC) or add/remove devices, the existing block groups will keep their old RAID settings if at all possible. Rebalancing essentially moves everything into new block groups, which will use the new RAID settings and be more balanced between data and metadata. It isn't useful unless you change RAID settings, add/remove devices, or have too much space reserved for either data or metadata. [...] If you never rebalance manually, will the filesystem do this in the background (when idle)? Or will the fs never rebalance itself and only become more balanced again after writing/changing some files, which it will then place on the drive which has the lowest balance? Rebalancing isn't done automatically, and nothing can become more balanced until new block groups are created when you run out of space in the old ones. Basically, I'm not sure I fully understood balancing, so any info on this would be great. In traditional raid0 and raid10 (block based), it is guaranteed that any big file will always be stiped between disks equally, so a certain performance can be assumed. With non-automatic balancing, I'm afraid some files might not be distributed as well as they could, resulting in lower performance. Is this an issue to be aware of, or can I safely assume that for most use cases the performance will roughly be the same as sw-raid? 2 cases I'm interested in: - big databases(lots of rewrites) - real-time video-capturing (sustained write to 1 or more big files, needing a guaranteed write throughput) If you initially create the filesystem with the right RAID settings, it will act just like normal software RAID. Balancing only comes into play when you start changing your mind :). Any info on this or balancing in general will be greatly appreciated. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snapshotting - what data gets shared?
On Wed, Jul 14, 2010 at 11:27:39PM +0200, Mathijs Kwik wrote: Hi all, I'm used to snapshots with LVM and I would like to compare them to btrfs. The case I want to compare is the following: At the moment a snapshot is created, no extra space is needed (maybe some metadata overhead) and all data is shared between the original and the snapshot. In LVM, snapshots work at the block-level, so any changes done to the original volume trigger a COW to the snapshot. If LVM is configured to use 4Mb blocks (default), this means that overwriting a 100k file, will lead to 4Mb snapshot data to be backed up. A 800Mb file will take around 800Mb. So, for small files (that are not on the same extent/block) this can waste quite some space, while for bigger files, or lots of files close to each other, it doesn't matter much. How is this for btrfs snapshots? Do they work at the file-level? or also at blocks/extents? I mean, does changing a 100k file lead to 100k being snapshotted? Btrfs CoWs file extents, and files can use only the parts of an extent they need, so a 1-byte change would only require one additional 4K data block. Of course, metadata also needs to be updated, and will require a number of additional blocks. What would happen if I have a 20G file (for example a disk image for kvm)? Would minor changes in that file lead to the entire 20G to be COWed/backed up? No, only the relevant portion. Is there a distinction between data and metadata? Or does touching (ctime/mtime) or visiting (atime) a file cause it to be COWed? Metadata is CoWed separately, so there will still only be one copy of the data. Thanks for any info on this. Mathijs -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: handle errors for FS_IOC_SETFLAGS
Makes btrfs_ioctl_setflags return -ENOSPC and other errors when necessary. Signed-off-by: Sean Bartell wingedtachik...@gmail.com --- I ran chattr -R on a full FS and btrfs crashed. This overlaps with the patch series being worked on by Jeff Mahoney. fs/btrfs/ioctl.c | 17 - 1 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4dbaf89..8db62c2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -200,19 +200,26 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) trans = btrfs_join_transaction(root, 1); - BUG_ON(!trans); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out_drop_write; + } ret = btrfs_update_inode(trans, root, inode); - BUG_ON(ret); + if (ret) + goto out_endtrans; btrfs_update_iflags(inode); inode-i_ctime = CURRENT_TIME; - btrfs_end_transaction(trans, root); + ret = 0; +out_endtrans: + btrfs_end_transaction(trans, root); +out_drop_write: mnt_drop_write(file-f_path.mnt); - out_unlock: +out_unlock: mutex_unlock(inode-i_mutex); - return 0; + return ret; } static int btrfs_ioctl_getversion(struct file *file, int __user *arg) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is there a more aggressive fixer than btrfsck?
On Tue, Jun 29, 2010 at 02:36:14PM -0700, Freddie Cash wrote: On Tue, Jun 29, 2010 at 3:37 AM, Daniel Kozlowski dan.kozlow...@gmail.com wrote: On Mon, Jun 28, 2010 at 10:31 PM, Rodrigo E. De León Plicet rdele...@gmail.com wrote: On Mon, Jun 28, 2010 at 8:48 AM, Daniel Kozlowski dan.kozlow...@gmail.com wrote: Sean Bartell wingedtachikoma at gmail.com writes: Is there a more aggressive filesystem restorer than btrfsck? It simply gives up immediately with the following error: btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root-node)' failed. btrfsck currently only checks whether a filesystem is consistent. It doesn't try to perform any recovery or error correction at all, so it's mostly useful to developers. Any error handling occurs while the filesystem is mounted. Is there any plan to implement this functionality. It would seem to me to be a pretty basic feature that is missing ? If Btrfs aims to be at least half of what ZFS is, then it will not impose a need for fsck at all. Read No, ZFS really doesn't need a fsck at the following URL: http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html Interesting idea. it would seem to me however that the functionality described in that article is more concerned with a bad transaction rather then something like a hardware failure where a block written more then 128 transactions ago is now corrupted and consiquently the entire partition is now unmountable( that is what I think i am looking at with BTRFS ) In the ZFS case, this is handled by checksumming and redundant data, and can be discovered (and fixed) via either reading the affected data block (in which case, the checksum is wrong, the data is read from a redundant data block, and the correct data is written over the incorrect data) or by running a scrub. Self-healing, checksumming, data redundancy eliminate the need for online (or offline) fsck. Automatic transaction rollback at boot eliminates the need for fsck at boot, as there is no such thing as a dirty filesystem. Either the data is on disk and correct, or it doesn't exist. Yes, you may lose data. But you will never have a corrupted filesystem. Not sure how things work for btrfs. btrfs works in a similar way. While it's writing new data, it keeps the superblock pointing at the old data, so after a crash you still get the complete old version. Once the new data is written, the superblock is updated to point at it, ensuring that you see the new data. This eliminates the need for any special handling after a crash. btrfs also uses checksums and redundancy to protect against data corruption. Thanks to its design, btrfs doesn't need to scan the filesystem or cross-reference structures to detect problems. It can easily detect corruption at run-time when it tries to read the problematic data, and fixes it using the redundant copies. In the event that something goes horribly wrong, for example if each copy of the superblock or of a tree root is corrupted, you could still find some valid nodes and try to piece them together; however, this is rare and falls outside the scope of a fsck anyway. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is there a more aggressive fixer than btrfsck?
On Tue, Jun 01, 2010 at 07:29:56PM -0700, u...@sonic.net wrote: Is there a more aggressive filesystem restorer than btrfsck? It simply gives up immediately with the following error: btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root-node)' failed. btrfsck currently only checks whether a filesystem is consistent. It doesn't try to perform any recovery or error correction at all, so it's mostly useful to developers. Any error handling occurs while the filesystem is mounted. Yet, the filesystem has plenty of data on it, and the discs are good and I didn't do anything to the data except regular btrfs commands and normal mounting. That's a wildly unreliable filesystem. btrfs is under heavy development, so make sure you're using the latest git versions of the kernel module and tools. BTW, is there a way to improve delete and copy performance of btrfs? I'm getting about 50KB/s-500KB/s (per size of file being deleted) in deleting and/or copying files on a disc that usually can go about 80MB/s. I think it's because they were fragmented. That implies btrfs is too accepting of writing data in fragmented style when it doesn't have to. Almost all the files on my btrfs partitions are around a gig, or 20 gigs, or a third of a gig, or stuff like that. The filesystem is 1.1TB. Brad -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Quota Support
On Wed, Jun 02, 2010 at 10:57:44AM +0200, Stephen wrote: Im just wondering if subvolumes or snap shot can have quotas imposed on them. Subvolume quotas are one of the many features that haven't yet been implemented. See https://btrfs.wiki.kernel.org/index.php/Development_timeline. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] btrfs-convert: make more use of cache_free_extents
On Tue, May 18, 2010 at 09:40:28PM +0800, Yan, Zheng wrote: On Sat, Mar 20, 2010 at 12:24 PM, Sean Bartell wingedtachik...@gmail.com wrote: An extent_io_tree is used for all free space information. This allows removal of ext2_alloc_block and ext2_free_block, and makes create_ext2_image less ext2-specific. + ret = ext2_cache_free_extents(ext2_fs, orig_free_tree); + if (ret) { + fprintf(stderr, error during cache_free_extents %d\n, ret); + goto fail; + } + /* preserve first 64KiB, just in case */ + clear_extent_dirty(orig_free_tree, 0, BTRFS_SUPER_INFO_OFFSET - 1, 0); + + ret = copy_dirtiness(free_tree, orig_free_tree); + if (ret) { + fprintf(stderr, error during copy_dirtiness %d\n, ret); + goto fail; + } extent_io_tree is not very space efficient. caching free space in several places is not good. I prefer adding a function that checks if a given block is used to the 'convert_fs' structure. Good point. I'll change cache_free_extents to something like int (*iterate_used_extents)(struct convert_fs *fs, u64 start, u64 end, void *priv, int (*cb)(u64 start, u64 end)) create_image_file_range and do_convert should work well with a callback. This also opens up the possibility of finding free extents incrementally: call iterate_used_extents on the first GB, then custom_alloc_extent will call it on the next GB once free space runs out. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Restoring BTRFS partition
On Tue, Apr 20, 2010 at 11:55:38PM +0800, Wengang Wang wrote: I guess the reason is that the 300M file btrfs and the one on your partition have different block size. Thus 65k zeros on your file image doesn't mean 65k on the partition. So maybe you will try with blocks instead of bytes. Actually, the block size doesn't matter for this--the superblock is always at 0x1. Alli, I think you'll have to upload the start of the partition so someone can take a look at it. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Restoring BTRFS partition
On Tue, Apr 20, 2010 at 06:13:41PM +, Alli Quaknaa wrote: So here are first ~12M of the partition. There was some junk preceding what is in the file, but it mostly looked like my swap or something (cached css, javascript and webpages I've recently visited - though I hope the beginning of the partition isn't somewhere else. Hopefuly you would be able to tell from the dump). http://pub.yweb.cz/sda7_head.dump The superblock in that file (starting at byte 0x10) is actually a mirror of the real superblock. Aside from the real superblock at 0x1, Btrfs stores mirror copies of the superblock at 0x400 (64 MiB), 0x40 (256 GiB), and 0x4 (1 PiB). Each superblock has a field that indicates where it is; when you made your image, you put the mirror superblock where the real superblock was supposed to be, and btrfs refused to mount it because that field was wrong. The real start of the btrfs partition is 0x400 bytes (64 MiB) before the place you found that mirror superblock; the real superblock should be 0x3ff bytes before the mirror. Even if the real superblock is corrupt, if the mirror is at 0x400, where it's supposed to be, you should be able to get btrfs to mount it (though I think you might need a mount option or a patch). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Restoring BTRFS partition
On Tue, Apr 20, 2010 at 10:25:34PM +, Alli Quaknaa wrote: I think I have found the real superblock you are talking about, but I'm afraid I may have written something in the first 64MiB. Is there a chance btrfsck will recover it? btrfsck is currently very limited; it only detects a limited number of problems, and it can't fix anything. Btrfs focuses on handling problems when they are discovered while using the FS; generally, it should handle corruption relatively gracefully. However, if anything really crucial was overwritten and the FS can't be mounted, there aren't any tools to repair it. Also, I think there's gotta be a better way to manipulate those huge files then dd and hexedit for examination - I'd like to take the raw file, open it in some hex editor and be able to cut of some of it's beginning - I can't seem to be able to do it with hexedit. Is there a tool you'd recomment? For viewing, you can use less, head, and tail with hexdump: tail -c +$((0x1+1)) /dev/sda1|hexdump -C|less will view the disk starting at the superblock. For editing, dd is probably best, though you could use a hex editor like Okteta. I've also heard of Radare, supposedly a very advanced command-line tool. Keep in mind that any tool that deletes the first part of a huge file will be forced to rewrite the entire file. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] btrfs-convert: Add extent iteration functions.
Whoops, there's a major memory leak. Please apply this patch to the patch :). diff --git a/convert.c b/convert.c index dfd2976..7bb4ed0 100644 --- a/convert.c +++ b/convert.c @@ -471,21 +471,24 @@ int finish_file_extents(struct extent_iterate_data *priv) return ret; } *priv-inode_nbytes += priv-size; - return btrfs_insert_inline_extent(priv-trans, priv-root, - priv-objectid, - priv-file_off, priv-data, - priv-size); - } - - ret = commit_file_extents(priv); - if (ret) - return ret; - - if (priv-total_size priv-last_file_off) { - ret = commit_disk_extent(priv, priv-last_file_off, 0, - priv-total_size - priv-last_file_off); + ret = btrfs_insert_inline_extent(priv-trans, priv-root, +priv-objectid, +priv-file_off, priv-data, +priv-size); if (ret) return ret; + } else { + ret = commit_file_extents(priv); + if (ret) + return ret; + + if (priv-total_size priv-last_file_off) { + ret = commit_disk_extent(priv, priv-last_file_off, 0, +priv-total_size - +priv-last_file_off); + if (ret) + return ret; + } } free(priv-data); return 0; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs-convert: Add extent iteration functions.
A filesystem can have disk extents in arbitrary places on the disk, as well as extents that must be read into memory because they have compression or encryption btrfs doesn't support. These extents can be passed to the new extent iteration functions, which will handle all the details of alignment, allocation, etc. --- convert.c | 604 - 1 files changed, 401 insertions(+), 203 deletions(-) diff --git a/convert.c b/convert.c index c48f8ba..bd91990 100644 --- a/convert.c +++ b/convert.c @@ -357,7 +357,7 @@ error: } static int read_disk_extent(struct btrfs_root *root, u64 bytenr, - u32 num_bytes, char *buffer) + u64 num_bytes, char *buffer) { int ret; struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices; @@ -371,6 +371,23 @@ fail: ret = -1; return ret; } + +static int write_disk_extent(struct btrfs_root *root, u64 bytenr, +u64 num_bytes, const char *buffer) +{ + int ret; + struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices; + + ret = pwrite(fs_devs-latest_bdev, buffer, num_bytes, bytenr); + if (ret != num_bytes) + goto fail; + ret = 0; +fail: + if (ret 0) + ret = -1; + return ret; +} + /* * Record a file extent. Do all the required works, such as inserting * file extent item, inserting extent item and backref item into extent @@ -378,8 +395,7 @@ fail: */ static int record_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, - struct btrfs_inode_item *inode, - u64 file_pos, u64 disk_bytenr, + u64 *inode_nbytes, u64 file_pos, u64 disk_bytenr, u64 num_bytes, int checksum) { int ret; @@ -391,7 +407,6 @@ static int record_file_extent(struct btrfs_trans_handle *trans, struct btrfs_path path; struct btrfs_extent_item *ei; u32 blocksize = root-sectorsize; - u64 nbytes; if (disk_bytenr == 0) { ret = btrfs_insert_file_extent(trans, root, objectid, @@ -450,8 +465,7 @@ static int record_file_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_other_encoding(leaf, fi, 0); btrfs_mark_buffer_dirty(leaf); - nbytes = btrfs_stack_inode_nbytes(inode) + num_bytes; - btrfs_set_stack_inode_nbytes(inode, nbytes); + *inode_nbytes += num_bytes; btrfs_release_path(root, path); @@ -492,95 +506,355 @@ fail: return ret; } -static int record_file_blocks(struct btrfs_trans_handle *trans, - struct btrfs_root *root, u64 objectid, - struct btrfs_inode_item *inode, - u64 file_block, u64 disk_block, - u64 num_blocks, int checksum) -{ - u64 file_pos = file_block * root-sectorsize; - u64 disk_bytenr = disk_block * root-sectorsize; - u64 num_bytes = num_blocks * root-sectorsize; - return record_file_extent(trans, root, objectid, inode, file_pos, - disk_bytenr, num_bytes, checksum); -} - -struct blk_iterate_data { +struct extent_iterate_data { struct btrfs_trans_handle *trans; struct btrfs_root *root; - struct btrfs_inode_item *inode; + u64 *inode_nbytes; u64 objectid; - u64 first_block; - u64 disk_block; - u64 num_blocks; - u64 boundary; - int checksum; - int errcode; + int checksum, packing; + u64 last_file_off; + u64 total_size; + enum {EXTENT_ITERATE_TYPE_NONE, EXTENT_ITERATE_TYPE_MEM, + EXTENT_ITERATE_TYPE_DISK} type; + u64 size; + u64 file_off; /* always aligned to sectorsize */ + char *data; /* for mem */ + u64 disk_off; /* for disk */ }; -static int block_iterate_proc(ext2_filsys ext2_fs, - u64 disk_block, u64 file_block, - struct blk_iterate_data *idata) +static u64 extent_boundary(struct btrfs_root *root, u64 extent_start) { - int ret; - int sb_region; - int do_barrier; - struct btrfs_root *root = idata-root; - struct btrfs_trans_handle *trans = idata-trans; - struct btrfs_block_group_cache *cache; - u64 bytenr = disk_block * root-sectorsize; - - sb_region = intersect_with_sb(bytenr, root-sectorsize); - do_barrier = sb_region || disk_block = idata-boundary; - if ((idata-num_blocks 0 do_barrier) || - (file_block idata-first_block + idata-num_blocks) || - (disk_block != idata-disk_block + idata-num_blocks)) { - if (idata-num_blocks 0) { - ret = record_file_blocks(trans, root,
[PATCH 3/4] btrfs-convert: permit support for non-ext2 FSs
Filesystems need to provide a function open_blah that fills a struct convert_fs with some information and three function pointers. The function pointers are: - cache_free_extents, which takes a struct extent_io_tree and marks all extents not being used by the filesystem as DIRTY - copy_inodes, which copies the contents of the filesystem into a btrfs_root using CoW. - close There's a void* in struct convert_fs for private use by the filesystem. libblkid is used to determine the filesystem. --- Makefile |2 +- convert.c | 184 + 2 files changed, 126 insertions(+), 60 deletions(-) diff --git a/Makefile b/Makefile index 525676e..755cc24 100644 --- a/Makefile +++ b/Makefile @@ -75,7 +75,7 @@ quick-test: $(objects) quick-test.o gcc $(CFLAGS) -o quick-test $(objects) quick-test.o $(LDFLAGS) $(LIBS) convert: $(objects) convert.o - gcc $(CFLAGS) -o btrfs-convert $(objects) convert.o -lext2fs $(LDFLAGS) $(LIBS) + gcc $(CFLAGS) -o btrfs-convert $(objects) convert.o -lext2fs -lblkid $(LDFLAGS) $(LIBS) ioctl-test: $(objects) ioctl-test.o gcc $(CFLAGS) -o ioctl-test $(objects) ioctl-test.o $(LDFLAGS) $(LIBS) diff --git a/convert.c b/convert.c index bd91990..6dfcb97 100644 --- a/convert.c +++ b/convert.c @@ -31,6 +31,7 @@ #include unistd.h #include uuid/uuid.h #include linux/fs.h +#include blkid/blkid.h #include kerncompat.h #include ctree.h #include disk-io.h @@ -42,9 +43,26 @@ #include ext2fs/ext2fs.h #include ext2fs/ext2_ext_attr.h +struct convert_fs { + u64 total_bytes; + u64 blocksize; + const char *label; + + /* Close the FS */ + int (*close)(struct convert_fs *fs); + /* Mark free extents as dirty */ + int (*cache_free_extents)(struct convert_fs *fs, + struct extent_io_tree *tree); + /* Copy everything over */ + int (*copy_inodes)(struct convert_fs *fs, struct btrfs_root *root, + int datacsum, int packing, int noxattr); + + void *privdata; +}; + #define INO_OFFSET (BTRFS_FIRST_FREE_OBJECTID - EXT2_ROOT_INO) #define STRIPE_LEN (64 * 1024) -#define EXT2_IMAGE_SUBVOL_OBJECTID BTRFS_FIRST_FREE_OBJECTID +#define ORIG_IMAGE_SUBVOL_OBJECTID BTRFS_FIRST_FREE_OBJECTID /* * Open Ext2fs in readonly mode, read block allocation bitmap and @@ -89,15 +107,16 @@ fail: return -1; } -static int close_ext2fs(ext2_filsys fs) +static int ext2_close(struct convert_fs *fs) { - ext2fs_close(fs); + ext2fs_close((ext2_filsys)fs-privdata); return 0; } -static int ext2_cache_free_extents(ext2_filsys ext2_fs, +static int ext2_cache_free_extents(struct convert_fs *fs, struct extent_io_tree *free_tree) { + ext2_filsys ext2_fs = fs-privdata; int ret = 0; blk_t block; u64 bytenr; @@ -117,19 +136,18 @@ static int ext2_cache_free_extents(ext2_filsys ext2_fs, } /* mark btrfs-reserved blocks as used */ -static void adjust_free_extents(ext2_filsys ext2_fs, +static void adjust_free_extents(struct convert_fs *fs, struct extent_io_tree *free_tree) { int i; u64 bytenr; - u64 blocksize = ext2_fs-blocksize; clear_extent_dirty(free_tree, 0, BTRFS_SUPER_INFO_OFFSET - 1, 0); for (i = 0; i BTRFS_SUPER_MIRROR_MAX; i++) { bytenr = btrfs_sb_offset(i); bytenr = ~((u64)STRIPE_LEN - 1); - if (bytenr = blocksize * ext2_fs-super-s_blocks_count) + if (bytenr = fs-total_bytes) break; clear_extent_dirty(free_tree, bytenr, bytenr + STRIPE_LEN - 1, 0); @@ -1373,9 +1391,10 @@ fail: /* * scan ext2's inode bitmap and copy all used inode. */ -static int copy_inodes(struct btrfs_root *root, ext2_filsys ext2_fs, - int datacsum, int packing, int noxattr) +static int ext2_copy_inodes(struct convert_fs *fs, struct btrfs_root *root, + int datacsum, int packing, int noxattr) { + ext2_filsys ext2_fs = fs-privdata; int ret; errcode_t err; ext2_inode_scan ext2_scan; @@ -1426,8 +1445,8 @@ static int copy_inodes(struct btrfs_root *root, ext2_filsys ext2_fs, } /* - * Construct a range of ext2fs image file. - * scan block allocation bitmap, find all blocks used by the ext2fs + * Construct a range of the image file. + * scan block allocation bitmap, find all blocks used by the filesystem * in this range and create file extents that point to these blocks. * * Note: Before calling the function, no file extent points to blocks @@ -1465,10 +1484,10 @@ static int create_image_file_range(struct btrfs_trans_handle *trans, return ret; } /* - * Create the ext2fs image file. + * Create the image file. */ -static int create_ext2_image(struct
[PATCH 4/4] btrfs-convert: split into convert/.
btrfs_trans_handle *trans; - - trans = btrfs_start_transaction(root, 1); - if (!trans) - return -ENOMEM; - err = ext2fs_open_inode_scan(ext2_fs, 0, ext2_scan); - if (err) { - fprintf(stderr, ext2fs_open_inode_scan: %s\n, error_message(err)); - return -1; - } - while (!(err = ext2fs_get_next_inode(ext2_scan, ext2_ino, -ext2_inode))) { - /* no more inodes */ - if (ext2_ino == 0) - break; - /* skip special inode in ext2fs */ - if (ext2_ino EXT2_GOOD_OLD_FIRST_INO - ext2_ino != EXT2_ROOT_INO) - continue; - objectid = ext2_ino + INO_OFFSET; - ret = copy_single_inode(trans, root, - objectid, ext2_fs, ext2_ino, - ext2_inode, datacsum, packing, - noxattr); - if (ret) - return ret; - if (trans-blocks_used = 4096) { - ret = btrfs_commit_transaction(trans, root); - BUG_ON(ret); - trans = btrfs_start_transaction(root, 1); - BUG_ON(!trans); - } - } - if (err) { - fprintf(stderr, ext2fs_get_next_inode: %s\n, error_message(err)); - return -1; - } - ret = btrfs_commit_transaction(trans, root); - BUG_ON(ret); - - return ret; -} /* * Construct a range of the image file. @@ -2586,26 +1805,6 @@ static int copy_dirtiness(struct extent_io_tree *out, return 0; } -int ext2_open(struct convert_fs *fs, const char *name) -{ - int ret; - ext2_filsys ext2_fs; - ret = open_ext2fs(name, ext2_fs); - if (ret) - return ret; - - fs-privdata = ext2_fs; - fs-blocksize = ext2_fs-blocksize; - fs-label = ext2_fs-super-s_volume_name; - fs-total_bytes = ext2_fs-super-s_blocks_count * fs-blocksize; - - fs-cache_free_extents = ext2_cache_free_extents; - fs-close = ext2_close; - fs-copy_inodes = ext2_copy_inodes; - - return 0; -} - static int open_fs(struct convert_fs *fs, const char *devname) { static struct { diff --git a/convert/convert.h b/convert/convert.h new file mode 100644 index 000..4f31775 --- /dev/null +++ b/convert/convert.h @@ -0,0 +1,76 @@ +/* + * Copyright (C) 2007 Oracle. All rights reserved. + * Copyright (C) 2010 Sean Bartell. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ +#ifndef BTRFS_CONVERT_H +#define BTRFS_CONVERT_H + +#include ctree.h +#include kerncompat.h +#include transaction.h + +struct convert_fs { + u64 total_bytes; + u64 blocksize; + const char *label; + + /* Close the FS */ + int (*close)(struct convert_fs *fs); + /* Mark free extents as dirty */ + int (*cache_free_extents)(struct convert_fs *fs, + struct extent_io_tree *tree); + /* Copy everything over */ + int (*copy_inodes)(struct convert_fs *fs, struct btrfs_root *root, + int datacsum, int packing, int noxattr); + + void *privdata; +}; + +int ext2_open(struct convert_fs *fs, const char *name); + +struct extent_iterate_data { + struct btrfs_trans_handle *trans; + struct btrfs_root *root; + u64 *inode_nbytes; + u64 objectid; + int checksum, packing; + u64 last_file_off; + u64 total_size; + enum {EXTENT_ITERATE_TYPE_NONE, EXTENT_ITERATE_TYPE_MEM, + EXTENT_ITERATE_TYPE_DISK} type; + u64 size; + u64 file_off; /* always aligned to sectorsize */ + char *data; /* for mem */ + u64 disk_off; /* for disk */ +}; + +int start_file_extents(struct extent_iterate_data *priv, + struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 *inode_nbytes, + u64 objectid, int checksum, int packing, + u64 total_size); +int start_file_extents_range(struct extent_iterate_data *priv, +struct btrfs_trans_handle *trans
Re: Creation time
There is room in btrfs for a fourth time called otime, but it is not currently used or even initialized. Once there are APIs, it should be possible to add crtime support with a slight format upgrade. On Sun, Mar 14, 2010 at 02:55:12AM +0100, Hubert Kario wrote: From what I could find, btrfs supports only the trinity of UNIX time stamps: atime, ctime and mtime. Is there any plan to support crtime (creation time)? Side note: ZFS already supports it, ext4 and cifs (samba) are waiting for APIs and userland support, so it could be a good time to coordinate efforts and solidify the interface. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html