Re: SSD format/mount parameters questions
Martin wrote (ao): Are there any format/mount parameters that should be set for using btrfs on SSDs (other than the ssd mount option)? If possible, format the whole device, do not partition the ssd. This will guarantee proper allignment. The kernel will detect the ssd, and apply the ssd mount option automatically. I've got a mix of various 120/128GB SSDs to newly set up. I will be using ext4 on the critical ones, but also wish to compare with btrfs... I would use btrfs on the critical ones, as btrfs has checksums to detect datacorruption. The mix includes some SSDs with the Sandforce controller that implements its own data compression and data deduplication. How well does btrfs fit with those compared to other non-data-compression controllers? Since you have them both, you might want to find out yourself, and let us know ;-) FWIW (not much, as you already have them), I would not buy anything else than intel. I have about 26 of them for years now (both in servers and workstations, several series), and never had an issue. Two of my colleagues have OCZ, and both had to RMA them. Sander -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Btrfs: avoid memory leak of extent state in error handling routine
We've forgotten to clear extent states in pinned tree, which will results in space counter mismatch and memory leak: WARNING: at fs/btrfs/extent-tree.c:7537 btrfs_free_block_groups+0x1f3/0x2e0 [btrfs]() ... space_info 2 has 8380416 free, is not full space_info total=12582912, used=4096, pinned=4096, reserved=0, may_use=0, readonly=4194304 btrfs state leak: start 29364224 end 29376511 state 1 in tree 880075f20090 refs 1 ... Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/disk-io.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a7ffc88..046a737 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3595,6 +3595,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, btrfs_destroy_marked_extents(root, cur_trans-dirty_pages, EXTENT_DIRTY); + btrfs_destroy_pinned_extent(root, + root-fs_info-pinned_extents); /* memset(cur_trans, 0, sizeof(*cur_trans)); -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: destroy the items of the delayed inodes in error handling routine
From: Miao Xie mi...@cn.fujitsu.com the items of the delayed inodes were forgotten to be freed, this patch fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/delayed-inode.c | 18 ++ fs/btrfs/delayed-inode.h |3 +++ fs/btrfs/disk-io.c |6 ++ 3 files changed, 27 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 03e3748..858d6c7 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1879,3 +1879,21 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root) } } } + +void btrfs_destroy_delayed_inodes(struct btrfs_root *root) +{ + struct btrfs_delayed_root *delayed_root; + struct btrfs_delayed_node *curr_node, *prev_node; + + delayed_root = btrfs_get_delayed_root(root); + + curr_node = btrfs_first_delayed_node(delayed_root); + while (curr_node) { + __btrfs_kill_delayed_node(curr_node); + + prev_node = curr_node; + curr_node = btrfs_next_delayed_node(curr_node); + btrfs_release_delayed_node(prev_node); + } +} + diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index 7083d08..c1cfa87 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -124,6 +124,9 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev); /* Used for drop dead root */ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root); +/* Used for clean the transaction */ +void btrfs_destroy_delayed_inodes(struct btrfs_root *root) + /* Used for readdir() */ void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, struct list_head *del_list); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 20196f4..a56026f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3583,6 +3583,9 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, if (waitqueue_active(cur_trans-commit_wait)) wake_up(cur_trans-commit_wait); + btrfs_destroy_delayed_inodes(root); + btrfs_assert_delayed_root_empty(root); + btrfs_destroy_pending_snapshots(cur_trans); btrfs_destroy_marked_extents(root, cur_trans-dirty_pages, @@ -3635,6 +3638,9 @@ int btrfs_cleanup_transaction(struct btrfs_root *root) if (waitqueue_active(t-commit_wait)) wake_up(t-commit_wait); + btrfs_destroy_delayed_inodes(root); + btrfs_assert_delayed_root_empty(root); + btrfs_destroy_pending_snapshots(t); btrfs_destroy_delalloc_inodes(root); -- 1.7.6.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: make sure that we've made everything in pinned tree clean
Since we have two trees for recording pinned extents, we need to go through both of them to make sure that we've done everything clean. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/disk-io.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 046a737..144f019 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3548,8 +3548,10 @@ static int btrfs_destroy_pinned_extent(struct btrfs_root *root, u64 start; u64 end; int ret; + bool loop = true; unpin = pinned_extents; +again: while (1) { ret = find_first_extent_bit(unpin, 0, start, end, EXTENT_DIRTY); @@ -3567,6 +3569,15 @@ static int btrfs_destroy_pinned_extent(struct btrfs_root *root, cond_resched(); } + if (loop) { + if (unpin == root-fs_info-freed_extents[0]) + unpin = root-fs_info-freed_extents[1]; + else + unpin = root-fs_info-freed_extents[0]; + loop = false; + goto again; + } + return 0; } -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Newbie questions on some of btrfs code...
Greetings everybody, I have been studying some of the btrfs code and the developer documentation on the wiki. My primary interest at this point, is to be able to search within fs tree of a btrfs subvolume, which was created as a snapshot of another subvolume. For that I have been using the debug-tree tool plus the References.png diagram on the wiki. I realize that my knowledge of btrfs is very rudimentary at this point, so please bear with me. # How can I navigate from EXTENT_DATA within the fs tree to appropriate CHUNK_ITEM in the chunk tree? I am basically trying to find where the file data resides on disk. For example, I have an EXTENT_DATA like this: item 30 key (265 EXTENT_DATA 0) itemoff 888 itemsize 53 extent data disk byte 12648448 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 I can navigate from here to EXTENT_ITEM within the extent tree, using btrfs_file_extent_item::disk_bytenr/disk_num_bytes as key for search: item 3 key (12648448 EXTENT_ITEM 8192) itemoff 3870 itemsize 53 extent refs 1 gen 8 flags 1 extent data backref root 5 objectid 265 offset 0 count 1 But from there how can I reach the relevant CHUNK_ITEM? # Once I have reached the CHUNK_ITEM, I assume that btrfs_file_extent_item::offset/num_bytes fields will provide the exact location of the data on disk. Is that correct? For now I assume that btrfs was created on a single device and raid0 is used for data, so I totally ignore mirroring/striping at this point. # I have been trying to follow btrfs_fiemap(), which seems to do the job, but it looks like it returns the disk_bytenr/disk_num_bytes fields without following to CHUNK_ITEMs. Maybe I am wrong. Some general questions on the ctree code. # I saw that slot==0 is special. My understanding is that btrfs maintains the property that the parent of each node/leaf has a key pointing to that node/leaf, which must be equal to the key in the slot==0 of this node/leaf. That's what fixup_low_keys() tries to maintain. Is this correct? # If my understanding in the previous bullet is correct: Is that the reason that in btrfs_prev_leaf() it is assumed that if there is a lesser key, btrfs_search_slot() will never bring us to the slot==0 of the current leaf? # btrfs_search_slot(): how can it happen that b becomes NULL, and we exit the while loop? (and set ret=1) # btrfs_insert_empty_items(): if nr1, then an array of keys is expected to be passed. But btrfs_search_slot() is searching only for the first key. What happens if the first key does not exist (as expected), but the next key in the array exists? # Do my questions make sense? Thanks! Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie questions on some of btrfs code...
On Fri, May 18, 2012 at 02:21:59PM +0300, Alex Lyakas wrote: Greetings everybody, I have been studying some of the btrfs code and the developer documentation on the wiki. My primary interest at this point, is to be able to search within fs tree of a btrfs subvolume, which was created as a snapshot of another subvolume. For that I have been using the debug-tree tool plus the References.png diagram on the wiki. I realize that my knowledge of btrfs is very rudimentary at this point, so please bear with me. # How can I navigate from EXTENT_DATA within the fs tree to appropriate CHUNK_ITEM in the chunk tree? I am basically trying to find where the file data resides on disk. For example, I have an EXTENT_DATA like this: item 30 key (265 EXTENT_DATA 0) itemoff 888 itemsize 53 extent data disk byte 12648448 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 I can navigate from here to EXTENT_ITEM within the extent tree, using btrfs_file_extent_item::disk_bytenr/disk_num_bytes as key for search: item 3 key (12648448 EXTENT_ITEM 8192) itemoff 3870 itemsize 53 extent refs 1 gen 8 flags 1 extent data backref root 5 objectid 265 offset 0 count 1 But from there how can I reach the relevant CHUNK_ITEM? CHUNK_ITEMs are indexed by the start address of the chunk, so for the extent at $e, you need to search for the chunk item immediately before the key (FIRST_CHUNK_TREE, CHUNK_ITEM, $e). # Once I have reached the CHUNK_ITEM, I assume that btrfs_file_extent_item::offset/num_bytes fields will provide the exact location of the data on disk. Is that correct? For now I assume that btrfs was created on a single device and raid0 is used for data, so I totally ignore mirroring/striping at this point. If you want to find the physical position of a given byte in a file on disk (and repeating some of what you already know): - The FS tree holds the directory structure, so you use that to find the inode number of the file by name. - With the inode number, you can look in the FS tree again to get the set of extents which make up the file. These extents are a mapping from [byte offset within the file] to [byte offset in virtual address space]. - The extent tree then holds extent info, indexed by virtual address. There are two main types of extent: the extents holding file data (EXTENT_ITEM), and, overlapping with them, extents representing the block groups (BLOCK_GROUP_ITEM), which are the high-level allocation units of the FS. - For any given file extent (EXTENT_ITEM), you can use the tree search API to look in the chunk tree for the chunks holding this virtual data extent. (For any non-single RAID level, there will be multiple chunks involved). You do this by simply finding CHUNK_ITEM items in the tree with a start value immediately less than or equal to the virtual-address offset of your file extent. - With any replicating RAID (-1 or -10) there will be multiple entries in the chunk tree for any given virtual address offset, representing the multiple mirrors. For any striped RAID level (-0, -10), each chunk record in the tree will have several btrfs_stripe records in its array. Each btrfs_stripe record that you end up with (duplicate copies from the RAID-1/-10, and stripes from the RAID-0/-10) will then reference the device tree, which gives you the physical location of that btrfs_stripe on a specific disk. Note that in the btrfs internal terminology, a stripe is a contiguous (256MiB or 1GiB) sequence of bytes on a single disk. RAID stripes (e.g. RAID-0, -10) are actually called sub-stripes in the btrfs code. There's also no clearly-defined use of the terms chunk and block group. HTH, Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- How do you become King? You stand in the marketplace and --- announce you're going to tax everyone. If you get out alive, you're King. signature.asc Description: Digital signature
Re: [PATCH v3 3/3] Btrfs: read device stats on mount, write modified ones during commit
On Wed, May 16, 2012 at 06:50:47PM +0200, Stefan Behrens wrote: --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -823,6 +823,26 @@ struct btrfs_csum_item { u8 csum; } __attribute__ ((__packed__)); +struct btrfs_device_stats_item { + /* + * grow this item struct at the end for future enhancements and keep + * the existing values unchanged + */ + __le64 cnt_write_io_errs; /* EIO or EREMOTEIO from lower layers */ + __le64 cnt_read_io_errs; /* EIO or EREMOTEIO from lower layers */ + __le64 cnt_flush_io_errs; /* EIO or EREMOTEIO from lower layers */ + + /* stats for indirect indications for I/O failures */ + __le64 cnt_corruption_errs; /* checksum error, bytenr error or + * contents is illegal: this is an + * indication that the block was damaged + * during read or write, or written to + * wrong location or read from wrong + * location */ + __le64 cnt_generation_errs; /* an indication that blocks have not + * been written */ A few spare u64 would come handy in the future. Currently there are 5, so add like 7 or 11. We might be interested in collecting more types of stats, ore more fine-grained. I see the comment above about enhancing the structure, but then you need to version this stucture. Let's say this kernel at version 3.5 add this structre as you propose it now, and kernel 3.6 adds another item 'cnt_exploded'. Accessing the 3.6-created image with a 3.5 will be ok (older kernel will not touch the new items). Accessing the 3.5-created image with a 3.6 will be problematic, as the kernel would try to access -cnt_exploded . So, either the 3.6 kernel needs to know not to touch the missing item (ie. via reading the struct version from somewhere, stored on disk). Or, there are spare items, which are zeroed in versions that do not use them and naturally used otherwise, but when new kernel uses old image, it finds zeros (and will be safe). +} __attribute__ ((__packed__)); + -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Problem with restore of filesystem
I have sinned! I had a production filesystem without a replica - which is bonked :( Running restore ( i have tried Debian's btrfs-tools; master and dangerdonteveruse branches version ) on kernels 3.2 and 3.3 I consistently get this error message, and I would like suggestions as to how i might proceed. root@foo:/net/users/home/bahner/src/btrfs-progs# btrfs-restore -si /dev/sdc /opt/data/restore/ failed to read /dev/sr0 failed to read /dev/sr0 parent transid verify failed on 5083380932608 wanted 332337 found 339991 parent transid verify failed on 5083380932608 wanted 332337 found 339991 parent transid verify failed on 5083380932608 wanted 332337 found 339991 parent transid verify failed on 5083380932608 wanted 332337 found 339991 Ignoring transid failure Root objectid is 5 Skipping existing file /opt/data/restore/data/move/treungen-s01/db/BackupSet/1322218096564/2011-11-25-14-15-41.log If you wish to overwrite use the -o option to overwrite btrfs-restore: disk-io.c:589: btrfs_read_fs_root: Assertion `!(location-objectid == -8ULL || location-offset != (u64)-1)' failed. Aborted root@foo:/net/users/home/bahner/src/btrfs-progs# There is supposed to be a folder data/users (/opt/data/restore/data/users) I thought i could somehow bypass the rootaccess by giving a -m 'data/users/*'-parameter, for instance. But thinking about I realize I was clutching a straws. And pointers in the right direction is appreciated. I have run find-root and tried with the latest blockid for the -t parameter. Kind regards, Lars Bahner -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] Btrfs: cancel the scrub when remounting a fs to ro
On Thu, May 17, 2012 at 07:58:21PM +0800, Miao Xie wrote: --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1151,6 +1151,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) /* pause restriper - we want to resume on remount to r/w */ btrfs_pause_balance(root-fs_info); + btrfs_scrub_cancel(root); Can we possibly switch scrub to readonly instead ? I'm not sure what's the 'least surprise here', whether to cancel everything on the filesystem upon ro-remount or just the minimal set of operations (and leave the rest running if possible). Looking at the scrub code, if dev-readonly is set, no repairs are done, so the only concern is to wait for any outstanding IOs and then switch to RO. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: do not resize a seeding device
On Thu, May 17, 2012 at 08:08:08PM +0800, Liu Bo wrote: --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1303,6 +1303,13 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root *root, ret = -EINVAL; goto out_free; } + if (device-fs_devices device-fs_devices-seeding) { + printk(KERN_INFO btrfs: resizer unable to apply on +seeding device %s\n, device-name); + ret = -EACCES; I think EINVAL would be more appropriate. EACCESS is about permissions which do not make much sense in context of resizing devices, besides that CAP_SYS_ADMIN is required anyway (and checked a few lines above). david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie questions on some of btrfs code...
Thank you, Hugo, for the detailed explanation. I am now able to find the CHUNK_ITEMs and to successfully locate the file data on disk. Can you maybe address several follow-up questions I have? # When looking for CHUNK_ITEMs, should I check that their btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA chunk? # It looks like I don't even need to bother with the extent tree at this point, because from EXTENT_DATA in fs tree I can navigate directly to CHUNK_ITEM in chunk tree, correct? # For replicating RAID levels, you said there will be multiple CHUNK_ITEMs. How do I find them then? Should I know in advance how much there should be, and look for them, considering only btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for replication at this point, though). # If I find in the fs tree an EXTENT_DATA of type BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean? (BTRFS_FILE_EXTENT_INLINE are easy to treat). # One of my files has two EXTENT_DATAs, like this: item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53 extent data disk byte 432508928 nr 1474560 extent data offset 0 nr 1470464 ram 1474560 extent compression 0 item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53 extent data disk byte 432082944 nr 126976 extent data offset 0 nr 126976 ram 126976 extent compression 0 Summing btrfs_file_extent_item::num_bytes gives 1470464+126976=1597440. (I know that I should not be summing btrfs_file_extent_item::disk_num_bytes, but num_bytes). However, it's INODE_ITEM gives size of 1593360, which is less: item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160 inode generation 26 size 1593360 block group 0 mode 100700 links 1 Is this a valid situation, or I should always consider size in INODE_ITEM as the correct one? Thanks again, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie questions on some of btrfs code...
On Fri, May 18, 2012 at 04:32:09PM +0300, Alex Lyakas wrote: Thank you, Hugo, for the detailed explanation. I am now able to find the CHUNK_ITEMs and to successfully locate the file data on disk. Can you maybe address several follow-up questions I have? # When looking for CHUNK_ITEMs, should I check that their btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA (and not SYSTEM/METADATA etc)? Or file extent should always be mapped to BTRFS_BLOCK_GROUP_DATA chunk? File extents will either be mapped to a data chunk, _or_ the file data will live inline in the metadata area, following the btrfs_extent_item. This is probably the trickiest piece of the on-disk data format to figure out, and I fear that I didn't document it well enough. Basically, it's non-obvious where inline extents are calculated, because there's all sorts of awkward-looking type casting to get to the data. # It looks like I don't even need to bother with the extent tree at this point, because from EXTENT_DATA in fs tree I can navigate directly to CHUNK_ITEM in chunk tree, correct? Mmm... possibly. Again, I'm not sure how this interacts with inline extents. # For replicating RAID levels, you said there will be multiple CHUNK_ITEMs. How do I find them then? Should I know in advance how much there should be, and look for them, considering only btrfs_chunk::type==BTRFS_BLOCK_GROUP_DATA? (I don't bother for replication at this point, though). Actually, thinking about it, there's a single CHUNK_ITEM, and the stripe[] array holds all of the per-disk allocations that correspond to that block group. So, for RAID-1, you'll have precisely two elements in the stripe[] array. Sorry for getting it wrong earlier. # If I find in the fs tree an EXTENT_DATA of type BTRFS_FILE_EXTENT_PREALLOC, how should I treat it? What does it mean? (BTRFS_FILE_EXTENT_INLINE are easy to treat). I don't know, sorry. # One of my files has two EXTENT_DATAs, like this: item 14 key (270 EXTENT_DATA 0) itemoff 1812 itemsize 53 extent data disk byte 432508928 nr 1474560 extent data offset 0 nr 1470464 ram 1474560 extent compression 0 item 15 key (270 EXTENT_DATA 1470464) itemoff 1759 itemsize 53 extent data disk byte 432082944 nr 126976 extent data offset 0 nr 126976 ram 126976 extent compression 0 Summing btrfs_file_extent_item::num_bytes gives 1470464+126976=1597440. (I know that I should not be summing btrfs_file_extent_item::disk_num_bytes, but num_bytes). However, it's INODE_ITEM gives size of 1593360, which is less: item 11 key (270 INODE_ITEM 0) itemoff 1970 itemsize 160 inode generation 26 size 1593360 block group 0 mode 100700 links 1 Is this a valid situation, or I should always consider size in INODE_ITEM as the correct one? Again, I don't know off the top of my head. It's been some time since I dug into these kinds of details, sorry. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- In my day, we didn't have fancy high numbers. We had --- nothing, one, twain and multitudes. signature.asc Description: Digital signature
Re: Problem with restore of filesystem
On Fri, May 18, 2012 at 6:04 AM, Lars Bahner bah...@onlinebackupcompany.com wrote: I have sinned! I had a production filesystem without a replica - which is bonked :( Grr... Running restore ( i have tried Debian's btrfs-tools; master and dangerdonteveruse branches version ) on kernels 3.2 and 3.3 I consistently get this error message, and I would like suggestions as to how i might proceed. It's worth trying to mount with the latest 3.4rc; it might be able to hobble along. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD format/mount parameters questions
I would not buy anything else than intel. I have about 26 of them for years now (both in servers and workstations, several series), and never had an issue. Two of my colleagues have OCZ, and both had to RMA them. I guess it boils down wether you want intel also to rule the SSD market in the long term, as they do with PC processors... Comparing intel SSDs with OCZ is not that fair, as OCZ has always been low-priced bleeding edge stuff. Usually ratings at Amazon are a good indicator how reliable the product in question is. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD format/mount parameters questions
On Fri, May 18, 2012 at 05:08:33PM +0200, Clemens Eisserer wrote: I would not buy anything else than intel. I have about 26 of them for years now (both in servers and workstations, several series), and never had an issue. Two of my colleagues have OCZ, and both had to RMA them. I guess it boils down wether you want intel also to rule the SSD market in the long term, as they do with PC processors... Comparing intel SSDs with OCZ is not that fair, as OCZ has always been low-priced bleeding edge stuff. Looking into the controllers... first there were bunch of different ones; Intel had it own design with SSD 320. Then come Sandforce; it got broadly used, despite sucking when used with FDE. Even Intel started to used Sandforce - SSD 520. How's reliabilty of Intel differs? Latest fad is Marvell controller; again Intel joins the pack with SSD510. So, Intel is not that different anymore. -- Tomasz Torcz God, root, what's the difference? xmpp: zdzich...@chrome.pl God is more forgiving. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD format/mount parameters questions
On Fri, 2012-05-18 at 17:32 +0200, Tomasz Torcz wrote: On Fri, May 18, 2012 at 05:08:33PM +0200, Clemens Eisserer wrote: I would not buy anything else than intel. I have about 26 of them for years now (both in servers and workstations, several series), and never had an issue. Two of my colleagues have OCZ, and both had to RMA them. I guess it boils down wether you want intel also to rule the SSD market in the long term, as they do with PC processors... Comparing intel SSDs with OCZ is not that fair, as OCZ has always been low-priced bleeding edge stuff. Looking into the controllers... first there were bunch of different ones; Intel had it own design with SSD 320. Then come Sandforce; it got broadly used, despite sucking when used with FDE. Even Intel started to used Sandforce - SSD 520. How's reliabilty of Intel differs? Latest fad is Marvell controller; again Intel joins the pack with SSD510. So, Intel is not that different anymore. The controllers themselves really aren't that interesting any more - an SSD controller is really just an ARM or MIPS core with some flash interfaces, a SATA interface, and some ram - running proprietary firmware. Several of the Marvell devices actually have completely different firmwares (e.g. Intel's firmware for Marvell devices was reportedly developed by them in-house), and Intel's Sandforce firmware has some customizations for improved reliability, at the expense of some speed. -- Calvin Walton calvin.wal...@kepstin.ca -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID with RAID cards (thread renamed)
- if a non-RAID SAS card is used, does it matter which card is chosen? Does btrfs work equally well with all of them? If you're using btrfs RAID, you need a HBA, not a RAID card. If the RAID card can work as a HBA (usually labelled as JBOD mode) then you're good to go. For example, HP CCISS controllers can't work in JBOD mode. Would you know if they implement their own checksumming, similar to what btrfs does? Or if someone uses SmartArray (CCISS) RAID1, then they simply don't get the full benefit of checksumming under any possible configuration? I've had a quick look at what is on the market, here are some observations: - in many cases, IOPS (critical for SSDs) vary wildly: e.g. - SATA-3 SSDs advertise up to 85k IOPS, so RAID1 needs 170k IOPS - HP's standard HBAs don't support high IOPS - HP Gen8 SmartArray (e.g. P420) claims up to 200k IOPS - previous HP arrays (e.g. P212) support only 60k IOPS - many vendors don't advertise the IOPS prominently - I had to Google the HP site to find those figures quoted in some PDFs, they don't quote them in the quickspecs or product summary tables - Adaptec now offers an SSD caching function in hardware, supposedly drop it in the machine and all disks respond faster - how would this interact with btrfs checksumming? E.g. I'm guessing it would be necessary to ensure that data from both spindles is not cached on the same SSD? - I started thinking about the possibility that data is degraded on the mechanical disk but btrfs gets a good checksum read from the SSD and remains blissfully unaware that the real disk is failing, then the other disk goes completely offline one day, for whatever reason the data is not in the SSD cache and the sector can't be read reliably from the remaining physical disk - should such caching just be avoided or can it be managed from btrfs itself in a manner that is foolproof? How about the combination of btrfs/root/boot filesystems and grub? Can they all play nicely together? This seems to be one compelling factor with hardware RAID, the cards have a BIOS that can boot from any drive even if the other is offline. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subdirectory creation on snapshot
On Mon, May 14, 2012 at 10:11:01AM -0700, Brendan Smithyman wrote: The disks that *are* still showing the subdirectory creation issue were both converted from ext4 (using old tools). So perhaps that's a direction to explore. Yeah, thanks for the hint. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, there was one line before the bug. [ 995.725105] couldn't find orphan item for 524 Am 18.05.2012 16:48, schrieb Josef Bacik: Ok hopefully this will print something out that makes sense. Thanks, -martin [ 241.754693] Btrfs loaded [ 241.755148] device fsid 43c4ebd9-3824-4b07-a710-3ec39b012759 devid 1 transid 4 /dev/sdc [ 241.755750] btrfs: setting nodatacow [ 241.755753] btrfs: enabling auto defrag [ 241.755754] btrfs: disk space caching is enabled [ 241.755755] btrfs flagging fs with big metadata feature [ 241.768683] device fsid e7e7f2df-6a4e-45b1-85cc-860cda849953 devid 1 transid 4 /dev/sdd [ 241.769028] btrfs: setting nodatacow [ 241.769030] btrfs: enabling auto defrag [ 241.769031] btrfs: disk space caching is enabled [ 241.769032] btrfs flagging fs with big metadata feature [ 241.781360] device fsid 203fdd4c-baac-49f8-bfdb-08486c937989 devid 1 transid 4 /dev/sde [ 241.781854] btrfs: setting nodatacow [ 241.781859] btrfs: enabling auto defrag [ 241.781861] btrfs: disk space caching is enabled [ 241.781864] btrfs flagging fs with big metadata feature [ 242.713741] device fsid 95c36e12-0098-48d7-a08d-9d54a299206b devid 1 transid 4 /dev/sdf [ 242.714110] btrfs: setting nodatacow [ 242.714118] btrfs: enabling auto defrag [ 242.714121] btrfs: disk space caching is enabled [ 242.714125] btrfs flagging fs with big metadata feature [ 995.725105] couldn't find orphan item for 524 [ 995.725126] [ cut here ] [ 995.725134] kernel BUG at fs/btrfs/inode.c:2227! [ 995.725143] invalid opcode: [#1] SMP [ 995.725158] CPU 0 [ 995.725162] Modules linked in: btrfs zlib_deflate libcrc32c ext2 coretemp ghash_clmulni_intel aesni_intel bonding cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma enclosure mac_hid lp parport ixgbe usbhid hid isci libsas megaraid_sas scsi_transport_sas igb dca mdio [ 995.725285] [ 995.725290] Pid: 2972, comm: ceph-osd Tainted: G C 3.4.0-rc7.2012051800+ #14 Supermicro X9SRi/X9SRi [ 995.725324] RIP: 0010:[a028535f] [a028535f] btrfs_orphan_del+0x14f/0x160 [btrfs] [ 995.725354] RSP: 0018:881016ed9d18 EFLAGS: 00010292 [ 995.725364] RAX: 0037 RBX: 88101485fdb0 RCX: [ 995.725378] RDX: RSI: 0082 RDI: 0246 [ 995.725392] RBP: 881016ed9d58 R08: R09: [ 995.725405] R10: R11: 00b6 R12: 88101efe9f90 [ 995.725419] R13: 88101efe9c00 R14: 0001 R15: 0001 [ 995.725433] FS: 7f58e5dbc700() GS:88107fc0() knlGS: [ 995.725466] CS: 0010 DS: ES: CR0: 80050033 [ 995.725492] CR2: 03f28000 CR3: 00101acac000 CR4: 000407f0 [ 995.725522] DR0: DR1: DR2: [ 995.725551] DR3: DR6: 0ff0 DR7: 0400 [ 995.725581] Process ceph-osd (pid: 2972, threadinfo 881016ed8000, task 88101618) [ 995.725626] Stack: [ 995.725646] 0c02 88101deaf550 881016ed9d38 88101deaf550 [ 995.725700] 88101efe9c00 88101485fdb0 880be890c1e0 [ 995.725757] 881016ed9e08 a02897a8 88101485fdb0 [ 995.725807] Call Trace: [ 995.725835] [a02897a8] btrfs_truncate+0x5e8/0x6d0 [btrfs] [ 995.725869] [a028b121] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 995.725898] [811955c3] notify_change+0x183/0x320 [ 995.725925] [8117889e] do_truncate+0x5e/0xa0 [ 995.725951] [81178a24] sys_truncate+0x144/0x1b0 [ 995.725979] [8165fd29] system_call_fastpath+0x16/0x1b [ 995.726006] Code: 45 31 ff e9 3c ff ff ff 48 8b b3 58 fe ff ff 48 85 f6 74 19 80 bb 60 fe ff ff 84 74 10 48 c7 c7 08 48 2e a0 31 c0 e8 09 7c 3c e1 0f 0b 48 8b 73 40 eb ea 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 [ 995.726221] RIP [a028535f] btrfs_orphan_del+0x14f/0x160 [btrfs] [ 995.726258] RSP 881016ed9d18 [ 995.726574] ---[ end trace 4bde8f513a6d106d ]--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: merge contigous regions when loading free space cache
When we write out the free space cache we will write out everything that is in our in memory tree, and then we will just walk the pinned extents tree and write anything we see there. The problem with this is that during normal operations the pinned extents will be merged back into the free space tree normally, and then we can allocate space from the merged areas and commit them to the tree log. If we crash and replay the tree log we will crash again because the tree log will try to free up space from what looks like 2 seperate but contiguous entries, since one entry is from the original free space cache and the other was a pinned extent that was merged back. To fix this we just need to walk the free space tree after we load it and merge contiguous entries back together. This will keep the tree log stuff from breaking and it will make the allocator behave more nicely. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/free-space-cache.c | 41 + 1 files changed, 41 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index cecf8df..19a0d85 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -33,6 +33,8 @@ static int link_free_space(struct btrfs_free_space_ctl *ctl, struct btrfs_free_space *info); +static void unlink_free_space(struct btrfs_free_space_ctl *ctl, + struct btrfs_free_space *info); static struct inode *__lookup_free_space_inode(struct btrfs_root *root, struct btrfs_path *path, @@ -584,6 +586,44 @@ static int io_ctl_read_bitmap(struct io_ctl *io_ctl, return 0; } +/* + * Since we attach pinned extents after the fact we can have contiguous sections + * of free space that are split up in entries. This poses a problem with the + * tree logging stuff since it could have allocated across what appears to be 2 + * entries since we would have merged the entries when adding the pinned extents + * back to the free space cache. So run through the space cache that we just + * loaded and merge contiguous entries. This will make the log replay stuff not + * blow up and it will make for nicer allocator behavior. + */ +static void merge_space_tree(struct btrfs_free_space_ctl *ctl) +{ + struct btrfs_free_space *e, *prev = NULL; + struct rb_node *n; + +again: + spin_lock(ctl-tree_lock); + for (n = rb_first(ctl-free_space_offset); n; n = rb_next(n)) { + e = rb_entry(n, struct btrfs_free_space, offset_index); + if (!prev) + goto next; + if (e-bitmap || prev-bitmap) + goto next; + if (prev-offset + prev-bytes == e-offset) { + unlink_free_space(ctl, prev); + unlink_free_space(ctl, e); + prev-bytes += e-bytes; + kmem_cache_free(btrfs_free_space_cachep, e); + link_free_space(ctl, prev); + prev = NULL; + spin_unlock(ctl-tree_lock); + goto again; + } +next: + prev = e; + } + spin_unlock(ctl-tree_lock); +} + int __load_free_space_cache(struct btrfs_root *root, struct inode *inode, struct btrfs_free_space_ctl *ctl, struct btrfs_path *path, u64 offset) @@ -726,6 +766,7 @@ int __load_free_space_cache(struct btrfs_root *root, struct inode *inode, } io_ctl_drop_pages(io_ctl); + merge_space_tree(ctl); ret = 1; out: io_ctl_free(io_ctl); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trim malfunction in linux 3.3.6
Hello. 2012/5/18 Liu Bo liubo2...@cn.fujitsu.com On 05/18/2012 12:10 AM, Sergey E. Kolesnikov wrote: Could you please show some logs about the corrpution? Ugh. Sorry, but logs got corrupted too :-( Today I've tested 3.4-rc7 kernel and everything seems to be fine. May be fix mentioned by Tomasz should be ported back to 3.3.x since it is current stable release and this bug is really dangerous. Thanks, Sergey. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
Hi Josef, now I get [ 2081.142669] couldn't find orphan item for 2039, nlink 1, root 269, root being deleted no -martin Am 18.05.2012 21:01, schrieb Josef Bacik: *sigh* ok try this, hopefully it will point me in the right direction. Thanks, [ 126.389847] Btrfs loaded [ 126.390284] device fsid 0c9d8c6d-2982-4604-b32a-fc443c4e2c50 devid 1 transid 4 /dev/sdc [ 126.391246] btrfs: setting nodatacow [ 126.391252] btrfs: enabling auto defrag [ 126.391254] btrfs: disk space caching is enabled [ 126.391257] btrfs flagging fs with big metadata feature [ 126.405700] device fsid e8a0dc27-8714-49bd-a14f-ac37525febb1 devid 1 transid 4 /dev/sdd [ 126.406162] btrfs: setting nodatacow [ 126.406167] btrfs: enabling auto defrag [ 126.406170] btrfs: disk space caching is enabled [ 126.406172] btrfs flagging fs with big metadata feature [ 126.419819] device fsid f67cd977-ebf4-41f2-9821-f2989e985954 devid 1 transid 4 /dev/sde [ 126.420198] btrfs: setting nodatacow [ 126.420206] btrfs: enabling auto defrag [ 126.420210] btrfs: disk space caching is enabled [ 126.420214] btrfs flagging fs with big metadata feature [ 127.274555] device fsid 3001355e-c2e2-46c7-9eba-dfecb441d6a6 devid 1 transid 4 /dev/sdf [ 127.274980] btrfs: setting nodatacow [ 127.274986] btrfs: enabling auto defrag [ 127.274989] btrfs: disk space caching is enabled [ 127.274992] btrfs flagging fs with big metadata feature [ 2081.142669] couldn't find orphan item for 2039, nlink 1, root 269, root being deleted no [ 2081.142735] [ cut here ] [ 2081.142750] kernel BUG at fs/btrfs/inode.c:2228! [ 2081.142766] invalid opcode: [#1] SMP [ 2081.142786] CPU 10 [ 2081.142794] Modules linked in: btrfs zlib_deflate libcrc32c ext2 bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ioatdma ses enclosure mac_hid lp parport usbhid hid megaraid_sas isci libsas scsi_transport_sas igb ixgbe dca mdio [ 2081.142974] [ 2081.142985] Pid: 2966, comm: ceph-osd Tainted: G C 3.4.0-rc7.2012051802+ #16 Supermicro X9SRi/X9SRi [ 2081.143020] RIP: 0010:[a0269383] [a0269383] btrfs_orphan_del+0x173/0x180 [btrfs] [ 2081.143080] RSP: 0018:881016d83d18 EFLAGS: 00010292 [ 2081.143096] RAX: 0062 RBX: 881017ad4770 RCX: [ 2081.143115] RDX: RSI: 0082 RDI: 0246 [ 2081.143134] RBP: 881016d83d58 R08: R09: [ 2081.143154] R10: R11: 0116 R12: 88101e7baf90 [ 2081.143173] R13: 88101e7bac00 R14: 0001 R15: 0001 [ 2081.143193] FS: 7fcc1e736700() GS:88107fd4() knlGS: [ 2081.143243] CS: 0010 DS: ES: CR0: 80050033 [ 2081.143274] CR2: 09269000 CR3: 00101ba87000 CR4: 000407e0 [ 2081.143308] DR0: DR1: DR2: [ 2081.143341] DR3: DR6: 0ff0 DR7: 0400 [ 2081.143376] Process ceph-osd (pid: 2966, threadinfo 881016d82000, task 881023c744a0) [ 2081.143424] Stack: [ 2081.143447] 0c07 88101e1dac30 881016d83d38 88101e1dac30 [ 2081.143510] 88101e7bac00 881017ad4770 88101f0f7d60 [ 2081.143572] 881016d83e08 a026d7c8 881017ad4770 [ 2081.143634] Call Trace: [ 2081.143684] [a026d7c8] btrfs_truncate+0x5e8/0x6d0 [btrfs] [ 2081.143737] [a026f141] btrfs_setattr+0xc1/0x1b0 [btrfs] [ 2081.143773] [811955c3] notify_change+0x183/0x320 [ 2081.143807] [8117889e] do_truncate+0x5e/0xa0 [ 2081.143839] [81178a24] sys_truncate+0x144/0x1b0 [ 2081.143873] [8165fd29] system_call_fastpath+0x16/0x1b [ 2081.143903] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 44 c0 48 85 f6 74 19 80 bb 60 fe ff ff 84 74 10 48 c7 c7 10 88 2c a0 31 c0 e8 e5 3b 3e e1 0f 0b 48 8b 73 40 eb ea 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 [ 2081.144199] RIP [a0269383] btrfs_orphan_del+0x173/0x180 [btrfs] [ 2081.144258] RSP 881016d83d18 [ 2081.144614] ---[ end trace 8d0829d100639242 ]--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs: Probably the larger filesystem I will see for a long time
Probably the larger filesystem I will ever see. Tryed 8 Exabytes but it failed. [root@CentOS6-A:/root] # df Filesystem1K-blocks Used Available Use% Mounted /dev/mapper/vg01-root 17915884 11533392 5513572 68% / /dev/sda1508745140314342831 30% /boot /dev/mapper/data_0 66993872 1644372 619940603% /mnt/data_0 /dev/mapper/data_1 7881299347898368508360 78812482240918961% /mnt/data_1 [root@CentOS6-A:/root] # df -h Filesystem Size Used Avail Use% Mounted /dev/mapper/vg01-root 18G 11G 5.3G 68% / /dev/sda1 497M 138M 335M 30% /boot /dev/mapper/data_0 64G 1.6G60G3% /mnt/data_0 /dev/mapper/data_1 7.0E 497M 7.0E1% /mnt/data_1 [root@CentOS6-A:/root] # df -Th Filesystem Type Size Used Avail Use% /dev/mapper/vg01-root ext4 18G 11G 5.3G 68% /dev/sda1 ext4 497M 138M 335M 30% /dev/mapper/data_0 ext4 64G 1.6G60G 3% /dev/mapper/data_1 btrfs 7.0E 499M 7.0E 1% [root@CentOS6-A:/root] # [root@CentOS6-A:/root] # uname -rv 3.4.0-rc7+ #23 SMP Wed May 16 20:20:47 EDT 2012 made with a dm-thin device sitting on a device pair composed of (metadata 256Megs and data 23 Gigs) running on my laptop at home. yes, this is 7 Exabytes or 7,168 Petabytes or ( 7,340,032 Terabytes ) or 7,516,192,768 Gigabytes. please do not answer, it is just a statement of a fact at 3.4-rc7 (was not working at 3.4-rc3 if I remember). Xtian. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/7] bcache: md conversion
Dan Williams wrote: The consensus from LSF was that bcache need not invent a new interface when md and dm can both do the job. As mentioned in patch 7 this series aims to be a minimal conversion. Other refactoring items like deprecating register_lock for mddev-reconfig_mutex are deferred. This supports assembly of an already established cache array: mdadm -A /dev/md/bcache /dev/sd[ab] ...will create the /dev/md/bcache container and a subarray representing the cache volume. Flash-only, or backing-device only volumes were not tested. Create support and hot-add/hot-remove come later. Note: * When attempting to test with small loopback devices (100MB), assembly soft locks in bcache_journal_read(). That hang went away with larger devices, so there seems to be minimum component device size that needs to be considered in the tooling. Is there any plan to separate the on-disk layout (per-device headers, etc) from the logic for the purpose of reuse? I can think of at least one case where this would be extremely useful: integration in BtrFS. BtrFS already has its own methods for making sure a group of devices are all present when the filesystem is mounted, so it doesn't really need the formatting of the backing device bcache does to prevent it from being mounted solo. Putting bcache under BtrFS would be silly in the same way as putting it under a raid array, but bcache can't be put on top of BtrFS. Logically, in looking at BtrFS' architecture, a cache would likely fit best at the 'block group' level, which IIUC would be roughly equivalent to the recommended 'over raid, under lvm' method of using bcache. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html