Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Fri, Jul 27, 2012 at 11:42:39AM -0700, Marc MERLIN wrote: https://oss.oracle.com/~mason/latencytop.patch Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU (not even close and I get the same problem with unencrypted filesystem, actually du -sh is exactly the same speed on encrypted and unecrypted). Here's the result I think you were looking for. I'm not good at reading this, but hopefully it tells you something useful :) The full run is here if that helps: http://marc.merlins.org/tmp/latencytop.txt I did some other tests since last week since my laptop is hard to use considering how slow the SSD is. (TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster than btrfs, but 3x slower than the same filesystem on spinning disk :( ) Ok, just to help with debuggging this, 1) I put my samsung 830 SSD into another thinkpad and it wasn't faster or slower. 2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that just died and killed all my data), and du took 0.3 seconds on both my old and new thinkpads. The old thinkpad is running ubuntu 32bit the new one debian testing 64bit both with kernel 3.4.4. So, clearly, there is something wrong with the samsung 830 SSD with linux but I have no clue what :( In raw speed (dd) the samsung is faster than the crucial (350MB/s vs 500MB/s). It it were a random crappy SSD from a random vendor, I'd blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and 'seeks'. 3) I just got a 2nd ssd from samsung (same kind), just to make sure the one I had wasn't bad. It's brand new, and I formatted it carefully on 512 boundaries: /dev/sda12048 502271 250112 83 Linux /dev/sda2 50227252930559262141447 HPFS/NTFS/exFAT /dev/sda3529305607390207910485760 82 Linux swap / Solaris /dev/sda473902080 1000215215 463156568 83 Linux I also upgraded to 3.5.0 in the meantime but unfortunately the results are similar. First: btrfs is the slowest: gandalfthegreat:/mnt/ssd/var/local# time du -sh src/ 514Msrc/ real0m25.741s gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 Second: ext4 is 2x faster than btrfs with mkfs.ext4 -O extent -b 4096 /dev/sda3 gandalfthegreat:/mnt/mnt3# reset_cache gandalfthegreat:/mnt/mnt3# time du -sh src/ 519Msrc/ real0m12.459s gandalfthegreat:~# grep mnt3 /proc/mounts /dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0 Third, A freshly made ntfs filesystem through fuse is actually FASTER! gandalfthegreat:/mnt/mnt2# reset_cache gandalfthegreat:/mnt/mnt2# time du -sh src/ 506Msrc/ real0m8.928s gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts /dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0 How can ntfs via fuse be the fastest and btrfs so slow? Of course, all 3 are slower than the same filesystem on spinning too, but I'm wondering if there is a scheduling issue that is somehow causing the extreme slowness I'm seeing. Did the latencytop trace I got help in any way? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 1, 2012 at 1:01 PM, Marc MERLIN m...@merlins.org wrote: So, clearly, there is something wrong with the samsung 830 SSD with linux It it were a random crappy SSD from a random vendor, I'd blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and 'seeks'. You'd be surprised on how badly some vendors can screw up :) First: btrfs is the slowest: gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 Just checking, did you explicitly activate discard? Cause on my setup (with corsair SSD) it made things MUCH slower. Also, try adding noatime (just in case the slow down was because du cause many access time updates) -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote: It it were a random crappy SSD from a random vendor, I'd blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and 'seeks'. You'd be surprised on how badly some vendors can screw up :) At some point, it may come down to that indeed :-/ I'm still hopefully that Samsung didn't, but we'll see. First: btrfs is the slowest: gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0 Just checking, did you explicitly activate discard? Cause on my Yes. Note that it should a noop when all you're doing is stating inodes and not writing (I'm using noatime). setup (with corsair SSD) it made things MUCH slower. Also, try adding noatime (just in case the slow down was because du cause many access time updates) I have noatime in there already :) Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On 01/08/12 16:01, Marc MERLIN wrote: Third, A freshly made ntfs filesystem through fuse is actually FASTER! Could it be that Samsungs FTL has optimisations in it for NTFS ? A horrible thought, but not impossible.. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Wed, Aug 01, 2012 at 04:36:22PM +1000, Chris Samuel wrote: On 01/08/12 16:01, Marc MERLIN wrote: Third, A freshly made ntfs filesystem through fuse is actually FASTER! Could it be that Samsungs FTL has optimisations in it for NTFS ? A horrible thought, but not impossible.. Not impossible, but it's can be the main reason since it clocks still 2x slower with ntfs than a spinning disk with encrypted btrfs. Since SSDs should seek 10-100x faster than spinning disks, that can't be the only reason. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] Btrfs: fix that error value is changed by mistake
In iterate_inodes_from_logical() the error result from extent_from_logical() is patched by mistake. Typically ENOENT is patched to EINVAL because (-ENOENT BTRFS_EXTENT_FLAG_TREE_BLOCK) evaluates to true. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- Resent because of the previously forgotten Signed-off-by line. fs/btrfs/backref.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index a256f3b..ff6475f 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1438,10 +1438,10 @@ int iterate_inodes_from_logical(u64 logical, struct btrfs_fs_info *fs_info, ret = extent_from_logical(fs_info, logical, path, found_key); btrfs_release_path(path); - if (ret BTRFS_EXTENT_FLAG_TREE_BLOCK) - ret = -EINVAL; if (ret 0) return ret; + if (ret BTRFS_EXTENT_FLAG_TREE_BLOCK) + return -EINVAL; extent_item_pos = logical - found_key.objectid; ret = iterate_extent_inodes(fs_info, found_key.objectid, -- 1.7.11.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions and notes about send/receive
On Tue, Jul 31, 2012 at 6:32 PM, Alex Lyakas alex.bolshoy.bt...@gmail.com wrote: Hi Alexander, I relooked at my list of questions, and it seems that there are more general questions, and more focused questions. So here I list the more focused ones. I really appreciate if you can address them. I hope it's ok that I opened a separate thread for them. So here goes... iterate_dir_item(): tmp_path is not used Removed. name_cache_insert(): if radix_tree_insert() fails, should we free the nce_head? __get_cur_name_and_parent() if name_cache_insert() fails, should we free the nce? Both, nce_head and nce are now freed in name_cache_insert. __get_cur_name_and_parent() if is_inode_existent() is false, and we generate unique name, we don't set *parent_ino/*parent_genI guess it's ok, it just looked strange when it showed up in my prints. Yepp, that's ok. We don't have a valid parent ino/gen in that case. I added a note to the function comment. btrfs_ioctl_send() struct file *filp = NULL; = this is not used btrfs_ioctl_set_received_subvol() if we fail to commit the transaction, should we rollback the in-memory values that we set, like stransid/rtransid etc? or they get rolled back automatically somehow (I need to study the transaction subsystem). Hmm not sure about that. In case of transaction abort, the FS is not writable anymore as far as I know, so no matter which values are in-memory, the user won't be able to use them. process_recorded_refs(): /* * If the inode is still orphan, unlink the orphan. This may * happen when a previous inode did overwrite the first ref * of this inode and no new refs were added for the current * inode. */ This comment is partially misleading, I think. It implies that the orphan inode will be deleted. But it may happen that it will only be unlinked, because another hardlink to it remains (one of my tests goes through this path - inode is orphanized, then orphan unlinked, but another hardlink exists). This behavior is expected. We do the orphan handling only on first refs and never check if that may generate unnecessary commands. I made it this way because it was the easiest and most secure solution, but it has room for optimization. A added a note about this to the comment. btrfs-progs::read_and_process_cmd() if (strstr(path, .bak_1.log)) { ret = 0; } this block is some leftover? Hehe, yes :) I love such statements so that I can add a breakpoint to ret=0 :) I removed it. __get_cur_name_and_parent() if we decided to call get_first_ref() on the send_root (due to ino sctx-send_progress), do we really need to call did_overwrite_ref()? Because it will just lookup the send root again. I mean if we know that this inode has been already handled, then it's not an orphan anymore, so no need for this additional check. If my understanding is wrong, pls give some small example? get_first_ref does not return the current state. It returns the permanent state. The result of did_overwrite_ref however depends on the current send progress. There is however some room for optimization here. In many cases, I did not think much about double tree lookups due to the different calls, as I wanted it to work first and later optimize it. One possible optimization for example would be to cache the results of lookups. But only for functions which return permanent state, e.g. get_inode_info. process_recorded_refs(): why we need to call did_overwrite_first_ref(), and not just call get_cur_path()? It looks like we need this only to set the is_orphan flag? Because, otherwise, get_cur_path() already does the overwrite_first_ref logic, and will return the correct path. Is my understanding correct? (I ran some tests and looks correct to me). Hmm this is probably true. It was probably required in the past to call did_overwrite_first_ref, not sure. I would like to keep it this way for the moment as I don't want to risk introducing new bugs. find_extent_clone(): /* * Now collect all backrefs. */ extent_item_pos = logical - found_key.objectid; Is my understanding correct that extent_item_pos will now be equal to btrfs_file_extent_offset(eb, fi)? Yepp. It's probably not needed to do it this way, but I left it this way as this part of the code is copied from Jan's initial version of btrfs send. process_recorded_refs(): if (ret == inode_state_did_create || ret == inode_state_no_change) { /* TODO delayed utimes */ ret = send_utimes(sctx, un-val, un-aux); This code results in UTIMES sent twice for a parent dir of new inode: once because parent_dir gets changed (stamped with a transid) and second time because of this code. Is this fine? Also what TODO delayed utimes means? I'm collection TODOs here:
Re: [PATCH 3/3] Btrfs-progs: list snapshots by generation
On 08/01/2012 05:16 AM, Goffredo Baroncelli wrote: Hi Bo, On 07/31/2012 07:49 AM, Liu Bo wrote: The idea is that we usually use snapshot to backup/restore our data, and the common way can be a cron script which makes lots of snapshots, so we can end up with spending some time to find the latest snapshot to restore. This adds a feature for 'btrfs subvolume list' to let it list snapshots by their _created_ generation. What we need to do is just to list them in descending order and get the latest snapshot. What's more, we can find the oldest snapshot as well by listing snapshots in ascending order. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- btrfs-list.c | 176 -- cmds-subvolume.c | 19 +- 2 files changed, 185 insertions(+), 10 deletions(-) [] static const char * const cmd_subvol_list_usage[] = { -btrfs subvolume list [-p] path, +btrfs subvolume list [-ps] path, List subvolumes (and snapshots), , --p print parent ID, +-p print parent ID, +-s value list snapshots with generation in ascending/descending order, + (1: ascending, 0: descending), Please change the user interface. I suggest something like: -s|-S list snapshots with generation in ascending|descending order Or better -ssort by generation -Psort by path -rreverse the sort order I prefer to the first one, since I have no any idea how to sort by path by then. Anyway, whichever your choice will be, please remember to update the man page too. ah, I should have remembered to update it, thanks for reminding. :) Will do it soon, thanks for reviewing this! thanks, liubo NULL }; [...] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: remove superblock writing after fatal error
With commit acce952b0, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. These lines can be used to reproduce the issue (using /dev/sdm): SCRATCH_DEV=/dev/sdm SCRATCH_MNT=/mnt echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo ls -alLF /dev/mapper/foo mkfs.btrfs /dev/mapper/foo mount /dev/mapper/foo $SCRATCH_MNT echo bar $SCRATCH_MNT/foo sync echo 0 25165824 error | dmsetup reload foo dmsetup resume foo ls -alF $SCRATCH_MNT touch $SCRATCH_MNT/1 ls -alF $SCRATCH_MNT sleep 35 echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo dmsetup resume foo sleep 1 umount $SCRATCH_MNT btrfsck /dev/mapper/foo dmsetup remove foo Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- Changes v1 - v2: - Removed one large comment entirely which was not needed anymore after the code changes. - Removed a warning message which was dead code since the super block is not written anymore when the flag BTRFS_SUPER_FLAG_ERROR is set. - Removed the return code of btrfs_error_commit_super() since it now does not return any errors anymore. fs/btrfs/disk-io.c | 36 fs/btrfs/disk-io.h | 2 +- 2 files changed, 5 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 502b20c..f6a1d0f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2531,8 +2531,7 @@ retry_root_backup: goto fail_trans_kthread; /* do not make disk changes in broken FS */ - if (btrfs_super_log_root(disk_super) != 0 - !(fs_info-fs_state BTRFS_SUPER_FLAG_ERROR)) { + if (btrfs_super_log_root(disk_super) != 0) { u64 bytenr = btrfs_super_log_root(disk_super); if (fs_devices-rw_devices == 0) { @@ -3192,30 +3191,14 @@ int close_ctree(struct btrfs_root *root) /* clear out the rbtree of defraggable inodes */ btrfs_run_defrag_inodes(fs_info); - /* -* Here come 2 situations when btrfs is broken to flip readonly: -* -* 1. when btrfs flips readonly somewhere else before -* btrfs_commit_super, sb-s_flags has MS_RDONLY flag, -* and btrfs will skip to write sb directly to keep -* ERROR state on disk. -* -* 2. when btrfs flips readonly just in btrfs_commit_super, -* and in such case, btrfs cannot write sb via btrfs_commit_super, -* and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag, -* btrfs will cleanup all FS resources first and write sb then. -*/ if (!(fs_info-sb-s_flags MS_RDONLY)) { ret = btrfs_commit_super(root); if (ret) printk(KERN_ERR btrfs: commit super ret %d\n, ret); } - if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) { - ret = btrfs_error_commit_super(root); - if (ret) - printk(KERN_ERR btrfs: commit super ret %d\n, ret); - } + if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) + btrfs_error_commit_super(root); btrfs_put_block_group_cache(fs_info); @@ -3437,18 +3420,11 @@ static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info, if (read_only) return 0; - if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) { - printk(KERN_WARNING warning: mount fs with errors, - running btrfsck is recommended\n); - } - return 0; } -int btrfs_error_commit_super(struct btrfs_root *root) +void btrfs_error_commit_super(struct btrfs_root *root) { - int ret; - mutex_lock(root-fs_info-cleaner_mutex); btrfs_run_delayed_iputs(root); mutex_unlock(root-fs_info-cleaner_mutex); @@ -3458,10 +3434,6 @@ int btrfs_error_commit_super(struct btrfs_root *root) /* cleanup FS via transaction */ btrfs_cleanup_transaction(root); - - ret =
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 08/01/2012 07:45 PM, Stefan Behrens wrote: With commit acce952b0, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. Yes, I have to admit that this can be a serious problem. But we'll need to send the error flag stored in the super block into disk in the future so that the next mount can find it unstable and do fsck by itself maybe. If we do not write the super any more, it cannot find the flag. So can we find a way to make both things happy? thanks, liubo These lines can be used to reproduce the issue (using /dev/sdm): SCRATCH_DEV=/dev/sdm SCRATCH_MNT=/mnt echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo ls -alLF /dev/mapper/foo mkfs.btrfs /dev/mapper/foo mount /dev/mapper/foo $SCRATCH_MNT echo bar $SCRATCH_MNT/foo sync echo 0 25165824 error | dmsetup reload foo dmsetup resume foo ls -alF $SCRATCH_MNT touch $SCRATCH_MNT/1 ls -alF $SCRATCH_MNT sleep 35 echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo dmsetup resume foo sleep 1 umount $SCRATCH_MNT btrfsck /dev/mapper/foo dmsetup remove foo Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- Changes v1 - v2: - Removed one large comment entirely which was not needed anymore after the code changes. - Removed a warning message which was dead code since the super block is not written anymore when the flag BTRFS_SUPER_FLAG_ERROR is set. - Removed the return code of btrfs_error_commit_super() since it now does not return any errors anymore. fs/btrfs/disk-io.c | 36 fs/btrfs/disk-io.h | 2 +- 2 files changed, 5 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 502b20c..f6a1d0f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2531,8 +2531,7 @@ retry_root_backup: goto fail_trans_kthread; /* do not make disk changes in broken FS */ - if (btrfs_super_log_root(disk_super) != 0 - !(fs_info-fs_state BTRFS_SUPER_FLAG_ERROR)) { + if (btrfs_super_log_root(disk_super) != 0) { u64 bytenr = btrfs_super_log_root(disk_super); if (fs_devices-rw_devices == 0) { @@ -3192,30 +3191,14 @@ int close_ctree(struct btrfs_root *root) /* clear out the rbtree of defraggable inodes */ btrfs_run_defrag_inodes(fs_info); - /* - * Here come 2 situations when btrfs is broken to flip readonly: - * - * 1. when btrfs flips readonly somewhere else before - * btrfs_commit_super, sb-s_flags has MS_RDONLY flag, - * and btrfs will skip to write sb directly to keep - * ERROR state on disk. - * - * 2. when btrfs flips readonly just in btrfs_commit_super, - * and in such case, btrfs cannot write sb via btrfs_commit_super, - * and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag, - * btrfs will cleanup all FS resources first and write sb then. - */ if (!(fs_info-sb-s_flags MS_RDONLY)) { ret = btrfs_commit_super(root); if (ret) printk(KERN_ERR btrfs: commit super ret %d\n, ret); } - if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) { - ret = btrfs_error_commit_super(root); - if (ret) - printk(KERN_ERR btrfs: commit super ret %d\n, ret); - } + if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) + btrfs_error_commit_super(root); btrfs_put_block_group_cache(fs_info); @@ -3437,18 +3420,11 @@ static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info, if (read_only) return 0; - if (fs_info-fs_state BTRFS_SUPER_FLAG_ERROR) { - printk(KERN_WARNING warning: mount fs with errors, -running btrfsck is recommended\n); - } - return 0; } -int
Re: [PATCH] Btrfs: unlock extent when the dio completes for reads
On Tue, Jul 31, 2012 at 03:30:49PM -0600, Zach Brown wrote: +static void btrfs_direct_read_iodone(struct kiocb *iocb, loff_t offset, +ssize_t bytes, void *private, int ret, +bool is_async) +{ + struct inode *inode = iocb-ki_filp-f_mapping-host; + u64 lockend; + + if (get_state_private(BTRFS_I(inode)-io_tree, offset, lockend)) { + WARN_ON(1); + goto out; + } + + unlock_extent(BTRFS_I(inode)-io_tree, offset, lockend); +out: + if (is_async) + aio_complete(iocb, ret, 0); + inode_dio_done(inode); +} Hmm. I'm worried that this can be called from interrupt context and the state/extent calls there seem to use bare spin locks. Did you run this with lockdep? Am I confused? Our read completions are run in a helper thread since we need to do things like clear the extent state stuff and such so it's safe in this regard. Course I hit a lockup with xfstests 91 because there are some cases where we can return errors without doing the dio complete (rightly so since we haven't setup the dio yet), so now I have to figure out how to know when we've been unlocked via dio complete or that I've gotten an error back and already unlocked it... Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] Btrfs: fix some endian bugs handling the root times
On Mon, Jul 30, 2012 at 10:10 AM, Dan Carpenter dan.carpen...@oracle.com wrote: trans-transid is cpu endian but we want to store the data as little endian. item-ctime.nsec is only 32 bits, not 64. Signed-off-by: Dan Carpenter dan.carpen...@oracle.com --- Applies to linux-next. diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 43f0012..a1fbca0 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -424,7 +424,7 @@ static noinline int create_subvol(struct btrfs_root *root, uuid_le_gen(new_uuid); memcpy(root_item.uuid, new_uuid.b, BTRFS_UUID_SIZE); root_item.otime.sec = cpu_to_le64(cur_time.tv_sec); - root_item.otime.nsec = cpu_to_le64(cur_time.tv_nsec); + root_item.otime.nsec = cpu_to_le32(cur_time.tv_nsec); root_item.ctime = root_item.otime; btrfs_set_root_ctransid(root_item, trans-transid); btrfs_set_root_otransid(root_item, trans-transid); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 7ac7cdc..7208ada 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1061,7 +1061,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, memcpy(new_root_item-parent_uuid, root-root_item.uuid, BTRFS_UUID_SIZE); new_root_item-otime.sec = cpu_to_le64(cur_time.tv_sec); - new_root_item-otime.nsec = cpu_to_le64(cur_time.tv_nsec); + new_root_item-otime.nsec = cpu_to_le32(cur_time.tv_nsec); btrfs_set_root_otransid(new_root_item, trans-transid); memset(new_root_item-stime, 0, sizeof(new_root_item-stime)); memset(new_root_item-rtime, 0, sizeof(new_root_item-rtime)); diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c index 6bb465c..10d8e4d 100644 --- a/fs/btrfs/root-tree.c +++ b/fs/btrfs/root-tree.c @@ -544,8 +544,8 @@ void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct timespec ct = CURRENT_TIME; spin_lock(root-root_times_lock); - item-ctransid = trans-transid; + item-ctransid = cpu_to_le64(trans-transid); item-ctime.sec = cpu_to_le64(ct.tv_sec); - item-ctime.nsec = cpu_to_le64(ct.tv_nsec); + item-ctime.nsec = cpu_to_le32(ct.tv_nsec); spin_unlock(root-root_times_lock); } Thanks. I will include this patch in the next send/receive pull request. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add crtime to the snapshot list
On 08/01/2012 08:01 PM, Anand jain wrote: From: Anand anand.j...@oracle.com This patch adds creation-time to the snapshot list display, which would help user to better manage the snapshots when number of snapshots grow substantially. This patch is developed and on top of the send/receive btrfs and btrfs-progs repo at git://github.com/ablock84/linux-btrfs.git (send-v2) git://github.com/ablock84/btrfs-progs.git (send-v2) respectively. Hi Arand, Can you please also post the patch itself here? thanks, liubo Further this patch has the dependency on the following patches Liu Bo: [PATCH 2/3 RESEND] Btrfs-progs: show generation in command btrfs subvol list [PATCH 3/3] Btrfs-progs: list snapshots by generation Eg output: #btrfs su list -s 1 /btrfs ID 258 gen 39 cgen 6 top level 5 crtime 2012-07-27 17:43:55 path ss1 ID 260 gen 8 cgen 8 top level 5 crtime 2012-07-27 17:47:51 path ss2 ID 263 gen 16 cgen 16 top level 5 crtime 2012-07-29 00:50:19 path ss3 ID 264 gen 25 cgen 25 top level 5 crtime 2012-07-30 09:56:50 path sv1/.snap Anand Jain (1): Btrfs-progs: show crtime in the snapshot list btrfs-list.c | 45 - 1 files changed, 36 insertions(+), 9 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)
On Mon, Jul 23, 2012 at 5:17 PM, Alex Lyakas alex.bolshoy.bt...@gmail.com wrote: Hi Alexander, I did some testing of the case where same inode, but with a different generation, exists both in send_root and in parent_root. I know that this can happen primarily when inode_cache option is enabled. So first I just tested some differential sends, where parent and root are unrelated subvolumes. Here are some issues: 1) The top subvolume inode (ino=BTRFS_FIRST_FREE_OBJECTID) is treated also as deleted + recreated. So the code goes into process_all_refs() path and does several strange things, such as trying to orphanize the top inode. Also get_cur_path() always returns for the top subvolume (without checking whether it is an orphan). Another complication for the top inode is that its parent dir is itself. I made the following fix: @@ -3782,7 +3972,13 @@ static int changed_inode(struct send_ctx *sctx, right_gen = btrfs_inode_generation(sctx-right_path-nodes[0], right_ii); - if (left_gen != right_gen) + if (left_gen != right_gen sctx-cur_ino != BTRFS_FIRST_FREE_OBJECTID) sctx-cur_inode_new_gen = 1; So basically, don't try to delete and re-create it, but treat it like a change. Since the top subvolume inode is S_IFDIR, and dir can have only one hardlink (and hopefully it is always ..), we will never need to change anything for this INODE_REF. I also added: @@ -2526,6 +2615,14 @@ static int process_recorded_refs(struct send_ctx *sctx) int did_overwrite = 0; int is_orphan = 0; + BUG_ON(sctx-cur_ino = BTRFS_FIRST_FREE_OBJECTID); I applied both fixes to for-chris now. 2) After I fixed this, I hit another issue, where inodes under the top subvolume dir, attempt to rmdir() the top dir, while iterating over check_dirs in process_recorded_refs(), because (top_dir_ino, top_dir_gen) indicate that it was deleted. So I added: @@ -2714,10 +2857,19 @@ verbose_printk(btrfs: process_recorded_refs %llu\n, sctx-cur_ino); */ ULIST_ITER_INIT(uit); while ((un = ulist_next(check_dirs, uit))) { + /* Do not attempt to rmdir it the top subvolume dir */ + if (un-val == BTRFS_FIRST_FREE_OBJECTID) + continue; + if (un-val sctx-cur_ino) continue; I applied a similar fix by adding a check to can_rmdir. The way you suggested would also skip utime updates for the top dir. 3) process_recorded_refs() always increments the send_progress: /* * Current inode is now at it's new position, so we must increase * send_progress */ sctx-send_progress = sctx-cur_ino + 1; However, in the changed_inode() path I am testing, process_all_refs() is called twice with the same inode (once for deleted inode, once for the recreated inode), so after the first call, send_progress is incremented and doesn't match the inode anymore. I don't think I hit any issues because of this, just that it's confusing. I fixed this issue some days ago. 4) +/* + * Record and process all refs at once. Needed when an inode changes the + * generation number, which means that it was deleted and recreated. + */ +static int process_all_refs(struct send_ctx *sctx, + enum btrfs_compare_tree_result cmd) +{ + int ret; + struct btrfs_root *root; + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_key found_key; + struct extent_buffer *eb; + int slot; + iterate_inode_ref_t cb; + + path = alloc_path_for_send(); + if (!path) + return -ENOMEM; + + if (cmd == BTRFS_COMPARE_TREE_NEW) { + root = sctx-send_root; + cb = __record_new_ref; + } else if (cmd == BTRFS_COMPARE_TREE_DELETED) { + root = sctx-parent_root; + cb = __record_deleted_ref; + } else { + BUG(); + } + + key.objectid = sctx-cmp_key-objectid; + key.type = BTRFS_INODE_REF_KEY; + key.offset = 0; + while (1) { + ret = btrfs_search_slot_for_read(root, key, path, 1, 0); + if (ret 0) { + btrfs_release_path(path); + goto out; + } + if (ret) { + btrfs_release_path(path); + break; + } + + eb = path-nodes[0]; + slot = path-slots[0]; + btrfs_item_key_to_cpu(eb, found_key, slot); + + if (found_key.objectid != key.objectid || + found_key.type != key.type) { + btrfs_release_path(path); + break; + } + + ret = iterate_inode_ref(sctx,
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 08/01/2012 09:07 PM, Jan Schmidt wrote: On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: On 08/01/2012 07:45 PM, Stefan Behrens wrote: With commit acce952b0, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. Yes, I have to admit that this can be a serious problem. But we'll need to send the error flag stored in the super block into disk in the future so that the next mount can find it unstable and do fsck by itself maybe. Hum, that's possible. However, I neither see a) a safe way to get that flag to disk nor b) a situation where this flag would help. When we abort a transaction, we just roll everything back to the last commit, i.e. a consistent state. So if we stop writing a potentially corrupt super block, we should be fine anyway. Or am I missing something? I'm just wondering if we can roll everything back well, why do we need fsck? thanks, liubo -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 01.08.2012 15:31, Liu Bo wrote: On 08/01/2012 09:07 PM, Jan Schmidt wrote: On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: On 08/01/2012 07:45 PM, Stefan Behrens wrote: With commit acce952b0, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. Yes, I have to admit that this can be a serious problem. But we'll need to send the error flag stored in the super block into disk in the future so that the next mount can find it unstable and do fsck by itself maybe. Hum, that's possible. However, I neither see a) a safe way to get that flag to disk nor b) a situation where this flag would help. When we abort a transaction, we just roll everything back to the last commit, i.e. a consistent state. So if we stop writing a potentially corrupt super block, we should be fine anyway. Or am I missing something? I'm just wondering if we can roll everything back well, why do we need fsck? Mostly for undetected errors. thanks, liubo -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add crtime to the snapshot list
Eg output: #btrfs su list -s 1 /btrfs ID 258 gen 39 cgen 6 top level 5 crtime 2012-07-27 17:43:55 path ss1 ID 260 gen 8 cgen 8 top level 5 crtime 2012-07-27 17:47:51 path ss2 ID 263 gen 16 cgen 16 top level 5 crtime 2012-07-29 00:50:19 path ss3 ID 264 gen 25 cgen 25 top level 5 crtime 2012-07-30 09:56:50 path sv1/.snap Is it possible to rename crtime to otime? I used otime to refer creation time all the time. Using crtime could be confusing, as ctime is the change time and rtime is the receive time. Yeah. Will rename it to otime just to mention the choices were.. Birth-time (as in stat output), crtime and otime. Thanks, Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: unlock extent when the dio completes for reads V2
A deadlock in xfstests 113 was uncovered by commit d187663ef24cd3d033f0cbf2867e70b36a3a90b8 This is because we would not return EIOCBQUEUED for short AIO reads, instead we'd wait for the DIO to complete and then return the amount of data we transferred, which would allow our stuff to unlock the remaning amount. But with this change this no longer happens, so if we have a short AIO read (for example if we try to read past EOF), we could leave the section from EOF to the end of where we tried to read locked. To fix this we need to store the end of the region we've locked in our locked state bit and then only unlock when the dio has finished completely. This makes sure we always unlock the entire range we locked when reading. Before this patch we would hang when running 113, no this no longer happens. Thanks to Zach Brown for the idea, Signed-off-by: Josef Bacik jba...@fusionio.com --- V1-V2: fix a problem where we didn't unlock the range if io was never submitted. fs/btrfs/inode.c | 38 -- 1 files changed, 32 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 48bdfd2..4b82ae2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5810,9 +5810,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock, if (!create (em-block_start == EXTENT_MAP_HOLE || test_bit(EXTENT_FLAG_PREALLOC, em-flags))) { free_extent_map(em); - /* DIO will do one hole at a time, so just unlock a sector */ - unlock_extent(BTRFS_I(inode)-io_tree, start, - start + root-sectorsize - 1); return 0; } @@ -5961,8 +5958,6 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) bvec++; } while (bvec = bvec_end); - unlock_extent(BTRFS_I(inode)-io_tree, dip-logical_offset, - dip-logical_offset + dip-bytes - 1); bio-bi_private = dip-private; kfree(dip-csums); @@ -6340,6 +6335,26 @@ static ssize_t check_direct_IO(struct btrfs_root *root, int rw, struct kiocb *io out: return retval; } + +static void btrfs_direct_read_iodone(struct kiocb *iocb, loff_t offset, +ssize_t bytes, void *private, int ret, +bool is_async) +{ + struct inode *inode = iocb-ki_filp-f_mapping-host; + u64 lockend; + + if (get_state_private(BTRFS_I(inode)-io_tree, offset, lockend)) { + WARN_ON(1); + goto out; + } + + unlock_extent(BTRFS_I(inode)-io_tree, offset, lockend); +out: + if (is_async) + aio_complete(iocb, ret, 0); + inode_dio_done(inode); +} + static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) @@ -6353,6 +6368,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, int writing = rw WRITE; int write_bits = 0; size_t count = iov_length(iov, nr_segs); + dio_iodone_t *iodone = NULL; if (check_direct_IO(BTRFS_I(inode)-root, rw, iocb, iov, offset, nr_segs)) { @@ -6438,6 +6454,9 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, 1, 0, cached_state, GFP_NOFS); goto out; } + } else { + set_state_private(BTRFS_I(inode)-io_tree, lockstart, lockend); + iodone = btrfs_direct_read_iodone; } free_extent_state(cached_state); @@ -6445,9 +6464,16 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, ret = __blockdev_direct_IO(rw, iocb, inode, BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev, - iov, offset, nr_segs, btrfs_get_blocks_direct, NULL, + iov, offset, nr_segs, btrfs_get_blocks_direct, iodone, btrfs_submit_direct, 0); + /* +* If we're reading we will be unlocked by the iodone call, unless ret +* is 0 then we need to unlock since we didn't submit any io. +*/ + if (!writing ret) + goto out; + if (ret 0 ret != -EIOCBQUEUED) { clear_extent_bit(BTRFS_I(inode)-io_tree, offset, offset + iov_length(iov, nr_segs) - 1, -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions and notes about send/receive
Alexander, thanks for addressing the issues. __get_cur_name_and_parent() if we decided to call get_first_ref() on the send_root (due to ino sctx-send_progress), do we really need to call did_overwrite_ref()? Because it will just lookup the send root again. I mean if we know that this inode has been already handled, then it's not an orphan anymore, so no need for this additional check. If my understanding is wrong, pls give some small example? get_first_ref does not return the current state. It returns the permanent state. The result of did_overwrite_ref however depends on the current send progress. There is however some room for optimization here. In many cases, I did not think much about double tree lookups due to the different calls, as I wanted it to work first and later optimize it. One possible optimization for example would be to cache the results of lookups. But only for functions which return permanent state, e.g. get_inode_info. So do you think my understanding is correct that if (ino sctx-send_progress), then both functions will return the same value? Or perhaps: if (ino sctx-send_progress), then no need to check anymore the did_overwrite_ref(), because we have handled that already? I think that the call to did_overwrite_ref() is relevant only as long as we haven't handled this ino. After that, we are good...I think. I'm not saying the code has be changed/optimized at this point, just testing my understanding:) struct btrfs_root_item: what is the difference between existing generation field that you mimic in generation_v2 and ctransid? (I know I need to study the root tree code). Did you read the comments for btrfs_root_item and btrfs_read_root_item? They should answer your question. Yes, I read the comments, but they address only generation and generation_v2, which I understand. Basically, I don't understand how generation differs from ctransid (I need to dig more there). Thanks! Alex. Thank you, Alex. Big thanks for your reviews :) I pushed a new version of the kernel and progs side. The branches are now for-chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)
Hi, I have some trouble with a btrfs filesystem. As you can see in logs, there is lines which are from btrfs (I supposed), then some warnings at fs/btrfs/extent-tree.c, and finally a kernel BUG at fs/btrfs/extent-tree.c:5038. Do you need more informations ? Aug 1 21:23:10 backup2 kernel: [7.015184] Btrfs loaded Aug 1 21:23:10 backup2 kernel: [7.023314] device fsid a5dfe512-c8a3-46f0-bc8c-6365b8eccdcb devid 1 transid 110423 /dev/mapper/vg--backupplug-backup Aug 1 21:23:10 backup2 kernel: [7.024405] btrfs: force zlib compression Aug 1 21:23:10 backup2 kernel: [7.024414] btrfs: not using ssd allocation scheme Aug 1 21:23:10 backup2 kernel: [ 17.938241] NET: Registered protocol family 10 Aug 1 21:26:53 backup2 kernel: [ 28.880027] eth0: no IPv6 routers present Aug 1 21:33:39 backup2 kernel: [ 435.285128] leaf 615015878656 total ptrs 28 free space 1651 Aug 1 21:33:39 backup2 kernel: [ 435.285146] item 0 key (642867957760 a8 4096) itemoff 3944 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285153] extent refs 1 gen 108436 flags 2 Aug 1 21:33:39 backup2 kernel: [ 435.285162] tree block key (18446744073709551606 80 1215799304192) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285171] tree block backref root 7 Aug 1 21:33:39 backup2 kernel: [ 435.285179] item 1 key (642867970048 a8 4096) itemoff 3893 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285186] extent refs 1 gen 38936 flags 258 Aug 1 21:33:39 backup2 kernel: [ 435.285196] tree block key (401571 54 2912287038) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285202] shared block backref parent 615081041920 Aug 1 21:33:39 backup2 kernel: [ 435.285208] item 2 key (642867994624 a8 4096) itemoff 3824 itemsize 69 Aug 1 21:33:39 backup2 kernel: [ 435.285214] extent refs 3 gen 38909 flags 258 Aug 1 21:33:39 backup2 kernel: [ 435.285221] tree block key (13092 c 12250) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285226] shared block backref parent 1306797543424 Aug 1 21:33:39 backup2 kernel: [ 435.285231] shared block backref parent 1306797494272 Aug 1 21:33:39 backup2 kernel: [ 435.285236] shared block backref parent 647007399936 Aug 1 21:33:39 backup2 kernel: [ 435.285242] item 3 key (642868002816 a8 4096) itemoff 3773 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285248] extent refs 1 gen 44304 flags 2 Aug 1 21:33:39 backup2 kernel: [ 435.285255] tree block key (1824364519424 a8 94208) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285260] tree block backref root 2 Aug 1 21:33:39 backup2 kernel: [ 435.285265] item 4 key (642868006912 a8 4096) itemoff 3722 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285271] extent refs 1 gen 44304 flags 2 Aug 1 21:33:39 backup2 kernel: [ 435.285278] tree block key (1824369225728 a8 118784) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285284] tree block backref root 2 Aug 1 21:33:39 backup2 kernel: [ 435.285289] item 5 key (642868011008 a8 4096) itemoff 3671 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285295] extent refs 1 gen 101712 flags 258 Aug 1 21:33:39 backup2 kernel: [ 435.285302] tree block key (833 6c 162816000) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285307] shared block backref parent 617940144128 Aug 1 21:33:39 backup2 kernel: [ 435.285313] item 6 key (642868015104 a8 4096) itemoff 3593 itemsize 78 Aug 1 21:33:39 backup2 kernel: [ 435.285321] extent refs 4 gen 38909 flags 258 Aug 1 21:33:39 backup2 kernel: [ 435.285326] tree block key (17721 60 119) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285331] shared block backref parent 1314062098432 Aug 1 21:33:39 backup2 kernel: [ 435.285336] shared block backref parent 1314062094336 Aug 1 21:33:39 backup2 kernel: [ 435.285341] shared block backref parent 1312740683776 Aug 1 21:33:39 backup2 kernel: [ 435.285348] shared block backref parent 1312740675584 Aug 1 21:33:39 backup2 kernel: [ 435.285354] item 7 key (642868019200 a8 4096) itemoff 3542 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285360] extent refs 1 gen 44307 flags 2 Aug 1 21:33:39 backup2 kernel: [ 435.285365] tree block key (18446744073709551606 80 1825532395520) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285373] tree block backref root 7 Aug 1 21:33:39 backup2 kernel: [ 435.285378] item 8 key (642868023296 a8 4096) itemoff 3491 itemsize 51 Aug 1 21:33:39 backup2 kernel: [ 435.285384] extent refs 1 gen 101712 flags 258 Aug 1 21:33:39 backup2 kernel: [ 435.285389] tree block key (833 6c 161284096) level 0 Aug 1 21:33:39 backup2 kernel: [ 435.285396] shared block backref parent 617940144128 Aug 1 21:33:39 backup2 kernel: [ 435.285402] item
[PATCH] Btrfs: barrier before waitqueue_active
We need an smb_mb() before waitqueue_active to avoid missing wakeups. Before Mitch was hitting a deadlock between the ordered flushers and the transaction commit because the ordered flushers were waiting for more refs and were never woken up, so those smp_mb()'s are the most important. Everything else I added for correctness sake and to avoid getting bitten by this again somewhere else. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/compression.c |1 + fs/btrfs/delayed-inode.c | 16 ++-- fs/btrfs/delayed-ref.c | 18 -- fs/btrfs/disk-io.c | 11 --- fs/btrfs/inode.c |8 +--- fs/btrfs/volumes.c |8 +--- 6 files changed, 41 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 86eff48..43d1c5a 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -818,6 +818,7 @@ static void free_workspace(int type, struct list_head *workspace) btrfs_compress_op[idx]-free_workspace(workspace); atomic_dec(alloc_workspace); wake: + smp_mb(); if (waitqueue_active(workspace_wait)) wake_up(workspace_wait); } diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 335605c..8cc9b19 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -513,9 +513,11 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item) rb_erase(delayed_item-rb_node, root); delayed_item-delayed_node-count--; atomic_dec(delayed_root-items); - if (atomic_read(delayed_root-items) BTRFS_DELAYED_BACKGROUND - waitqueue_active(delayed_root-wait)) - wake_up(delayed_root-wait); + if (atomic_read(delayed_root-items) BTRFS_DELAYED_BACKGROUND) { + smp_mb(); + if (waitqueue_active(delayed_root-wait)) + wake_up(delayed_root-wait); + } } static void btrfs_release_delayed_item(struct btrfs_delayed_item *item) @@ -1057,9 +1059,11 @@ static void btrfs_release_delayed_inode(struct btrfs_delayed_node *delayed_node) delayed_root = delayed_node-root-fs_info-delayed_root; atomic_dec(delayed_root-items); if (atomic_read(delayed_root-items) - BTRFS_DELAYED_BACKGROUND - waitqueue_active(delayed_root-wait)) - wake_up(delayed_root-wait); + BTRFS_DELAYED_BACKGROUND) { + smp_mb(); + if (waitqueue_active(delayed_root-wait)) + wake_up(delayed_root-wait); + } } } diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index da7419e..858ef02 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -662,9 +662,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, add_delayed_tree_ref(fs_info, trans, ref-node, bytenr, num_bytes, parent, ref_root, level, action, for_cow); - if (!need_ref_seq(for_cow, ref_root) - waitqueue_active(fs_info-tree_mod_seq_wait)) - wake_up(fs_info-tree_mod_seq_wait); + if (!need_ref_seq(for_cow, ref_root)) { + smp_mb(); + if (waitqueue_active(fs_info-tree_mod_seq_wait)) + wake_up(fs_info-tree_mod_seq_wait); + } + spin_unlock(delayed_refs-lock); if (need_ref_seq(for_cow, ref_root)) btrfs_qgroup_record_ref(trans, ref-node, extent_op); @@ -713,9 +716,11 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, add_delayed_data_ref(fs_info, trans, ref-node, bytenr, num_bytes, parent, ref_root, owner, offset, action, for_cow); - if (!need_ref_seq(for_cow, ref_root) - waitqueue_active(fs_info-tree_mod_seq_wait)) - wake_up(fs_info-tree_mod_seq_wait); + if (!need_ref_seq(for_cow, ref_root)) { + smp_mb(); + if (waitqueue_active(fs_info-tree_mod_seq_wait)) + wake_up(fs_info-tree_mod_seq_wait); + } spin_unlock(delayed_refs-lock); if (need_ref_seq(for_cow, ref_root)) btrfs_qgroup_record_ref(trans, ref-node, extent_op); @@ -744,6 +749,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, num_bytes, BTRFS_UPDATE_DELAYED_HEAD, extent_op-is_data); + smp_mb(); if (waitqueue_active(fs_info-tree_mod_seq_wait)) wake_up(fs_info-tree_mod_seq_wait); spin_unlock(delayed_refs-lock); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 502b20c..a355c89 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -756,9
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Hi Marc, Am Mittwoch, 1. August 2012 schrieb Marc MERLIN: On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote: It it were a random crappy SSD from a random vendor, I'd blame the SSD, but I have a hard time believing that samsung is selling SSDs that are slower than hard drives at random IO and 'seeks'. You'd be surprised on how badly some vendors can screw up :) At some point, it may come down to that indeed :-/ I'm still hopefully that Samsung didn't, but we'll see. Its getting quite strange. I lost track of whether you did that already or not, but if you didnĀ“t please post some vmstat 1 iostat -xd 1 on the device while it is being slow. I am interested in wait I/O and latencies and disk utilization. Comparison data of Intel SSD 320 in ThinkPad T520 during merkaba:~ echo 3 /proc/sys/drop_caches ; du -sch /usr on BTRFS with Kernel 3.5: martin@merkaba:~ vmstat 1 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 1 21556 4442668 2056 50235200 19485 247 120 11 2 87 0 1 2 21556 440 2448 51488400 11684 328 4975 24585 5 16 65 14 1 0 21556 4389880 2448 52806000 13400 0 4574 23452 2 16 68 14 3 1 21556 4370068 2448 54505200 18132 0 5499 27220 1 18 64 16 1 0 21556 4350228 2448 8000 10856 0 4122 25339 3 16 67 14 1 1 21556 4315604 2448 56975600 12648 0 4647 31153 5 14 66 15 0 1 21556 4295652 2456 58148000 1154856 4093 24618 2 13 69 16 0 1 21556 4286720 2456 59158000 10824 0 3750 21445 1 12 71 16 0 1 21556 4266308 2456 60362000 12932 0 4841 26447 4 12 68 17 1 0 21556 4248228 2456 61380800 10264 4 3703 22108 1 13 71 15 5 1 21556 4231976 2456 62435600 10540 0 3581 20436 1 10 72 17 0 1 21556 4197168 2456 63910800 12952 0 4738 28223 4 15 66 15 4 1 21556 4178456 2456 65055200 11656 0 4234 23480 2 14 68 16 0 1 21556 4163616 2456 66299200 13652 0 4619 26580 1 16 70 13 4 1 21556 4138288 2456 67569600 13352 0 4422 22254 1 16 70 13 1 0 21556 4113204 2456 68906000 13232 0 4312 21936 1 15 70 14 0 1 21556 4085532 2456 70416000 14972 0 4820 24238 1 16 69 14 2 0 21556 4055740 2456 71964400 15736 0 5099 25513 3 17 66 14 0 1 21556 4028612 2456 73438000 14504 0 4795 25052 3 15 68 14 2 0 21556 3999108 2456 74904000 1465616 4672 21878 1 17 69 13 1 1 21556 3972732 2456 76210800 12972 0 4717 22411 1 17 70 13 5 0 21556 3949684 2584 77348400 1152852 4837 24107 3 15 67 15 1 0 21556 3912504 2584 78742000 12156 0 4883 25201 4 15 67 14 martin@merkaba:~ iostat -xd 1 /dev/sda Linux 3.5.0-tp520 (merkaba) 01.08.2012 _x86_64_(4 CPU) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1,29 1,44 11,58 12,78 684,74 299,7580,81 0,249,860,95 17,93 0,29 0,71 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2808,000,00 11232,00 0,00 8,00 0,570,210,210,00 0,19 54,50 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2967,000,00 11868,00 0,00 8,00 0,630,210,210,00 0,21 60,90 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0011,00 2992,004,00 11968,0056,00 8,03 0,640,220,220,25 0,21 62,00 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2680,000,00 10720,00 0,00 8,00 0,700,260,260,00 0,25 66,70 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 3153,000,00 12612,00 0,00 8,00 0,720,230,230,00 0,22 69,30 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2769,000,00 11076,00 0,00 8,00 0,630,230,230,00 0,21 58,00 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 2523,001,00 10092,00 4,00 8,00 0,740,290,290,00 0,28
Re: [PATCH] Btrfs: barrier before waitqueue_active
On Wed, Aug 1, 2012 at 3:25 PM, Josef Bacik jba...@fusionio.com wrote: We need an smb_mb() before waitqueue_active to avoid missing wakeups. Before Mitch was hitting a deadlock between the ordered flushers and the transaction commit because the ordered flushers were waiting for more refs and were never woken up, so those smp_mb()'s are the most important. Everything else I added for correctness sake and to avoid getting bitten by this again somewhere else. Thanks, This patch seems to make it tougher to hit a deadlock, but I'm still encountering intermittent deadlocks using this patch when running multiple rsync threads. I've also tested Patch 2, and that has me hitting a deadlock even quicker (when starting several copying threads). I also found a slight performance hit using this patch. On a 3.4.6 kernel (merged with the 3.5_rc for-linus branch), I would typically complete my rsync test in ~265 seconds. Also, I can't recall hitting a deadlock on the 3.4.6 kernel (with 3.5_rc for-linus). When using this patch, the test would take ~310 seconds (when it didn't hit a deadlock). Here's the Delayed Tasks (Ctrl-SysRq-W) when using JUST this patch: [ 1568.794030] SysRq : Show Blocked State [ 1568.794101] taskPC stack pid father [ 1568.794123] btrfs-endio-wri D 88012579c000 0 3845 2 0x [ 1568.794128] 8801254f3c20 0046 8801254f2000 8801241b5a80 [ 1568.794132] 00012280 8801254f3fd8 00012280 4000 [ 1568.794136] 8801254f3fd8 00012280 880129af16a0 8801241b5a80 [ 1568.794140] Call Trace: [ 1568.794179] [a0068785] ? memcpy_extent_buffer+0x159/0x17a [btrfs] [ 1568.794200] [a0082ab7] ? find_ref_head+0xa3/0xc6 [btrfs] [ 1568.794220] [a008343c] ? btrfs_find_ref_cluster+0xdd/0x117 [btrfs] [ 1568.794225] [8162d58c] schedule+0x64/0x66 [ 1568.794241] [a003fc86] btrfs_run_delayed_refs+0x269/0x3f0 [btrfs] [ 1568.794246] [8104b10e] ? wake_up_bit+0x2a/0x2a [ 1568.794265] [a004fdc4] __btrfs_end_transaction+0xca/0x283 [btrfs] [ 1568.794283] [a004ffda] btrfs_end_transaction+0x15/0x17 [btrfs] [ 1568.794302] [a00555da] btrfs_finish_ordered_io+0x2e4/0x334 [btrfs] [ 1568.794306] [8103b980] ? run_timer_softirq+0x2d4/0x2d4 [ 1568.794325] [a005563f] finish_ordered_fn+0x15/0x17 [btrfs] [ 1568.794344] [a0070ef8] worker_loop+0x188/0x4e0 [btrfs] [ 1568.794365] [a0070d70] ? btrfs_queue_worker+0x275/0x275 [btrfs] [ 1568.794384] [a0070d70] ? btrfs_queue_worker+0x275/0x275 [btrfs] [ 1568.794387] [8104ac37] kthread+0x89/0x91 [ 1568.794391] [8162fd74] kernel_thread_helper+0x4/0x10 [ 1568.794395] [8104abae] ? kthread_freezable_should_stop+0x57/0x57 [ 1568.794398] [8162fd70] ? gs_change+0xb/0xb [ 1568.794400] btrfs-transacti D 88009912ba50 0 3851 2 0x [ 1568.794403] 8801241cfc70 0046 8801241ce000 8801248cda80 [ 1568.794407] 00012280 8801241cffd8 00012280 4000 [ 1568.794411] 8801241cffd8 00012280 8801254b8000 8801248cda80 [ 1568.794415] Call Trace: [ 1568.794436] [a0066646] ? extent_writepages+0x53/0x5d [btrfs] [ 1568.794455] [a005357b] ? uncompress_inline.clone.33+0x15f/0x15f [btrfs] [ 1568.794459] [810c9ada] ? pagevec_lookup_tag+0x24/0x2e [ 1568.794478] [a0052e0e] ? btrfs_writepages+0x27/0x29 [btrfs] [ 1568.794481] [810c90b1] ? do_writepages+0x20/0x29 [ 1568.794485] [8162d58c] schedule+0x64/0x66 [ 1568.794505] [a0061547] btrfs_start_ordered_extent+0xde/0xfa [btrfs] [ 1568.794508] [8104b10e] ? wake_up_bit+0x2a/0x2a [ 1568.794529] [a0061984] ? btrfs_lookup_first_ordered_extent+0x65/0x99 [btrfs] [ 1568.794549] [a0061a6a] btrfs_wait_ordered_range+0xb2/0xda [btrfs] [ 1568.794569] [a0061bcc] btrfs_run_ordered_operations+0x13a/0x1c1 [btrfs] [ 1568.794587] [a004f5f5] btrfs_commit_transaction+0x287/0x960 [btrfs] [ 1568.794606] [a00502b1] ? start_transaction+0x2d5/0x310 [btrfs] [ 1568.794609] [8104b10e] ? wake_up_bit+0x2a/0x2a [ 1568.794627] [a004913b] transaction_kthread+0x187/0x258 [btrfs] [ 1568.794644] [a0048fb4] ? btrfs_alloc_root+0x42/0x42 [btrfs] [ 1568.794661] [a0048fb4] ? btrfs_alloc_root+0x42/0x42 [btrfs] [ 1568.794664] [8104ac37] kthread+0x89/0x91 [ 1568.794668] [8162fd74] kernel_thread_helper+0x4/0x10 [ 1568.794671] [8104abae] ? kthread_freezable_should_stop+0x57/0x57 [ 1568.794674] [8162fd70] ? gs_change+0xb/0xb [ 1568.794676] flush-btrfs-1 D 88012579c000 0 3857 2 0x [ 1568.794680] 880037125670 0046 880037124000 8801254b8000 [ 1568.794684] 00012280 880037125fd8 00012280 4000 [
[josef-btrfs:master 9/11] fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function
Hi Miao, There are new compile warnings show up in tree: git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git master head: 8e1163044779e90662e96887cdd692c1594146ad commit: dd817e81e81bbb83b63317b169d0e57a5d7ae0e1 [9/11] Btrfs: fix error path in create_pending_snapshot() config: x86_64-randconfig-h031 (attached as .config) All error/warnings: fs/btrfs/transaction.c: In function 'create_pending_snapshot': fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function [-Wmaybe-uninitialized] fs/btrfs/transaction.c:1119:19: warning: 'rsv' may be used uninitialized in this function [-Wmaybe-uninitialized] vim +1118 fs/btrfs/transaction.c 1115 if (ret) 1116 goto abort_trans; 1117 fail: 1118 dput(parent); 1119 trans-block_rsv = rsv; 1120 no_free_objectid: 1121 kfree(new_root_item); --- 0-DAY kernel build testing backend Open Source Technology Centre Fengguang Wu w...@linux.intel.com Intel Corporation # # Automatically generated file; DO NOT EDIT. # Linux/x86_64 3.5.0 Kernel Configuration # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT=elf64-x86-64 CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y # CONFIG_GENERIC_ISA_DMA is not set CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y # CONFIG_ARCH_MAY_HAVE_PC_FDC is not set # CONFIG_RWSEM_GENERIC_SPINLOCK is not set CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_ARCH_HAS_CPU_AUTOPROBE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11 CONFIG_ARCH_CPU_PROBE_RELEASE=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_HAVE_IRQ_WORK=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE= CONFIG_LOCALVERSION= CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set CONFIG_DEFAULT_HOSTNAME=(none) CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_POSIX_MQUEUE is not set CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y # CONFIG_FHANDLE is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set CONFIG_HAVE_GENERIC_HARDIRQS=y # # IRQ subsystem # CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # # RCU Subsystem # CONFIG_TREE_PREEMPT_RCU=y CONFIG_PREEMPT_RCU=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FANOUT_EXACT is not set CONFIG_TREE_RCU_TRACE=y CONFIG_RCU_BOOST=y CONFIG_RCU_BOOST_PRIO=1 CONFIG_RCU_BOOST_DELAY=500 CONFIG_IKCONFIG=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set # CONFIG_CGROUP_FREEZER is not set CONFIG_CGROUP_DEVICE=y CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y # CONFIG_CGROUP_CPUACCT is not set CONFIG_RESOURCE_COUNTERS=y # CONFIG_CGROUP_MEM_RES_CTLR is not set # CONFIG_CGROUP_PERF is not set CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_CFS_BANDWIDTH is not set # CONFIG_RT_GROUP_SCHED is not set CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set # CONFIG_CHECKPOINT_RESTORE is not set CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set # CONFIG_IPC_NS is not set CONFIG_PID_NS=y CONFIG_NET_NS=y CONFIG_SCHED_AUTOGROUP=y # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_RD_GZIP=y # CONFIG_RD_BZIP2 is not set # CONFIG_RD_LZMA is not set # CONFIG_RD_XZ is not set # CONFIG_RD_LZO is not set #
Re: [josef-btrfs:master 9/11] fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function
On Thu, 2 Aug 2012 10:31:23 +0800, Fengguang Wu wrote: There are new compile warnings show up in tree: git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git master head: 8e1163044779e90662e96887cdd692c1594146ad commit: dd817e81e81bbb83b63317b169d0e57a5d7ae0e1 [9/11] Btrfs: fix error path in create_pending_snapshot() config: x86_64-randconfig-h031 (attached as .config) All error/warnings: fs/btrfs/transaction.c: In function 'create_pending_snapshot': fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function [-Wmaybe-uninitialized] fs/btrfs/transaction.c:1119:19: warning: 'rsv' may be used uninitialized in this function [-Wmaybe-uninitialized] vim +1118 fs/btrfs/transaction.c 1115if (ret) 1116goto abort_trans; 1117fail: 1118 dput(parent); 1119trans-block_rsv = rsv; 1120no_free_objectid: 1121kfree(new_root_item); The patch I sent before is based on the v3.5.0, not Chris's latest for-linus branch, so... I will send new patches to fix it. Thanks for your report. Miao --- 0-DAY kernel build testing backend Open Source Technology Centre Fengguang Wu w...@linux.intel.com Intel Corporation -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 2/2] Btrfs: fix the snapshot that should not exist
The snapshot should be the image of the fs tree before it was created, so the metadata of the snapshot should not exist in the its tree. But now, we found the directory item and directory name index is in both the snapshot tree and the fs tree. It introduces some problems and makes the users feel strange: # mkfs.btrfs /dev/sda1 # mount /dev/sda1 /mnt # mkdir /mnt/1 # cd /mnt/1 # btrfs subvolume snapshot /mnt snap0 # ls -a /mnt/1/snap0/1 . .. [no other file/dir] # ll /mnt/1/snap0/ total 0 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1 ^^^ There is no file/dir in it, but it's size is 10 # cd /mnt/1/snap0/1/snap0 [Enter a unexisted directory successfully...] There is nothing in the directory 1 in snap0, but btrfs told the length of this directory is 10. Beside that, we can enter an unexisted directory, it is very strange to the users. # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1 # ll /mnt/1/snap0/1/ total 0 [None] # ll /mnt/snap1/1/ total 0 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0 And the source of snap1 did have any directory in Directory 1, but snap1 have a snap0, it is different between the source and the snapshot. So I think we should insert directory item and directory name index and update the parent inode as the last step of snapshot creation, and do not leave the useless metadata in the file tree. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v2 - v3: - rebase on the latest for-linus branch Changelog v1 - v2: - add comment to explain why we need deal with the delayed items after snapshot creation and why this operation do not corrupt the metadata. - move dput() to patch 1/2 --- fs/btrfs/transaction.c | 66 +-- 1 files changed, 52 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 4e9c106..7943dc2 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -950,6 +950,8 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, struct btrfs_root *parent_root; struct btrfs_block_rsv *rsv; struct inode *parent_inode; + struct btrfs_path *path; + struct btrfs_dir_item *dir_item; struct dentry *parent; struct dentry *dentry; struct extent_buffer *tmp; @@ -962,6 +964,12 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, u64 root_flags; uuid_le new_uuid; + path = btrfs_alloc_path(); + if (!path) { + ret = pending-error = -ENOMEM; + goto path_alloc_fail; + } + new_root_item = kmalloc(sizeof(*new_root_item), GFP_NOFS); if (!new_root_item) { ret = pending-error = -ENOMEM; @@ -1010,22 +1018,20 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, */ ret = btrfs_set_inode_index(parent_inode, index); BUG_ON(ret); /* -ENOMEM */ - ret = btrfs_insert_dir_item(trans, parent_root, - dentry-d_name.name, dentry-d_name.len, - parent_inode, key, - BTRFS_FT_DIR, index); - if (ret == -EEXIST) { + + /* check if there is a file/dir which has the same name. */ + dir_item = btrfs_lookup_dir_item(NULL, parent_root, path, +btrfs_ino(parent_inode), +dentry-d_name.name, +dentry-d_name.len, 0); + if (dir_item != NULL !IS_ERR(dir_item)) { pending-error = -EEXIST; goto fail; - } else if (ret) { + } else if (IS_ERR(dir_item)) { + ret = PTR_ERR(dir_item); goto abort_trans; } - - btrfs_i_size_write(parent_inode, parent_inode-i_size + -dentry-d_name.len * 2); - ret = btrfs_update_inode(trans, parent_root, parent_inode); - if (ret) - goto abort_trans; + btrfs_release_path(path); /* * pull in the delayed directory update @@ -1113,12 +1119,29 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, ret = btrfs_reloc_post_snapshot(trans, pending); if (ret) goto abort_trans; + + ret = btrfs_insert_dir_item(trans, parent_root, + dentry-d_name.name, dentry-d_name.len, + parent_inode, key, + BTRFS_FT_DIR, index); + /* We have check then name at the beginning, so it is impossible. */ + BUG_ON(ret == -EEXIST); + if (ret) + goto abort_trans; + + btrfs_i_size_write(parent_inode, parent_inode-i_size + +dentry-d_name.len * 2); + ret =