Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-01 Thread Marc MERLIN
On Fri, Jul 27, 2012 at 11:42:39AM -0700, Marc MERLIN wrote:
  https://oss.oracle.com/~mason/latencytop.patch
 
 Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU 
 (not even close and I get the same problem with unencrypted filesystem, 
 actually
 du -sh is exactly the same speed on encrypted and unecrypted).
 
 Here's the result I think you were looking for. I'm not good at reading this,
 but hopefully it tells you something useful :)
 
 The full run is here if that helps:
 http://marc.merlins.org/tmp/latencytop.txt
 
I did some other tests since last week since my laptop is hard to use
considering how slow the SSD is.

(TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster
than btrfs, but 3x slower than the same filesystem on spinning disk :( )


Ok, just to help with debuggging this,
1)  I put my samsung 830 SSD into another thinkpad and it wasn't faster or
slower.

2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that
just died and killed all my data), and du took 0.3 seconds on both my old
and new thinkpads.
The old thinkpad is running ubuntu 32bit the new one debian testing 64bit
both with kernel 3.4.4.

So, clearly, there is something wrong with the samsung 830 SSD with linux
but I have no clue what :(
In raw speed (dd) the samsung is faster than the crucial (350MB/s vs
500MB/s).
It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
I have a hard time believing that samsung is selling SSDs that are slower
than hard drives at random IO and 'seeks'.

3) I just got a 2nd ssd from samsung (same kind), just to make sure the one
I had wasn't bad. It's brand new, and I formatted it carefully on 512
boundaries:
/dev/sda12048  502271  250112   83  Linux
/dev/sda2  50227252930559262141447  HPFS/NTFS/exFAT
/dev/sda3529305607390207910485760   82  Linux swap / Solaris
/dev/sda473902080  1000215215   463156568   83  Linux

I also upgraded to 3.5.0 in the meantime but unfortunately the results are
similar.

First: btrfs is the slowest:
gandalfthegreat:/mnt/ssd/var/local# time du -sh src/
514Msrc/
real0m25.741s
gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts 
/dev/mapper/ssd /mnt/ssd/var btrfs 
rw,noatime,compress=lzo,ssd,discard,space_cache 0 0


Second: ext4 is 2x faster than btrfs with mkfs.ext4 -O extent -b 4096 /dev/sda3
gandalfthegreat:/mnt/mnt3# reset_cache
gandalfthegreat:/mnt/mnt3# time du -sh src/
519Msrc/
real0m12.459s
gandalfthegreat:~# grep mnt3 /proc/mounts
/dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0

Third, A freshly made ntfs filesystem through fuse is actually FASTER!
gandalfthegreat:/mnt/mnt2# reset_cache 
gandalfthegreat:/mnt/mnt2# time du -sh src/
506Msrc/
real0m8.928s
gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts
/dev/sda2 /mnt/mnt2 fuseblk 
rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0

How can ntfs via fuse be the fastest and btrfs so slow?
Of course, all 3 are slower than the same filesystem on spinning too, but
I'm wondering if there is a scheduling issue that is somehow causing the
extreme slowness I'm seeing.

Did the latencytop trace I got help in any way?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-01 Thread Fajar A. Nugraha
On Wed, Aug 1, 2012 at 1:01 PM, Marc MERLIN m...@merlins.org wrote:

 So, clearly, there is something wrong with the samsung 830 SSD with linux


 It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
 I have a hard time believing that samsung is selling SSDs that are slower
 than hard drives at random IO and 'seeks'.

You'd be surprised on how badly some vendors can screw up :)


 First: btrfs is the slowest:

 gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts
 /dev/mapper/ssd /mnt/ssd/var btrfs 
 rw,noatime,compress=lzo,ssd,discard,space_cache 0 0

Just checking, did you explicitly activate discard? Cause on my
setup (with corsair SSD) it made things MUCH slower. Also, try adding
noatime (just in case the slow down was because du cause many
access time updates)

-- 
Fajar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-01 Thread Marc MERLIN
On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:
  It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
  I have a hard time believing that samsung is selling SSDs that are slower
  than hard drives at random IO and 'seeks'.
 
 You'd be surprised on how badly some vendors can screw up :)
 
At some point, it may come down to that indeed :-/
I'm still hopefully that Samsung didn't, but we'll see.
 
  First: btrfs is the slowest:
 
  gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts
  /dev/mapper/ssd /mnt/ssd/var btrfs 
  rw,noatime,compress=lzo,ssd,discard,space_cache 0 0
 
 Just checking, did you explicitly activate discard? Cause on my

Yes. Note that it should a noop when all you're doing is stating inodes and
not writing (I'm using noatime).

 setup (with corsair SSD) it made things MUCH slower. Also, try adding
 noatime (just in case the slow down was because du cause many
 access time updates)

I have noatime in there already :)

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-01 Thread Chris Samuel
On 01/08/12 16:01, Marc MERLIN wrote:

 Third, A freshly made ntfs filesystem through fuse is actually FASTER!

Could it be that Samsungs FTL has optimisations in it for NTFS ?

A horrible thought, but not impossible..

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-01 Thread Marc MERLIN
On Wed, Aug 01, 2012 at 04:36:22PM +1000, Chris Samuel wrote:
 On 01/08/12 16:01, Marc MERLIN wrote:
 
  Third, A freshly made ntfs filesystem through fuse is actually FASTER!
 
 Could it be that Samsungs FTL has optimisations in it for NTFS ?
 
 A horrible thought, but not impossible..

Not impossible, but it's can be the main reason since it clocks still 2x
slower with ntfs than a spinning disk with encrypted btrfs.
Since SSDs should seek 10-100x faster than spinning disks, that can't be
the only reason.

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] Btrfs: fix that error value is changed by mistake

2012-08-01 Thread Stefan Behrens
In iterate_inodes_from_logical() the error result from
extent_from_logical() is patched by mistake. Typically ENOENT is
patched to EINVAL because (-ENOENT  BTRFS_EXTENT_FLAG_TREE_BLOCK)
evaluates to true.

Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de
---
Resent because of the previously forgotten Signed-off-by line.

 fs/btrfs/backref.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index a256f3b..ff6475f 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1438,10 +1438,10 @@ int iterate_inodes_from_logical(u64 logical, struct 
btrfs_fs_info *fs_info,
ret = extent_from_logical(fs_info, logical, path,
found_key);
btrfs_release_path(path);
-   if (ret  BTRFS_EXTENT_FLAG_TREE_BLOCK)
-   ret = -EINVAL;
if (ret  0)
return ret;
+   if (ret  BTRFS_EXTENT_FLAG_TREE_BLOCK)
+   return -EINVAL;
 
extent_item_pos = logical - found_key.objectid;
ret = iterate_extent_inodes(fs_info, found_key.objectid,
-- 
1.7.11.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions and notes about send/receive

2012-08-01 Thread Alexander Block
On Tue, Jul 31, 2012 at 6:32 PM, Alex Lyakas
alex.bolshoy.bt...@gmail.com wrote:
 Hi Alexander,
 I relooked at my list of questions, and it seems that there are more
 general questions, and more focused questions. So here I list the
 more focused ones. I really appreciate if you can address them. I
 hope it's ok that I opened a separate thread for them. So here goes...

 iterate_dir_item(): tmp_path is not used
Removed.

 name_cache_insert(): if radix_tree_insert() fails, should we free the 
 nce_head?

 __get_cur_name_and_parent()
 if name_cache_insert() fails, should we free the nce?
Both, nce_head and nce are now freed in name_cache_insert.

 __get_cur_name_and_parent()
 if is_inode_existent() is false, and we generate unique name, we don't
 set *parent_ino/*parent_genI guess it's ok, it just looked strange
 when it showed up in my prints.
Yepp, that's ok. We don't have a valid parent ino/gen in that case. I
added a note to the function comment.

 btrfs_ioctl_send()
 struct file *filp = NULL; = this is not used

 btrfs_ioctl_set_received_subvol()
 if we fail to commit the transaction, should we rollback the in-memory
 values that we set, like stransid/rtransid etc? or they get rolled
 back automatically somehow (I need to study the transaction
 subsystem).
Hmm not sure about that. In case of transaction abort, the FS is not
writable anymore as far as I know, so no matter which values are
in-memory, the user won't be able to use them.

 process_recorded_refs():
 /*
  * If the inode is still orphan, unlink the orphan. This may
  * happen when a previous inode did overwrite the first ref
  * of this inode and no new refs were added for the current
  * inode.
  */
 This comment is partially misleading, I think. It implies that the
 orphan inode will be deleted. But it may happen that it will only be
 unlinked, because another hardlink to it remains (one of my tests goes
 through this path - inode is orphanized, then orphan unlinked, but
 another hardlink exists).
This behavior is expected. We do the orphan handling only on first
refs and never check if that may generate unnecessary commands. I made
it this way because it was the easiest and most secure solution, but
it has room for optimization. A added a note about this to the
comment.

 btrfs-progs::read_and_process_cmd()
 if (strstr(path, .bak_1.log)) {
 ret = 0;
 }
 this block is some leftover?
Hehe, yes :) I love such statements so that I can add a breakpoint to
ret=0 :) I removed it.

 __get_cur_name_and_parent()
 if we decided to call get_first_ref() on the send_root (due to ino 
 sctx-send_progress), do we really need to call did_overwrite_ref()?
 Because it will just lookup the send root again. I mean if we know
 that this inode has been already handled, then it's not an orphan
 anymore, so no need for this additional check. If my understanding is
 wrong, pls give some small example?
get_first_ref does not return the current state. It returns the
permanent state. The result of did_overwrite_ref however depends on
the current send progress. There is however some room for optimization
here. In many cases, I did not think much about double tree lookups
due to the different calls, as I wanted it to work first and later
optimize it. One possible optimization for example would be to cache
the results of lookups. But only for functions which return permanent
state, e.g. get_inode_info.

 process_recorded_refs():
 why we need to call did_overwrite_first_ref(), and not just call
 get_cur_path()? It looks like we need this only to set the is_orphan
 flag? Because, otherwise, get_cur_path() already does the
 overwrite_first_ref logic, and will return the correct path. Is my
 understanding correct? (I ran some tests and looks correct to me).
Hmm this is probably true. It was probably required in the past to
call did_overwrite_first_ref, not sure. I would like to keep it this
way for the moment as I don't want to risk introducing new bugs.

 find_extent_clone():
 /*
  * Now collect all backrefs.
  */
 extent_item_pos = logical - found_key.objectid;
 Is my understanding correct that extent_item_pos will now be equal to
 btrfs_file_extent_offset(eb, fi)?
Yepp. It's probably not needed to do it this way, but I left it this
way as this part of the code is copied from Jan's initial version of
btrfs send.

 process_recorded_refs():
 if (ret == inode_state_did_create ||
 ret == inode_state_no_change) {
 /* TODO delayed utimes */
 ret = send_utimes(sctx, un-val, un-aux);
 This code results in UTIMES sent twice for a parent dir of new inode:
 once because parent_dir gets changed (stamped with a transid) and
 second time because of this code. Is this fine? Also what TODO
 delayed utimes means?
I'm collection TODOs here:

Re: [PATCH 3/3] Btrfs-progs: list snapshots by generation

2012-08-01 Thread Liu Bo
On 08/01/2012 05:16 AM, Goffredo Baroncelli wrote:
 Hi Bo,
 
 On 07/31/2012 07:49 AM, Liu Bo wrote:
 The idea is that we usually use snapshot to backup/restore our data, and the
 common way can be a cron script which makes lots of snapshots, so we can end
 up with spending some time to find the latest snapshot to restore.

 This adds a feature for 'btrfs subvolume list' to let it list snapshots by 
 their
 _created_ generation.

 What we need to do is just to list them in descending order and get the 
 latest
 snapshot.  What's more, we can find the oldest snapshot as well by listing
 snapshots in ascending order.

 Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
 ---
  btrfs-list.c |  176 
 --
  cmds-subvolume.c |   19 +-
  2 files changed, 185 insertions(+), 10 deletions(-)

 []
  
  static const char * const cmd_subvol_list_usage[] = {
 -btrfs subvolume list [-p] path,
 +btrfs subvolume list [-ps] path,
  List subvolumes (and snapshots),
  ,
 --p print parent ID,
 +-p   print parent ID,
 +-s value list snapshots with generation in ascending/descending 
 order,
 + (1: ascending, 0: descending),
 
 Please change the user interface. I suggest something like:
 
 -s|-S list snapshots with generation in ascending|descending
   order   
 
 Or better
 
 -ssort by generation
 -Psort by path
 -rreverse the sort order
 

I prefer to the first one, since I have no any idea how to sort by path
by then.

 
 Anyway, whichever your choice will be, please remember to update the man
 page too.
 

ah, I should have remembered to update it, thanks for reminding. :)

Will do it soon, thanks for reviewing this!

thanks,
liubo

  NULL
  };
 
 [...]
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-01 Thread Stefan Behrens
With commit acce952b0, btrfs was changed to flag the filesystem with
BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
error happened like a write I/O errors of all mirrors.
In such situations, on unmount, the superblock is written in
btrfs_error_commit_super(). This is done with the intention to be able
to evaluate the error flag on the next mount. A warning is printed
in this case during the next mount and the log tree is ignored.

The issue is that it is possible that the superblock points to a root
that was not written (due to write I/O errors).
The result is that the filesystem cannot be mounted. btrfsck also does
not start and all the other btrfs-progs tools fail to start as well.
However, mount -o recovery is working well and does the right things
to recover the filesystem (i.e., don't use the log root, clear the
free space cache and use the next mountable root that is stored in the
root backup array).

This patch removes the writing of the superblock when
BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
flag in the mount function.

These lines can be used to reproduce the issue (using /dev/sdm):
SCRATCH_DEV=/dev/sdm
SCRATCH_MNT=/mnt
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
ls -alLF /dev/mapper/foo
mkfs.btrfs /dev/mapper/foo
mount /dev/mapper/foo $SCRATCH_MNT
echo bar  $SCRATCH_MNT/foo
sync
echo 0 25165824 error | dmsetup reload foo
dmsetup resume foo
ls -alF $SCRATCH_MNT
touch $SCRATCH_MNT/1
ls -alF $SCRATCH_MNT
sleep 35
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
dmsetup resume foo
sleep 1
umount $SCRATCH_MNT
btrfsck /dev/mapper/foo
dmsetup remove foo

Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de
Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
---
Changes v1 - v2:
- Removed one large comment entirely which was not needed anymore
  after the code changes.
- Removed a warning message which was dead code since the super block
  is not written anymore when the flag BTRFS_SUPER_FLAG_ERROR is set.
- Removed the return code of btrfs_error_commit_super() since it now
  does not return any errors anymore.

 fs/btrfs/disk-io.c | 36 
 fs/btrfs/disk-io.h |  2 +-
 2 files changed, 5 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 502b20c..f6a1d0f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2531,8 +2531,7 @@ retry_root_backup:
goto fail_trans_kthread;
 
/* do not make disk changes in broken FS */
-   if (btrfs_super_log_root(disk_super) != 0 
-   !(fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR)) {
+   if (btrfs_super_log_root(disk_super) != 0) {
u64 bytenr = btrfs_super_log_root(disk_super);
 
if (fs_devices-rw_devices == 0) {
@@ -3192,30 +3191,14 @@ int close_ctree(struct btrfs_root *root)
/* clear out the rbtree of defraggable inodes */
btrfs_run_defrag_inodes(fs_info);
 
-   /*
-* Here come 2 situations when btrfs is broken to flip readonly:
-*
-* 1. when btrfs flips readonly somewhere else before
-* btrfs_commit_super, sb-s_flags has MS_RDONLY flag,
-* and btrfs will skip to write sb directly to keep
-* ERROR state on disk.
-*
-* 2. when btrfs flips readonly just in btrfs_commit_super,
-* and in such case, btrfs cannot write sb via btrfs_commit_super,
-* and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag,
-* btrfs will cleanup all FS resources first and write sb then.
-*/
if (!(fs_info-sb-s_flags  MS_RDONLY)) {
ret = btrfs_commit_super(root);
if (ret)
printk(KERN_ERR btrfs: commit super ret %d\n, ret);
}
 
-   if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR) {
-   ret = btrfs_error_commit_super(root);
-   if (ret)
-   printk(KERN_ERR btrfs: commit super ret %d\n, ret);
-   }
+   if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR)
+   btrfs_error_commit_super(root);
 
btrfs_put_block_group_cache(fs_info);
 
@@ -3437,18 +3420,11 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
if (read_only)
return 0;
 
-   if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR) {
-   printk(KERN_WARNING warning: mount fs with errors, 
-  running btrfsck is recommended\n);
-   }
-
return 0;
 }
 
-int btrfs_error_commit_super(struct btrfs_root *root)
+void btrfs_error_commit_super(struct btrfs_root *root)
 {
-   int ret;
-
mutex_lock(root-fs_info-cleaner_mutex);
btrfs_run_delayed_iputs(root);
mutex_unlock(root-fs_info-cleaner_mutex);
@@ -3458,10 +3434,6 @@ int btrfs_error_commit_super(struct btrfs_root *root)
 
/* cleanup FS via transaction */
btrfs_cleanup_transaction(root);
-
-   ret = 

Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-01 Thread Liu Bo
On 08/01/2012 07:45 PM, Stefan Behrens wrote:
 With commit acce952b0, btrfs was changed to flag the filesystem with
 BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
 error happened like a write I/O errors of all mirrors.
 In such situations, on unmount, the superblock is written in
 btrfs_error_commit_super(). This is done with the intention to be able
 to evaluate the error flag on the next mount. A warning is printed
 in this case during the next mount and the log tree is ignored.
 
 The issue is that it is possible that the superblock points to a root
 that was not written (due to write I/O errors).
 The result is that the filesystem cannot be mounted. btrfsck also does
 not start and all the other btrfs-progs tools fail to start as well.
 However, mount -o recovery is working well and does the right things
 to recover the filesystem (i.e., don't use the log root, clear the
 free space cache and use the next mountable root that is stored in the
 root backup array).
 
 This patch removes the writing of the superblock when
 BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
 flag in the mount function.
 

Yes, I have to admit that this can be a serious problem.

But we'll need to send the error flag stored in the super block into
disk in the future so that the next mount can find it unstable and do
fsck by itself maybe.

If we do not write the super any more, it cannot find the flag.

So can we find a way to make both things happy?

thanks,
liubo

 These lines can be used to reproduce the issue (using /dev/sdm):
 SCRATCH_DEV=/dev/sdm
 SCRATCH_MNT=/mnt
 echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
 ls -alLF /dev/mapper/foo
 mkfs.btrfs /dev/mapper/foo
 mount /dev/mapper/foo $SCRATCH_MNT
 echo bar  $SCRATCH_MNT/foo
 sync
 echo 0 25165824 error | dmsetup reload foo
 dmsetup resume foo
 ls -alF $SCRATCH_MNT
 touch $SCRATCH_MNT/1
 ls -alF $SCRATCH_MNT
 sleep 35
 echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
 dmsetup resume foo
 sleep 1
 umount $SCRATCH_MNT
 btrfsck /dev/mapper/foo
 dmsetup remove foo
 
 Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de
 Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
 ---
 Changes v1 - v2:
 - Removed one large comment entirely which was not needed anymore
   after the code changes.
 - Removed a warning message which was dead code since the super block
   is not written anymore when the flag BTRFS_SUPER_FLAG_ERROR is set.
 - Removed the return code of btrfs_error_commit_super() since it now
   does not return any errors anymore.
 
  fs/btrfs/disk-io.c | 36 
  fs/btrfs/disk-io.h |  2 +-
  2 files changed, 5 insertions(+), 33 deletions(-)
 
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 502b20c..f6a1d0f 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2531,8 +2531,7 @@ retry_root_backup:
   goto fail_trans_kthread;
  
   /* do not make disk changes in broken FS */
 - if (btrfs_super_log_root(disk_super) != 0 
 - !(fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR)) {
 + if (btrfs_super_log_root(disk_super) != 0) {
   u64 bytenr = btrfs_super_log_root(disk_super);
  
   if (fs_devices-rw_devices == 0) {
 @@ -3192,30 +3191,14 @@ int close_ctree(struct btrfs_root *root)
   /* clear out the rbtree of defraggable inodes */
   btrfs_run_defrag_inodes(fs_info);
  
 - /*
 -  * Here come 2 situations when btrfs is broken to flip readonly:
 -  *
 -  * 1. when btrfs flips readonly somewhere else before
 -  * btrfs_commit_super, sb-s_flags has MS_RDONLY flag,
 -  * and btrfs will skip to write sb directly to keep
 -  * ERROR state on disk.
 -  *
 -  * 2. when btrfs flips readonly just in btrfs_commit_super,
 -  * and in such case, btrfs cannot write sb via btrfs_commit_super,
 -  * and since fs_state has been set BTRFS_SUPER_FLAG_ERROR flag,
 -  * btrfs will cleanup all FS resources first and write sb then.
 -  */
   if (!(fs_info-sb-s_flags  MS_RDONLY)) {
   ret = btrfs_commit_super(root);
   if (ret)
   printk(KERN_ERR btrfs: commit super ret %d\n, ret);
   }
  
 - if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR) {
 - ret = btrfs_error_commit_super(root);
 - if (ret)
 - printk(KERN_ERR btrfs: commit super ret %d\n, ret);
 - }
 + if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR)
 + btrfs_error_commit_super(root);
  
   btrfs_put_block_group_cache(fs_info);
  
 @@ -3437,18 +3420,11 @@ static int btrfs_check_super_valid(struct 
 btrfs_fs_info *fs_info,
   if (read_only)
   return 0;
  
 - if (fs_info-fs_state  BTRFS_SUPER_FLAG_ERROR) {
 - printk(KERN_WARNING warning: mount fs with errors, 
 -running btrfsck is recommended\n);
 - }
 -
   return 0;
  }
  
 -int 

Re: [PATCH] Btrfs: unlock extent when the dio completes for reads

2012-08-01 Thread Josef Bacik
On Tue, Jul 31, 2012 at 03:30:49PM -0600, Zach Brown wrote:
  +static void btrfs_direct_read_iodone(struct kiocb *iocb, loff_t offset,
  +ssize_t bytes, void *private, int ret,
  +bool is_async)
  +{
  +   struct inode *inode = iocb-ki_filp-f_mapping-host;
  +   u64 lockend;
  +
  +   if (get_state_private(BTRFS_I(inode)-io_tree, offset, lockend)) {
  +   WARN_ON(1);
  +   goto out;
  +   }
  +
  +   unlock_extent(BTRFS_I(inode)-io_tree, offset, lockend);
  +out:
  +   if (is_async)
  +   aio_complete(iocb, ret, 0);
  +   inode_dio_done(inode);
  +}
 
 Hmm.  I'm worried that this can be called from interrupt context and the
 state/extent calls there seem to use bare spin locks.  Did you run this
 with lockdep?  Am I confused?
 

Our read completions are run in a helper thread since we need to do things like
clear the extent state stuff and such so it's safe in this regard.  Course I hit
a lockup with xfstests 91 because there are some cases where we can return
errors without doing the dio complete (rightly so since we haven't setup the dio
yet), so now I have to figure out how to know when we've been unlocked via dio
complete or that I've gotten an error back and already unlocked it...

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] Btrfs: fix some endian bugs handling the root times

2012-08-01 Thread Alexander Block
On Mon, Jul 30, 2012 at 10:10 AM, Dan Carpenter
dan.carpen...@oracle.com wrote:
 trans-transid is cpu endian but we want to store the data as little
 endian.  item-ctime.nsec is only 32 bits, not 64.

 Signed-off-by: Dan Carpenter dan.carpen...@oracle.com
 ---
 Applies to linux-next.

 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 43f0012..a1fbca0 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -424,7 +424,7 @@ static noinline int create_subvol(struct btrfs_root *root,
 uuid_le_gen(new_uuid);
 memcpy(root_item.uuid, new_uuid.b, BTRFS_UUID_SIZE);
 root_item.otime.sec = cpu_to_le64(cur_time.tv_sec);
 -   root_item.otime.nsec = cpu_to_le64(cur_time.tv_nsec);
 +   root_item.otime.nsec = cpu_to_le32(cur_time.tv_nsec);
 root_item.ctime = root_item.otime;
 btrfs_set_root_ctransid(root_item, trans-transid);
 btrfs_set_root_otransid(root_item, trans-transid);
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index 7ac7cdc..7208ada 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -1061,7 +1061,7 @@ static noinline int create_pending_snapshot(struct 
 btrfs_trans_handle *trans,
 memcpy(new_root_item-parent_uuid, root-root_item.uuid,
 BTRFS_UUID_SIZE);
 new_root_item-otime.sec = cpu_to_le64(cur_time.tv_sec);
 -   new_root_item-otime.nsec = cpu_to_le64(cur_time.tv_nsec);
 +   new_root_item-otime.nsec = cpu_to_le32(cur_time.tv_nsec);
 btrfs_set_root_otransid(new_root_item, trans-transid);
 memset(new_root_item-stime, 0, sizeof(new_root_item-stime));
 memset(new_root_item-rtime, 0, sizeof(new_root_item-rtime));
 diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
 index 6bb465c..10d8e4d 100644
 --- a/fs/btrfs/root-tree.c
 +++ b/fs/btrfs/root-tree.c
 @@ -544,8 +544,8 @@ void btrfs_update_root_times(struct btrfs_trans_handle 
 *trans,
 struct timespec ct = CURRENT_TIME;

 spin_lock(root-root_times_lock);
 -   item-ctransid = trans-transid;
 +   item-ctransid = cpu_to_le64(trans-transid);
 item-ctime.sec = cpu_to_le64(ct.tv_sec);
 -   item-ctime.nsec = cpu_to_le64(ct.tv_nsec);
 +   item-ctime.nsec = cpu_to_le32(ct.tv_nsec);
 spin_unlock(root-root_times_lock);
  }

Thanks. I will include this patch in the next send/receive pull request.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add crtime to the snapshot list

2012-08-01 Thread Liu Bo
On 08/01/2012 08:01 PM, Anand jain wrote:
 From: Anand anand.j...@oracle.com
 
  This patch adds creation-time to the snapshot list display,
  which would help user to better manage the snapshots when
  number of snapshots grow substantially. This patch is developed
  and on top of the send/receive btrfs and btrfs-progs repo at 
   git://github.com/ablock84/linux-btrfs.git (send-v2)
   git://github.com/ablock84/btrfs-progs.git (send-v2)
  respectively.
 

Hi Arand,

Can you please also post the patch itself here?

thanks,
liubo

  Further this patch has the dependency on the following patches
   Liu Bo:
[PATCH 2/3 RESEND] Btrfs-progs: show generation in command btrfs subvol 
 list
[PATCH 3/3] Btrfs-progs: list snapshots by generation
 
  Eg output:
#btrfs su list -s 1 /btrfs
ID 258 gen 39 cgen 6 top level 5 crtime 2012-07-27 17:43:55 path ss1
ID 260 gen 8 cgen 8 top level 5 crtime 2012-07-27 17:47:51 path ss2
ID 263 gen 16 cgen 16 top level 5 crtime 2012-07-29 00:50:19 path ss3
ID 264 gen 25 cgen 25 top level 5 crtime 2012-07-30 09:56:50 path sv1/.snap
 
 Anand Jain (1):
   Btrfs-progs: show crtime in the snapshot list
 
  btrfs-list.c |   45 -
  1 files changed, 36 insertions(+), 9 deletions(-)
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 7/7] Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive (part 2)

2012-08-01 Thread Alexander Block
On Mon, Jul 23, 2012 at 5:17 PM, Alex Lyakas
alex.bolshoy.bt...@gmail.com wrote:
 Hi Alexander,
 I did some testing of the case where same inode, but with a different
 generation, exists both in send_root and in parent_root.
 I know that this can happen primarily when inode_cache option is
 enabled. So first I just tested some differential sends, where parent
 and root are unrelated subvolumes. Here are some issues:

 1) The top subvolume inode (ino=BTRFS_FIRST_FREE_OBJECTID) is treated
 also as deleted + recreated. So the code goes into process_all_refs()
 path and does several strange things, such as trying to orphanize the
 top inode. Also get_cur_path() always returns  for the top subvolume
 (without checking whether it is an orphan).  Another complication for
 the top inode is that its parent dir is itself.
 I made the following fix:
 @@ -3782,7 +3972,13 @@ static int changed_inode(struct send_ctx *sctx,

 right_gen = btrfs_inode_generation(sctx-right_path-nodes[0],
 right_ii);
 -   if (left_gen != right_gen)
 +   if (left_gen != right_gen  sctx-cur_ino !=
 BTRFS_FIRST_FREE_OBJECTID)
 sctx-cur_inode_new_gen = 1;

 So basically, don't try to delete and re-create it, but treat it like
 a change. Since the top subvolume inode is S_IFDIR, and dir can have
 only one hardlink (and hopefully it is always ..), we will never
 need to change anything for this INODE_REF. I also added:

 @@ -2526,6 +2615,14 @@ static int process_recorded_refs(struct send_ctx *sctx)
 int did_overwrite = 0;
 int is_orphan = 0;

 +   BUG_ON(sctx-cur_ino = BTRFS_FIRST_FREE_OBJECTID);
I applied both fixes to for-chris now.

 2) After I fixed this, I hit another issue, where inodes under the top
 subvolume dir, attempt to rmdir() the top dir, while iterating over
 check_dirs in process_recorded_refs(), because (top_dir_ino,
 top_dir_gen) indicate that it was deleted. So I added:

 @@ -2714,10 +2857,19 @@ verbose_printk(btrfs: process_recorded_refs
 %llu\n, sctx-cur_ino);
  */
 ULIST_ITER_INIT(uit);
 while ((un = ulist_next(check_dirs, uit))) {
 +   /* Do not attempt to rmdir it the top subvolume dir */
 +   if (un-val == BTRFS_FIRST_FREE_OBJECTID)
 +   continue;
 +
 if (un-val  sctx-cur_ino)
 continue;
I applied a similar fix by adding a check to can_rmdir. The way you
suggested would also skip utime updates for the top dir.

 3) process_recorded_refs() always increments the send_progress:
 /*
  * Current inode is now at it's new position, so we must increase
  * send_progress
  */
 sctx-send_progress = sctx-cur_ino + 1;

 However, in the changed_inode() path I am testing, process_all_refs()
 is called twice with the same inode (once for deleted inode, once for
 the recreated inode), so after the first call, send_progress is
 incremented and doesn't match the inode anymore. I don't think I hit
 any issues because of this, just that it's confusing.
I fixed this issue some days ago.

 4)

 +/*
 + * Record and process all refs at once. Needed when an inode changes the
 + * generation number, which means that it was deleted and recreated.
 + */
 +static int process_all_refs(struct send_ctx *sctx,
 +   enum btrfs_compare_tree_result cmd)
 +{
 +   int ret;
 +   struct btrfs_root *root;
 +   struct btrfs_path *path;
 +   struct btrfs_key key;
 +   struct btrfs_key found_key;
 +   struct extent_buffer *eb;
 +   int slot;
 +   iterate_inode_ref_t cb;
 +
 +   path = alloc_path_for_send();
 +   if (!path)
 +   return -ENOMEM;
 +
 +   if (cmd == BTRFS_COMPARE_TREE_NEW) {
 +   root = sctx-send_root;
 +   cb = __record_new_ref;
 +   } else if (cmd == BTRFS_COMPARE_TREE_DELETED) {
 +   root = sctx-parent_root;
 +   cb = __record_deleted_ref;
 +   } else {
 +   BUG();
 +   }
 +
 +   key.objectid = sctx-cmp_key-objectid;
 +   key.type = BTRFS_INODE_REF_KEY;
 +   key.offset = 0;
 +   while (1) {
 +   ret = btrfs_search_slot_for_read(root, key, path, 1, 0);
 +   if (ret  0) {
 +   btrfs_release_path(path);
 +   goto out;
 +   }
 +   if (ret) {
 +   btrfs_release_path(path);
 +   break;
 +   }
 +
 +   eb = path-nodes[0];
 +   slot = path-slots[0];
 +   btrfs_item_key_to_cpu(eb, found_key, slot);
 +
 +   if (found_key.objectid != key.objectid ||
 +   found_key.type != key.type) {
 +   btrfs_release_path(path);
 +   break;
 +   }
 +
 +   ret = iterate_inode_ref(sctx, 

Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-01 Thread Liu Bo
On 08/01/2012 09:07 PM, Jan Schmidt wrote:
 On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
 On 08/01/2012 07:45 PM, Stefan Behrens wrote:
 With commit acce952b0, btrfs was changed to flag the filesystem with
 BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
 error happened like a write I/O errors of all mirrors.
 In such situations, on unmount, the superblock is written in
 btrfs_error_commit_super(). This is done with the intention to be able
 to evaluate the error flag on the next mount. A warning is printed
 in this case during the next mount and the log tree is ignored.

 The issue is that it is possible that the superblock points to a root
 that was not written (due to write I/O errors).
 The result is that the filesystem cannot be mounted. btrfsck also does
 not start and all the other btrfs-progs tools fail to start as well.
 However, mount -o recovery is working well and does the right things
 to recover the filesystem (i.e., don't use the log root, clear the
 free space cache and use the next mountable root that is stored in the
 root backup array).

 This patch removes the writing of the superblock when
 BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
 flag in the mount function.


 Yes, I have to admit that this can be a serious problem.

 But we'll need to send the error flag stored in the super block into
 disk in the future so that the next mount can find it unstable and do
 fsck by itself maybe.
 
 Hum, that's possible. However, I neither see
 
 a) a safe way to get that flag to disk
 
 nor
 
 b) a situation where this flag would help. When we abort a transaction, we 
 just
 roll everything back to the last commit, i.e. a consistent state. So if we 
 stop
 writing a potentially corrupt super block, we should be fine anyway. Or am I
 missing something?
 

I'm just wondering if we can roll everything back well, why do we need fsck?

thanks,
liubo

 -Jan
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-01 Thread Arne Jansen
On 01.08.2012 15:31, Liu Bo wrote:
 On 08/01/2012 09:07 PM, Jan Schmidt wrote:
 On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
 On 08/01/2012 07:45 PM, Stefan Behrens wrote:
 With commit acce952b0, btrfs was changed to flag the filesystem with
 BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
 error happened like a write I/O errors of all mirrors.
 In such situations, on unmount, the superblock is written in
 btrfs_error_commit_super(). This is done with the intention to be able
 to evaluate the error flag on the next mount. A warning is printed
 in this case during the next mount and the log tree is ignored.

 The issue is that it is possible that the superblock points to a root
 that was not written (due to write I/O errors).
 The result is that the filesystem cannot be mounted. btrfsck also does
 not start and all the other btrfs-progs tools fail to start as well.
 However, mount -o recovery is working well and does the right things
 to recover the filesystem (i.e., don't use the log root, clear the
 free space cache and use the next mountable root that is stored in the
 root backup array).

 This patch removes the writing of the superblock when
 BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
 flag in the mount function.


 Yes, I have to admit that this can be a serious problem.

 But we'll need to send the error flag stored in the super block into
 disk in the future so that the next mount can find it unstable and do
 fsck by itself maybe.

 Hum, that's possible. However, I neither see

 a) a safe way to get that flag to disk

 nor

 b) a situation where this flag would help. When we abort a transaction, we 
 just
 roll everything back to the last commit, i.e. a consistent state. So if we 
 stop
 writing a potentially corrupt super block, we should be fine anyway. Or am I
 missing something?

 
 I'm just wondering if we can roll everything back well, why do we need fsck?

Mostly for undetected errors.

 
 thanks,
 liubo
 
 -Jan

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add crtime to the snapshot list

2012-08-01 Thread anand jain



  Eg output:
#btrfs su list -s 1 /btrfs
ID 258 gen 39 cgen 6 top level 5 crtime 2012-07-27 17:43:55 path ss1
ID 260 gen 8 cgen 8 top level 5 crtime 2012-07-27 17:47:51 path ss2
ID 263 gen 16 cgen 16 top level 5 crtime 2012-07-29 00:50:19 path ss3
ID 264 gen 25 cgen 25 top level 5 crtime 2012-07-30 09:56:50 path sv1/.snap

Is it possible to rename crtime to otime? I used otime to refer
creation time all the time. Using crtime could be confusing, as ctime
is the change time and rtime is the receive time.


 Yeah.  Will rename it to otime

 just to mention the choices were..
 Birth-time (as in stat output), crtime and otime.

Thanks, Anand
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: unlock extent when the dio completes for reads V2

2012-08-01 Thread Josef Bacik
A deadlock in xfstests 113 was uncovered by commit

d187663ef24cd3d033f0cbf2867e70b36a3a90b8

This is because we would not return EIOCBQUEUED for short AIO reads, instead
we'd wait for the DIO to complete and then return the amount of data we
transferred, which would allow our stuff to unlock the remaning amount.  But
with this change this no longer happens, so if we have a short AIO read (for
example if we try to read past EOF), we could leave the section from EOF to
the end of where we tried to read locked.  To fix this we need to store the
end of the region we've locked in our locked state bit and then only unlock
when the dio has finished completely.  This makes sure we always unlock the
entire range we locked when reading.  Before this patch we would hang when
running 113, no this no longer happens.  Thanks to Zach Brown for the idea,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
V1-V2: fix a problem where we didn't unlock the range if io was never
submitted.
 fs/btrfs/inode.c |   38 --
 1 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 48bdfd2..4b82ae2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5810,9 +5810,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, 
sector_t iblock,
if (!create  (em-block_start == EXTENT_MAP_HOLE ||
test_bit(EXTENT_FLAG_PREALLOC, em-flags))) {
free_extent_map(em);
-   /* DIO will do one hole at a time, so just unlock a sector */
-   unlock_extent(BTRFS_I(inode)-io_tree, start,
- start + root-sectorsize - 1);
return 0;
}
 
@@ -5961,8 +5958,6 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
bvec++;
} while (bvec = bvec_end);
 
-   unlock_extent(BTRFS_I(inode)-io_tree, dip-logical_offset,
- dip-logical_offset + dip-bytes - 1);
bio-bi_private = dip-private;
 
kfree(dip-csums);
@@ -6340,6 +6335,26 @@ static ssize_t check_direct_IO(struct btrfs_root *root, 
int rw, struct kiocb *io
 out:
return retval;
 }
+
+static void btrfs_direct_read_iodone(struct kiocb *iocb, loff_t offset,
+ssize_t bytes, void *private, int ret,
+bool is_async)
+{
+   struct inode *inode = iocb-ki_filp-f_mapping-host;
+   u64 lockend;
+
+   if (get_state_private(BTRFS_I(inode)-io_tree, offset, lockend)) {
+   WARN_ON(1);
+   goto out;
+   }
+
+   unlock_extent(BTRFS_I(inode)-io_tree, offset, lockend);
+out:
+   if (is_async)
+   aio_complete(iocb, ret, 0);
+   inode_dio_done(inode);
+}
+
 static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs)
@@ -6353,6 +6368,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
int writing = rw  WRITE;
int write_bits = 0;
size_t count = iov_length(iov, nr_segs);
+   dio_iodone_t *iodone = NULL;
 
if (check_direct_IO(BTRFS_I(inode)-root, rw, iocb, iov,
offset, nr_segs)) {
@@ -6438,6 +6454,9 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 1, 0, cached_state, GFP_NOFS);
goto out;
}
+   } else {
+   set_state_private(BTRFS_I(inode)-io_tree, lockstart, lockend);
+   iodone = btrfs_direct_read_iodone;
}
 
free_extent_state(cached_state);
@@ -6445,9 +6464,16 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb 
*iocb,
 
ret = __blockdev_direct_IO(rw, iocb, inode,
   BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev,
-  iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
+  iov, offset, nr_segs, btrfs_get_blocks_direct, iodone,
   btrfs_submit_direct, 0);
 
+   /*
+* If we're reading we will be unlocked by the iodone call, unless ret
+* is 0 then we need to unlock since we didn't submit any io.
+*/
+   if (!writing  ret)
+   goto out;
+
if (ret  0  ret != -EIOCBQUEUED) {
clear_extent_bit(BTRFS_I(inode)-io_tree, offset,
  offset + iov_length(iov, nr_segs) - 1,
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions and notes about send/receive

2012-08-01 Thread Alex Lyakas
Alexander,
thanks for addressing the issues.

 __get_cur_name_and_parent()
 if we decided to call get_first_ref() on the send_root (due to ino 
 sctx-send_progress), do we really need to call did_overwrite_ref()?
 Because it will just lookup the send root again. I mean if we know
 that this inode has been already handled, then it's not an orphan
 anymore, so no need for this additional check. If my understanding is
 wrong, pls give some small example?
 get_first_ref does not return the current state. It returns the
 permanent state. The result of did_overwrite_ref however depends on
 the current send progress. There is however some room for optimization
 here. In many cases, I did not think much about double tree lookups
 due to the different calls, as I wanted it to work first and later
 optimize it. One possible optimization for example would be to cache
 the results of lookups. But only for functions which return permanent
 state, e.g. get_inode_info.
So do you think my understanding is correct that if (ino 
sctx-send_progress), then both functions will return the same value?
Or perhaps: if (ino  sctx-send_progress), then no need to check
anymore the did_overwrite_ref(), because we have handled that already?
I think that the call to did_overwrite_ref() is relevant only as long
as we haven't handled this ino. After that, we are good...I think.

I'm not saying the code has be changed/optimized at this point, just
testing my understanding:)


 struct btrfs_root_item:
 what is the difference between existing generation field that you
 mimic in generation_v2 and ctransid? (I know I need to study the
 root tree code).
 Did you read the comments for btrfs_root_item and
 btrfs_read_root_item? They should answer your question.
Yes, I read the comments, but they address only generation and
generation_v2, which I understand. Basically, I don't understand how
generation differs from ctransid (I need to dig more there).

Thanks!
Alex.




 Thank you,
 Alex.
 Big thanks for your reviews :)
 I pushed a new version of the kernel and progs side. The branches are
 now for-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)

2012-08-01 Thread Olivier Bonvalet
Hi,

I have some trouble with a btrfs filesystem.
As you can see in logs, there is lines which are from btrfs (I supposed), then 
some warnings at fs/btrfs/extent-tree.c, and finally a kernel BUG at 
fs/btrfs/extent-tree.c:5038.

Do you need more informations ?


Aug  1 21:23:10 backup2 kernel: [7.015184] Btrfs loaded
Aug  1 21:23:10 backup2 kernel: [7.023314] device fsid 
a5dfe512-c8a3-46f0-bc8c-6365b8eccdcb devid 1 transid 110423 
/dev/mapper/vg--backupplug-backup
Aug  1 21:23:10 backup2 kernel: [7.024405] btrfs: force zlib compression
Aug  1 21:23:10 backup2 kernel: [7.024414] btrfs: not using ssd allocation 
scheme
Aug  1 21:23:10 backup2 kernel: [   17.938241] NET: Registered protocol family 
10
Aug  1 21:26:53 backup2 kernel: [   28.880027] eth0: no IPv6 routers present
Aug  1 21:33:39 backup2 kernel: [  435.285128] leaf 615015878656 total ptrs 28 
free space 1651
Aug  1 21:33:39 backup2 kernel: [  435.285146]  item 0 key (642867957760 a8 
4096) itemoff 3944 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285153]  extent refs 1 gen 
108436 flags 2
Aug  1 21:33:39 backup2 kernel: [  435.285162]  tree block key 
(18446744073709551606 80 1215799304192) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285171]  tree block backref root 
7
Aug  1 21:33:39 backup2 kernel: [  435.285179]  item 1 key (642867970048 a8 
4096) itemoff 3893 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285186]  extent refs 1 gen 38936 
flags 258
Aug  1 21:33:39 backup2 kernel: [  435.285196]  tree block key (401571 
54 2912287038) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285202]  shared block backref 
parent 615081041920
Aug  1 21:33:39 backup2 kernel: [  435.285208]  item 2 key (642867994624 a8 
4096) itemoff 3824 itemsize 69
Aug  1 21:33:39 backup2 kernel: [  435.285214]  extent refs 3 gen 38909 
flags 258
Aug  1 21:33:39 backup2 kernel: [  435.285221]  tree block key (13092 c 
12250) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285226]  shared block backref 
parent 1306797543424
Aug  1 21:33:39 backup2 kernel: [  435.285231]  shared block backref 
parent 1306797494272
Aug  1 21:33:39 backup2 kernel: [  435.285236]  shared block backref 
parent 647007399936
Aug  1 21:33:39 backup2 kernel: [  435.285242]  item 3 key (642868002816 a8 
4096) itemoff 3773 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285248]  extent refs 1 gen 44304 
flags 2
Aug  1 21:33:39 backup2 kernel: [  435.285255]  tree block key 
(1824364519424 a8 94208) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285260]  tree block backref root 
2
Aug  1 21:33:39 backup2 kernel: [  435.285265]  item 4 key (642868006912 a8 
4096) itemoff 3722 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285271]  extent refs 1 gen 44304 
flags 2
Aug  1 21:33:39 backup2 kernel: [  435.285278]  tree block key 
(1824369225728 a8 118784) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285284]  tree block backref root 
2
Aug  1 21:33:39 backup2 kernel: [  435.285289]  item 5 key (642868011008 a8 
4096) itemoff 3671 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285295]  extent refs 1 gen 
101712 flags 258
Aug  1 21:33:39 backup2 kernel: [  435.285302]  tree block key (833 6c 
162816000) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285307]  shared block backref 
parent 617940144128
Aug  1 21:33:39 backup2 kernel: [  435.285313]  item 6 key (642868015104 a8 
4096) itemoff 3593 itemsize 78
Aug  1 21:33:39 backup2 kernel: [  435.285321]  extent refs 4 gen 38909 
flags 258
Aug  1 21:33:39 backup2 kernel: [  435.285326]  tree block key (17721 
60 119) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285331]  shared block backref 
parent 1314062098432
Aug  1 21:33:39 backup2 kernel: [  435.285336]  shared block backref 
parent 1314062094336
Aug  1 21:33:39 backup2 kernel: [  435.285341]  shared block backref 
parent 1312740683776
Aug  1 21:33:39 backup2 kernel: [  435.285348]  shared block backref 
parent 1312740675584
Aug  1 21:33:39 backup2 kernel: [  435.285354]  item 7 key (642868019200 a8 
4096) itemoff 3542 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285360]  extent refs 1 gen 44307 
flags 2
Aug  1 21:33:39 backup2 kernel: [  435.285365]  tree block key 
(18446744073709551606 80 1825532395520) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285373]  tree block backref root 
7
Aug  1 21:33:39 backup2 kernel: [  435.285378]  item 8 key (642868023296 a8 
4096) itemoff 3491 itemsize 51
Aug  1 21:33:39 backup2 kernel: [  435.285384]  extent refs 1 gen 
101712 flags 258
Aug  1 21:33:39 backup2 kernel: [  435.285389]  tree block key (833 6c 
161284096) level 0
Aug  1 21:33:39 backup2 kernel: [  435.285396]  shared block backref 
parent 617940144128
Aug  1 21:33:39 backup2 kernel: [  435.285402]  item 

[PATCH] Btrfs: barrier before waitqueue_active

2012-08-01 Thread Josef Bacik
We need an smb_mb() before waitqueue_active to avoid missing wakeups.
Before Mitch was hitting a deadlock between the ordered flushers and the
transaction commit because the ordered flushers were waiting for more refs
and were never woken up, so those smp_mb()'s are the most important.
Everything else I added for correctness sake and to avoid getting bitten by
this again somewhere else.  Thanks,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 fs/btrfs/compression.c   |1 +
 fs/btrfs/delayed-inode.c |   16 ++--
 fs/btrfs/delayed-ref.c   |   18 --
 fs/btrfs/disk-io.c   |   11 ---
 fs/btrfs/inode.c |8 +---
 fs/btrfs/volumes.c   |8 +---
 6 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 86eff48..43d1c5a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -818,6 +818,7 @@ static void free_workspace(int type, struct list_head 
*workspace)
btrfs_compress_op[idx]-free_workspace(workspace);
atomic_dec(alloc_workspace);
 wake:
+   smp_mb();
if (waitqueue_active(workspace_wait))
wake_up(workspace_wait);
 }
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 335605c..8cc9b19 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -513,9 +513,11 @@ static void __btrfs_remove_delayed_item(struct 
btrfs_delayed_item *delayed_item)
rb_erase(delayed_item-rb_node, root);
delayed_item-delayed_node-count--;
atomic_dec(delayed_root-items);
-   if (atomic_read(delayed_root-items)  BTRFS_DELAYED_BACKGROUND 
-   waitqueue_active(delayed_root-wait))
-   wake_up(delayed_root-wait);
+   if (atomic_read(delayed_root-items)  BTRFS_DELAYED_BACKGROUND) {
+   smp_mb();
+   if (waitqueue_active(delayed_root-wait))
+   wake_up(delayed_root-wait);
+   }
 }
 
 static void btrfs_release_delayed_item(struct btrfs_delayed_item *item)
@@ -1057,9 +1059,11 @@ static void btrfs_release_delayed_inode(struct 
btrfs_delayed_node *delayed_node)
delayed_root = delayed_node-root-fs_info-delayed_root;
atomic_dec(delayed_root-items);
if (atomic_read(delayed_root-items) 
-   BTRFS_DELAYED_BACKGROUND 
-   waitqueue_active(delayed_root-wait))
-   wake_up(delayed_root-wait);
+   BTRFS_DELAYED_BACKGROUND) {
+   smp_mb();
+   if (waitqueue_active(delayed_root-wait))
+   wake_up(delayed_root-wait);
+   }
}
 }
 
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index da7419e..858ef02 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -662,9 +662,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
add_delayed_tree_ref(fs_info, trans, ref-node, bytenr,
   num_bytes, parent, ref_root, level, action,
   for_cow);
-   if (!need_ref_seq(for_cow, ref_root) 
-   waitqueue_active(fs_info-tree_mod_seq_wait))
-   wake_up(fs_info-tree_mod_seq_wait);
+   if (!need_ref_seq(for_cow, ref_root)) {
+   smp_mb();
+   if (waitqueue_active(fs_info-tree_mod_seq_wait))
+   wake_up(fs_info-tree_mod_seq_wait);
+   }
+
spin_unlock(delayed_refs-lock);
if (need_ref_seq(for_cow, ref_root))
btrfs_qgroup_record_ref(trans, ref-node, extent_op);
@@ -713,9 +716,11 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
add_delayed_data_ref(fs_info, trans, ref-node, bytenr,
   num_bytes, parent, ref_root, owner, offset,
   action, for_cow);
-   if (!need_ref_seq(for_cow, ref_root) 
-   waitqueue_active(fs_info-tree_mod_seq_wait))
-   wake_up(fs_info-tree_mod_seq_wait);
+   if (!need_ref_seq(for_cow, ref_root)) {
+   smp_mb();
+   if (waitqueue_active(fs_info-tree_mod_seq_wait))
+   wake_up(fs_info-tree_mod_seq_wait);
+   }
spin_unlock(delayed_refs-lock);
if (need_ref_seq(for_cow, ref_root))
btrfs_qgroup_record_ref(trans, ref-node, extent_op);
@@ -744,6 +749,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
*fs_info,
   num_bytes, BTRFS_UPDATE_DELAYED_HEAD,
   extent_op-is_data);
 
+   smp_mb();
if (waitqueue_active(fs_info-tree_mod_seq_wait))
wake_up(fs_info-tree_mod_seq_wait);
spin_unlock(delayed_refs-lock);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 502b20c..a355c89 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -756,9 

Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-01 Thread Martin Steigerwald
Hi Marc,

Am Mittwoch, 1. August 2012 schrieb Marc MERLIN:
 On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:
   It it were a random crappy SSD from a random vendor, I'd blame the
   SSD, but I have a hard time believing that samsung is selling SSDs
   that are slower than hard drives at random IO and 'seeks'.
  
  You'd be surprised on how badly some vendors can screw up :)
 
 At some point, it may come down to that indeed :-/
 I'm still hopefully that Samsung didn't, but we'll see.

Its getting quite strange.

I lost track of whether you did that already or not, but if you didnĀ“t
please post some

vmstat 1

iostat -xd 1

on the device while it is being slow.

I am interested in wait I/O and latencies and disk utilization.


Comparison data of Intel SSD 320 in ThinkPad T520 during

merkaba:~ echo 3  /proc/sys/drop_caches ; du -sch /usr

on BTRFS with Kernel 3.5:


martin@merkaba:~ vmstat 1
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 2  1  21556 4442668   2056 50235200   19485  247  120 11  2 87  0
 1  2  21556 440   2448 51488400 11684   328 4975 24585  5 16 65 14
 1  0  21556 4389880   2448 52806000 13400 0 4574 23452  2 16 68 14
 3  1  21556 4370068   2448 54505200 18132 0 5499 27220  1 18 64 16
 1  0  21556 4350228   2448 8000 10856 0 4122 25339  3 16 67 14
 1  1  21556 4315604   2448 56975600 12648 0 4647 31153  5 14 66 15
 0  1  21556 4295652   2456 58148000 1154856 4093 24618  2 13 69 16
 0  1  21556 4286720   2456 59158000 10824 0 3750 21445  1 12 71 16
 0  1  21556 4266308   2456 60362000 12932 0 4841 26447  4 12 68 17
 1  0  21556 4248228   2456 61380800 10264 4 3703 22108  1 13 71 15
 5  1  21556 4231976   2456 62435600 10540 0 3581 20436  1 10 72 17
 0  1  21556 4197168   2456 63910800 12952 0 4738 28223  4 15 66 15
 4  1  21556 4178456   2456 65055200 11656 0 4234 23480  2 14 68 16
 0  1  21556 4163616   2456 66299200 13652 0 4619 26580  1 16 70 13
 4  1  21556 4138288   2456 67569600 13352 0 4422 22254  1 16 70 13
 1  0  21556 4113204   2456 68906000 13232 0 4312 21936  1 15 70 14
 0  1  21556 4085532   2456 70416000 14972 0 4820 24238  1 16 69 14
 2  0  21556 4055740   2456 71964400 15736 0 5099 25513  3 17 66 14
 0  1  21556 4028612   2456 73438000 14504 0 4795 25052  3 15 68 14
 2  0  21556 3999108   2456 74904000 1465616 4672 21878  1 17 69 13
 1  1  21556 3972732   2456 76210800 12972 0 4717 22411  1 17 70 13
 5  0  21556 3949684   2584 77348400 1152852 4837 24107  3 15 67 15
 1  0  21556 3912504   2584 78742000 12156 0 4883 25201  4 15 67 14


martin@merkaba:~ iostat -xd 1 /dev/sda
Linux 3.5.0-tp520 (merkaba) 01.08.2012  _x86_64_(4 CPU)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   1,29 1,44   11,58   12,78   684,74   299,7580,81 
0,249,860,95   17,93   0,29   0,71

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00 2808,000,00 11232,00 0,00 8,00 
0,570,210,210,00   0,19  54,50

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00 2967,000,00 11868,00 0,00 8,00 
0,630,210,210,00   0,21  60,90

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,0011,00 2992,004,00 11968,0056,00 8,03 
0,640,220,220,25   0,21  62,00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00 2680,000,00 10720,00 0,00 8,00 
0,700,260,260,00   0,25  66,70

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00 3153,000,00 12612,00 0,00 8,00 
0,720,230,230,00   0,22  69,30

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00 2769,000,00 11076,00 0,00 8,00 
0,630,230,230,00   0,21  58,00

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00 2523,001,00 10092,00 4,00 8,00 
0,740,290,290,00   0,28  

Re: [PATCH] Btrfs: barrier before waitqueue_active

2012-08-01 Thread Mitch Harder
On Wed, Aug 1, 2012 at 3:25 PM, Josef Bacik jba...@fusionio.com wrote:
 We need an smb_mb() before waitqueue_active to avoid missing wakeups.
 Before Mitch was hitting a deadlock between the ordered flushers and the
 transaction commit because the ordered flushers were waiting for more refs
 and were never woken up, so those smp_mb()'s are the most important.
 Everything else I added for correctness sake and to avoid getting bitten by
 this again somewhere else.  Thanks,


This patch seems to make it tougher to hit a deadlock, but I'm still
encountering intermittent deadlocks using this patch when running
multiple rsync threads.

I've also tested Patch 2, and that has me hitting a deadlock even
quicker (when starting several copying threads).

I also found a slight performance hit using this patch.  On a 3.4.6
kernel (merged with the 3.5_rc for-linus branch), I would typically
complete my rsync test in ~265 seconds.  Also, I can't recall hitting
a deadlock on the 3.4.6 kernel (with 3.5_rc for-linus).  When using
this patch, the test would take ~310 seconds (when it didn't hit a
deadlock).

Here's the Delayed Tasks (Ctrl-SysRq-W) when using JUST this patch:

[ 1568.794030] SysRq : Show Blocked State
[ 1568.794101]   taskPC stack   pid father
[ 1568.794123] btrfs-endio-wri D 88012579c000 0  3845  2 0x
[ 1568.794128]  8801254f3c20 0046 8801254f2000
8801241b5a80
[ 1568.794132]  00012280 8801254f3fd8 00012280
4000
[ 1568.794136]  8801254f3fd8 00012280 880129af16a0
8801241b5a80
[ 1568.794140] Call Trace:
[ 1568.794179]  [a0068785] ? memcpy_extent_buffer+0x159/0x17a [btrfs]
[ 1568.794200]  [a0082ab7] ? find_ref_head+0xa3/0xc6 [btrfs]
[ 1568.794220]  [a008343c] ? btrfs_find_ref_cluster+0xdd/0x117 [btrfs]
[ 1568.794225]  [8162d58c] schedule+0x64/0x66
[ 1568.794241]  [a003fc86] btrfs_run_delayed_refs+0x269/0x3f0 [btrfs]
[ 1568.794246]  [8104b10e] ? wake_up_bit+0x2a/0x2a
[ 1568.794265]  [a004fdc4] __btrfs_end_transaction+0xca/0x283 [btrfs]
[ 1568.794283]  [a004ffda] btrfs_end_transaction+0x15/0x17 [btrfs]
[ 1568.794302]  [a00555da] btrfs_finish_ordered_io+0x2e4/0x334 [btrfs]
[ 1568.794306]  [8103b980] ? run_timer_softirq+0x2d4/0x2d4
[ 1568.794325]  [a005563f] finish_ordered_fn+0x15/0x17 [btrfs]
[ 1568.794344]  [a0070ef8] worker_loop+0x188/0x4e0 [btrfs]
[ 1568.794365]  [a0070d70] ? btrfs_queue_worker+0x275/0x275 [btrfs]
[ 1568.794384]  [a0070d70] ? btrfs_queue_worker+0x275/0x275 [btrfs]
[ 1568.794387]  [8104ac37] kthread+0x89/0x91
[ 1568.794391]  [8162fd74] kernel_thread_helper+0x4/0x10
[ 1568.794395]  [8104abae] ? kthread_freezable_should_stop+0x57/0x57
[ 1568.794398]  [8162fd70] ? gs_change+0xb/0xb
[ 1568.794400] btrfs-transacti D 88009912ba50 0  3851  2 0x
[ 1568.794403]  8801241cfc70 0046 8801241ce000
8801248cda80
[ 1568.794407]  00012280 8801241cffd8 00012280
4000
[ 1568.794411]  8801241cffd8 00012280 8801254b8000
8801248cda80
[ 1568.794415] Call Trace:
[ 1568.794436]  [a0066646] ? extent_writepages+0x53/0x5d [btrfs]
[ 1568.794455]  [a005357b] ?
uncompress_inline.clone.33+0x15f/0x15f [btrfs]
[ 1568.794459]  [810c9ada] ? pagevec_lookup_tag+0x24/0x2e
[ 1568.794478]  [a0052e0e] ? btrfs_writepages+0x27/0x29 [btrfs]
[ 1568.794481]  [810c90b1] ? do_writepages+0x20/0x29
[ 1568.794485]  [8162d58c] schedule+0x64/0x66
[ 1568.794505]  [a0061547]
btrfs_start_ordered_extent+0xde/0xfa [btrfs]
[ 1568.794508]  [8104b10e] ? wake_up_bit+0x2a/0x2a
[ 1568.794529]  [a0061984] ?
btrfs_lookup_first_ordered_extent+0x65/0x99 [btrfs]
[ 1568.794549]  [a0061a6a] btrfs_wait_ordered_range+0xb2/0xda [btrfs]
[ 1568.794569]  [a0061bcc]
btrfs_run_ordered_operations+0x13a/0x1c1 [btrfs]
[ 1568.794587]  [a004f5f5]
btrfs_commit_transaction+0x287/0x960 [btrfs]
[ 1568.794606]  [a00502b1] ? start_transaction+0x2d5/0x310 [btrfs]
[ 1568.794609]  [8104b10e] ? wake_up_bit+0x2a/0x2a
[ 1568.794627]  [a004913b] transaction_kthread+0x187/0x258 [btrfs]
[ 1568.794644]  [a0048fb4] ? btrfs_alloc_root+0x42/0x42 [btrfs]
[ 1568.794661]  [a0048fb4] ? btrfs_alloc_root+0x42/0x42 [btrfs]
[ 1568.794664]  [8104ac37] kthread+0x89/0x91
[ 1568.794668]  [8162fd74] kernel_thread_helper+0x4/0x10
[ 1568.794671]  [8104abae] ? kthread_freezable_should_stop+0x57/0x57
[ 1568.794674]  [8162fd70] ? gs_change+0xb/0xb
[ 1568.794676] flush-btrfs-1   D 88012579c000 0  3857  2 0x
[ 1568.794680]  880037125670 0046 880037124000
8801254b8000
[ 1568.794684]  00012280 880037125fd8 00012280
4000
[ 

[josef-btrfs:master 9/11] fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function

2012-08-01 Thread Fengguang Wu
Hi Miao,

There are new compile warnings show up in

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
master
head:   8e1163044779e90662e96887cdd692c1594146ad
commit: dd817e81e81bbb83b63317b169d0e57a5d7ae0e1 [9/11] Btrfs: fix error path 
in create_pending_snapshot()
config: x86_64-randconfig-h031 (attached as .config)

All error/warnings:

fs/btrfs/transaction.c: In function 'create_pending_snapshot':
fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in 
this function [-Wmaybe-uninitialized]
fs/btrfs/transaction.c:1119:19: warning: 'rsv' may be used uninitialized in 
this function [-Wmaybe-uninitialized]

vim +1118 fs/btrfs/transaction.c
  1115  if (ret)
  1116  goto abort_trans;
  1117  fail:
 1118  dput(parent);
  1119  trans-block_rsv = rsv;
  1120  no_free_objectid:
  1121  kfree(new_root_item);

---
0-DAY kernel build testing backend Open Source Technology Centre
Fengguang Wu w...@linux.intel.com Intel Corporation
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 3.5.0 Kernel Configuration
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT=elf64-x86-64
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
# CONFIG_GENERIC_ISA_DMA is not set
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_ARCH_MAY_HAVE_PC_FDC is not set
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=
CONFIG_LOCALVERSION=
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME=(none)
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_POSIX_MQUEUE is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_FHANDLE is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y

#
# RCU Subsystem
#
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_TREE_RCU_TRACE=y
CONFIG_RCU_BOOST=y
CONFIG_RCU_BOOST_PRIO=1
CONFIG_RCU_BOOST_DELAY=500
CONFIG_IKCONFIG=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_FREEZER is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_CGROUP_MEM_RES_CTLR is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# 

Re: [josef-btrfs:master 9/11] fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function

2012-08-01 Thread Miao Xie
On Thu, 2 Aug 2012 10:31:23 +0800, Fengguang Wu wrote:
 There are new compile warnings show up in
 
 tree:   git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
 master
 head:   8e1163044779e90662e96887cdd692c1594146ad
 commit: dd817e81e81bbb83b63317b169d0e57a5d7ae0e1 [9/11] Btrfs: fix error path 
 in create_pending_snapshot()
 config: x86_64-randconfig-h031 (attached as .config)
 
 All error/warnings:
 
 fs/btrfs/transaction.c: In function 'create_pending_snapshot':
 fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in 
 this function [-Wmaybe-uninitialized]
 fs/btrfs/transaction.c:1119:19: warning: 'rsv' may be used uninitialized in 
 this function [-Wmaybe-uninitialized]
 
 vim +1118 fs/btrfs/transaction.c
   1115if (ret)
   1116goto abort_trans;
   1117fail:
 1118 dput(parent);
   1119trans-block_rsv = rsv;
   1120no_free_objectid:
   1121kfree(new_root_item);

The patch I sent before is based on the v3.5.0, not Chris's latest for-linus 
branch, so...

I will send new patches to fix it. Thanks for your report.

Miao

 
 ---
 0-DAY kernel build testing backend Open Source Technology Centre
 Fengguang Wu w...@linux.intel.com Intel Corporation
 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 2/2] Btrfs: fix the snapshot that should not exist

2012-08-01 Thread Miao Xie
The snapshot should be the image of the fs tree before it was created,
so the metadata of the snapshot should not exist in the its tree. But now, we
found the directory item and directory name index is in both the snapshot tree
and the fs tree. It introduces some problems and makes the users feel strange:

 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # mkdir /mnt/1
 # cd /mnt/1
 # btrfs subvolume snapshot /mnt snap0
 # ls -a /mnt/1/snap0/1
 .  ..  [no other file/dir]

 # ll /mnt/1/snap0/
 total 0
 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
^^^
There is no file/dir in it, but it's size is 10

 # cd /mnt/1/snap0/1/snap0
 [Enter a unexisted directory successfully...]

There is nothing in the directory 1 in snap0, but btrfs told the length of
this directory is 10. Beside that, we can enter an unexisted directory, it is
very strange to the users.

 # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
 # ll /mnt/1/snap0/1/
 total 0
 [None]
 # ll /mnt/snap1/1/
 total 0
 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0

And the source of snap1 did have any directory in Directory 1, but snap1 have
a snap0, it is different between the source and the snapshot.

So I think we should insert directory item and directory name index and update
the parent inode as the last step of snapshot creation, and do not leave the
useless metadata in the file tree.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v2 - v3:
- rebase on the latest for-linus branch

Changelog v1 - v2:
- add comment to explain why we need deal with the delayed items after
  snapshot creation and why this operation do not corrupt the metadata.
- move dput() to patch 1/2
---
 fs/btrfs/transaction.c |   66 +--
 1 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4e9c106..7943dc2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -950,6 +950,8 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
struct btrfs_root *parent_root;
struct btrfs_block_rsv *rsv;
struct inode *parent_inode;
+   struct btrfs_path *path;
+   struct btrfs_dir_item *dir_item;
struct dentry *parent;
struct dentry *dentry;
struct extent_buffer *tmp;
@@ -962,6 +964,12 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
u64 root_flags;
uuid_le new_uuid;
 
+   path = btrfs_alloc_path();
+   if (!path) {
+   ret = pending-error = -ENOMEM;
+   goto path_alloc_fail;
+   }
+
new_root_item = kmalloc(sizeof(*new_root_item), GFP_NOFS);
if (!new_root_item) {
ret = pending-error = -ENOMEM;
@@ -1010,22 +1018,20 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
 */
ret = btrfs_set_inode_index(parent_inode, index);
BUG_ON(ret); /* -ENOMEM */
-   ret = btrfs_insert_dir_item(trans, parent_root,
-   dentry-d_name.name, dentry-d_name.len,
-   parent_inode, key,
-   BTRFS_FT_DIR, index);
-   if (ret == -EEXIST) {
+
+   /* check if there is a file/dir which has the same name. */
+   dir_item = btrfs_lookup_dir_item(NULL, parent_root, path,
+btrfs_ino(parent_inode),
+dentry-d_name.name,
+dentry-d_name.len, 0);
+   if (dir_item != NULL  !IS_ERR(dir_item)) {
pending-error = -EEXIST;
goto fail;
-   } else if (ret) {
+   } else if (IS_ERR(dir_item)) {
+   ret = PTR_ERR(dir_item);
goto abort_trans;
}
-
-   btrfs_i_size_write(parent_inode, parent_inode-i_size +
-dentry-d_name.len * 2);
-   ret = btrfs_update_inode(trans, parent_root, parent_inode);
-   if (ret)
-   goto abort_trans;
+   btrfs_release_path(path);
 
/*
 * pull in the delayed directory update
@@ -1113,12 +1119,29 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
ret = btrfs_reloc_post_snapshot(trans, pending);
if (ret)
goto abort_trans;
+
+   ret = btrfs_insert_dir_item(trans, parent_root,
+   dentry-d_name.name, dentry-d_name.len,
+   parent_inode, key,
+   BTRFS_FT_DIR, index);
+   /* We have check then name at the beginning, so it is impossible. */
+   BUG_ON(ret == -EEXIST);
+   if (ret)
+   goto abort_trans;
+
+   btrfs_i_size_write(parent_inode, parent_inode-i_size +
+dentry-d_name.len * 2);
+   ret =