Re: [PATCH] [RFC] Add btrfs autosnap feature

2012-03-05 Thread Anand Jain




Prior to making a new snapshot, grab the (stored) transid of the
previous snapshot, and check if any files have been modified in the
source since that transid: btrfs sub find ${source}
${previous_transid}. If nothing is returned, then you don't have to
bother making the snapshot at all, otherwise after making the
snapshot, grab the transid via btrfs sub find ${new_snapshot} -1,
and store it some place (even a dot file in the root of the snapshot
would work).



 there might be small window of time where transid and snapshot could
 be out of sync as we know them. since there is no atomic command which
 provides both - snapshot and transid. As in the example below.
 
 Assume tgw is the transaction group write which happens after we have

 read the transaction group id.
 ---
 sync; read current tran-id and compare
 (new tgw occurs)
 snapshot
 new tgw  occurs
 sync; read current tran-id again and store
 ---

 which will result in failing to take snapshot even if there are changes.

 Certainly there will be some trade off, and below logic seems to be
 more safer...
 ---
 sync; read current tran-id and compare with previous
 new tgw occurs
 snapshot
 new tgw occurs
 store tran_id+2 (since tran_id gets added by two for a snapshot)
 ---

 which might have a situation where we have two identical snapshot.
 but a safer trade off.

thanks, Anand
 
--

To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RFC] Add btrfs autosnap feature

2012-03-05 Thread cwillu
  ---
  sync; read current tran-id and compare
  (new tgw occurs)
  snapshot
  new tgw  occurs
  sync; read current tran-id again and store
  ---

  which will result in failing to take snapshot even if there are changes.

btrfs sub find-new /snapshot- -1 shows the transid of the latest
change of the snapshot, not the whole filesystem.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-05 Thread Jacek Luczak
2012/3/4 Jacek Luczak difrost.ker...@gmail.com:
 2012/3/3 Jacek Luczak difrost.ker...@gmail.com:
 2012/3/2 Chris Mason chris.ma...@oracle.com:
 On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
 2012/3/2 Chris Mason chris.ma...@oracle.com:
  On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
 
  I've took both on tests. The subject is acp and spd_readdir used with
  tar, all on ext4:
  1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
  2) spd_readdir: 
  http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
  3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
 
  The acp looks much better than spd_readdir but directory copy with
  spd_readdir decreased to 52m 39sec (30 min less).
 
  Do you have stats on how big these files are, and how fragmented they
  are?  For acp and spd to give us this, I think something has gone wrong
  at writeback time (creating individual fragmented files).

 How big? Which files?

 All the files you're reading ;)

 filefrag will tell you how many extents each file has, any file with
 more than one extent is interesting.  (The ext4 crowd may have better
 suggestions on measuring fragmentation).

 Since you mention this is a compile farm, I'm guessing there are a bunch
 of .o files created by parallel builds.  There are a lot of chances for
 delalloc and the kernel writeback code to do the wrong thing here.


 [Most of files are B and K size]

 All files scanned: 1978149
 Files fragmented: 313 (0.015%) where 11 have 3+ extents
 Total size of fragmented files: 7GB (~13% of dir size)

 BTRFS: Non of files according to filefrag are fragmented - all fit
 into one extent.

 tar cf on fragmented files:
 1) time: 7sec
 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
 3) sw graph with spd_readdir:
 http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
 4) both on one:
 http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png

 BTRFS: tar on ext4 fragmented files
 1) time: 6sec
 2) sw graph: 
 http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png

 tar cf of fragmented files disturbed with [40,50) K files (in total
 4373 files). K files before fragmented M files:
 1) size: 7.2GB
 2) time: 1m 14sec
 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
 4) sw graph with spd_readdir:
 http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
 5) both on one:
 http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png

 BTRFS: tar on [40,50) K and ext4 fragmented
 1) time: 56sec
 2) sw graph: 
 http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png

 New test I've included - randomly selected files:
 - size 240MB
 1) ext4 (time: 34sec) sw graph:
 http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png
 2) btrfs (time: 55sec) sw graph:
 http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.png

Yet another test. The original issue is in the directory data
handling. In my case a lot of dirs are introduced due to extra .svn.
Let's then see how does tar on those dirs looks like.

Number of .svn directories: 61605
1) Ext4:
 - tar time: 10m 53sec
 - sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_ext4.png
 - sw tar graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/svn_dir_spd_ext4.png
2) Btrfs:
 - tar time: 4m 35sec
 - sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs.png
 - sw tar graph with ext4:
http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs_ext4.png

IMO this is not a writeback issue (well it could be but then it mean
that it broken in general), it's not fragmentation. Sorting files in
readdir helps a bit but is still far behind the btrfs.

Any ideas? Is this a issue or the things are like they are and one
need to live with it.

-Jacek
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-05 Thread Jan Kara
On Fri 02-03-12 14:32:15, Ted Tso wrote:
 On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
 It would be interesting to have a project where someone added
 fallocate() support into libelf, and then added some hueristics into
 ext4 so that if a file is fallocated to a precise size, or if the file
 is fully written and closed before writeback begins, that we use this
 to more efficiently pack the space used by the files by the block
 allocator.  This is a place where I would not be surprised that XFS
 has some better code to avoid accelerated file system aging, and where
 we could do better with ext4 with some development effort.
  AFAIK XFS people actually prefer that applications let them do their work
using delayed allocation and do not interfere with fallocate(2) calls. The
problem they have with fallocate(2) is that it forces you to allocate
blocks while with delayed allocation you can make the decision about
allocation later. So for small files which completely fit into pagecache
before they get pushed out by writeback, they can make better decisions
from delayed allocation. Just dumping my memory from some other thread...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] btrfs: Convert to new freezing mechanism

2012-03-05 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason chris.ma...@oracle.com
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
mutex_trylock(root-fs_info-cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
do {
delay = HZ * 30;
-   vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
mutex_lock(root-fs_info-transaction_kthread_mutex);
 
spin_lock(root-fs_info-trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode-i_sb);
 
mutex_lock(inode-i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode-i_sb);
current-backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode-i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma-vm_file);
@@ -6495,12 +6496,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode-i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode-i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19] Fix filesystem freezing deadlocks

2012-03-05 Thread Jan Kara
  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
Push file_update_time() into .page_mkwrite so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder el...@kernel.org
CC: Anton Altaparmakov an...@tuxera.com
CC: Ben Myers b...@sgi.com
CC: Chris Mason chris.ma...@oracle.com
CC: cluster-de...@redhat.com
CC: David S. Miller da...@davemloft.net
CC: fuse-de...@lists.sourceforge.net
CC: J. Bruce Fields bfie...@fieldses.org
CC: Joel Becker jl...@evilplan.org
CC: KONISHI Ryusuke konishi.ryus...@lab.ntt.co.jp
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh mfas...@suse.com
CC: Miklos Szeredi mik...@szeredi.hu
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi hirof...@mail.parknet.co.jp
CC: Steven Whitehouse swhit...@redhat.com
CC: Theodore Ts'o ty...@mit.edu
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex

2012-03-05 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason chris.ma...@oracle.com
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara j...@suse.cz
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(inode-i_mutex);
 
ip_oldflags = ip-flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags  FS_SYNC_FL)
ip-flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode-i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(inode-i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent-mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(dir-i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent-dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry-d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent-mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(BTRFS_I(dir)-root-fs_info-subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent-mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(dir-i_mutex);
+   mnt_drop_write(parent-mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: getdents - ext4 vs btrfs performance

2012-03-05 Thread Chris Mason
On Mon, Mar 05, 2012 at 12:32:45PM +0100, Jacek Luczak wrote:
 2012/3/4 Jacek Luczak difrost.ker...@gmail.com:
  2012/3/3 Jacek Luczak difrost.ker...@gmail.com:
  2012/3/2 Chris Mason chris.ma...@oracle.com:
  On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
  2012/3/2 Chris Mason chris.ma...@oracle.com:
   On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
  
   I've took both on tests. The subject is acp and spd_readdir used with
   tar, all on ext4:
   1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
   2) spd_readdir: 
   http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
   3) both: 
   http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
  
   The acp looks much better than spd_readdir but directory copy with
   spd_readdir decreased to 52m 39sec (30 min less).
  
   Do you have stats on how big these files are, and how fragmented they
   are?  For acp and spd to give us this, I think something has gone wrong
   at writeback time (creating individual fragmented files).
 
  How big? Which files?
 
  All the files you're reading ;)
 
  filefrag will tell you how many extents each file has, any file with
  more than one extent is interesting.  (The ext4 crowd may have better
  suggestions on measuring fragmentation).
 
  Since you mention this is a compile farm, I'm guessing there are a bunch
  of .o files created by parallel builds.  There are a lot of chances for
  delalloc and the kernel writeback code to do the wrong thing here.
 
 
  [Most of files are B and K size]
 
  All files scanned: 1978149
  Files fragmented: 313 (0.015%) where 11 have 3+ extents
  Total size of fragmented files: 7GB (~13% of dir size)

Ok, so I don't have a lot of great new ideas.  My guess is that inode
order and disk order for the blocks aren't matching up.  You can confirm
this with:

acp -b some_dir

You can also try telling acp to make a bigger read ahead window:

acp -s 4096 -r 128 some_dir

You can tell acp to scan all the files in the directory tree first
(warning, this might use a good chunk of ram)

acp -w some_dir

and you can combine all of these together  None of the above
will actually help in your workload, but it'll help narrow down what is
actually seeky on disk.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Understanding metadata efficiency of btrfs

2012-03-05 Thread Kai Ren
I've run a little wired benchmark on comparing Btrfs v0.19 and XFS:

There are 2000 directories and each directory contains 1000 files.
The workload randomly stat a file or chmod a file for 200 times.
And the number of stat and chmod are 50% and 50%.

I monitor the number of disk read requests

 #Disk Write Requests,  #Disk Read Requests,  #Disk Write Sectors,  
#Disk Read Sectors
Btrfs   2403520  157118329249216  13512248
XFS 62549339608010302718   4932800

I found the number of write quests of Btrfs is significant larger than XFS.
I am not quite familiar with how btrfs commits the metadata change into the 
disks.
From the website, it is said that btrfs uses COW B-tree which never overwrite 
previous disk pages.
I assume that Btrfs also keep an in-memory buffer to keep the metadata changes.
But it is unclear to me that how often Btrfs will commit these changes and what 
is the behind mechanism.

Could anyone please comment on the experiment results and give a brief 
explanation of Btrfs's metadata committing mechanism?

Sincerely,

Kai Ren--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding metadata efficiency of btrfs

2012-03-05 Thread Duncan
Kai Ren posted on Mon, 05 Mar 2012 21:16:34 -0500 as excerpted:

 I've run a little wired benchmark on comparing Btrfs v0.19 and XFS:

[snip description of test]
 
 I monitor the number of disk read requests
 
#WriteRq  #ReadRq  #WriteSect  #ReadSect
 Btrfs   2403520  157118329249216   13512248 
 XFS  625493   396080103027184932800
 
 I found the number of write quests of Btrfs is significant larger than
 XFS.

 I am not quite familiar with how btrfs commits the metadata change into
 the disks. From the website, it is said that btrfs uses COW B-tree
 which never overwrite previous disk pages. I assume that Btrfs also
 keep an in-memory buffer to keep the metadata changes.  But it is
 unclear to me that how often Btrfs will commit these changes
 and what is the behind mechanism.
 
 Could anyone please comment on the experiment results and give a brief
 explanation of Btrfs's metadata committing mechanism?

First...

You mentioned the web site, but didn't specify which one.  FWIW, the 
kernel.org breakin of some months ago threw a monkey wrench in a lot of 
things, one of them being the btrfs wiki.  The official 
btrfs.wiki.kernel.org site is currently a static copy of the wiki from 
before the breakin, so while it has the general btrfs ideas which haven't 
changed from back then, current status, etc, is now rather stale.

But there's a temporary (that could end up being permanent, it's been 
months...) btrfs wiki that's MUCH more current, at:

http://btrfs.ipv5.de/index.php?title=Main_Page

So before going further, catch up with things on the current
(temporary?) wiki.  From your post, I'd suggest you read up a bit more 
than you have, because you failed to mention at all the most important 
metadata differences between the two filesystems.  I'm not deep enough 
into filesystem internals to know if these facts explain the whole 
differences above; in fact, the wiki's where I got most of my btrfs 
specific info myself, but they certainly explain a good portion of it!

The #1 biggest difference between btrfs and most other filesystems is 
that btrfs, by default, duplicates all metadata -- two copies of all 
metadata, one copy of data, by default.  On a single disk/partition, 
that's called DUP mode, else it's referred to (not entirely correctly) as 
raid1 or raid10 mode depending on layout.  (The not entirely correctly 
bit is because a true raid1 will have as many copies as there are active 
disks, while btrfs presently only does two-way mirroring.  As such, with 
three plus disks, it's not proper raid1, only two-way-mirroring.  3-way 
and possibly N-way mirroring is on the roadmap for after raid5/6 support, 
which is roadmapped for kernels 3.4 or 3.5, so multi-way-mirroring is 
presumably 3.5 or 3.6.)

It IS possible to setup only single-copy metadata, SINGLE mode, or two 
mirror data as well, but by default, btrfs keeps two copies of metadata, 
only one of data.

So that doubles the btrfs metadata writes, right there, since by default, 
btrfs double-copies all metadata.

The #2 big factor is that btrfs (again, by default, but this is a major 
feature of btrfs, otherwise, you might as well run something else) does 
full checksumming for both data and metadata.  Unlike most filesystems, 
if cosmic rays or whatever start flipping bits on your data, btrfs will 
catch that, and if possible, retrieve a correct copy from elsewhere.  
This is actually one of the reasons for dual-copy metadata... and data 
too if you configure btrfs for it -- if the one copy is bad (fails the 
checksum validation) and there's another copy, btrfs will try to use it, 
instead.

And of course all these checksums must be written somewhere as well, so 
that's another huge increase in written metadata, even for 0-length 
files, since the metadata itself is checksummed!

And the checksumming goes some way toward explaining all those extra 
reads, as well, as any sysadmin who has run raid5/6 against raid1 can 
tell you, because in ordered to write out the new checksums, unchanged 
(meta)data must be read in, and on btrfs, existing checksums read in and 
verified as well, to make sure the existing version is valid, before 
making the change and writing it back out.

As I said, I don't know if this explains /all/ the difference that you're 
seeing, but it should be quite plain that the btrfs double-metadata and 
integrity checking is going to be MULTIPLE TIMES more work and I/O than 
what more traditional filesystems such as the xfs you're comparing 
against must do.

That's all covered in the wiki, actually, both of them, since those are 
btrfs basics that haven't changed (except the multi-way-mirroring roadmap) 
in some time.  That they're such big factors and that you didn't mention 
them at all, indicates to me that you've quite some reading to do about 
btrfs, since they're so very basic to what makes it what it is.  
Otherwise, you might as well just be using some other filesystem instead, 
especially since 

[PATCH 1/2] Make find_updated_files to return value instead of printing

2012-03-05 Thread Anand jain
From: Anand Jain anand.j...@oracle.com

This patch made the function find_updated_files to update the transid
in a pointer instead of printing it on the stdout. This is needed by
the autosnap and anyother program which may want to find the current
transid. Note that when last_gen 3rd parameter is not -1 then
find_updated_files might still print the values on the stdout.

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 btrfs-list.c |4 ++--
 btrfs_cmds.c |5 -
 btrfs_cmds.h |2 +-
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/btrfs-list.c b/btrfs-list.c
index 61eddf9..6b642fb 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -872,7 +872,7 @@ static int print_one_extent(int fd, struct 
btrfs_ioctl_search_header *sh,
return 0;
 }
 
-int find_updated_files(int fd, u64 root_id, u64 oldest_gen)
+int find_updated_files(int fd, u64 root_id, u64 oldest_gen, u64 *transid)
 {
int ret;
struct btrfs_ioctl_search_args args;
@@ -969,7 +969,7 @@ int find_updated_files(int fd, u64 root_id, u64 oldest_gen)
}
free(cache_dir_name);
free(cache_full_name);
-   printf(transid marker was %llu\n, (unsigned long long)max_found);
+   *transid = max_found;
return ret;
 }
 
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index 7aab105..9357305 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -275,6 +275,7 @@ int do_find_newer(int argc, char **argv)
int ret;
char *subvol;
u64 last_gen;
+   u64 *tranid;
 
subvol = argv[1];
last_gen = atoll(argv[2]);
@@ -294,9 +295,11 @@ int do_find_newer(int argc, char **argv)
fprintf(stderr, ERROR: can't access '%s'\n, subvol);
return 12;
}
-   ret = find_updated_files(fd, 0, last_gen);
+   ret = find_updated_files(fd, 0, last_gen, tranid);
if (ret)
return 19;
+
+   printf(transid marker was %llu\n, (unsigned long long)*tranid);
return 0;
 }
 
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index f53c113..218ed20 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -35,7 +35,7 @@ int do_set_default_subvol(int nargs, char **argv);
 int do_get_default_subvol(int nargs, char **argv);
 int list_subvols(int fd, int print_parent, struct sv_list **head, char *mnt);
 int do_df_filesystem(int nargs, char **argv);
-int find_updated_files(int fd, u64 root_id, u64 oldest_gen);
+int find_updated_files(int fd, u64 root_id, u64 oldest_gen, u64 *transid);
 int do_find_newer(int argc, char **argv);
 int do_change_label(int argc, char **argv);
 int open_file_or_dir(const char *fname);
-- 
1.7.9.2.315.g25a78

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Use transaction id to determin if there is any change in the subvol

2012-03-05 Thread Anand jain
From: Anand Jain anand.j...@oracle.com

Moved from hash method of determining the FS changes to the transaction
record id method

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 autosnap.c |  106 ++--
 autosnap.h |4 +--
 2 files changed, 70 insertions(+), 40 deletions(-)

diff --git a/autosnap.c b/autosnap.c
index beddf68..1adaf01 100644
--- a/autosnap.c
+++ b/autosnap.c
@@ -45,7 +45,7 @@
 /* during run time if not the below we use /var/spool/cron; */
 char cron_path[]=/var/spool/cron/crontabs;
 char autosnap_conf_file[]=/etc/autosnap/config;
-char tmp_file[]=/etc/autosnap/tmpfile;
+//char tmp_file[]=/etc/autosnap/tmpfile;
 
 
 /* Take a snapshot with the default dest and adds attributes */
@@ -59,10 +59,10 @@ int do_autosnap_now(int argc, char **argv)
char**ap;
charsubvol[BTRFS_VOL_NAME_MAX];
charsspath[BTRFS_VOL_NAME_MAX + 128];
-   chartag[100];
-   charnew_hash[65];
+   chartag[TAG_MAX_LEN];
+   u64 cur_tranid = 0;
+   u64 ss_tranid = 0;
char*mnt;
-   FILE*fp;
u8  fsid[BTRFS_FSID_SIZE];
struct stat sb;
struct rpolicy_cfg rp;
@@ -101,6 +101,7 @@ int do_autosnap_now(int argc, char **argv)
return -1;
fd = open_file_or_dir(mnt);
get_fsid(fd,fsid[0]);
+   close(fd);
if ((res = read_config(subvol+strlen(mnt),tag,rp,NULL,fsid[0])) == 1) 
{
fprintf(stderr,need to run autosnap enable for this subvol and 
tag pair\n);
return 1;
@@ -109,28 +110,46 @@ int do_autosnap_now(int argc, char **argv)
return 1;
}
 
+   /* Check if there is any change in the FS by comparing the transaction 
id*/
+   if (strcmp(rp.idcal, older) == 0 ) {
+   /* Sync Subvol*/
+   a[1] = subvol;
+   ap = a;
+   res = do_fssync(1, ap);
+   if(res != 0) {
+   return -1;
+   }
+   fd = open_file_or_dir(subvol);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access '%s'\n, subvol);
+   return -1;
+   }
+   res = find_updated_files(fd, 0, -1, cur_tranid);
+   close(fd);
+   if (res)
+   return -1;
+
+   if((stat(rp.last_ss, sb) == 0)  (rp.last_ss_tranid == 
cur_tranid)) {
+   printf(FS is identical to the last snapshot. 
Aborting.\n); 
+   return -1;
+   }
+   }
+
if ( take_autosnap(subvol, tag, sspath) !=0 )
return -1;
 
-   if (strcmp(rp.idcal, older) == 0 ) {
-   fp = fopen(tmp_file, w);
-   tree_scan(sspath, fp);
-   fclose(fp);
-   get_sha256(tmp_file, new_hash);
-   if((stat(rp.last_ss, sb) == 0)  
(strcmp(rp.last_ss_hash,new_hash) == 0)) {
-   printf(Newer snapshot is identical to the previous 
snapshot, deleting the newer\n); 
-   a[1] = sspath;
-   ap = a;
-   res = do_delete_subvolume(2,ap);
-   if(res)
-   printf(do_delete_subvolume failed %d\n,res);
-   } else {
-   /* hash does not match so keep the new snasphot  OR
-   Last snapshot was deleted. */
-   
update_last_hash(subvol+strlen(mnt),tag,fsid[0],sspath,new_hash);
-   }
-   unlink(tmp_file);
+   fd = open_file_or_dir(sspath);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access '%s'\n, sspath);
+   return -1;
}
+   res = find_updated_files(fd, 0, -1, ss_tranid);
+   close(fd);
+   if (res)
+   return -1;
+
+   /* tranid does not match or Last snapshot was deleted. go ahead*/
+   update_last_tranid(subvol+strlen(mnt),tag,fsid[0],sspath,ss_tranid);
 
#if 0
/* Un-def this when we have synchronous snapshot delete */
@@ -141,7 +160,8 @@ int do_autosnap_now(int argc, char **argv)
if (rp.rpval != -1) {
res = chk_retain_bynum(subvol, rp.rpval, tag);
if(res != 0 ) {
-   fprintf(stderr,Error: Check for the retainable subvol 
failed %d\n,res);
+   fprintf(stderr,Error: Check for the retainable subvol 
failed %d\n,
+   res);
return -1;
}
}
@@ -457,7 +477,8 @@ int do_autosnap_enable(int argc, char **argv)
case 'm':
fcnt++;
if ((atoi(optarg)  60) || (atoi(optarg)  1)) {
-   fprintf(stderr, Value for option -m: Minutes 
should be between 1 to 60\n);
+