Re: Permissions model for btrfs?

2010-12-09 Thread Sean Bartell
On Thu, Dec 09, 2010 at 12:35:35PM -0500, Wayne Pollock wrote:
 I looked though the wiki (and searched the archives) but
 don't see an answer.  Will btrfs support old POSIX-style ACLs
 and permissions, or the new NFS/NT style ACLs like ZFS?  From
 the patch I saw, it seems old POSIX ACLs and permissions, but
 I'd like to know for sure.  (And maybe the FAQ on the wiki could
 address this?)  If it is the older POSIC ACLs, is there any plan
 to support NFSv4 ACLs in the future?

Right now it supports POSIX ACLs. I don't know about future plans.

 On a related note, will btrfs support any ext4 attributes (via chattr)?

It currently supports AaDdiS.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding COW-related behaviors

2010-11-08 Thread Sean Bartell
(sorry for sending twice)

On Mon, Nov 08, 2010 at 02:23:13PM +, João Eduardo Luís wrote:

 Basically, I need to be aware how the COW works in BTRFS, and what it may 
 allow one to achieve. Questions follow.

From your questions, you don't seem to understand CoW. CoW is basically
an alternative to the logging/journalling used by most filesystems.

When you change a data structure in a journalling filesystem, like ext4,
you actually write two copies--one into the journal, and one that
overwrites the old data structure. If a crash happens, at least one copy
will still be valid, making recovery possible.

When you change a data structure in a CoW filesystem, like btrfs, you
only write one copy, but you DON'T write it over the old data structure.
You write it to a new, unallocated space. This means the location of the
data structure changed, so you have to change the parent data structure;
you use CoW for that and so on up to the superblocks, which actually are
overwritten. Once that's finished, the old versions are no longer
needed, so they will be unallocated and eventually overwritten. If a
crash happens, the superblocks will still point to the old version of
the data structures.

This makes it relatively easy to add snapshot features--just add
reference counting, and don't free old versions of data structures if
they're still being used. However, this only happens if the user
explicitly requests a snapshot. Otherwise, the old data structures are
freed immediately once the new ones are completely written.

 1) Is COW only used when creating or updating a file? While testing BTRFS, 
 using 'btrfs subvolume find-new', I got the idea that neither creation of 
 directories, nor any kind of deletion are covered by COW. Is this right?

CoW is used anytime any structure is changed. find-new is not directly
related to CoW.

 2) Each time a COW happens, is there any kind implicit 'snapshotting' that 
 may keep track of changes around the filesystem for each COW? 
 By Rodeh's paper and some info on the wiki, I gather that a new root is 
 created for each COW, due to shadowing, but will the previous tree be kept? 
 The wiki, at BTRFS Design, states that after the commit finishes, the 
 older subvolume root items may be removed. This would make impossible to 
 track changes to files, but 'btrfs subvolume find-new' still manages to 
 output file generations, so there must be some info left behind. 

The old tree is discarded unless the user requested a snapshot of it.

Every time btrfs updates the roots is a new generation. Some data
structures have generation fields, indicating the generation in which
they were most recently changed. This is mostly used to verify the
filesystem is correct, but it's also possible to scan the generation
fields and find out which files have changed.

 3) Following (2), is there any way to access this informations and, let's 
 say, recover an older version of a given file? Or an entire previous tree?

No, unless the user request a snapshot. I'm assuming you're not talking
about tools like PhotoRec, that try to reassemble files from whatever
disk data looks valid.

 4) From Rodeh's paper I got the idea that BTRFS uses periodic checkpointing, 
 in order to assign generations to operations. Using 'btrfs subvolume 
 find-new' I confirmed my suspicions. After copying two different directories 
 into the same subvolume at the same time, all files got assigned the same 
 generation and it took a while until they all showed up. This raises the 
 question: what triggers a new checkpoint? Is it based on elapsed time since 
 last checkpoint? Is it triggered by a COW and then, all COWs happening at the 
 same time will be put together and create a big new generation?

Again, periodic checkpointing is probably the wrong way to think about
it. It would be wasteful to overwrite the superblocks every time a
change is made; instead, btrfs may combine multiple changes into one
generation and only update the superblocks once. I'm not sure exactly
how btrfs decides when to write a new generation.

 5) If we have multiple jobs updating the same file at the same time, I assume 
 the system will shadow their updates; when the time for committing comes, 
 will there be any kind of 'conflict' between concurrent updates, or will they 
 be applied by order of commit, ignoring whether there were previous commits 
 or not? Regarding checkpointing, will all the changes be shown as part of the 
 generation, or will they be considered as only one?

This is handled just like in any other filesystem. There are no
concurrent generations; if two threads both update a file, btrfs will
handle the updates sequentially, one at a time.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: crc32c

2010-10-11 Thread Sean Bartell
On Mon, Oct 11, 2010 at 03:47:58AM -0500, Nathan Caza wrote:
 I think I'm on the verge of getting all my data back; the only missing
 piece is to recalculate the crc checksum of my altered superblock and
 I'm having trouble finding the correct function/method; the data I am
 checksumming is (based on the sheet) 0x20 (directly after the
 checksum) to 0x32b + n (226 bytes); if im doing something wrong let me
 know; or if theres a quick/dirty way to get the right checksum

The checksum seems to be from 0x20 to 0x1000.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On-Disk Format

2010-10-10 Thread Sean Bartell
On Sun, Oct 10, 2010 at 10:43:56PM -0500, Nathan Caza wrote:
 Is this up-to-date? if not, has anyone put together something like
 this more recent??
 
 https://btrfs.wiki.kernel.org/index.php/User:Wtachi/On-disk_Format

It should be up-to-date, to the extent that it contains any useful
information at all. It's basically a sketch I wrote when I was first
figuring out btrfs, and I haven't gotten around to filling in the
details.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
 I know BTRFS is a kind of Log-structured File System, which doesn't do
 overwrite. Here is my question, suppose file A is overwritten by A',
 instead of writing A' to the original place of A, a new place is
 selected to store it. However, we know that the address of a file
 should be recorded in its inode. In such case, the corresponding part
 in inode of A should update from the original place A to the new place
 A', is this a kind of overwrite actually? I think no matter what
 design it is for Log-Structured FS, a mapping table is always needed,
 such as inode map, DAT, etc. When a update operation happens for this
 mapping table, is it actually a kind of over-write? If it is, is it a
 bottleneck for the performance of write for SSD?

In btrfs, this is solved by doing the same thing for the inode--a new
place for the leaf holding the inode is chosen. Then the parent of the
leaf must point to the new position of the leaf, so the parent is moved,
and the parent's parent, etc. This goes all the way up to the
superblocks, which are actually overwritten one at a time.

 What do you think the major work that BTRFS can do to improve the
 performance for SSD? I know FTL has becomes smarter and smarter, the
 idea of log-structured file system is always implemented inside the
 SSD by FTL, in that case, it sounds all the issues have been solved no
 matter what the FS it is in upper stack. But at least, from the
 results of benchmarks on the internet show that the performance from
 different FS are quite different, such as NILFS2 and BTRFS.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:
 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com 
 wrote:
  On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
  I know BTRFS is a kind of Log-structured File System, which doesn't do
  overwrite. Here is my question, suppose file A is overwritten by A',
  instead of writing A' to the original place of A, a new place is
  selected to store it. However, we know that the address of a file
  should be recorded in its inode. In such case, the corresponding part
  in inode of A should update from the original place A to the new place
  A', is this a kind of overwrite actually? I think no matter what
  design it is for Log-Structured FS, a mapping table is always needed,
  such as inode map, DAT, etc. When a update operation happens for this
  mapping table, is it actually a kind of over-write? If it is, is it a
  bottleneck for the performance of write for SSD?
 
  In btrfs, this is solved by doing the same thing for the inode--a new
  place for the leaf holding the inode is chosen. Then the parent of the
  leaf must point to the new position of the leaf, so the parent is moved,
  and the parent's parent, etc. This goes all the way up to the
  superblocks, which are actually overwritten one at a time.
 
 You mean that there is no over-write for inode too, once the inode
 need to be updated, this inode is actually written to a new place
 while the only thing to do is to change the point of its parent to
 this new place. However, for the last parent, or the superblock, does
 it need to be overwritten?

Yes. The idea of copy-on-write, as used by btrfs, is that whenever
*anything* is changed, it is simply written to a new location. This
applies to data, inodes, and all of the B-trees used by the filesystem.
However, it's necessary to have *something* in a fixed place on disk
pointing to everything else. So the superblocks can't move, and they are
overwritten instead.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 03:39:07PM -0400, Aryeh Gregor wrote:
 On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell wingedtachik...@gmail.com 
 wrote:
  In btrfs, this is solved by doing the same thing for the inode--a new
  place for the leaf holding the inode is chosen. Then the parent of the
  leaf must point to the new position of the leaf, so the parent is moved,
  and the parent's parent, etc. This goes all the way up to the
  superblocks, which are actually overwritten one at a time.
 
 Sorry for the useless question, but just out of curiosity: doesn't
 this mean that btrfs has to do quite a lot more writes than ext4 for
 small file operations?  E.g., if you append one block to a file, like
 a log file, then ext3 should have to do about three writes: data,
 metadata, and journal (and the latter is always sequential, so it's
 cheap).  But btrfs will need to do more, rewriting parent nodes all
 the way up the line for both the data and metadata blocks.  Why
 doesn't this hurt performance a lot?

For a single change, it does write more. However, there are usually many
changes to children being performed at once, which only require one
change to the parent. Since it's moving everything to new places, btrfs
also has much more control over where writes occur, so all the leaves
and parents can be written sequentially. ext3 is a slave to the current
locations on disk.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A device dedicated for metadata?

2010-07-28 Thread Sean Bartell
In response to your original questions, btrfs currently gives no control
over the allocation of data or metadata. I'm sure someone will implement
more control eventually.

On Wed, Jul 28, 2010 at 11:49:33PM +0800, wks1986 wrote:
 Another issue is the speed of fsck.  There will always be times when
 the operating system is brought down abnormally and fsck is necessary.
  In order to make the downtime as short as possible, fsck should be
 fast.  In this case, when metadata are stored in a fast device, fsck
 will be significantly faster.  The hot data tracking patch is based
 on the statistics of ONLINE accesses.  Some data may suddenly become
 hot when the filesystem goes offline for fsck.

Actually, because of copy-on-write and other aspects of btrfs' design,
there's no need for the typical use of fsck after a crash. Even once a
proper fsck is finished, it will only be necessary when important
information is corrupted. So it generally doesn't make sense to worry
about fsck speed.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid modes, balancing, and order in which data gets written

2010-07-15 Thread Sean Bartell
On Thu, Jul 15, 2010 at 10:29:07AM +0200, Mathijs Kwik wrote:
 Hi all,
 
 I read that btrfs - in a raid mode - does not mimic the behavior of
 traditional (hw/sw) raid.
 After writing to a btrfs raid filesystem, data will only be
 distributed the way you expect after running a rebalance.

This is not the case. When you create a btrfs filesystem with RAID
enabled, stuff written from then on will be written just like with
traditional RAID.

The difference with traditional RAID is that different parts of the FS
can have different RAID settings. Btrfs reserves space in ~1GiB block
groups for data or metadata, each of which has its own RAID settings.
If you change the RAID mode for an existing filesystem (not yet
supported IIUC) or add/remove devices, the existing block groups will
keep their old RAID settings if at all possible.

Rebalancing essentially moves everything into new block groups, which
will use the new RAID settings and be more balanced between data and
metadata. It isn't useful unless you change RAID settings, add/remove
devices, or have too much space reserved for either data or metadata.

 [...]

 If you never rebalance manually, will the filesystem do this in the
 background (when idle)?
 Or will the fs never rebalance itself and only become more balanced
 again after writing/changing some files, which it will then place on
 the drive which has the lowest balance?

Rebalancing isn't done automatically, and nothing can become more
balanced until new block groups are created when you run out of space
in the old ones.

 Basically, I'm not sure I fully understood balancing, so any info on
 this would be great.
 In traditional raid0 and raid10 (block based), it is guaranteed that
 any big file will always be stiped between disks equally, so a certain
 performance can be assumed.
 With non-automatic balancing, I'm afraid some files might not be
 distributed as well as they could, resulting in lower performance.
 Is this an issue to be aware of, or can I safely assume that for most
 use cases the performance will roughly be the same as sw-raid?
 2 cases I'm interested in:
 - big databases(lots of rewrites)
 - real-time video-capturing (sustained write to 1 or more big files,
 needing a guaranteed write throughput)

If you initially create the filesystem with the right RAID settings, it
will act just like normal software RAID. Balancing only comes into play
when you start changing your mind :).

 Any info on this or balancing in general will be greatly appreciated.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshotting - what data gets shared?

2010-07-14 Thread Sean Bartell
On Wed, Jul 14, 2010 at 11:27:39PM +0200, Mathijs Kwik wrote:
 Hi all,
 
 I'm used to snapshots with LVM and I would like to compare them to btrfs.
 
 The case I want to compare is the following:
 At the moment a snapshot is created, no extra space is needed (maybe
 some metadata overhead) and all data is shared between the original
 and the snapshot.
 In LVM, snapshots work at the block-level, so any changes done to the
 original volume trigger a COW to the snapshot.
 If LVM is configured to use 4Mb blocks (default), this means that
 overwriting a 100k file, will lead to 4Mb snapshot data to be backed
 up.
 A 800Mb file will take around 800Mb.
 So, for small files (that are not on the same extent/block) this can
 waste quite some space, while for bigger files, or lots of files
 close to each other, it doesn't matter much.
 
 How is this for btrfs snapshots?
 Do they work at the file-level? or also at blocks/extents?
 
 I mean, does changing a 100k file lead to 100k being snapshotted?

Btrfs CoWs file extents, and files can use only the parts of an extent
they need, so a 1-byte change would only require one additional 4K data
block. Of course, metadata also needs to be updated, and will require
a number of additional blocks.

 What would happen if I have a 20G file (for example a disk image for kvm)?
 Would minor changes in that file lead to the entire 20G to be COWed/backed 
 up?

No, only the relevant portion.

 Is there a distinction between data and metadata?
 Or does touching (ctime/mtime) or visiting (atime) a file cause it to be 
 COWed?

Metadata is CoWed separately, so there will still only be one copy of
the data.

 Thanks for any info on this.
 Mathijs
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: handle errors for FS_IOC_SETFLAGS

2010-06-30 Thread Sean Bartell
Makes btrfs_ioctl_setflags return -ENOSPC and other errors when
necessary.

Signed-off-by: Sean Bartell wingedtachik...@gmail.com
---
I ran chattr -R on a full FS and btrfs crashed. This overlaps with the patch
series being worked on by Jeff Mahoney.

 fs/btrfs/ioctl.c |   17 -
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4dbaf89..8db62c2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -200,19 +200,26 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
 
 
trans = btrfs_join_transaction(root, 1);
-   BUG_ON(!trans);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto out_drop_write;
+   }
 
ret = btrfs_update_inode(trans, root, inode);
-   BUG_ON(ret);
+   if (ret)
+   goto out_endtrans;
 
btrfs_update_iflags(inode);
inode-i_ctime = CURRENT_TIME;
-   btrfs_end_transaction(trans, root);
 
+   ret = 0;
+out_endtrans:
+   btrfs_end_transaction(trans, root);
+out_drop_write:
mnt_drop_write(file-f_path.mnt);
- out_unlock:
+out_unlock:
mutex_unlock(inode-i_mutex);
-   return 0;
+   return ret;
 }
 
 static int btrfs_ioctl_getversion(struct file *file, int __user *arg)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is there a more aggressive fixer than btrfsck?

2010-06-29 Thread Sean Bartell
On Tue, Jun 29, 2010 at 02:36:14PM -0700, Freddie Cash wrote:
 On Tue, Jun 29, 2010 at 3:37 AM, Daniel Kozlowski
 dan.kozlow...@gmail.com wrote:
  On Mon, Jun 28, 2010 at 10:31 PM, Rodrigo E. De León Plicet
  rdele...@gmail.com wrote:
  On Mon, Jun 28, 2010 at 8:48 AM, Daniel Kozlowski
  dan.kozlow...@gmail.com wrote:
  Sean Bartell wingedtachikoma at gmail.com writes:
 
   Is there a more aggressive filesystem restorer than btrfsck?  It simply
   gives up immediately with the following error:
  
   btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root-node)'
   failed.
 
  btrfsck currently only checks whether a filesystem is consistent. It
  doesn't try to perform any recovery or error correction at all, so it's
  mostly useful to developers. Any error handling occurs while the
  filesystem is mounted.
 
 
  Is there any plan to implement this functionality. It would seem to me to 
  be a
  pretty basic feature that is missing ?
 
  If Btrfs aims to be at least half of what ZFS is, then it will not
  impose a need for fsck at all.
 
  Read No, ZFS really doesn't need a fsck at the following URL:
 
  http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html
 
 
  Interesting idea. it would seem to me however that the functionality
  described in that article is more concerned with a bad transaction
  rather then something like a hardware failure where a block written
  more then 128 transactions ago is now corrupted and consiquently the
  entire partition is now unmountable( that is what I think i am looking
  at with BTRFS )
 
 In the ZFS case, this is handled by checksumming and redundant data,
 and can be discovered (and fixed) via either reading the affected data
 block (in which case, the checksum is wrong, the data is read from a
 redundant data block, and the correct data is written over the
 incorrect data) or by running a scrub.
 
 Self-healing, checksumming, data redundancy eliminate the need for
 online (or offline) fsck.
 
 Automatic transaction rollback at boot eliminates the need for fsck at
 boot, as there is no such thing as a dirty filesystem.  Either the
 data is on disk and correct, or it doesn't exist.  Yes, you may lose
 data.  But you will never have a corrupted filesystem.
 
 Not sure how things work for btrfs.

btrfs works in a similar way. While it's writing new data, it keeps the
superblock pointing at the old data, so after a crash you still get the
complete old version. Once the new data is written, the superblock is
updated to point at it, ensuring that you see the new data. This
eliminates the need for any special handling after a crash.

btrfs also uses checksums and redundancy to protect against data
corruption. Thanks to its design, btrfs doesn't need to scan the
filesystem or cross-reference structures to detect problems. It can
easily detect corruption at run-time when it tries to read the
problematic data, and fixes it using the redundant copies.

In the event that something goes horribly wrong, for example if each
copy of the superblock or of a tree root is corrupted, you could still
find some valid nodes and try to piece them together; however, this is
rare and falls outside the scope of a fsck anyway.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is there a more aggressive fixer than btrfsck?

2010-06-02 Thread Sean Bartell
On Tue, Jun 01, 2010 at 07:29:56PM -0700, u...@sonic.net wrote:
 Is there a more aggressive filesystem restorer than btrfsck?  It simply
 gives up immediately with the following error:
 
 btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root-node)'
 failed.

btrfsck currently only checks whether a filesystem is consistent. It
doesn't try to perform any recovery or error correction at all, so it's
mostly useful to developers. Any error handling occurs while the
filesystem is mounted.

 Yet, the filesystem has plenty of data on it, and the discs are good and I
 didn't do anything to the data except regular btrfs commands and normal
 mounting.  That's a wildly unreliable filesystem.

btrfs is under heavy development, so make sure you're using the latest
git versions of the kernel module and tools.

 BTW, is there a way to improve delete and copy performance of btrfs?  I'm
 getting about 50KB/s-500KB/s (per size of file being deleted) in deleting
 and/or copying files on a disc that usually can go about 80MB/s.  I think
 it's because they were fragmented.  That implies btrfs is too accepting of
 writing data in fragmented style when it doesn't have to.  Almost all the
 files on my btrfs partitions are around a gig, or 20 gigs, or a third of a
 gig, or stuff like that.  The filesystem is 1.1TB.
 
 Brad
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Quota Support

2010-06-02 Thread Sean Bartell
On Wed, Jun 02, 2010 at 10:57:44AM +0200, Stephen wrote:
 Im just wondering if subvolumes or snap shot can have quotas imposed on
 them.

Subvolume quotas are one of the many features that haven't yet been
implemented. See
https://btrfs.wiki.kernel.org/index.php/Development_timeline.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs-convert: make more use of cache_free_extents

2010-05-18 Thread Sean Bartell
On Tue, May 18, 2010 at 09:40:28PM +0800, Yan, Zheng  wrote:
 On Sat, Mar 20, 2010 at 12:24 PM, Sean Bartell
 wingedtachik...@gmail.com wrote:
  An extent_io_tree is used for all free space information. This allows
  removal of ext2_alloc_block and ext2_free_block, and makes
  create_ext2_image less ext2-specific.

  +       ret = ext2_cache_free_extents(ext2_fs, orig_free_tree);
  +       if (ret) {
  +               fprintf(stderr, error during cache_free_extents %d\n, 
  ret);
  +               goto fail;
  +       }
  +       /* preserve first 64KiB, just in case */
  +       clear_extent_dirty(orig_free_tree, 0, BTRFS_SUPER_INFO_OFFSET - 1, 
  0);
  +
  +       ret = copy_dirtiness(free_tree, orig_free_tree);
  +       if (ret) {
  +               fprintf(stderr, error during copy_dirtiness %d\n, ret);
  +               goto fail;
  +       }

 extent_io_tree is not very space efficient. caching free space in
 several places is not good. I prefer adding a function that checks
 if a given block is used to the 'convert_fs' structure.

Good point. I'll change cache_free_extents to something like
  int (*iterate_used_extents)(struct convert_fs *fs, u64 start, u64 end,
  void *priv, int (*cb)(u64 start, u64 end))
create_image_file_range and do_convert should work well with a callback.

This also opens up the possibility of finding free extents
incrementally: call iterate_used_extents on the first GB, then
custom_alloc_extent will call it on the next GB once free space runs
out.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring BTRFS partition

2010-04-20 Thread Sean Bartell
On Tue, Apr 20, 2010 at 11:55:38PM +0800, Wengang Wang wrote:
 I guess the reason is that the 300M file btrfs and the one on your
 partition have different block size. Thus 65k zeros on your file image
 doesn't mean 65k on the partition. So maybe you will try with blocks
 instead of bytes.

Actually, the block size doesn't matter for this--the superblock is
always at 0x1. Alli, I think you'll have to upload the start of the
partition so someone can take a look at it.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring BTRFS partition

2010-04-20 Thread Sean Bartell
On Tue, Apr 20, 2010 at 06:13:41PM +, Alli Quaknaa wrote:
 So here are first ~12M of the partition. There was some junk preceding
 what is in the file, but it mostly looked like my swap or something
 (cached css, javascript and webpages I've recently visited - though I
 hope the beginning of the partition isn't somewhere else. Hopefuly you
 would be able to tell from the dump).
 
 http://pub.yweb.cz/sda7_head.dump

The superblock in that file (starting at byte 0x10) is actually a mirror
of the real superblock. Aside from the real superblock at 0x1,
Btrfs stores mirror copies of the superblock at 0x400 (64 MiB),
0x40 (256 GiB), and 0x4 (1 PiB). Each superblock has
a field that indicates where it is; when you made your image, you put
the mirror superblock where the real superblock was supposed to be, and
btrfs refused to mount it because that field was wrong. 

The real start of the btrfs partition is 0x400 bytes (64 MiB) before
the place you found that mirror superblock; the real superblock should
be 0x3ff bytes before the mirror. Even if the real superblock is
corrupt, if the mirror is at 0x400, where it's supposed to be, you
should be able to get btrfs to mount it (though I think you might need a
mount option or a patch).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring BTRFS partition

2010-04-20 Thread Sean Bartell
On Tue, Apr 20, 2010 at 10:25:34PM +, Alli Quaknaa wrote:
 I think I have found the real superblock you are talking about, but
 I'm afraid I may have written something in the first 64MiB. Is there a
 chance btrfsck will recover it?
btrfsck is currently very limited; it only detects a limited number of
problems, and it can't fix anything. Btrfs focuses on handling problems
when they are discovered while using the FS; generally, it should handle
corruption relatively gracefully. However, if anything really crucial
was overwritten and the FS can't be mounted, there aren't any tools to
repair it.

 Also, I think there's gotta be a better way to manipulate those huge
 files then dd and hexedit for examination - I'd like to take the raw
 file, open it in some hex editor and be able to cut of some of it's
 beginning - I can't seem to be able to do it with hexedit. Is there a
 tool you'd recomment?
For viewing, you can use less, head, and tail with hexdump:
tail -c +$((0x1+1)) /dev/sda1|hexdump -C|less
will view the disk starting at the superblock. For editing, dd is
probably best, though you could use a hex editor like Okteta. I've also
heard of Radare, supposedly a very advanced command-line tool. Keep in
mind that any tool that deletes the first part of a huge file will be
forced to rewrite the entire file.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs-convert: Add extent iteration functions.

2010-03-21 Thread Sean Bartell
Whoops, there's a major memory leak. Please apply this patch to the
patch :).

diff --git a/convert.c b/convert.c
index dfd2976..7bb4ed0 100644
--- a/convert.c
+++ b/convert.c
@@ -471,21 +471,24 @@ int finish_file_extents(struct extent_iterate_data *priv)
return ret;
}
*priv-inode_nbytes += priv-size;
-   return btrfs_insert_inline_extent(priv-trans, priv-root,
- priv-objectid,
- priv-file_off, priv-data,
- priv-size);
-   }
-
-   ret = commit_file_extents(priv);
-   if (ret)
-   return ret;
-
-   if (priv-total_size  priv-last_file_off) {
-   ret = commit_disk_extent(priv, priv-last_file_off, 0,
-   priv-total_size - priv-last_file_off);
+   ret = btrfs_insert_inline_extent(priv-trans, priv-root,
+priv-objectid,
+priv-file_off, priv-data,
+priv-size);
if (ret)
return ret;
+   } else {
+   ret = commit_file_extents(priv);
+   if (ret)
+   return ret;
+
+   if (priv-total_size  priv-last_file_off) {
+   ret = commit_disk_extent(priv, priv-last_file_off, 0,
+priv-total_size -
+priv-last_file_off);
+   if (ret)
+   return ret;
+   }
}
free(priv-data);
return 0;
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs-convert: Add extent iteration functions.

2010-03-19 Thread Sean Bartell
A filesystem can have disk extents in arbitrary places on the disk, as
well as extents that must be read into memory because they have
compression or encryption btrfs doesn't support. These extents can be
passed to the new extent iteration functions, which will handle all the
details of alignment, allocation, etc.
---
 convert.c |  604 -
 1 files changed, 401 insertions(+), 203 deletions(-)

diff --git a/convert.c b/convert.c
index c48f8ba..bd91990 100644
--- a/convert.c
+++ b/convert.c
@@ -357,7 +357,7 @@ error:
 }
 
 static int read_disk_extent(struct btrfs_root *root, u64 bytenr,
-   u32 num_bytes, char *buffer)
+   u64 num_bytes, char *buffer)
 {
int ret;
struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices;
@@ -371,6 +371,23 @@ fail:
ret = -1;
return ret;
 }
+
+static int write_disk_extent(struct btrfs_root *root, u64 bytenr,
+u64 num_bytes, const char *buffer)
+{
+   int ret;
+   struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices;
+
+   ret = pwrite(fs_devs-latest_bdev, buffer, num_bytes, bytenr);
+   if (ret != num_bytes)
+   goto fail;
+   ret = 0;
+fail:
+   if (ret  0)
+   ret = -1;
+   return ret;
+}
+
 /*
  * Record a file extent. Do all the required works, such as inserting
  * file extent item, inserting extent item and backref item into extent
@@ -378,8 +395,7 @@ fail:
  */
 static int record_file_extent(struct btrfs_trans_handle *trans,
  struct btrfs_root *root, u64 objectid,
- struct btrfs_inode_item *inode,
- u64 file_pos, u64 disk_bytenr,
+ u64 *inode_nbytes, u64 file_pos, u64 disk_bytenr,
  u64 num_bytes, int checksum)
 {
int ret;
@@ -391,7 +407,6 @@ static int record_file_extent(struct btrfs_trans_handle 
*trans,
struct btrfs_path path;
struct btrfs_extent_item *ei;
u32 blocksize = root-sectorsize;
-   u64 nbytes;
 
if (disk_bytenr == 0) {
ret = btrfs_insert_file_extent(trans, root, objectid,
@@ -450,8 +465,7 @@ static int record_file_extent(struct btrfs_trans_handle 
*trans,
btrfs_set_file_extent_other_encoding(leaf, fi, 0);
btrfs_mark_buffer_dirty(leaf);
 
-   nbytes = btrfs_stack_inode_nbytes(inode) + num_bytes;
-   btrfs_set_stack_inode_nbytes(inode, nbytes);
+   *inode_nbytes += num_bytes;
 
btrfs_release_path(root, path);
 
@@ -492,95 +506,355 @@ fail:
return ret;
 }
 
-static int record_file_blocks(struct btrfs_trans_handle *trans,
- struct btrfs_root *root, u64 objectid,
- struct btrfs_inode_item *inode,
- u64 file_block, u64 disk_block,
- u64 num_blocks, int checksum)
-{
-   u64 file_pos = file_block * root-sectorsize;
-   u64 disk_bytenr = disk_block * root-sectorsize;
-   u64 num_bytes = num_blocks * root-sectorsize;
-   return record_file_extent(trans, root, objectid, inode, file_pos,
- disk_bytenr, num_bytes, checksum);
-}
-
-struct blk_iterate_data {
+struct extent_iterate_data {
struct btrfs_trans_handle *trans;
struct btrfs_root *root;
-   struct btrfs_inode_item *inode;
+   u64 *inode_nbytes;
u64 objectid;
-   u64 first_block;
-   u64 disk_block;
-   u64 num_blocks;
-   u64 boundary;
-   int checksum;
-   int errcode;
+   int checksum, packing;
+   u64 last_file_off;
+   u64 total_size;
+   enum {EXTENT_ITERATE_TYPE_NONE, EXTENT_ITERATE_TYPE_MEM,
+ EXTENT_ITERATE_TYPE_DISK} type;
+   u64 size;
+   u64 file_off; /* always aligned to sectorsize */
+   char *data; /* for mem */
+   u64 disk_off; /* for disk */
 };
 
-static int block_iterate_proc(ext2_filsys ext2_fs,
- u64 disk_block, u64 file_block,
- struct blk_iterate_data *idata)
+static u64 extent_boundary(struct btrfs_root *root, u64 extent_start)
 {
-   int ret;
-   int sb_region;
-   int do_barrier;
-   struct btrfs_root *root = idata-root;
-   struct btrfs_trans_handle *trans = idata-trans;
-   struct btrfs_block_group_cache *cache;
-   u64 bytenr = disk_block * root-sectorsize;
-
-   sb_region = intersect_with_sb(bytenr, root-sectorsize);
-   do_barrier = sb_region || disk_block = idata-boundary;
-   if ((idata-num_blocks  0  do_barrier) ||
-   (file_block  idata-first_block + idata-num_blocks) ||
-   (disk_block != idata-disk_block + idata-num_blocks)) {
-   if (idata-num_blocks  0) {
-   ret = record_file_blocks(trans, root, 

[PATCH 3/4] btrfs-convert: permit support for non-ext2 FSs

2010-03-19 Thread Sean Bartell
Filesystems need to provide a function open_blah that fills a struct
convert_fs with some information and three function pointers.
The function pointers are:
- cache_free_extents, which takes a struct extent_io_tree and marks all
  extents not being used by the filesystem as DIRTY
- copy_inodes, which copies the contents of the filesystem into a
  btrfs_root using CoW.
- close
There's a void* in struct convert_fs for private use by the filesystem.

libblkid is used to determine the filesystem.
---
 Makefile  |2 +-
 convert.c |  184 +
 2 files changed, 126 insertions(+), 60 deletions(-)

diff --git a/Makefile b/Makefile
index 525676e..755cc24 100644
--- a/Makefile
+++ b/Makefile
@@ -75,7 +75,7 @@ quick-test: $(objects) quick-test.o
gcc $(CFLAGS) -o quick-test $(objects) quick-test.o $(LDFLAGS) $(LIBS)
 
 convert: $(objects) convert.o
-   gcc $(CFLAGS) -o btrfs-convert $(objects) convert.o -lext2fs $(LDFLAGS) 
$(LIBS)
+   gcc $(CFLAGS) -o btrfs-convert $(objects) convert.o -lext2fs -lblkid 
$(LDFLAGS) $(LIBS)
 
 ioctl-test: $(objects) ioctl-test.o
gcc $(CFLAGS) -o ioctl-test $(objects) ioctl-test.o $(LDFLAGS) $(LIBS)
diff --git a/convert.c b/convert.c
index bd91990..6dfcb97 100644
--- a/convert.c
+++ b/convert.c
@@ -31,6 +31,7 @@
 #include unistd.h
 #include uuid/uuid.h
 #include linux/fs.h
+#include blkid/blkid.h
 #include kerncompat.h
 #include ctree.h
 #include disk-io.h
@@ -42,9 +43,26 @@
 #include ext2fs/ext2fs.h
 #include ext2fs/ext2_ext_attr.h
 
+struct convert_fs {
+   u64 total_bytes;
+   u64 blocksize;
+   const char *label;
+
+   /* Close the FS */
+   int (*close)(struct convert_fs *fs);
+   /* Mark free extents as dirty */
+   int (*cache_free_extents)(struct convert_fs *fs,
+ struct extent_io_tree *tree);
+   /* Copy everything over */
+   int (*copy_inodes)(struct convert_fs *fs, struct btrfs_root *root,
+  int datacsum, int packing, int noxattr);
+
+   void *privdata;
+};
+
 #define INO_OFFSET (BTRFS_FIRST_FREE_OBJECTID - EXT2_ROOT_INO)
 #define STRIPE_LEN (64 * 1024)
-#define EXT2_IMAGE_SUBVOL_OBJECTID BTRFS_FIRST_FREE_OBJECTID
+#define ORIG_IMAGE_SUBVOL_OBJECTID BTRFS_FIRST_FREE_OBJECTID
 
 /*
  * Open Ext2fs in readonly mode, read block allocation bitmap and
@@ -89,15 +107,16 @@ fail:
return -1;
 }
 
-static int close_ext2fs(ext2_filsys fs)
+static int ext2_close(struct convert_fs *fs)
 {
-   ext2fs_close(fs);
+   ext2fs_close((ext2_filsys)fs-privdata);
return 0;
 }
 
-static int ext2_cache_free_extents(ext2_filsys ext2_fs,
+static int ext2_cache_free_extents(struct convert_fs *fs,
   struct extent_io_tree *free_tree)
 {
+   ext2_filsys ext2_fs = fs-privdata;
int ret = 0;
blk_t block;
u64 bytenr;
@@ -117,19 +136,18 @@ static int ext2_cache_free_extents(ext2_filsys ext2_fs,
 }
 
 /* mark btrfs-reserved blocks as used */
-static void adjust_free_extents(ext2_filsys ext2_fs,
+static void adjust_free_extents(struct convert_fs *fs,
struct extent_io_tree *free_tree)
 {
int i;
u64 bytenr;
-   u64 blocksize = ext2_fs-blocksize;
 
clear_extent_dirty(free_tree, 0, BTRFS_SUPER_INFO_OFFSET - 1, 0);
 
for (i = 0; i  BTRFS_SUPER_MIRROR_MAX; i++) {
bytenr = btrfs_sb_offset(i);
bytenr = ~((u64)STRIPE_LEN - 1);
-   if (bytenr = blocksize * ext2_fs-super-s_blocks_count)
+   if (bytenr = fs-total_bytes)
break;
clear_extent_dirty(free_tree, bytenr, bytenr + STRIPE_LEN - 1,
   0);
@@ -1373,9 +1391,10 @@ fail:
 /*
  * scan ext2's inode bitmap and copy all used inode.
  */
-static int copy_inodes(struct btrfs_root *root, ext2_filsys ext2_fs,
-  int datacsum, int packing, int noxattr)
+static int ext2_copy_inodes(struct convert_fs *fs, struct btrfs_root *root,
+   int datacsum, int packing, int noxattr)
 {
+   ext2_filsys ext2_fs = fs-privdata;
int ret;
errcode_t err;
ext2_inode_scan ext2_scan;
@@ -1426,8 +1445,8 @@ static int copy_inodes(struct btrfs_root *root, 
ext2_filsys ext2_fs,
 }
 
 /*
- * Construct a range of ext2fs image file.
- * scan block allocation bitmap, find all blocks used by the ext2fs
+ * Construct a range of the image file.
+ * scan block allocation bitmap, find all blocks used by the filesystem
  * in this range and create file extents that point to these blocks.
  *
  * Note: Before calling the function, no file extent points to blocks
@@ -1465,10 +1484,10 @@ static int create_image_file_range(struct 
btrfs_trans_handle *trans,
return ret;
 }
 /*
- * Create the ext2fs image file.
+ * Create the image file.
  */
-static int create_ext2_image(struct 

[PATCH 4/4] btrfs-convert: split into convert/.

2010-03-19 Thread Sean Bartell
 btrfs_trans_handle *trans;
-
-   trans = btrfs_start_transaction(root, 1);
-   if (!trans)
-   return -ENOMEM;
-   err = ext2fs_open_inode_scan(ext2_fs, 0, ext2_scan);
-   if (err) {
-   fprintf(stderr, ext2fs_open_inode_scan: %s\n, 
error_message(err));
-   return -1;
-   }
-   while (!(err = ext2fs_get_next_inode(ext2_scan, ext2_ino,
-ext2_inode))) {
-   /* no more inodes */
-   if (ext2_ino == 0)
-   break;
-   /* skip special inode in ext2fs */
-   if (ext2_ino  EXT2_GOOD_OLD_FIRST_INO 
-   ext2_ino != EXT2_ROOT_INO)
-   continue;
-   objectid = ext2_ino + INO_OFFSET;
-   ret = copy_single_inode(trans, root,
-   objectid, ext2_fs, ext2_ino,
-   ext2_inode, datacsum, packing,
-   noxattr);
-   if (ret)
-   return ret;
-   if (trans-blocks_used = 4096) {
-   ret = btrfs_commit_transaction(trans, root);
-   BUG_ON(ret);
-   trans = btrfs_start_transaction(root, 1);
-   BUG_ON(!trans);
-   }
-   }
-   if (err) {
-   fprintf(stderr, ext2fs_get_next_inode: %s\n, 
error_message(err));
-   return -1;
-   }
-   ret = btrfs_commit_transaction(trans, root);
-   BUG_ON(ret);
-
-   return ret;
-}
 
 /*
  * Construct a range of the image file.
@@ -2586,26 +1805,6 @@ static int copy_dirtiness(struct extent_io_tree *out,
return 0;
 }
 
-int ext2_open(struct convert_fs *fs, const char *name)
-{
-   int ret;
-   ext2_filsys ext2_fs;
-   ret = open_ext2fs(name, ext2_fs);
-   if (ret)
-   return ret;
-
-   fs-privdata = ext2_fs;
-   fs-blocksize = ext2_fs-blocksize;
-   fs-label = ext2_fs-super-s_volume_name;
-   fs-total_bytes = ext2_fs-super-s_blocks_count * fs-blocksize;
-
-   fs-cache_free_extents = ext2_cache_free_extents;
-   fs-close = ext2_close;
-   fs-copy_inodes = ext2_copy_inodes;
-
-   return 0;
-}
-
 static int open_fs(struct convert_fs *fs, const char *devname)
 {
static struct {
diff --git a/convert/convert.h b/convert/convert.h
new file mode 100644
index 000..4f31775
--- /dev/null
+++ b/convert/convert.h
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ * Copyright (C) 2010 Sean Bartell.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#ifndef BTRFS_CONVERT_H
+#define BTRFS_CONVERT_H
+
+#include ctree.h
+#include kerncompat.h
+#include transaction.h
+
+struct convert_fs {
+   u64 total_bytes;
+   u64 blocksize;
+   const char *label;
+
+   /* Close the FS */
+   int (*close)(struct convert_fs *fs);
+   /* Mark free extents as dirty */
+   int (*cache_free_extents)(struct convert_fs *fs,
+ struct extent_io_tree *tree);
+   /* Copy everything over */
+   int (*copy_inodes)(struct convert_fs *fs, struct btrfs_root *root,
+  int datacsum, int packing, int noxattr);
+
+   void *privdata;
+};
+
+int ext2_open(struct convert_fs *fs, const char *name);
+
+struct extent_iterate_data {
+   struct btrfs_trans_handle *trans;
+   struct btrfs_root *root;
+   u64 *inode_nbytes;
+   u64 objectid;
+   int checksum, packing;
+   u64 last_file_off;
+   u64 total_size;
+   enum {EXTENT_ITERATE_TYPE_NONE, EXTENT_ITERATE_TYPE_MEM,
+ EXTENT_ITERATE_TYPE_DISK} type;
+   u64 size;
+   u64 file_off; /* always aligned to sectorsize */
+   char *data; /* for mem */
+   u64 disk_off; /* for disk */
+};
+
+int start_file_extents(struct extent_iterate_data *priv,
+  struct btrfs_trans_handle *trans,
+  struct btrfs_root *root, u64 *inode_nbytes,
+  u64 objectid, int checksum, int packing,
+  u64 total_size);
+int start_file_extents_range(struct extent_iterate_data *priv,
+struct btrfs_trans_handle *trans

Re: Creation time

2010-03-15 Thread Sean Bartell
There is room in btrfs for a fourth time called otime, but it is not
currently used or even initialized. Once there are APIs, it should be
possible to add crtime support with a slight format upgrade.

On Sun, Mar 14, 2010 at 02:55:12AM +0100, Hubert Kario wrote:
 From what I could find, btrfs supports only the trinity of UNIX time stamps: 
 atime, ctime and mtime.
 
 Is there any plan to support crtime (creation time)?
 
 Side note: ZFS already supports it, ext4 and cifs (samba) are waiting for 
 APIs 
 and userland support, so it could be a good time to coordinate efforts and 
 solidify the interface.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html