Re: btrfsck: unresolved ref root

2011-12-02 Thread Andrey Kuzmin
Just curious, would the observed behavior change if one modifies head in any
way between snap and umount?

Regards,
Andrey


 02.12.2011 13:28 пользователь Jan Schmidt list.bt...@jan-o-sch.net
 написал:


 While hunting another bug, I got distracted by btrfsck error output,
 which is reproducible as simple as creating a snapshot. Either btrfsck
 is strange or subvolume snapsnot does something wrong.

 # mkfs.btrfs /dev/sdo
 # mount /dev/sdo /mnt/scratch/
 # umount /mnt/scratch/
 # btrfsck /dev/sdo

 - everything ok

 # mount /dev/sdo /mnt/scratch/
 # btrfs subvol snap /mnt/scratch/ /mnt/scratch/snap1
 # umount /mnt/scratch
 # btrfsck /dev/sdo
 fs tree 257 refs 2
        unresolved ref root 257 dir 256 index 2 namelen 5 name snap1
 error 600
 [...]

 Tested with current for-linus and most current btrfsck. I also have
 older filesystems with snapshots I never used on btrfsck before, they
 also show the unresolved ref error.

 From a quick look at the btrfsck code, this complaint means that btrfsck
 is looking for two BTRFS_ROOT_REF_KEY and BTRFS_ROOT_BACKREF_KEY each in
 the tree of tree roots. However there's only one of each (as I would
 expect):

        item 4 key (FS_TREE ROOT_REF 257) itemoff 3238 itemsize 23
                root ref key dirid 256 sequence 2 name snap1
 ...
        item 12 key (257 ROOT_BACKREF 5) itemoff 2315 itemsize 23
                root backref key dirid 256 sequence 2 name snap1

 -Jan
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Quota Implementation

2011-06-03 Thread Andrey Kuzmin
On Fri, Jun 3, 2011 at 8:47 PM, Hugo Mills h...@carfax.org.uk wrote:
 On Fri, Jun 03, 2011 at 06:24:41PM +0200, Arne Jansen wrote:
 Hi,

 If no one is already working on it, I'd like to take the Quota lock and
 see how far I come.
 Let me sketch out in short what I'm planning to do:

  - Quota will be subvolume based. Only the FS-trees and data extents
    will be accounted.
  - Quota Groups can be defined. Every quota group can comprise any
    number of subvolumes. A subvolume can be assigned to any number
    of quota groups.
  - A Quota Group can account/limit the total amount of space that is
    referenced by it and/or the amount of space that is exclusively
    referenced (i.e. referenced by no other quota group).
  - With this it is possible to define a hierarchical quota that need
    not necessarily reflect the filesystem hierarchy.
  - It is also possible to decide for each snapshot if it should be
    accounted into the parent group. So in a scenario where each
    subvolume reflect a user home, it's possible to have some snapshots
    accounted to the user and others not (e.g. the ones needed for system
    backups).
  - Quota information will be stored in new records, possibly in a
    separate tree.
  - It should be possible to change the Quota config and group
    assignments online, though this might need a full re-scan of the fs.
  - It does NOT include any kind of user/group (UID/GID) quota.

 Any addenda or arguments why it's impossible or insane welcome.

   There's a problem in that in some cases, it's possible to get into
 a situation where you can't *delete* files because you're going over
 quota. If I have two subvolumes that share most of their data
 (e.g. one is a snapshot of the other), and both subvolumes have a
 limit under the exclusive use clause, then deleting material from
 subvolume A could cause subvolume B to go over quota.

   If users can create their own subvolumes, then using the exclusive
 use form is also pointless, because as a user, I can simply snapshot
 (or otherwise CoW copy) all my data into a snapshot, and I then don't
 pay for it. That one probably comes under the admin shot himself in
 the foot, though.

   Getting out the bike-shed brush, I might suggest the use of some
 name other than quota, because inevitably people will think of
 UID/GID-type quotas, and we've got enough confusingly-modified
 terminology already. Size bounds, storage bounds, possibly?

Budget :)?

Regards,
Andrey


   Hugo.

 --
 === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- Is it true that last known good on Windows XP ---
                            boots into CP/M?

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)

 iD8DBQFN6RAiIKyzvlFcI40RAkkQAKCAulO65dL1F/vaO7W20qJEAKuonwCghfvH
 XlliA+eCfmLmP/G0quVALe0=
 =m513
 -END PGP SIGNATURE-


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Andrey Kuzmin
On Fri, Mar 25, 2011 at 6:39 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:
 Adaptive owner spinning used to be applied only to mutex_lock().  This
 patch applies it also to mutex_trylock().

 btrfs has developed custom locking to avoid excessive context switches
 in its btree implementation.  Generally, doing away with the custom
 implementation and just using the mutex shows better behavior;
 however, there's an interesting distinction in the custom implemention
 of trylock.  It distinguishes between simple trylock and tryspin,
 where the former just tries once and then fail while the latter does
 some spinning before giving up.

 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, for btrfs anyway.

 The following results are from dbench 50 run on an opteron two
 socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
 During the run, disk stays mostly idle and all CPUs are fully occupied
 and the difference in locking performance becomes quite visible.

 SIMPLE is with the locking simplification patch[1] applied.  i.e. it
 basically just uses mutex.  SPIN is with this patch applied on top -
 mutex_trylock() uses adaptive spinning.

         USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
  SIMPLE 61107  354977    217  8099529  845.100 MB/sec
  SPIN   63140  364888    214  6840527  879.077 MB/sec

 On various runs, the adaptive spinning trylock consistently posts
 higher throughput.  The amount of difference varies but it outperforms
 consistently.

 In general, using adaptive spinning on trylock makes sense as trylock
 failure usually leads to costly unlock-relock sequence.

 [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658

 Signed-off-by: Tejun Heo t...@kernel.org

 I'm curious about the effects that this has on those places that do:

 again:
        mutex_lock(A);
        if (mutex_trylock(B)) {
                mutex_unlock(A);
                goto again;


 Where the normal locking order is:
  B - A

 If another location does:

        mutex_lock(B);
        [...]
        mutex_lock(A);

 But another process has A already, and is running, it may spin waiting
 for A as A's owner is still running.

 But now, mutex_trylock(B) becomes a spinner too, and since the B's owner
 is running (spinning on A) it will spin as well waiting for A's owner to
 release it. Unfortunately, A's owner is also spinning waiting for B to
 release it.

 If both A and B's owners are real time tasks, then boom! deadlock.

Turning try_lock into indefinitely spinning one breaks its semantics,
so deadlock is to be expected. But what's wrong in this scenario if
try_lock spins a bit before giving up?

Regards,
Andrey


 -- Steve

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-25 Thread Andrey Kuzmin
On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:
 Turning try_lock into indefinitely spinning one breaks its semantics,
 so deadlock is to be expected. But what's wrong in this scenario if
 try_lock spins a bit before giving up?

 Because that will cause this scenario to spin that little longer
 always, and introduce latencies that did not exist before. Either the
 solution does not break this scenario, or it should not go in.

Broken semantics and extra latency are two separate issues. If the
former is fixed, the latter is easily handled by introducing new
mutex_trylock_spin call that lets one either stick to existing
behavior (try/fail) or choose a new one where latency penalty is
justified by locking patterns.

Regards,
Andrey


 -- Steve



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Andrey Kuzmin
On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen sensi...@gmx.net wrote:
 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

 b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s

10x speed-up looks indeed impressive. Just for me to be sure, did I
get you right in that you attribute this effect specifically to
enumerating tree leaves in key address vs. disk addresses when these
two are not aligned?

Regards,
Andrey


 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with

 fs_mark -D 512 -t 16 -n 4096 -F -S0

 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.

 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..

 I'd imagine an interface along the lines of

 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
                        struct btrfs_key *start,
                        struct btrfs_key *end,
                        struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.

 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.

 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.

 --
 Arne


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-23 Thread Andrey Kuzmin
On Wed, Mar 23, 2011 at 6:48 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote:

 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, at least for
 btrfs anyway.

 Hmm. Seems reasonable to me.

TAS/spin with exponential back-off has been preferred locking approach
in Postgres (and I believe other DBMSes) for years, at least since '04
when I had last touched Postgres code. Even with 'false negative' cost
in user-space being much higher than in the kernel, it's still just a
question of scale (no wonder measurable improvement here is reported
from dbench on SSD capable of few dozen thousand IOPS).

Regards,
Andrey

 The patch looks clean, although part of that is just the mutex_spin()
 cleanup that is independent of actually using it in trylock.

 So no objections from me.

                    Linus
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Andrey Kuzmin
On Wed, Mar 23, 2011 at 11:28 PM, Arne Jansen sensi...@gmx.net wrote:
 On 23.03.2011 20:26, Andrey Kuzmin wrote:

 On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net  wrote:

 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

 b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s

 10x speed-up looks indeed impressive. Just for me to be sure, did I
 get you right in that you attribute this effect specifically to
 enumerating tree leaves in key address vs. disk addresses when these
 two are not aligned?

 Yes. Leaves and the intermediate nodes tend to be quite scattered
 around the disk with respect to their logical order.
 Reading them in logical (ascending/descending) order require lots
 of seeks.

And the patch actually does on-the-fly defragmentation, right? Why
loose it then :)?

Regards,
Andrey



 Regards,
 Andrey


 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with

 fs_mark -D 512 -t 16 -n 4096 -F -S0

 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.

 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..

 I'd imagine an interface along the lines of

 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
                        struct btrfs_key *start,
                        struct btrfs_key *end,
                        struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.

 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.

 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.

 --
 Arne


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: check items for correctness as we search V3

2011-03-18 Thread Andrey Kuzmin
On Fri, Mar 18, 2011 at 3:52 AM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Andrey Kuzmin's message of 2011-03-17 15:12:32 -0400:
 On Thu, Mar 17, 2011 at 9:18 PM, Josef Bacik jo...@redhat.com wrote:
  Currently if we have corrupted items things will blow up in spectacular 
  ways.
  So as we read in blocks and they are leaves, check the entire leaf to make 
  sure
  all of the items are correct and point to valid parts in the leaf for the 
  item
  data the are responsible for.  If the item is corrupt we will kick back 
  EIO and
  not read any of the copies since they are likely to not be correct either. 
   This
  will catch generic corruptions, it will be up to the individual callers of
  btrfs_search_slot to make sure their items are right.  Thanks,
 
  diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
  index 495b1ac..9f31e11 100644
  --- a/fs/btrfs/disk-io.c
  +++ b/fs/btrfs/disk-io.c
  @@ -323,6 +323,7 @@ static int btree_read_extent_buffer_pages(struct 
  btrfs_root *root,
         int num_copies = 0;
         int mirror_num = 0;
 
  +       clear_bit(EXTENT_BUFFER_CORRUPT, eb-bflags);
         io_tree = BTRFS_I(root-fs_info-btree_inode)-io_tree;
         while (1) {
                 ret = read_extent_buffer_pages(io_tree, eb, start, 1,
  @@ -331,6 +332,14 @@ static int btree_read_extent_buffer_pages(struct 
  btrfs_root *root,
                     !verify_parent_transid(io_tree, eb, parent_transid))
                         return ret;
 
  +               /*
  +                * This buffer's crc is fine, but its contents are 
  corrupted, so
  +                * there is no reason to read the other copies, they won't 
  be
  +                * any less wrong.
  +                */

 This sounds like an overstatement to me. You may be dealing with an
 error pattern CRC faled to catch, so giving up reading a mirror at
 this point seems premature.

 But we have no way to tell which one is more correct, at least not
 without a full fsck.

Voting with two participants (would be better to have at least three,
though theory says even this is insufficient in the presence of
failures  :)) is naturally deficient, so you are right in general
except one particular case: when 2nd copy passes CRC _and_
verification, and two copies differ by a bit pattern undetectable by
CRC in use.

This is a corner case, of course, but the price to pay for false
positive (full fsck with associated downtime) is high enough to make
it worth a deeper dive.

Regards,
Andrey


 -chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to implement raid1 repair

2011-03-17 Thread Andrey Kuzmin
On Thu, Mar 17, 2011 at 8:42 PM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Jan Schmidt's message of 2011-03-17 13:37:54 -0400:
 On 03/17/2011 06:09 PM, Andrey Kuzmin wrote:
  On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt list.bt...@jan-o-sch.net
  mailto:list.bt...@jan-o-sch.net wrote:
      - Is it acceptable to retry reading a block immediately after the disk
      said it won't work? Or in case of a successful read followed by a
      checksum error? (Which is already being done right now in btrfs.)
 
 
  These are two pretty different cases. When disk firmware fails read, it
  means it has retried number of times but gave up (suggesting media
  error), so an upper layer retry would hardly make sense. Checksum error
  catches on-disk EDC fault, so retry is on the contrary quite reasonable.

 Agreed.

      - Is it acceptable to always write both mirrors if one is found to be
      bad (also consider ssds)?
 
 
  Writing on read path bypassing file-system transaction mechanism doesn't
  seem a good idea to me. Just imaging loosing power while overwriting
  last good copy.

 Okay, sounds reasonable to me. Let's say we're bypassing transaction
 mechanism in the same rude manner, but only write the bad mirror. Does
 that seem reasonable?

 The bad mirror is fair game.  Write away, as long as you're sure you're
 excluding nodatacow and you don't allow that block to get reallocated
 elsewhere.  You don't actually need to bypass the transaction
 mechanism, just those two things.

What happens if multiple readers (allowed by read lock) attempt an overwrite?


Regards,
Andrey



 -chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: check items for correctness as we search V3

2011-03-17 Thread Andrey Kuzmin
On Thu, Mar 17, 2011 at 9:18 PM, Josef Bacik jo...@redhat.com wrote:
 Currently if we have corrupted items things will blow up in spectacular ways.
 So as we read in blocks and they are leaves, check the entire leaf to make 
 sure
 all of the items are correct and point to valid parts in the leaf for the item
 data the are responsible for.  If the item is corrupt we will kick back EIO 
 and
 not read any of the copies since they are likely to not be correct either.  
 This
 will catch generic corruptions, it will be up to the individual callers of
 btrfs_search_slot to make sure their items are right.  Thanks,

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 495b1ac..9f31e11 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -323,6 +323,7 @@ static int btree_read_extent_buffer_pages(struct 
 btrfs_root *root,
        int num_copies = 0;
        int mirror_num = 0;

 +       clear_bit(EXTENT_BUFFER_CORRUPT, eb-bflags);
        io_tree = BTRFS_I(root-fs_info-btree_inode)-io_tree;
        while (1) {
                ret = read_extent_buffer_pages(io_tree, eb, start, 1,
 @@ -331,6 +332,14 @@ static int btree_read_extent_buffer_pages(struct 
 btrfs_root *root,
                    !verify_parent_transid(io_tree, eb, parent_transid))
                        return ret;

 +               /*
 +                * This buffer's crc is fine, but its contents are corrupted, 
 so
 +                * there is no reason to read the other copies, they won't be
 +                * any less wrong.
 +                */

This sounds like an overstatement to me. You may be dealing with an
error pattern CRC faled to catch, so giving up reading a mirror at
this point seems premature.

Regards,
Andrey
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Appending data to the middle of a file using btrfs-specific features

2010-12-06 Thread Andrey Kuzmin
On Mon, Dec 6, 2010 at 7:05 PM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Nirbheek Chauhan's message of 2010-12-06 07:41:16 -0500:
 Hello,

 I'd like to know if there has been any discussion about adding a new
 feature to write (add) data at an offset, but without overwriting
 existing data, or re-writing the existing data. Essentially, in-place
 addition/removal of data to a file at a place other than the end of
 the file.

 Some possible use-cases of such a feature would be:

 (a) Databases (currently hack around this by allocating sparse files)
 (b) Delta-patching (rsync, patch, xdelta, etc)
 (c) Video editors (especially if combined with reflink copies)

 Besides I/O savings, it would also have significant space savings if
 the current subvolume being written to has been snapshotted (a common
 use-case for incremental backups).

 I've been told that the problem is somewhat difficult to solve
 properly under block-based representation of data, but I was hoping
 that btrfs' reflink mechanism and its space-efficient packing of small
 files might make it doable.

 A hack I can think of is to do a BTRFS_IOC_CLONE_RANGE into a new file
 (upto the offset), writing whatever data is required, and then doing
 another BTRFS_IOC_CLONE_RANGE with an offset for the rest of the
 original file. This can be followed by a rename() over the original
 file. Similarly for removing data from the middle of a file. Would
 this work? Would it be cleaner to implement something equivalent
 internally?

 It would work yes.  The operation has three cases:

 1) file size doesn't change
 2) extend the file with new bytes in the middle
 3) make the file smaller removing bytes in the middle

 #1 is the easiest case, you can just use the clone range ioctl directly

Tis doesn't seem to be interesting, looking just like traditional COW overwrite.


 For #2 and #3, all of the file pointers past the bytes you want to add
 or remove need to be updated with a new file offset.  I'd say for an
 initial implementation to use the IOC_CLONE_RANGE code, and after
 everything is working we can look at optimizing it with a shift ioctl if
 it makes sense.

Not sure how btrfs implements versioned B-trees, but other
snapshot-capable file-systems I'm aware of utilize DITTO B-tree entry
that says for tis range, consult previous version tree. One can
imagine DITTO(n) extension that would tell subtract n from look-up
key and then consult previous version tree, effectively achieving
range shift behavior. FWIW.

Regards,
Andrey



 Of the use cases you list, video editors seems the most useful.
 Databases already have things pretty much under control, and delta
 patching wants to go to a new file anyway.  Video editing software has
 long been looking for ways to do this.

 -chris
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Default to read-only on snapshot creation and have a flag if snapshot should be writable (was: [PATCH 0/5] btrfs: Readonly snapshots)

2010-11-30 Thread Andrey Kuzmin
In my opinion, the point is not the default snapshot creation mode but
rather default usage, equals user's expectation.

On 11/30/10, Li Zefan l...@cn.fujitsu.com wrote:
 C Anthony Risinger wrote:
 On Nov 29, 2010, at 3:48 PM, Andrey Kuzmin andrey.v.kuz...@gmail.com
 wrote:

 I'm not sure why zfs came up, they don't own the term :). As to
 zfs/overhead topic, I doubt there's any difference between clone and
 writable shapshot (there should be none, of course, it's just two
 different names for the same concept).

 Regards,
 Andrey




 On Tue, Nov 30, 2010 at 12:43 AM, Mike Fedyk mfe...@mikefedyk.com
 wrote:
 On Mon, Nov 29, 2010 at 1:31 PM, Andrey Kuzmin
 andrey.v.kuz...@gmail.com wrote:
 This may sound excessive as any new concept introduction that late
 in
 development, but readonly/writable snapshots could be further
 differentiated by naming the latter clones. This way end-user would
 naturally perceive snapsot as read-only PIT fs image, while clone
 would naturally refer to (writable) head fork.

 I'm not sure we want to take all of the terminology that zfs uses as
 it may also bring the percieved drawbacks as well.  Isn't there some
 additional overhead for a zfs clone compared to a snapshot?  I'm not
 very familiar with zfs so that's why I ask.

 --
 To unsubscribe from this list: send the line unsubscribe linux-
 btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 I don't like the idea of readonly by default, or further changes to
 terminology, for several reasons:


 I quite agree with you. LVM2 also defaults to read/write for snapshots.

 ) readonly by default offers no real enhancement whatsoever other than
 breaking _anything_ that's written right now

 This was the first thing that came to my mind.

 ) btrfs readonly is not even really readonly; as superuser could
 simply flip a flag to enable writes, readonly merely prevents
 accidental writes or misbehaving apps... ie. protecting you from
 yourself
 ) backups are the simple/obvious use case; I personally use btrfs
 heavily for LXC containers, in which case nearly every single snapshot
 is intended to be writable -- usually cloning a template into a new
 domain
 ) I also use an initramfs hook to provide system rollbacks, also
 writable; the hook also provides multiple versions of the branch...
 all writable
 ) adding new terms is not a good idea imo; I've already spewed out
 many sentences explaining the difference between subvolumes and
 snapshots, ie. that there is none... adding another term only adds to
 this problem; they each describe the same thing, but differentiate
 based on origin or current state, neither of which actually describe
 what it _is_-- a new named pointer to a tree, like a git branch -- a
 subvolume.

 I think a better solution/compromise would be to leave snapshots
 writeable by default, since that's more true to what's happening
 internally anyway, but maybe introduce a mount option controlling the
 default action for that mount point.

 C Anthony [mobile]



-- 
Regards,
Andrey
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Default to read-only on snapshot creation and have a flag if snapshot should be writable (was: [PATCH 0/5] btrfs: Readonly snapshots)

2010-11-29 Thread Andrey Kuzmin
This may sound excessive as any new concept introduction that late in
development, but readonly/writable snapshots could be further
differentiated by naming the latter clones. This way end-user would
naturally perceive snapsot as read-only PIT fs image, while clone
would naturally refer to (writable) head fork.

Regards,
Andrey




On Tue, Nov 30, 2010 at 12:08 AM, Mike Fedyk mfe...@mikefedyk.com wrote:
 On Mon, Nov 29, 2010 at 12:41 PM, David Arendt ad...@prnet.org wrote:
 On 11/29/10 21:02, Mike Fedyk wrote:

 On Mon, Nov 29, 2010 at 12:02 AM, Li Zefanl...@cn.fujitsu.com  wrote:

 (Cc: Sage Weils...@newdream.net  for changes in async snapshots)

 This patchset adds readonly-snapshots support. You can create a
 readonly snapshot, and you can also set a snapshot readonly/writable
 on the fly.

 A few readonly checks are added in setattr, permission, remove_xattr
 and set_xattr callbacks, as well as in some ioctls.

 Great work!

 I have a suggestion on defaults when snapshots are created.  I think
 they should default to being read-only and if they are meant to be
 read-write a flag can be set at creation time (and changable at a
 later time as well of course).

 This way user/admin preconceptions of a snapshot being read-only can
 be enforced by default, and the exception when you want a read-write
 snapshot can be available with a switch at the cli level (and probably
 a flag at the ioctl level).

 It gives one more natural distinction between a snapshot and a
 subvolume at the user conceptual level.

 What do you think?

 I completely agree with you. I think lots of people use snapshots for backup
 purposes and these ones shouldn't be writable.

  by default.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapshots of directories

2010-01-13 Thread Andrey Kuzmin
Did I get you right in that btrfs does not support snapshots of an
arbitrary directory?

Regards,
Andrey




On Tue, Jan 12, 2010 at 5:19 AM, TARUISI Hiroaki
taruishi.hir...@jp.fujitsu.com wrote:
 In btrfs, snapshot is a clone of subvolume, not arbitrary
 directory.
 You specified '/root' directory and it is not subvolume,
 snapshot is created for parent subvolume, root of filesystem.

 Regards,
 taruisi

 (2010/01/12 11:12), Michael Niederle wrote:
 I try to take a snapshot of a single directory, e.g. root:

 btrfsctl -s root.2010-01-12 /root
 operation complete
 Btrfs v0.19-4-gab8fb4c-dirty

 Then I take look what's inside the newly created snapshot:

 ls -l /root.2010-01-12/
 total 0
 drwxr-xr-x 1 root root 1192 2010-01-03 20:32:12 bin
 drwxr-xr-x 1 root root   76 2009-06-25  0:40:35 boot
 drwxr-xr-x 1 root root 1756 2010-01-12  2:33:07 cmds
 drwxr-xr-x 1 root root    0 2010-01-06 12:21:46 data
 drwxr-xr-x 1 root root 4356 2010-01-12  2:07:00 dev
 drwxr-xr-x 1 root root   42 2010-01-04 12:29:45 downloads
 drwxr-xr-x 1 root root 4528 2010-01-12  2:12:12 etc
 drwxr-xr-x 1 root root   52 2010-01-11 12:57:47 home
 drwxr-xr-x 1 root root    0 2007-11-10  4:44:07 initrd
 drwxr-xr-x 1 root root 4490 2010-01-05 20:15:53 lib
 drwxr-xr-x 1 root root  124 2008-04-27 14:53:39 mnt
 drwxr-xr-x 1 root root   62 2008-01-08  0:21:58 net
 drwxr-xr-x 1 root root    0 2008-04-09  3:19:16 objects
 drwxr-xr-x 1 root root  316 2009-12-28 23:23:13 opt
 dr-xr-xr-x 1 root root    0 2007-11-10  3:35:28 proc
 drwxr-xr-x 1 root root 7676 2010-01-11  0:35:41 root
 drwxr-xr-x 1 root root    0 2010-01-12  1:56:17 save
 drwxr-xr-x 1 root root    0 2010-01-12  1:55:58 save2
 drwxr-xr-x 1 root root 3804 2010-01-06  2:36:08 sbin
 drwxr-xr-x 1 root root    0 2007-11-10  3:35:28 sys
 drwxr-xr-x 1 root root  358 2010-01-11 18:44:29 tmp
 drwxr-xr-x 1 root root  176 2009-12-29 17:08:37 usr
 drwxr-xr-x 1 root root   72 2010-01-05 20:03:00 var

 It seems that always a snapshot of the root is taken instead one of the
 specified directory? Is this by design?

 Snapshotting the root works fine, but if you take several snapshots it's a 
 bit
 recursive, because every new snapshot contains all previous snapshots.

 Greetings, Michael
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: committing new snapshots

2009-12-08 Thread Andrey Kuzmin
On Tue, Dec 8, 2009 at 7:05 PM, Josef Bacik jo...@redhat.com wrote:
 On Mon, Dec 07, 2009 at 02:25:50PM -0800, Sage Weil wrote:
 When you create a new snap or subvol, first a new ROOT_ITEM is created
 while everything commits, and then the referring directory entry is set up
 (with a correspond ROOT_BACKREF).

 First, if you say 'btrfsctl -s foo .' and then 'reboot -f -n' before the
 next regularly scheduled commit, the snap is created, but lost.. there's
 no reference.  Second, the unreferenced ROOT_ITEM is never cleaned up.

 Are there any existing plans for this?  It would be nice if the reference
 could be committed as well the first time around.  That probably requires
 a bit of futzing to determine what the root objectid is going to be
 beforehand, then adding the link in the namespace, then flushing things
 out and updating the root item in the right order?


 We could probably use the orphan code for this.  Just create an orphan item 
 for
 the snapshot and then delete it when the snapshot is fully created that way if
 somebody does reboot -fn we cleanup the root item and such.  Thanks,


It would be nice to have atomic behavior. Perhaps something similar to
rename with its atomicity guarantees could help?

Regards,
Andrey


 Josef
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UI issues around RAID1

2009-11-17 Thread Andrey Kuzmin
On Tue, Nov 17, 2009 at 6:25 PM, jim owens jow...@hp.com wrote:
 snip
 So we know the raw free blocks, but can not guarantee
 how many raw blocks per new user write-block will be
 consumed because we do not know what topology will be
 in effect for a new write.

 We could cheat and use worst-case topology numbers
 if all writes are the current default raid.  Of course
 this ignores DUP unless it is set on the whole filesystem.

 And we also have the problem of metadata - which is dynamic
 and allocated in large chunks and has a DUP type, how do we
 account for that in worst-case calculations.

 The worst-case is probably wrong but may be more useful to
 people to know when they will run out of space. Or at least
 it might make some of our ENOSPC complaints go away :)

 Only raw and worst-case can be explained to users and
 which we report is up to Chris.  Today we report raw.

 After spending 10 years on a multi-volume filesystem that
 had (unsolvable) confusing df output, I'm just of the
 opinion that nothing we do will make everyone happy.

df is user-centric, and therefore is naturally expected to return
used/available _logical_ capacity (how this translates to used
physical space is up to file-system-specific tools to find
out/report). Returning raw is counter-intuitive and causes surprise
similar to that of Roland.

With so flexible, down to per-file, topology configuration the only
option I see for df to return logical capacity available is to compute
the  latter off the file-system object for which df is invoked. For
instance, 'df /path/to/some/file' could return logical capacity for
the mountpoint where some-file resides, computed from underlying
physical capacity available _and_ topology for this file. 'df
/mount-point' would under this implementation return  available
logical capacity assuming default topology for the referenced
file-system.

As to used logical space accounting, this is file-system-specific and
I'm not yet familiar enough with btrfs code-base to argument for any
approach.

Regards,
Andrey

 But feel free to run a patch proposal by Chris.

 jim

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Snapshot/subvolume listing feature

2009-11-16 Thread Andrey Kuzmin
Just for clarity, getdents is exactly the other interface options
discussed couple of weeks back (use virtual directories  standard
file-system API).

Regards,
Andrey



On Mon, Nov 16, 2009 at 11:58 AM, TARUISI Hiroaki
taruishi.hir...@jp.fujitsu.com wrote:
 Thank you for your advice.

 I'm aware of redundant search, but I didn't think of
 getdents like interface.

 I'll remake it without redundant search.

 Regards,
 taruisi

 Yan, Zheng wrote:
 2009/11/16 TARUISI Hiroaki taruishi.hir...@jp.fujitsu.com:
 I made Snapshot/subvolume listing feature.

 This feature consists of two patches, for kernel(ioctl),
 and for progs(btrfsctl). I send these two patches as response
 of this mail soon.

 New option '-l' is introduced to btrfsctl for listing.

 If this option is specified, btrfsctl call new ioctl. New ioctl
 searches root tree and enumerates subtrees. For each subtrees,
 ioctl searches directory path to tree root, and enumerates
 more descendant until no more subtree is found.

 MANPAGE-like option description and examples are as follows.

  OPTIONS
        -l _file_
                List all snapshot/subvolume directories under a tree
                which _file_ belongs to.

  EXAMPLES
        # btrfsctl -l /work/btrfs
    Base path = /work/btrfs/
    No.    Tree ID      Subvolume Relative Path
     1         256      ss1/
     2         257      ss2/
     3         258      svs1/ss1/
     4         259      svs1/ss2/
     5         260      svs2/ss1/
     6         261      svs2/ss2/
     7         262      ss3/
     8         263      ss4/
     9         264      sv_pool/
    10         265      sv_pool/ss01/
    11         266      sv_pool/ss02/
    12         267      sv_pool/ss03/
    13         268      sv_pool/ss04/
    14         269      sv_pool/ss05/
    15         270      sv_pool/ss06/
    16         271      sv_pool/ss07/
    17         272      sv_pool/ss08/
    18         273      sv_pool/ss09/
    19         274      sv_pool/ss10/
  operation complete
  Btrfs v0.19-9-gd67dad2


 Thank you for doing this.

 I have a quick look at the patches. It seems the ioctl returns full path
 to each subvolume and uses sequence ID to indicate the progress
 of listing. Every time the ioctl is called, it tries building full list of
 subvolume, then skip entries that already returned.  I think the API is
 suboptimal, a getdents like API is better. (The ioctl only lists subvolumes
 within a given subvolume, the user program call the ioctl recursively
 to list all subvolumes.)

 Yan, Zheng
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] big fat transaction ioctl

2009-11-11 Thread Andrey Kuzmin
On Wed, Nov 11, 2009 at 6:03 PM, Chris Mason chris.ma...@oracle.com wrote:
 On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
 On Tue, 10 Nov 2009, Andrey Kuzmin wrote:

  On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil s...@newdream.net wrote:
   Hi all,
  
   This is an alternative approach to atomic user transactions for btrfs.
   The old start/end ioctls suffer from some basic limitations, namely
  
    - We can't properly reserve space ahead of time to avoid ENOSPC part
   way through the transaction, and
    - The process may die (seg fault, SIGKILL) part way through the
   transaction.  Currently when that happens the partial transaction will
   commit.
  
   This patch implements an ioctl that lets the application completely
   specify the entire transaction in a single syscall.  If the process gets
   killed or seg faults part way through, the entire transaction will still
   complete.
  
   The goal is to atomically commit updates to multiple files, xattrs,
   directories.  But this is still a file system: we don't get rollback if
   things go wrong.  Instead, do what we can up front to make sure things
   will work out.  And if things do go wrong, optionally prevent a partial
   result from reaching the disk.
 
  Why not snapshot respective root (doesn't work if transaction spans
  multiple file-systems, but this doesn't look like a real-world
  limitation), run txn against that snapshot and rollback on failure
  instead? Snapshots are writable, cheap, and this looks like a real
  transaction abort mechanism.

 Good question.  :)

 I hadn't looked into this before, but I think the snapshots could be used
 to achieve both atomicity and rollback.  If userspace uses an rw mutex to
 quiesce writes, it can make sure all transactions complete before creating
 a snapshot (commit).  The problem with this currently is the create
 snapshot ioctl is relatively slow... it calls commit_transaction, which
 blocks until everything reaches disk.  I think to perform well this
 approach would need a hook to start a commit and then return as soon as it
 can guarantee than any subsequent operation's start_transaction can't join
 in that commit.

 This may be a better way to go about this, though.  Does that sound
 reasonable, Chris?

 Yes, we could do this, but I don't think it will perform very well
 compared to your multi-operation ioctl.  It really does depend on how
 often you need to do atomic ops (my guess is very).

 Honestly you'll get better performance with a simple write-ahead log
 from userland:

Write-ahead logging is necessary anyway if the aim is to provide
transactional semantics to an application. But, at the same time, w/o
snapshot there is no synchronization between the log and file-system
state.

Regards,
Andrey


 step1: write redo log somewhere in the FS, with enough information to
 bring all the objects you're about to touch to a consistent state.
 step2: fsync the log
 step3: do your operations
 step4: append a record to the undo log that invalidates the last log
 op, or just truncate it to zero.
 step5: fsync the log.

 The big advantage of the log is that you won't be tied to btrfs, but
 it's two fsyncs where the big transaction framework does none.  This
 should allow you to turn on the fast fsync log again, but I think the
 multi-operation ioctl would do that as well.

 -chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] big fat transaction ioctl

2009-11-11 Thread Andrey Kuzmin
On Wed, Nov 11, 2009 at 8:19 PM, Sage Weil s...@newdream.net wrote:
 On Wed, 11 Nov 2009, Chris Mason wrote:

 On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
  On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
 
   On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil s...@newdream.net wrote:
Hi all,
   
This is an alternative approach to atomic user transactions for btrfs.
The old start/end ioctls suffer from some basic limitations, namely
   
 - We can't properly reserve space ahead of time to avoid ENOSPC part
way through the transaction, and
 - The process may die (seg fault, SIGKILL) part way through the
transaction.  Currently when that happens the partial transaction will
commit.
   
This patch implements an ioctl that lets the application completely
specify the entire transaction in a single syscall.  If the process 
gets
killed or seg faults part way through, the entire transaction will 
still
complete.
   
The goal is to atomically commit updates to multiple files, xattrs,
directories.  But this is still a file system: we don't get rollback if
things go wrong.  Instead, do what we can up front to make sure things
will work out.  And if things do go wrong, optionally prevent a partial
result from reaching the disk.
  
   Why not snapshot respective root (doesn't work if transaction spans
   multiple file-systems, but this doesn't look like a real-world
   limitation), run txn against that snapshot and rollback on failure
   instead? Snapshots are writable, cheap, and this looks like a real
   transaction abort mechanism.
 
  Good question.  :)
 
  I hadn't looked into this before, but I think the snapshots could be used
  to achieve both atomicity and rollback.  If userspace uses an rw mutex to
  quiesce writes, it can make sure all transactions complete before creating
  a snapshot (commit).  The problem with this currently is the create
  snapshot ioctl is relatively slow... it calls commit_transaction, which
  blocks until everything reaches disk.  I think to perform well this
  approach would need a hook to start a commit and then return as soon as it
  can guarantee than any subsequent operation's start_transaction can't join
  in that commit.
 
  This may be a better way to go about this, though.  Does that sound
  reasonable, Chris?

 Yes, we could do this, but I don't think it will perform very well
 compared to your multi-operation ioctl.  It really does depend on how
 often you need to do atomic ops (my guess is very).

 The thing is, I'm not sure using snaps is that different from what I'm
 doing now.  Currently the ioctl transactions don't hit disk until each
 full commit (flushoncommit, no fsync).  Unless the presense of a snapshot
 adds additional overhead (to the commit, or to cleaning up the slightly
 longer-living snapped roots), the difference would be that starting
 transactions would need to be blocked by the application instead of
 wait_current_trans in start_transaction, and (currently at least) they
 would wait longer (the extra writes between blocked = 0 and commit_done =
 1 in commit_transaction).

 The key, as now, is keeping the full fs syncs infrequent.  And, if
 possible, reducing the duration of the blocked == 1 period during
 commit_transaction.

It took me some time to associate you with Ceph project and to recall
what Ceph is, so my original snapshot suggestion was out-of-context.
When put into Ceph context, it looks too heavy-weight and may turn an
overkill. Chris's write-ahead logging idea looks much more realistic
for your use case.



 Honestly you'll get better performance with a simple write-ahead log
 from userland:

 There actually is a log, but it's optional and not strictly write-ahead...
 it's only used to reduce the commit latency:

 1- apply operations to fs (grouped into atomic transactions)
 2- (optionally) write and flush log entry
 ...repeat...
 3- periodically sync the fs, then trim the log.  or sync early if a
 client explicitly requests it.

 But

 1- I don't want to make the log required.  Sometimes you're more concerned
 about total throughput, not latency, and the log halves your write bw
 unless you add more spindles.

Log-induced latency penalty is the price for transactional consistency
:). Traditional mitigation recipe involves low-latency log device
(NVRAM and, recently, SLC flash). Since you specifically target
distributed systems, you have a distributed in-memory logging option.

Regards,
Andrey


 2- I don't want it strictly write-ahead because (in the absense of atomic
 ops) it means you have to wait for the log to sync before applying the ops
 to the fs (to ensure the fs doesn't get a partial transaction ahead of the
 log).  This marries atomicity with your schedule for durability, which
 isn't necessarily what you want.  (e.g., Ceph makes a distinction between
 serialized and commited ops, allowing limited sharing of data before it
 hits disk.  That's the nice

Re: snapshot-removal - timeline ?

2009-08-05 Thread Andrey Kuzmin
On Wed, Aug 5, 2009 at 3:18 PM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote:

 On 4. aug.. 2009, at 20.33, Chris Mason wrote:

 It's strange that such a small thing should be delayed so much. If
 snapshot removal was working, I'm quite sure we might get more users
 and thereby more stable code faster.

 It's a small feature but it gets deep into the difficult parts of the
 dentry cache to do it right.  So, it definitely isn't easy.


 I'd say it's a pretty elemetary feature to be able to remove something you 
 have created.

Snapshots are somewhat counter-intuitive in many respects: for
instance, one snapshot-capable file-system performs writes to  a
dataset with snapshots _faster_ than to the same dataset w/o
snapshots. Snapshot removal is no exception - it's a bit more complex
than one would think.

Regards,
Andrey

 I know, you can remove the files and so on, but still, having a bunch of old 
 and empty
 snapshots lying around is no good.



 roy
 --
 Roy Sigurd Karlsbakk
 (+47) 97542685
 r...@karlsbakk.net
 http://blogg.karlsbakk.net/
 --
 I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det 
 er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
 idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
 relevante synonymer på norsk.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Andrey Kuzmin
On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust janfr...@tanso.net wrote:
 Looking at the website content, it also revealed that VMware will have a
 similiar feature for their workhorse ,,esx server'' in the upcoming
 release, however my point still stands. Ship out a service pack for
 windows and you 1.5 Gbyte of modified data that is not deduped.

 All desktops that are linked to a master image can be patched or updated
  simply by updating the master image, without affecting users’ settings,
  data or applications.

Writable clone support in the file-system coupled with hierarchical
settings/data/apps  layout, and you have what's described above.

Regards,
Andrey



  -jf

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs development plans

2009-04-20 Thread Andrey Kuzmin
On Mon, Apr 20, 2009 at 8:10 PM, Ahmed Kamal
email.ahmedka...@googlemail.com wrote:
  But now Oracle can re-license Solaris and merge ZFS with btrfs.
 Just kidding, I don't think it would be technically feasible.


 May I suggest the name ZbtrFS :)
 Sorry couldn't resist. On a more serious note though, is there any
 technical benefits that justify continuing to push money in btrfs

Personally, I don't see any. Porting zfs to Linux will cost (quite)
some time and effort, but this is peanuts compared to what's needed to
get btrfs  (no offense meant) to maturity level/feature parity with
zfs. The only thing that could prevent this is CDDL licensing issues
and patent claims from NTAP over zfs snapshots  and other features;
btrfs is free from both.

Regards,
Andrey

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs development plans

2009-04-20 Thread Andrey Kuzmin
On Mon, Apr 20, 2009 at 9:08 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
 On Mon, Apr 20, 2009 at 12:57 PM, Andrey Kuzmin
 andrey.v.kuz...@gmail.com wrote:
 On Mon, Apr 20, 2009 at 8:10 PM, Ahmed Kamal
 email.ahmedka...@googlemail.com wrote:
  But now Oracle can re-license Solaris and merge ZFS with btrfs.
 Just kidding, I don't think it would be technically feasible.


 May I suggest the name ZbtrFS :)
 Sorry couldn't resist. On a more serious note though, is there any
 technical benefits that justify continuing to push money in btrfs

 Personally, I don't see any. Porting zfs to Linux will cost (quite)
 some time and effort, but this is peanuts compared to what's needed to
 get btrfs  (no offense meant) to maturity level/feature parity with
 zfs. The only thing that could prevent this is CDDL licensing issues
 and patent claims from NTAP over zfs snapshots  and other features;
 btrfs is free from both.

 I'm sure that people with far more experience than I will comment—
 But considering that BTRFS is in the Linux Kernel today, the histories
 of other imported FSes (XFS),

Imported file-systems (someone more experienced may correct me if I'm
wrong) have previously been give-aways. This one is different - zfs is
in active development, with highly welcomed features like
de-duplication coming.

 and the state of ZFS in FreeBSD this may not be strictly true.

This was one-man's effort (though a heroic one, definitely), hardly a
case to compare with.


Regards,
Andrey
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs and raw zvol-like partition

2009-04-12 Thread Andrey Kuzmin
zvol (interface) does not just  'export raw device' but rather
implements volume  abstraction and integrates volume management into
file-system.

Regards,
Andrey



On Sun, Apr 12, 2009 at 11:26 AM, Sébastien Wacquiez s...@enix.org wrote:
 Hi,

 A nice feature is ZFS is the ZVOL layer, that permit you to export
 (directly) a raw device from your zfs pool of disc, with the benefit of
 powerful (growing!) snapshot and easy raid management from zfs. It's
 particulary usefull when you use it with virtual server, allowing you to
 centralize all your backup problematic, etc

 Does btrfs plan to support this kind of feature ? (please, don't tell me
 that lvm do, lvm just sucks when you make a snapshot of your disk, and lack
 of the growing, commit, rollback, diffsend feature).

 Thanks !


 Sébastien Wacquiez

 PS : see http://opensolaris.org/os/community/zfs/source/zfstour.png if you
 don't know what zvol do.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] error handling of ERR_PTR() returns

2009-04-07 Thread Andrey Kuzmin
Since both NULL ptr and IS_ERR(ptr) are treated as error, why not
redefine IS_ERR to handle both, simplifying caller's life?

Regards,
Andrey



On Tue, Apr 7, 2009 at 5:38 PM, Dan Carpenter erro...@gmail.com wrote:
 There are a couple functions which return ERR_PTR as well as NULL.  The
 caller needs to handle both.

 Smatch also complains about the handling of alloc_extent_map() but as far
 as I can see that doesn't actually return an ERR_PTR.

 Compile tested on 2.6.29.

 regards,
 dan carpenter

 --- orig/fs/btrfs/disk-io.c     2009-04-07 16:15:36.0 +0300
 +++ devel/fs/btrfs/disk-io.c    2009-04-07 16:23:33.0 +0300
 @@ -123,7 +123,7 @@

        spin_lock(em_tree-lock);
        em = lookup_extent_mapping(em_tree, start, len);
 -       if (em) {
 +       if (!IS_ERR(em)  em) {
                em-bdev =
                        BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev;
                spin_unlock(em_tree-lock);
 @@ -1216,8 +1216,8 @@
        int ret;

        root = btrfs_read_fs_root_no_name(fs_info, location);
 -       if (!root)
 -               return NULL;
 +       if (!root || IS_ERR(root))
 +               return root;

        if (root-in_sysfs)
                return root;
 @@ -1324,7 +1324,7 @@
        spin_lock(em_tree-lock);
        em = lookup_extent_mapping(em_tree, offset, PAGE_CACHE_SIZE);
        spin_unlock(em_tree-lock);
 -       if (!em) {
 +       if (!em || IS_ERR(em)) {
                __unplug_io_fn(bdi, page);
                return;
        }
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html