Re: Create subvolume from a directory?

2012-03-28 Thread Liu Bo
On 03/29/2012 12:54 AM, Goffredo Baroncelli wrote:
>>
> 
> Could you elaborate which would be the issue ?
> "cp --reflink"-ing a file is not different than snapshotting a file. In
> any case I could mount a snapshot and not the source subvolume.
> 
> 
> 

We already have a debate about this "cross-link device":
http://comments.gmane.org/gmane.comp.file-systems.btrfs/9864

"cp --reflink" will use clone feature, which can share data among files, but 
metadata is preserved individually.

My case is that I can mount both a subvolume and a snapshot via "-o subvol=xxx" 
or "-o subvolid=xxx".

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "not enough space" with "data raid0"

2012-03-28 Thread Duncan
Alex posted on Wed, 28 Mar 2012 22:11:01 + as excerpted:

> Phillip Susi  ubuntu.com> writes:
> 
> 
>> So currently btrfs's concept of raid0 is "stripe across as many disks
>> as possible, with a minimum of 2 disks".  Is there any reason for that
>> minimum?  I don't see why it can't allow only one if that's the best it
>> can manage.
>> 
> That's called "Single". http://btrfs.ipv5.de/index.php?title=Mkfs.btrfs

It's called "single" if you deliberately create it that way, yes.  

However, Phillip's point was, I believe, that if a striped set can start 
with say five devices of varying size and drop one at a time as the 
smallest devices run out of space, there's no reason it should stop at 
two devices.  If the goal is maximum performance, it should insist on 
using only the space available in parallel across all devices and should 
refuse to drop even a single one.  If OTOH, the goal is maximum 
utilization of available space, with a secondary goal of performance if 
at all possible, then it should drop devices all the way down to a single 
device, not stopping at two.

Since btrfs apparently is already content to drop from say five devices 
to two despite the performance drop on the way, there's no fundamental 
reason it should stop there; it should continue all the way down to a 
single device, even tho when it gets to that point it's no longer 
actually striped.

Of course, that's defining the behavior I'd expect of a mature btrfs.  
Right now, it's still experimental, with many missing features.  If at 
this point there's and arbitrary line drawn at two devices minimum, 
that's just the way it is.  But I'd not expect that line to be there when 
the filesystem is declared mature and fully ready for use.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] btrfs: Convert to new freezing mechanism

2012-03-28 Thread Jan Kara
We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason 
Signed-off-by: Jan Kara 
---
 fs/btrfs/disk-io.c |3 ---
 fs/btrfs/file.c|3 ++-
 fs/btrfs/inode.c   |6 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
do {
delay = HZ * 30;
-   vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;
 
-   vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+   sb_start_write(inode->i_sb);
 
mutex_lock(&inode->i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
num_written = err;
}
 out:
+   sb_end_write(inode->i_sb);
current->backing_dev_info = NULL;
return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
u64 page_start;
u64 page_end;
 
+   sb_start_pagefault(inode->i_sb);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (!ret) {
ret = btrfs_update_time(vma->vm_file);
@@ -6495,12 +6496,15 @@ again:
unlock_extent_cached(io_tree, page_start, page_end, &cached_state, 
GFP_NOFS);
 
 out_unlock:
-   if (!ret)
+   if (!ret) {
+   sb_end_pagefault(inode->i_sb);
return VM_FAULT_LOCKED;
+   }
unlock_page(page);
 out:
btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+   sb_end_pagefault(inode->i_sb);
return ret;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19 v4] Fix filesystem freezing deadlocks

2012-03-28 Thread Jan Kara
  Hello,

  here is the fourth iteration of my patches to improve filesystem freezing.
Filesystem freezing is currently racy and thus we can end up with dirty data on
frozen filesystem (see changelog patch 06 for detailed race description). This
patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v3:
  * added third level of freezing for fs internal purposes - hooked some
filesystems to use it (XFS, nilfs2)
  * removed racy i_size check from filemap_mkwrite()

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder 
CC: Anton Altaparmakov 
CC: Ben Myers 
CC: Chris Mason 
CC: cluster-de...@redhat.com
CC: "David S. Miller" 
CC: fuse-de...@lists.sourceforge.net
CC: "J. Bruce Fields" 
CC: Joel Becker 
CC: KONISHI Ryusuke 
CC: linux-btrfs@vger.kernel.org
CC: linux-e...@vger.kernel.org
CC: linux-...@vger.kernel.org
CC: linux-ni...@vger.kernel.org
CC: linux-ntfs-...@lists.sourceforge.net
CC: Mark Fasheh 
CC: Miklos Szeredi 
CC: ocfs2-de...@oss.oracle.com
CC: OGAWA Hirofumi 
CC: Steven Whitehouse 
CC: "Theodore Ts'o" 
CC: x...@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex

2012-03-28 Thread Jan Kara
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason 
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara 
---
 fs/btrfs/ioctl.c |   23 +++
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;
 
+   ret = mnt_want_write_file(file);
+   if (ret)
+   return ret;
+
mutex_lock(&inode->i_mutex);
 
ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
}
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode->i_flags = i_oldflags;
}
 
-   mnt_drop_write_file(file);
  out_unlock:
mutex_unlock(&inode->i_mutex);
+   mnt_drop_write_file(file);
return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
+   error = mnt_want_write(parent->mnt);
+   if (error)
+   return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
-   goto out_drop_write;
+   goto out_dput;
 
down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
 out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-   mnt_drop_write(parent->mnt);
 out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
+   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "not enough space" with "data raid0"

2012-03-28 Thread Alex
Phillip Susi  ubuntu.com> writes:

> 
> So currently btrfs's concept of raid0 is "stripe across as many disks as 
> possible, with a minimum of 2 disks".  Is there any reason for that 
> minimum?  I don't see why it can't allow only one if that's the best it 
> can manage.
> 
That's called "Single". http://btrfs.ipv5.de/index.php?title=Mkfs.btrfs



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Niels de Carpentier
> I'd like to see how they do that.  The fact is you are still going to get
> random
> seeks since you have to binary search the blocks in an entire row since
> there is
> no way you can read a several thousand block row into memory to search it,
> so
> once your rows get pretty big you are doing just as much if not more
> random io
> as a btree.

Why? The rows are sequential on disk, so you're never doing more random
seeks than the number of rows as long as you can search faster than the
disk can provide data. If they indeed can limit the number of searches per
row to a constant, it shouldn't be an issue at all.

> I don't buy that its better in the insert case either for similar reasons,
> you
> are talking about having to rewrite entire rows to maintain the sequential
> nature of everything, so if you end up adding a giant row you have to go
> and
> write the whole thing out again, so you are talking about a lot more IO
> than
> with b-trees, albeit sequential.  So maybe it's a win since it's
> sequential but
> I wonder at what tree size that benefit no longer exists.

The worst case might be bad, but on average it should be good. I don't
doubt it is always better on rotating disks, probably not on ssd's.
>
> Of course all this analysis is based on a power point presentation and
> there may
> be some detail I'm missing, but that's mostly my point, claiming that
> b-trees
> are now obsolete because we can maybe do inserts faster is just idiotic.

It's a presentation from a company, so lots of marketing. Of course it
doesn't make btrees obsolete, it's optimized for specific cases. Like they
say they reduce random seeks at the cost of disk bandwidth and cpu usage.
It depends on the use case if that is useful or not. Probably not for
btrfs, since the future will be ssd's, and meta data will generally be
cached.

Niels

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 10:44:03PM +0200, Niels de Carpentier wrote:
> 
> >
> > You are still going to have to have at least 29 levels to accomodate 1
> > billion
> > objects, though they won't all be full (sorry I missed the must be full or
> > empty
> > bit).  So it looks like we'll have to actually search what 13 rows right?
> > So
> > still more rows than a b-tree, and again you are talking about binary
> > searching
> > each row's blocks and then following its forward pointer down to the next
> > one,
> > so maybe not 100's of times slower than a b-tree, but we're not talking
> > about a
> > 5-10% difference either, probably still measured in multiples.
> 
> They say that they can do the block pointers in a way that limits the
> number for searches per row to a constant, so it's not that bad. They do
> less random seeks, but bigger ones at the cost of more cpu usage.

I'd like to see how they do that.  The fact is you are still going to get random
seeks since you have to binary search the blocks in an entire row since there is
no way you can read a several thousand block row into memory to search it, so
once your rows get pretty big you are doing just as much if not more random io
as a btree.

> >
> > Hah my math was a little off I'll grant you but the basic idea still
> > stands,
> > once we start getting into larger and larger rows we're going to have to
> > binary
> > search just to find the right _block_ to use, which in my mind is much
> > more of a
> > problem than having to binary search inside of multiple blocks.  At the
> > very
> > least as the number of objects grows the time it takes to find something
> > in the
> > tree increases exponentially.
> 
> That's not what they say. A lot of info is missing though.
> >
> >> There's a *ton* of detail that they don't describe about how to actually
> >> translate the logical notion of power-of-two node sizes into coherent
> >> block IO that scales.  I don't doubt that it's possible.
> >
> > I imagine there is, but based on what little information they've shown I
> > don't
> > see how it's a hands down win against b-trees.  If anything we're talking
> > about
> > having to solve really complex problems in order to get any sort of good
> > performance out of this thing.  Thanks,
> 
> They don't claim to win hands down for the search case, they say they are
> similar for the search case, but much better at inserts.
> 
> I don't think it's good for the linux fs general use case, but it's not as
> bad as you think. But it will be very hard to implement anyway unless they
> release more details.
>

I don't buy that its better in the insert case either for similar reasons, you
are talking about having to rewrite entire rows to maintain the sequential
nature of everything, so if you end up adding a giant row you have to go and
write the whole thing out again, so you are talking about a lot more IO than
with b-trees, albeit sequential.  So maybe it's a win since it's sequential but
I wonder at what tree size that benefit no longer exists.

Of course all this analysis is based on a power point presentation and there may
be some detail I'm missing, but that's mostly my point, claiming that b-trees
are now obsolete because we can maybe do inserts faster is just idiotic.
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Niels de Carpentier

>
> You are still going to have to have at least 29 levels to accomodate 1
> billion
> objects, though they won't all be full (sorry I missed the must be full or
> empty
> bit).  So it looks like we'll have to actually search what 13 rows right?
> So
> still more rows than a b-tree, and again you are talking about binary
> searching
> each row's blocks and then following its forward pointer down to the next
> one,
> so maybe not 100's of times slower than a b-tree, but we're not talking
> about a
> 5-10% difference either, probably still measured in multiples.

They say that they can do the block pointers in a way that limits the
number for searches per row to a constant, so it's not that bad. They do
less random seeks, but bigger ones at the cost of more cpu usage.
>
> Hah my math was a little off I'll grant you but the basic idea still
> stands,
> once we start getting into larger and larger rows we're going to have to
> binary
> search just to find the right _block_ to use, which in my mind is much
> more of a
> problem than having to binary search inside of multiple blocks.  At the
> very
> least as the number of objects grows the time it takes to find something
> in the
> tree increases exponentially.

That's not what they say. A lot of info is missing though.
>
>> There's a *ton* of detail that they don't describe about how to actually
>> translate the logical notion of power-of-two node sizes into coherent
>> block IO that scales.  I don't doubt that it's possible.
>
> I imagine there is, but based on what little information they've shown I
> don't
> see how it's a hands down win against b-trees.  If anything we're talking
> about
> having to solve really complex problems in order to get any sort of good
> performance out of this thing.  Thanks,

They don't claim to win hands down for the search case, they say they are
similar for the search case, but much better at inserts.

I don't think it's good for the linux fs general use case, but it's not as
bad as you think. But it will be very hard to implement anyway unless they
release more details.

Niels

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Zach Brown



I imagine there is, but based on what little information they've shown
I don't see how it's a hands down win against b-trees.  If anything
we're talking about having to solve really complex problems in order
to get any sort of good performance out of this thing.


Oh, absolutely.  Tack on COW and online repair and the complexity goes
through the roof.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 03:50:07PM -0400, Zach Brown wrote:
> 
> >but lets say O(log N/2) where N is the number of elements in the row.
> >So in the situation I describe you are looking at having to do minimum
> >of 29 reads, one for each row,
> 
> Hmm.
> 
> Levels are powers of two and are either full or empty.  So the total
> item count tells you which levels are full or empty.
> 
> (gdb) print/t 10
> $3 = 1110111001101011001010
> 
> So no, definitely not reading 29 levels.
>

You are still going to have to have at least 29 levels to accomodate 1 billion
objects, though they won't all be full (sorry I missed the must be full or empty
bit).  So it looks like we'll have to actually search what 13 rows right?  So
still more rows than a b-tree, and again you are talking about binary searching
each row's blocks and then following its forward pointer down to the next one,
so maybe not 100's of times slower than a b-tree, but we're not talking about a
5-10% difference either, probably still measured in multiples.

> After realizing this, I have to be honest: I'm not finding your
> hand-wavey dismissal of the technology convincing at all. :)
> 

Hah my math was a little off I'll grant you but the basic idea still stands,
once we start getting into larger and larger rows we're going to have to binary
search just to find the right _block_ to use, which in my mind is much more of a
problem than having to binary search inside of multiple blocks.  At the very
least as the number of objects grows the time it takes to find something in the
tree increases exponentially.

> There's a *ton* of detail that they don't describe about how to actually
> translate the logical notion of power-of-two node sizes into coherent
> block IO that scales.  I don't doubt that it's possible.

I imagine there is, but based on what little information they've shown I don't
see how it's a hands down win against b-trees.  If anything we're talking about
having to solve really complex problems in order to get any sort of good
performance out of this thing.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Zach Brown



but lets say O(log N/2) where N is the number of elements in the row.
So in the situation I describe you are looking at having to do minimum
of 29 reads, one for each row,


Hmm.

Levels are powers of two and are either full or empty.  So the total
item count tells you which levels are full or empty.

(gdb) print/t 10
$3 = 1110111001101011001010

So no, definitely not reading 29 levels.

After realizing this, I have to be honest: I'm not finding your
hand-wavey dismissal of the technology convincing at all. :)

There's a *ton* of detail that they don't describe about how to actually
translate the logical notion of power-of-two node sizes into coherent
block IO that scales.  I don't doubt that it's possible.

I very much doubt that the required complexity and cpu cost in the
kernel would be worth it for generic file system users, though.  Hooo
boy, do I doubt it.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 02:42:04PM -0400, Josef Bacik wrote:
> On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote:
> > The case has been made on Phoronix for F-Trees: They makes use hard
> > drive speeds, not (relatively slow) access times; beat SSD's; and scale 
> > perfectly across multiple cores with hundreds of millions of entries.
> > 
> > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes
> > 
> > How TokuDB Fractal Tree Databases Work
> > 
> > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM
> > 
> > Time for someone to get started on ftrfs? Or can it be implemented 
> > in Btrfs? 
> > https://bugzilla.kernel.org/show_bug.cgi?id=43004
> > 
> 
> This is dumber shit than usual out of phoronix.  I did some just basic
> calculations for worst case scenarios, let's say we max out btrfs's 8 level
> limit, so we have 7 levels of nodes and then our leaves.  Lets just for
> simplicity sake say we can fit 1 billion objects into this kind of tree, and 
> for
> each node/leaf we can only hold 10 objects.  (Keep in mind these are just 
> random
> numbers I'm pulling out of my ass and may not work out right with such a large
> tree, just work with me here).
> 
> So worst case scenario doing a binary search for an object with 10 objects is 
> 4
> comparisons, so with a full 8 level btree we're doing 4 comparisons per level,
> so 32 comparisons worst possible case to find our object.
> 
> Now let's consider a fractal tree with 1 billion objects.  So that would have 
> a
> 29 row fractal tree (if I did my math right).  Now I have to make some
> assumptions about how you would implement a fractal tree here, but let's 
> assume
> that every level tells you it's min and max value to make it easier to tell
> which level you need to search in.  So worst case you're object is in the 29th
> level, so you have to read 29 blocks in and check the min/max levels to figure
> out which row to search.  Let's be fair and say these are in memory, we're 
> still
> doing 29 comparisons right out of the gate to find the right row.  Now we have
> the right row, but this is a file system and the rows are split up into 
> blocks,
> so we not only have to binary search each block, but the blocks themselves to
> find the right block with our object.  And we can't do this directed either
> because we have no way of indexing the individual blocks, so worst case 
> scenario
> we're having to read in 268435456 blocks to then do a normal binary search on
> _each_ block!  Now lets again say just to give fractal trees an added benefit
> that each block has it's own min/max setting, we only have to binary search 
> the
> one that will have our actual data, but we're still talking about reading in
> 33554432 times the number of blocks as compared to a btree!!
> 

Ok looks like I wasn't being completely fair here, part of the presentation they
show talks about using forward pointers to point to where the element would be
in the next row.  So if we assume we're using forward pointers and every row has
a min/max indicator you can walk down each row and do a binary search to get as
close as possible to what you want and then follow the forward pointer down to
the next level.  The problem with this is that in the worst case you end up
having to binary search on each row, which makes the SMP case even crappier
because you have to make sure nothing changes while you are walking down the
rows and following the forward pointers.  The math here is a little trickier
since I can't easily picture what the worst case scenario would be per level,
but lets say O(log N/2) where N is the number of elements in the row.  So in the
situation I describe you are looking at having to do minimum of 29 reads, one
for each row, and then add into that all the reads you need to binary search
from your forward pointer on to the end of the row, which is going to increase
as you go down the tree.  So probably not millions times more work than b-trees,
but at least 10s if not 100s of times more work in the worst case.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/28/2012 02:42 PM, Josef Bacik wrote:
> On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote:
>> The case has been made on Phoronix for F-Trees: They makes use
>> hard drive speeds, not (relatively slow) access times; beat
>> SSD's; and scale perfectly across multiple cores with hundreds of
>> millions of entries.

> So no I don't think it's time for someone to get started on ftrfs.
> Thanks,

Thanks for doing the research to back up the answer I was going to
respond with based on intuition.

+1

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJPc1xgAAoJEB57S2MheeWydJIP/R7jHiloaPLfwffuFS3gobmy
h0DqX8YezGMs70kLBYTXeu3EVuFtX8IFGJDGoVp3pkM9nS6F6iK9LbRWDC7IDAy7
t1taQoUqy0HMmmrkXfZ8KWmXWv424/Zq5aHfnegL/oq4OrUwnHtjzJBbsx4fvezW
mnPrNlOxaLWgXSyUs+1hDjvcgfmnRjpgURHfsfGNgX3gTUE4xNKCFXD4zs0bs5YO
NcpLa/66R5UwINPLOazHt9QpC2jHxPb0j93YZBipwtbtXPj1W+od/0rOgPvc9gaK
hzi7kRiegt50V2LVOJnofdaBC0AlCfRlcXRUNOil1vgFe6s8sft1KOdl8MShQ/+t
JUTjb1j7Z3efds42nnahvNj8wV4T6z6qLlhSQiOulYbVarBsfrWoM3EJY2ObzIlM
uOjCIPwHr/fzMJ9wVyLgK2ksssZmYAVgQ3YyS9G1CHIPe9xgW9Ld570mhhXRoLRj
HG5QZkfkmFzlde+S/gGgrdrvTWDT1zDsUhe+IGYu3jtPjXVs1udB+BN3c7rTWjf+
gQUOdcS/6XHoTXWpgZ7p5DUb8qF7Nv+ABBcEWeiaIEPH7uUZ22/XZEdf2euSCTp2
+TRZfoj9PZ17cszkllG4n+kc5H4gdKMYRyvNQo9mQ2TvogsmVOgIY+0Fys2ZdOkF
CaZ6ti9ZLkofawzImOV5
=UaEh
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Josef Bacik
On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote:
> The case has been made on Phoronix for F-Trees: They makes use hard
> drive speeds, not (relatively slow) access times; beat SSD's; and scale 
> perfectly across multiple cores with hundreds of millions of entries.
> 
> http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes
> 
> How TokuDB Fractal Tree Databases Work
> 
> Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM
> 
> Time for someone to get started on ftrfs? Or can it be implemented 
> in Btrfs? 
> https://bugzilla.kernel.org/show_bug.cgi?id=43004
> 

This is dumber shit than usual out of phoronix.  I did some just basic
calculations for worst case scenarios, let's say we max out btrfs's 8 level
limit, so we have 7 levels of nodes and then our leaves.  Lets just for
simplicity sake say we can fit 1 billion objects into this kind of tree, and for
each node/leaf we can only hold 10 objects.  (Keep in mind these are just random
numbers I'm pulling out of my ass and may not work out right with such a large
tree, just work with me here).

So worst case scenario doing a binary search for an object with 10 objects is 4
comparisons, so with a full 8 level btree we're doing 4 comparisons per level,
so 32 comparisons worst possible case to find our object.

Now let's consider a fractal tree with 1 billion objects.  So that would have a
29 row fractal tree (if I did my math right).  Now I have to make some
assumptions about how you would implement a fractal tree here, but let's assume
that every level tells you it's min and max value to make it easier to tell
which level you need to search in.  So worst case you're object is in the 29th
level, so you have to read 29 blocks in and check the min/max levels to figure
out which row to search.  Let's be fair and say these are in memory, we're still
doing 29 comparisons right out of the gate to find the right row.  Now we have
the right row, but this is a file system and the rows are split up into blocks,
so we not only have to binary search each block, but the blocks themselves to
find the right block with our object.  And we can't do this directed either
because we have no way of indexing the individual blocks, so worst case scenario
we're having to read in 268435456 blocks to then do a normal binary search on
_each_ block!  Now lets again say just to give fractal trees an added benefit
that each block has it's own min/max setting, we only have to binary search the
one that will have our actual data, but we're still talking about reading in
33554432 times the number of blocks as compared to a btree!!

And this doesn't even begin to touch the ability to handle multi-threaded
workloads.  Imagine having to move all of these blocks around in memory because
your insert created a new row.  You'd have to lock each row as you move stuff
around (at the very least), so if you hit a particularly hot row your
multithreaded performance is going to look like you are running on an Atari
2600.

So no I don't think it's time for someone to get started on ftrfs.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "not enough space" with "data raid0"

2012-03-28 Thread Phillip Susi

On 3/17/2012 8:19 AM, Hugo Mills wrote:

Where is the problem, how can I use the full space?


You can't. btrfs requires RAID-0 to be at least two devices wide
(otherwise it's not striped at all, which is the point of RAID-0). If
you want to use the full capacity of both disks and don't care about
the performance gain from striping, use -d single (which is the
default). If you do care about the performance gain from striping,
then you're going to have to lose some usable space.


So currently btrfs's concept of raid0 is "stripe across as many disks as 
possible, with a minimum of 2 disks".  Is there any reason for that 
minimum?  I don't see why it can't allow only one if that's the best it 
can manage.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: enforce block count on all devices in mkfs

2012-03-28 Thread Josef Bacik
I had a test that creates a 7gig raid1 device but it was ending up wonky
because the second device that gets added is the full size of the disk
instead of the limited size.  So enforce the limited size on all disks
passed in at mkfs time, otherwise our threshold calculations end up wonky
when doing chunk allocations.  Thanks,

Signed-off-by: Josef Bacik 
---
 mkfs.c  |1 +
 utils.c |2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index c531ef2..6ae412f 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1404,6 +1404,7 @@ int main(int ac, char **av)
close(fd);
continue;
}
+   dev_block_count = block_count;
ret = btrfs_prepare_device(fd, file, zero_end,
   &dev_block_count, &mixed);
mixed = old_mixed;
diff --git a/utils.c b/utils.c
index ee7fa1b..552233b 100644
--- a/utils.c
+++ b/utils.c
@@ -555,6 +555,8 @@ int btrfs_prepare_device(int fd, char *file, int zero_end, 
u64 *block_count_ret,
fprintf(stderr, "unable to find %s size\n", file);
exit(1);
}
+   if (*block_count_ret)
+   block_count = min(block_count, *block_count_ret);
zero_end = 1;
 
if (block_count < 1024 * 1024 * 1024 && !(*mixed)) {
-- 
1.7.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create subvolume from a directory?

2012-03-28 Thread Goffredo Baroncelli


Hello Liu

On 03/28/2012 04:18 AM, Liu Bo wrote:

On 03/28/2012 06:24 AM, Matthias G. Eckermann wrote:



# time cp -a --reflink /var/lib/lxc/installserver_tmp/rootfs 
/var/lib/lxc/installserver



This is too much weird.

AFAIK, clone between different subvolumes should be forbidden.
So this would get a "Invalid cross-device link", because an individual 
subvolume can be mounted directly.


Could you elaborate which would be the issue ?
"cp --reflink"-ing a file is not different than snapshotting a file. In 
any case I could mount a snapshot and not the source subvolume.






thanks,
liubo


real0m1.367s
user0m0.148s
sys 0m1.108s

## Now remove /var/lib/lxc/installserver_tmp (or not)

--<  snap>--

Just to compare this with a "mv":

--<  snip>--

## Go back to former state

# btrfs subvol delete /var/lib/lxc/installserver
Delete subvolume '/var/lib/lxc/installserver'

# btrfs subvol create /var/lib/lxc/installserver
Create subvolume '/var/lib/lxc/installserver'

# time mv /var/lib/lxc/installserver_tmp/rootfs /var/lib/lxc/installserver/

real0m12.917s
user0m0.208s
sys 0m2.508s

--<  snap>--

While the time measurement might be flawed due to the subvol
actions inbetween, caching etc.: I tried several times, and
"cp --reflinks" always is multiple times faster than "mv" in
my environment.

Or did I misunderstand your question?

so long -
MgE



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fractal Tree Indexing over B-Trees?

2012-03-28 Thread C Anthony Risinger
On Wed, Mar 28, 2012 at 9:25 AM, Danny Piccirillo
 wrote:
> The case has been made on Phoronix for F-Trees: They makes use hard
> drive speeds, not (relatively slow) access times; beat SSD's; and scale
> perfectly across multiple cores with hundreds of millions of entries.
>
> http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes
>
> How TokuDB Fractal Tree Databases Work
>
> Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM
>
> Time for someone to get started on ftrfs? Or can it be implemented
> in Btrfs?
> https://bugzilla.kernel.org/show_bug.cgi?id=43004

whoa, very cool stuff.  fractals are awesome, cool to see them in use.

... 2010/11, surprised i never heard of it before now.  thanks for the
reference/links at the very least!

aside: i once described fractals to my grandmother (100% devout
catholic) as related to my own understanding of the universe --
specifically, i pointed out how their often simple mathematical
identity fragments into an infinitely self-similar pattern of
seemingly unbounded complexity -- she told me i was describing god ...
and, well, we agreed :-)

-- 

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fractal Tree Indexing over B-Trees?

2012-03-28 Thread Danny Piccirillo
The case has been made on Phoronix for F-Trees: They makes use hard
drive speeds, not (relatively slow) access times; beat SSD's; and scale 
perfectly across multiple cores with hundreds of millions of entries.

http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes

How TokuDB Fractal Tree Databases Work

Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM

Time for someone to get started on ftrfs? Or can it be implemented 
in Btrfs? 
https://bugzilla.kernel.org/show_bug.cgi?id=43004

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create subvolume from a directory?

2012-03-28 Thread Alex
David Sterba  jikos.cz> writes:

> 
> On Tue, Mar 27, 2012 at 05:19:06PM +, Alex wrote:
> > Can I create/convert a existing (btrfs) directory into a subvolume?
> 
> For the reference there's an entry at the wiki:
> 
>
http://btrfs.ipv5.de/index.php?title=UseCases#Can_I_take_a_snapshot_of_a_directory.3F
> 

Thank you all for your replies.

Kind regards.
(happily experimenting with btrfs in a 3.3 kernel!)

Al.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Honest timeline for btrfsck

2012-03-28 Thread Duncan
Danny Piccirillo posted on Wed, 28 Mar 2012 06:15:41 + as excerpted:

> Chris Mason  oracle.com> writes:
>> 
>> People have already started picking up #4, fujitsu had some patches in
>> this direction that we'll keep developing with.
>> 
>> I stepped back to add some directory metadata fixups as well to the
>> basic fsck tool.  I had thought I could easily do it all from the
>> kernel, but sometimes the userland side really is easier.
> 
> 
> Phoronix reported on your talk at the SCALE 10x conference here
> http://www.phoronix.com/scan.php?page=news_item&px=MTA0Njk
> 
> 
> What happened to the February 14th deadline? Is another rewrite
> underway? I think the biggest thing that can be done in lieu of the
> release is really good communication. There's no feed to get up-to-date
> info on what's being worked on, what's left, etc.

I too read the phoronix article, and in fact mentioned it here back about 
Feb 10 or so...

Actually, a somewhat-working repair-writing btrfsck does exist and is in 
fact "public"... for some value of "public", at least.  It's still sort 
of like the plans for the pan-galactic bypass in hitchhiker's guide to 
the galaxy, down several flights of stairs behind a door that says 
"beware of leopard", in that it's only available in Chris's 
"dangerdonteveruse" git branch, but it *IS* publicly available... as I 
said for /some/ value of "public" anyway.

Here's the issue, tho.  At present, it's pretty much a "well, it may fix 
some problems, but then again, it might wreak further havoc on an already 
damaged filesystem, destroying any reasonable chance of retrieving valid 
data off it, too!" type of situation.  That's why it's still stuck in 
"dangerdonteveruse".

But if you read the article well, that's all that was really promised 
anyway, at this point.  That was a deadline for Oracle getting it for 
testing and QA, etc.  It was *NOT* a "release quality" deadline, or even 
a "this matches the experimental state of the filesystem" deadline, in 
any shape or form.

So Oracle and others are doing some seriously focused QA and bug fixes on 
the thing at present.  The idea is that they can release it along with a 
kernel with their own patches, and support that.  It's worth noting, 
however, that if they're only supporting their own kernel, they can, if 
necessary, disable certain already available features of the filesystem 
in their kernel, AND in the btrfsck, so as to be able to support what's 
left.  After all, the filesystem itself is still officially experimental 
and in disk-format-flux, as well as lacking various targeted features 
ATM, and it's only a small stretch to go from there to disabling a 
feature or two already in mainline, if necessary to be able to be 
comfortable supporting the rest.


Bottom line:  Purely as a simple Linux user following this list for 
perhaps a couple months now, I'd expect btrfs to continue maturing and 
that it'll probably getting toward somewhat stable toward the end of the 
year... depending on how conservative you are, of course (there's many 
that are barely warming up to ext4 at this point).  Which would put it in 
line for the spring of next year (2013) distro releases.  Until then, I'd 
expect the current label, experimental, fit only for testing, not for 
storage of data you expect to be able to reliably retrieve, to continue 
to apply, tho to a lessor degree as it gradually matures.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create subvolume from a directory?

2012-03-28 Thread David Sterba
On Tue, Mar 27, 2012 at 05:19:06PM +, Alex wrote:
> Can I create/convert a existing (btrfs) directory into a subvolume?

For the reference there's an entry at the wiki:

http://btrfs.ipv5.de/index.php?title=UseCases#Can_I_take_a_snapshot_of_a_directory.3F

as noted in the thread, this relies on the unmerged cross-subvolume
reflink patch.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create subvolume from a directory?

2012-03-28 Thread David Sterba
On Wed, Mar 28, 2012 at 08:46:13AM +0700, Fajar A. Nugraha wrote:
> On Wed, Mar 28, 2012 at 5:24 AM, Matthias G. Eckermann  wrote:
> > While the time measurement might be flawed due to the subvol
> > actions inbetween, caching etc.: I tried several times, and
> > "cp --reflinks" always is multiple times faster than "mv" in
> > my environment.
> 
> So this is cross-subvolume reflinks? I thought the code for that
> wasn't merged yet?

Matthias was using SLES kernel, the cross-subvolume reflink patch is
present there, he was not aware it's not merged upstream.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html