Re: Create subvolume from a directory?
On 03/29/2012 12:54 AM, Goffredo Baroncelli wrote: >> > > Could you elaborate which would be the issue ? > "cp --reflink"-ing a file is not different than snapshotting a file. In > any case I could mount a snapshot and not the source subvolume. > > > We already have a debate about this "cross-link device": http://comments.gmane.org/gmane.comp.file-systems.btrfs/9864 "cp --reflink" will use clone feature, which can share data among files, but metadata is preserved individually. My case is that I can mount both a subvolume and a snapshot via "-o subvol=xxx" or "-o subvolid=xxx". thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "not enough space" with "data raid0"
Alex posted on Wed, 28 Mar 2012 22:11:01 + as excerpted: > Phillip Susi ubuntu.com> writes: > > >> So currently btrfs's concept of raid0 is "stripe across as many disks >> as possible, with a minimum of 2 disks". Is there any reason for that >> minimum? I don't see why it can't allow only one if that's the best it >> can manage. >> > That's called "Single". http://btrfs.ipv5.de/index.php?title=Mkfs.btrfs It's called "single" if you deliberately create it that way, yes. However, Phillip's point was, I believe, that if a striped set can start with say five devices of varying size and drop one at a time as the smallest devices run out of space, there's no reason it should stop at two devices. If the goal is maximum performance, it should insist on using only the space available in parallel across all devices and should refuse to drop even a single one. If OTOH, the goal is maximum utilization of available space, with a secondary goal of performance if at all possible, then it should drop devices all the way down to a single device, not stopping at two. Since btrfs apparently is already content to drop from say five devices to two despite the performance drop on the way, there's no fundamental reason it should stop there; it should continue all the way down to a single device, even tho when it gets to that point it's no longer actually striped. Of course, that's defining the behavior I'd expect of a mature btrfs. Right now, it's still experimental, with many missing features. If at this point there's and arbitrary line drawn at two devices minimum, that's just the way it is. But I'd not expect that line to be there when the filesystem is declared mature and fully ready for use. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/19] btrfs: Convert to new freezing mechanism
We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason Signed-off-by: Jan Kara --- fs/btrfs/disk-io.c |3 --- fs/btrfs/file.c|3 ++- fs/btrfs/inode.c |6 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 811d9f9..fc0f74c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE); - if (!(root->fs_info->sb->s_flags & MS_RDONLY) && mutex_trylock(&root->fs_info->cleaner_mutex)) { btrfs_run_delayed_iputs(root); @@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg) do { delay = HZ * 30; - vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE); mutex_lock(&root->fs_info->transaction_kthread_mutex); spin_lock(&root->fs_info->trans_lock); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 859ba2d..1aac7ca 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, ssize_t err = 0; size_t count, ocount; - vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); + sb_start_write(inode->i_sb); mutex_lock(&inode->i_mutex); @@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, num_written = err; } out: + sb_end_write(inode->i_sb); current->backing_dev_info = NULL; return num_written ? num_written : err; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..63c9006 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) u64 page_start; u64 page_end; + sb_start_pagefault(inode->i_sb); ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); if (!ret) { ret = btrfs_update_time(vma->vm_file); @@ -6495,12 +6496,15 @@ again: unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS); out_unlock: - if (!ret) + if (!ret) { + sb_end_pagefault(inode->i_sb); return VM_FAULT_LOCKED; + } unlock_page(page); out: btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); out_noreserve: + sb_end_pagefault(inode->i_sb); return ret; } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/19 v4] Fix filesystem freezing deadlocks
Hello, here is the fourth iteration of my patches to improve filesystem freezing. Filesystem freezing is currently racy and thus we can end up with dirty data on frozen filesystem (see changelog patch 06 for detailed race description). This patch series aims at fixing this. To be able to block all places where inodes get dirtied, I've moved filesystem freeze handling in mnt_want_write() / mnt_drop_write(). This however required some code shuffling and changes to kern_path_create() (see patches 02-05). I think the result is OK but opinions may differ ;). The advantage of this change also is that all filesystems get freeze protection almost for free - even ext2 can handle freezing well now. Another potential contention point might be patch 19. In that patch we make freeze_super() refuse to freeze the filesystem when there are open but unlinked files which may be impractical in some cases. The main reason for this is the problem with handling of file deletion from fput() called with mmap_sem held (e.g. from munmap(2)), and then there's the fact that we cannot really force such filesystem into a consistent state... But if people think that freezing with open but unlinked files should happen, then I have some possible solutions in mind (maybe as a separate patchset since this is large enough). I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen filesystem despite beating it with fsstress and bash-shared-mapping while freezing and unfreezing for several hours (using ext4 and xfs) so I'm reasonably confident this could finally be the right solution. And for people wanting to test - this patchset is based on patch series "Push file_update_time() into .page_mkwrite" so you'll need to pull that one in as well. Changes since v3: * added third level of freezing for fs internal purposes - hooked some filesystems to use it (XFS, nilfs2) * removed racy i_size check from filemap_mkwrite() Changes since v2: * completely rewritten * freezing is now blocked at VFS entry points * two stage freezing to handle both mmapped writes and other IO The biggest changes since v1: * have two counters to provide safe state transitions for SB_FREEZE_WRITE and SB_FREEZE_TRANS states * use percpu counters instead of own percpu structure * added documentation fixes from the old fs freezing series * converted XFS to use SB_FREEZE_TRANS counter instead of its private m_active_trans counter Honza CC: Alex Elder CC: Anton Altaparmakov CC: Ben Myers CC: Chris Mason CC: cluster-de...@redhat.com CC: "David S. Miller" CC: fuse-de...@lists.sourceforge.net CC: "J. Bruce Fields" CC: Joel Becker CC: KONISHI Ryusuke CC: linux-btrfs@vger.kernel.org CC: linux-e...@vger.kernel.org CC: linux-...@vger.kernel.org CC: linux-ni...@vger.kernel.org CC: linux-ntfs-...@lists.sourceforge.net CC: Mark Fasheh CC: Miklos Szeredi CC: ocfs2-de...@oss.oracle.com CC: OGAWA Hirofumi CC: Steven Whitehouse CC: "Theodore Ts'o" CC: x...@oss.sgi.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex
When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara --- fs/btrfs/ioctl.c | 23 +++ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 03bb62a..c855e55 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (!inode_owner_or_capable(inode)) return -EACCES; + ret = mnt_want_write_file(file); + if (ret) + return ret; + mutex_lock(&inode->i_mutex); ip_oldflags = ip->flags; @@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } } - ret = mnt_want_write_file(file); - if (ret) - goto out_unlock; - if (flags & FS_SYNC_FL) ip->flags |= BTRFS_INODE_SYNC; else @@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) inode->i_flags = i_oldflags; } - mnt_drop_write_file(file); out_unlock: mutex_unlock(&inode->i_mutex); + mnt_drop_write_file(file); return ret; } @@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent, struct dentry *dentry; int error; + error = mnt_want_write(parent->mnt); + if (error) + return error; + mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT); dentry = lookup_one_len(name, parent->dentry, namelen); @@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent, if (dentry->d_inode) goto out_dput; - error = mnt_want_write(parent->mnt); - if (error) - goto out_dput; - error = btrfs_may_create(dir, dentry); if (error) - goto out_drop_write; + goto out_dput; down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem); @@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent, fsnotify_mkdir(dir, dentry); out_up_read: up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem); -out_drop_write: - mnt_drop_write(parent->mnt); out_dput: dput(dentry); out_unlock: mutex_unlock(&dir->i_mutex); + mnt_drop_write(parent->mnt); return error; } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "not enough space" with "data raid0"
Phillip Susi ubuntu.com> writes: > > So currently btrfs's concept of raid0 is "stripe across as many disks as > possible, with a minimum of 2 disks". Is there any reason for that > minimum? I don't see why it can't allow only one if that's the best it > can manage. > That's called "Single". http://btrfs.ipv5.de/index.php?title=Mkfs.btrfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
> I'd like to see how they do that. The fact is you are still going to get > random > seeks since you have to binary search the blocks in an entire row since > there is > no way you can read a several thousand block row into memory to search it, > so > once your rows get pretty big you are doing just as much if not more > random io > as a btree. Why? The rows are sequential on disk, so you're never doing more random seeks than the number of rows as long as you can search faster than the disk can provide data. If they indeed can limit the number of searches per row to a constant, it shouldn't be an issue at all. > I don't buy that its better in the insert case either for similar reasons, > you > are talking about having to rewrite entire rows to maintain the sequential > nature of everything, so if you end up adding a giant row you have to go > and > write the whole thing out again, so you are talking about a lot more IO > than > with b-trees, albeit sequential. So maybe it's a win since it's > sequential but > I wonder at what tree size that benefit no longer exists. The worst case might be bad, but on average it should be good. I don't doubt it is always better on rotating disks, probably not on ssd's. > > Of course all this analysis is based on a power point presentation and > there may > be some detail I'm missing, but that's mostly my point, claiming that > b-trees > are now obsolete because we can maybe do inserts faster is just idiotic. It's a presentation from a company, so lots of marketing. Of course it doesn't make btrees obsolete, it's optimized for specific cases. Like they say they reduce random seeks at the cost of disk bandwidth and cpu usage. It depends on the use case if that is useful or not. Probably not for btrfs, since the future will be ssd's, and meta data will generally be cached. Niels -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
On Wed, Mar 28, 2012 at 10:44:03PM +0200, Niels de Carpentier wrote: > > > > > You are still going to have to have at least 29 levels to accomodate 1 > > billion > > objects, though they won't all be full (sorry I missed the must be full or > > empty > > bit). So it looks like we'll have to actually search what 13 rows right? > > So > > still more rows than a b-tree, and again you are talking about binary > > searching > > each row's blocks and then following its forward pointer down to the next > > one, > > so maybe not 100's of times slower than a b-tree, but we're not talking > > about a > > 5-10% difference either, probably still measured in multiples. > > They say that they can do the block pointers in a way that limits the > number for searches per row to a constant, so it's not that bad. They do > less random seeks, but bigger ones at the cost of more cpu usage. I'd like to see how they do that. The fact is you are still going to get random seeks since you have to binary search the blocks in an entire row since there is no way you can read a several thousand block row into memory to search it, so once your rows get pretty big you are doing just as much if not more random io as a btree. > > > > Hah my math was a little off I'll grant you but the basic idea still > > stands, > > once we start getting into larger and larger rows we're going to have to > > binary > > search just to find the right _block_ to use, which in my mind is much > > more of a > > problem than having to binary search inside of multiple blocks. At the > > very > > least as the number of objects grows the time it takes to find something > > in the > > tree increases exponentially. > > That's not what they say. A lot of info is missing though. > > > >> There's a *ton* of detail that they don't describe about how to actually > >> translate the logical notion of power-of-two node sizes into coherent > >> block IO that scales. I don't doubt that it's possible. > > > > I imagine there is, but based on what little information they've shown I > > don't > > see how it's a hands down win against b-trees. If anything we're talking > > about > > having to solve really complex problems in order to get any sort of good > > performance out of this thing. Thanks, > > They don't claim to win hands down for the search case, they say they are > similar for the search case, but much better at inserts. > > I don't think it's good for the linux fs general use case, but it's not as > bad as you think. But it will be very hard to implement anyway unless they > release more details. > I don't buy that its better in the insert case either for similar reasons, you are talking about having to rewrite entire rows to maintain the sequential nature of everything, so if you end up adding a giant row you have to go and write the whole thing out again, so you are talking about a lot more IO than with b-trees, albeit sequential. So maybe it's a win since it's sequential but I wonder at what tree size that benefit no longer exists. Of course all this analysis is based on a power point presentation and there may be some detail I'm missing, but that's mostly my point, claiming that b-trees are now obsolete because we can maybe do inserts faster is just idiotic. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
> > You are still going to have to have at least 29 levels to accomodate 1 > billion > objects, though they won't all be full (sorry I missed the must be full or > empty > bit). So it looks like we'll have to actually search what 13 rows right? > So > still more rows than a b-tree, and again you are talking about binary > searching > each row's blocks and then following its forward pointer down to the next > one, > so maybe not 100's of times slower than a b-tree, but we're not talking > about a > 5-10% difference either, probably still measured in multiples. They say that they can do the block pointers in a way that limits the number for searches per row to a constant, so it's not that bad. They do less random seeks, but bigger ones at the cost of more cpu usage. > > Hah my math was a little off I'll grant you but the basic idea still > stands, > once we start getting into larger and larger rows we're going to have to > binary > search just to find the right _block_ to use, which in my mind is much > more of a > problem than having to binary search inside of multiple blocks. At the > very > least as the number of objects grows the time it takes to find something > in the > tree increases exponentially. That's not what they say. A lot of info is missing though. > >> There's a *ton* of detail that they don't describe about how to actually >> translate the logical notion of power-of-two node sizes into coherent >> block IO that scales. I don't doubt that it's possible. > > I imagine there is, but based on what little information they've shown I > don't > see how it's a hands down win against b-trees. If anything we're talking > about > having to solve really complex problems in order to get any sort of good > performance out of this thing. Thanks, They don't claim to win hands down for the search case, they say they are similar for the search case, but much better at inserts. I don't think it's good for the linux fs general use case, but it's not as bad as you think. But it will be very hard to implement anyway unless they release more details. Niels -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
I imagine there is, but based on what little information they've shown I don't see how it's a hands down win against b-trees. If anything we're talking about having to solve really complex problems in order to get any sort of good performance out of this thing. Oh, absolutely. Tack on COW and online repair and the complexity goes through the roof. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
On Wed, Mar 28, 2012 at 03:50:07PM -0400, Zach Brown wrote: > > >but lets say O(log N/2) where N is the number of elements in the row. > >So in the situation I describe you are looking at having to do minimum > >of 29 reads, one for each row, > > Hmm. > > Levels are powers of two and are either full or empty. So the total > item count tells you which levels are full or empty. > > (gdb) print/t 10 > $3 = 1110111001101011001010 > > So no, definitely not reading 29 levels. > You are still going to have to have at least 29 levels to accomodate 1 billion objects, though they won't all be full (sorry I missed the must be full or empty bit). So it looks like we'll have to actually search what 13 rows right? So still more rows than a b-tree, and again you are talking about binary searching each row's blocks and then following its forward pointer down to the next one, so maybe not 100's of times slower than a b-tree, but we're not talking about a 5-10% difference either, probably still measured in multiples. > After realizing this, I have to be honest: I'm not finding your > hand-wavey dismissal of the technology convincing at all. :) > Hah my math was a little off I'll grant you but the basic idea still stands, once we start getting into larger and larger rows we're going to have to binary search just to find the right _block_ to use, which in my mind is much more of a problem than having to binary search inside of multiple blocks. At the very least as the number of objects grows the time it takes to find something in the tree increases exponentially. > There's a *ton* of detail that they don't describe about how to actually > translate the logical notion of power-of-two node sizes into coherent > block IO that scales. I don't doubt that it's possible. I imagine there is, but based on what little information they've shown I don't see how it's a hands down win against b-trees. If anything we're talking about having to solve really complex problems in order to get any sort of good performance out of this thing. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
but lets say O(log N/2) where N is the number of elements in the row. So in the situation I describe you are looking at having to do minimum of 29 reads, one for each row, Hmm. Levels are powers of two and are either full or empty. So the total item count tells you which levels are full or empty. (gdb) print/t 10 $3 = 1110111001101011001010 So no, definitely not reading 29 levels. After realizing this, I have to be honest: I'm not finding your hand-wavey dismissal of the technology convincing at all. :) There's a *ton* of detail that they don't describe about how to actually translate the logical notion of power-of-two node sizes into coherent block IO that scales. I don't doubt that it's possible. I very much doubt that the required complexity and cpu cost in the kernel would be worth it for generic file system users, though. Hooo boy, do I doubt it. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
On Wed, Mar 28, 2012 at 02:42:04PM -0400, Josef Bacik wrote: > On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote: > > The case has been made on Phoronix for F-Trees: They makes use hard > > drive speeds, not (relatively slow) access times; beat SSD's; and scale > > perfectly across multiple cores with hundreds of millions of entries. > > > > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes > > > > How TokuDB Fractal Tree Databases Work > > > > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM > > > > Time for someone to get started on ftrfs? Or can it be implemented > > in Btrfs? > > https://bugzilla.kernel.org/show_bug.cgi?id=43004 > > > > This is dumber shit than usual out of phoronix. I did some just basic > calculations for worst case scenarios, let's say we max out btrfs's 8 level > limit, so we have 7 levels of nodes and then our leaves. Lets just for > simplicity sake say we can fit 1 billion objects into this kind of tree, and > for > each node/leaf we can only hold 10 objects. (Keep in mind these are just > random > numbers I'm pulling out of my ass and may not work out right with such a large > tree, just work with me here). > > So worst case scenario doing a binary search for an object with 10 objects is > 4 > comparisons, so with a full 8 level btree we're doing 4 comparisons per level, > so 32 comparisons worst possible case to find our object. > > Now let's consider a fractal tree with 1 billion objects. So that would have > a > 29 row fractal tree (if I did my math right). Now I have to make some > assumptions about how you would implement a fractal tree here, but let's > assume > that every level tells you it's min and max value to make it easier to tell > which level you need to search in. So worst case you're object is in the 29th > level, so you have to read 29 blocks in and check the min/max levels to figure > out which row to search. Let's be fair and say these are in memory, we're > still > doing 29 comparisons right out of the gate to find the right row. Now we have > the right row, but this is a file system and the rows are split up into > blocks, > so we not only have to binary search each block, but the blocks themselves to > find the right block with our object. And we can't do this directed either > because we have no way of indexing the individual blocks, so worst case > scenario > we're having to read in 268435456 blocks to then do a normal binary search on > _each_ block! Now lets again say just to give fractal trees an added benefit > that each block has it's own min/max setting, we only have to binary search > the > one that will have our actual data, but we're still talking about reading in > 33554432 times the number of blocks as compared to a btree!! > Ok looks like I wasn't being completely fair here, part of the presentation they show talks about using forward pointers to point to where the element would be in the next row. So if we assume we're using forward pointers and every row has a min/max indicator you can walk down each row and do a binary search to get as close as possible to what you want and then follow the forward pointer down to the next level. The problem with this is that in the worst case you end up having to binary search on each row, which makes the SMP case even crappier because you have to make sure nothing changes while you are walking down the rows and following the forward pointers. The math here is a little trickier since I can't easily picture what the worst case scenario would be per level, but lets say O(log N/2) where N is the number of elements in the row. So in the situation I describe you are looking at having to do minimum of 29 reads, one for each row, and then add into that all the reads you need to binary search from your forward pointer on to the end of the row, which is going to increase as you go down the tree. So probably not millions times more work than b-trees, but at least 10s if not 100s of times more work in the worst case. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/28/2012 02:42 PM, Josef Bacik wrote: > On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote: >> The case has been made on Phoronix for F-Trees: They makes use >> hard drive speeds, not (relatively slow) access times; beat >> SSD's; and scale perfectly across multiple cores with hundreds of >> millions of entries. > So no I don't think it's time for someone to get started on ftrfs. > Thanks, Thanks for doing the research to back up the answer I was going to respond with based on intuition. +1 - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJPc1xgAAoJEB57S2MheeWydJIP/R7jHiloaPLfwffuFS3gobmy h0DqX8YezGMs70kLBYTXeu3EVuFtX8IFGJDGoVp3pkM9nS6F6iK9LbRWDC7IDAy7 t1taQoUqy0HMmmrkXfZ8KWmXWv424/Zq5aHfnegL/oq4OrUwnHtjzJBbsx4fvezW mnPrNlOxaLWgXSyUs+1hDjvcgfmnRjpgURHfsfGNgX3gTUE4xNKCFXD4zs0bs5YO NcpLa/66R5UwINPLOazHt9QpC2jHxPb0j93YZBipwtbtXPj1W+od/0rOgPvc9gaK hzi7kRiegt50V2LVOJnofdaBC0AlCfRlcXRUNOil1vgFe6s8sft1KOdl8MShQ/+t JUTjb1j7Z3efds42nnahvNj8wV4T6z6qLlhSQiOulYbVarBsfrWoM3EJY2ObzIlM uOjCIPwHr/fzMJ9wVyLgK2ksssZmYAVgQ3YyS9G1CHIPe9xgW9Ld570mhhXRoLRj HG5QZkfkmFzlde+S/gGgrdrvTWDT1zDsUhe+IGYu3jtPjXVs1udB+BN3c7rTWjf+ gQUOdcS/6XHoTXWpgZ7p5DUb8qF7Nv+ABBcEWeiaIEPH7uUZ22/XZEdf2euSCTp2 +TRZfoj9PZ17cszkllG4n+kc5H4gdKMYRyvNQo9mQ2TvogsmVOgIY+0Fys2ZdOkF CaZ6ti9ZLkofawzImOV5 =UaEh -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
On Wed, Mar 28, 2012 at 02:25:39PM +, Danny Piccirillo wrote: > The case has been made on Phoronix for F-Trees: They makes use hard > drive speeds, not (relatively slow) access times; beat SSD's; and scale > perfectly across multiple cores with hundreds of millions of entries. > > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes > > How TokuDB Fractal Tree Databases Work > > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM > > Time for someone to get started on ftrfs? Or can it be implemented > in Btrfs? > https://bugzilla.kernel.org/show_bug.cgi?id=43004 > This is dumber shit than usual out of phoronix. I did some just basic calculations for worst case scenarios, let's say we max out btrfs's 8 level limit, so we have 7 levels of nodes and then our leaves. Lets just for simplicity sake say we can fit 1 billion objects into this kind of tree, and for each node/leaf we can only hold 10 objects. (Keep in mind these are just random numbers I'm pulling out of my ass and may not work out right with such a large tree, just work with me here). So worst case scenario doing a binary search for an object with 10 objects is 4 comparisons, so with a full 8 level btree we're doing 4 comparisons per level, so 32 comparisons worst possible case to find our object. Now let's consider a fractal tree with 1 billion objects. So that would have a 29 row fractal tree (if I did my math right). Now I have to make some assumptions about how you would implement a fractal tree here, but let's assume that every level tells you it's min and max value to make it easier to tell which level you need to search in. So worst case you're object is in the 29th level, so you have to read 29 blocks in and check the min/max levels to figure out which row to search. Let's be fair and say these are in memory, we're still doing 29 comparisons right out of the gate to find the right row. Now we have the right row, but this is a file system and the rows are split up into blocks, so we not only have to binary search each block, but the blocks themselves to find the right block with our object. And we can't do this directed either because we have no way of indexing the individual blocks, so worst case scenario we're having to read in 268435456 blocks to then do a normal binary search on _each_ block! Now lets again say just to give fractal trees an added benefit that each block has it's own min/max setting, we only have to binary search the one that will have our actual data, but we're still talking about reading in 33554432 times the number of blocks as compared to a btree!! And this doesn't even begin to touch the ability to handle multi-threaded workloads. Imagine having to move all of these blocks around in memory because your insert created a new row. You'd have to lock each row as you move stuff around (at the very least), so if you hit a particularly hot row your multithreaded performance is going to look like you are running on an Atari 2600. So no I don't think it's time for someone to get started on ftrfs. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "not enough space" with "data raid0"
On 3/17/2012 8:19 AM, Hugo Mills wrote: Where is the problem, how can I use the full space? You can't. btrfs requires RAID-0 to be at least two devices wide (otherwise it's not striped at all, which is the point of RAID-0). If you want to use the full capacity of both disks and don't care about the performance gain from striping, use -d single (which is the default). If you do care about the performance gain from striping, then you're going to have to lose some usable space. So currently btrfs's concept of raid0 is "stripe across as many disks as possible, with a minimum of 2 disks". Is there any reason for that minimum? I don't see why it can't allow only one if that's the best it can manage. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: enforce block count on all devices in mkfs
I had a test that creates a 7gig raid1 device but it was ending up wonky because the second device that gets added is the full size of the disk instead of the limited size. So enforce the limited size on all disks passed in at mkfs time, otherwise our threshold calculations end up wonky when doing chunk allocations. Thanks, Signed-off-by: Josef Bacik --- mkfs.c |1 + utils.c |2 ++ 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mkfs.c b/mkfs.c index c531ef2..6ae412f 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1404,6 +1404,7 @@ int main(int ac, char **av) close(fd); continue; } + dev_block_count = block_count; ret = btrfs_prepare_device(fd, file, zero_end, &dev_block_count, &mixed); mixed = old_mixed; diff --git a/utils.c b/utils.c index ee7fa1b..552233b 100644 --- a/utils.c +++ b/utils.c @@ -555,6 +555,8 @@ int btrfs_prepare_device(int fd, char *file, int zero_end, u64 *block_count_ret, fprintf(stderr, "unable to find %s size\n", file); exit(1); } + if (*block_count_ret) + block_count = min(block_count, *block_count_ret); zero_end = 1; if (block_count < 1024 * 1024 * 1024 && !(*mixed)) { -- 1.7.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Create subvolume from a directory?
Hello Liu On 03/28/2012 04:18 AM, Liu Bo wrote: On 03/28/2012 06:24 AM, Matthias G. Eckermann wrote: # time cp -a --reflink /var/lib/lxc/installserver_tmp/rootfs /var/lib/lxc/installserver This is too much weird. AFAIK, clone between different subvolumes should be forbidden. So this would get a "Invalid cross-device link", because an individual subvolume can be mounted directly. Could you elaborate which would be the issue ? "cp --reflink"-ing a file is not different than snapshotting a file. In any case I could mount a snapshot and not the source subvolume. thanks, liubo real0m1.367s user0m0.148s sys 0m1.108s ## Now remove /var/lib/lxc/installserver_tmp (or not) --< snap>-- Just to compare this with a "mv": --< snip>-- ## Go back to former state # btrfs subvol delete /var/lib/lxc/installserver Delete subvolume '/var/lib/lxc/installserver' # btrfs subvol create /var/lib/lxc/installserver Create subvolume '/var/lib/lxc/installserver' # time mv /var/lib/lxc/installserver_tmp/rootfs /var/lib/lxc/installserver/ real0m12.917s user0m0.208s sys 0m2.508s --< snap>-- While the time measurement might be flawed due to the subvol actions inbetween, caching etc.: I tried several times, and "cp --reflinks" always is multiple times faster than "mv" in my environment. Or did I misunderstand your question? so long - MgE -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fractal Tree Indexing over B-Trees?
On Wed, Mar 28, 2012 at 9:25 AM, Danny Piccirillo wrote: > The case has been made on Phoronix for F-Trees: They makes use hard > drive speeds, not (relatively slow) access times; beat SSD's; and scale > perfectly across multiple cores with hundreds of millions of entries. > > http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes > > How TokuDB Fractal Tree Databases Work > > Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM > > Time for someone to get started on ftrfs? Or can it be implemented > in Btrfs? > https://bugzilla.kernel.org/show_bug.cgi?id=43004 whoa, very cool stuff. fractals are awesome, cool to see them in use. ... 2010/11, surprised i never heard of it before now. thanks for the reference/links at the very least! aside: i once described fractals to my grandmother (100% devout catholic) as related to my own understanding of the universe -- specifically, i pointed out how their often simple mathematical identity fragments into an infinitely self-similar pattern of seemingly unbounded complexity -- she told me i was describing god ... and, well, we agreed :-) -- C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fractal Tree Indexing over B-Trees?
The case has been made on Phoronix for F-Trees: They makes use hard drive speeds, not (relatively slow) access times; beat SSD's; and scale perfectly across multiple cores with hundreds of millions of entries. http://en.wikipedia.org/wiki/TokuDB#Fractal_tree_indexes How TokuDB Fractal Tree Databases Work Via: http://www.phoronix.com/scan.php?page=news_item&px=MTA3NjM Time for someone to get started on ftrfs? Or can it be implemented in Btrfs? https://bugzilla.kernel.org/show_bug.cgi?id=43004 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Create subvolume from a directory?
David Sterba jikos.cz> writes: > > On Tue, Mar 27, 2012 at 05:19:06PM +, Alex wrote: > > Can I create/convert a existing (btrfs) directory into a subvolume? > > For the reference there's an entry at the wiki: > > http://btrfs.ipv5.de/index.php?title=UseCases#Can_I_take_a_snapshot_of_a_directory.3F > Thank you all for your replies. Kind regards. (happily experimenting with btrfs in a 3.3 kernel!) Al. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Honest timeline for btrfsck
Danny Piccirillo posted on Wed, 28 Mar 2012 06:15:41 + as excerpted: > Chris Mason oracle.com> writes: >> >> People have already started picking up #4, fujitsu had some patches in >> this direction that we'll keep developing with. >> >> I stepped back to add some directory metadata fixups as well to the >> basic fsck tool. I had thought I could easily do it all from the >> kernel, but sometimes the userland side really is easier. > > > Phoronix reported on your talk at the SCALE 10x conference here > http://www.phoronix.com/scan.php?page=news_item&px=MTA0Njk > > > What happened to the February 14th deadline? Is another rewrite > underway? I think the biggest thing that can be done in lieu of the > release is really good communication. There's no feed to get up-to-date > info on what's being worked on, what's left, etc. I too read the phoronix article, and in fact mentioned it here back about Feb 10 or so... Actually, a somewhat-working repair-writing btrfsck does exist and is in fact "public"... for some value of "public", at least. It's still sort of like the plans for the pan-galactic bypass in hitchhiker's guide to the galaxy, down several flights of stairs behind a door that says "beware of leopard", in that it's only available in Chris's "dangerdonteveruse" git branch, but it *IS* publicly available... as I said for /some/ value of "public" anyway. Here's the issue, tho. At present, it's pretty much a "well, it may fix some problems, but then again, it might wreak further havoc on an already damaged filesystem, destroying any reasonable chance of retrieving valid data off it, too!" type of situation. That's why it's still stuck in "dangerdonteveruse". But if you read the article well, that's all that was really promised anyway, at this point. That was a deadline for Oracle getting it for testing and QA, etc. It was *NOT* a "release quality" deadline, or even a "this matches the experimental state of the filesystem" deadline, in any shape or form. So Oracle and others are doing some seriously focused QA and bug fixes on the thing at present. The idea is that they can release it along with a kernel with their own patches, and support that. It's worth noting, however, that if they're only supporting their own kernel, they can, if necessary, disable certain already available features of the filesystem in their kernel, AND in the btrfsck, so as to be able to support what's left. After all, the filesystem itself is still officially experimental and in disk-format-flux, as well as lacking various targeted features ATM, and it's only a small stretch to go from there to disabling a feature or two already in mainline, if necessary to be able to be comfortable supporting the rest. Bottom line: Purely as a simple Linux user following this list for perhaps a couple months now, I'd expect btrfs to continue maturing and that it'll probably getting toward somewhat stable toward the end of the year... depending on how conservative you are, of course (there's many that are barely warming up to ext4 at this point). Which would put it in line for the spring of next year (2013) distro releases. Until then, I'd expect the current label, experimental, fit only for testing, not for storage of data you expect to be able to reliably retrieve, to continue to apply, tho to a lessor degree as it gradually matures. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Create subvolume from a directory?
On Tue, Mar 27, 2012 at 05:19:06PM +, Alex wrote: > Can I create/convert a existing (btrfs) directory into a subvolume? For the reference there's an entry at the wiki: http://btrfs.ipv5.de/index.php?title=UseCases#Can_I_take_a_snapshot_of_a_directory.3F as noted in the thread, this relies on the unmerged cross-subvolume reflink patch. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Create subvolume from a directory?
On Wed, Mar 28, 2012 at 08:46:13AM +0700, Fajar A. Nugraha wrote: > On Wed, Mar 28, 2012 at 5:24 AM, Matthias G. Eckermann wrote: > > While the time measurement might be flawed due to the subvol > > actions inbetween, caching etc.: I tried several times, and > > "cp --reflinks" always is multiple times faster than "mv" in > > my environment. > > So this is cross-subvolume reflinks? I thought the code for that > wasn't merged yet? Matthias was using SLES kernel, the cross-subvolume reflink patch is present there, he was not aware it's not merged upstream. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html