Re: SSD Optimizations
On Fri, 12 Mar 2010 02:07:40 +0100 Hubert Kario h...@qbs.com.pl wrote: [...] If the FS were to be smart and know about the 256kb requirement, it would do a read/modify/write cycle somewhere and then write the 4KB. If all the free blocks have been TRIMmed, FS should pick a completely free erasure size block and write those 4KiB of data. Correct implementation of wear leveling in the drive should notice that the write is entirely inside a free block and make just a write cycle adding zeros to the end of supplied data. Your assumption here is that your _addressed_ block layout is completely identical to the SSDs disk layout. Else you cannot know where a free erasure block is located and how to address it from FS. I really wonder what this assumption is based on. You still think a SSD is a true disk with linear addressing. I doubt that very much. Even on true spinning disks your assumption is wrong for relocated sectors. Which basically means that every disk controller firmware fiddles around with the physical layout since decades. Please accept that you cannot do a disks' job in FS. The more advanced technology gets the more disks become black boxes with a defined software interface. Use this interface and drop the idea of having inside knowledge of such a device. That's other peoples' work. If you want to design smart SSD controllers hire at a company that builds those. -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Friday 12 March 2010 10:15:28 Stephan von Krawczynski wrote: On Fri, 12 Mar 2010 02:07:40 +0100 Hubert Kario h...@qbs.com.pl wrote: [...] If the FS were to be smart and know about the 256kb requirement, it would do a read/modify/write cycle somewhere and then write the 4KB. If all the free blocks have been TRIMmed, FS should pick a completely free erasure size block and write those 4KiB of data. Correct implementation of wear leveling in the drive should notice that the write is entirely inside a free block and make just a write cycle adding zeros to the end of supplied data. Your assumption here is that your _addressed_ block layout is completely identical to the SSDs disk layout. Else you cannot know where a free erasure block is located and how to address it from FS. I really wonder what this assumption is based on. You still think a SSD is a true disk with linear addressing. I doubt that very much. I made no such assumptions. Im sure that the linearity on the ATA LBA level isn't so linear on the device level, especially after wear-leveling takes its toll, but I assume that the smallest block of data that the translation layer can address is erase-block sized and that all the erase-block are equal in size. Otherwise the algorithm would be needlessly complicated which would make it both slower and more error prone. Even on true spinning disks your assumption is wrong for relocated sectors. Which we don't have to worry about because if the drive has less than 5 of 'em, the impact of hitting them is marginal and if there are more, the user has much more pressing problem than the performance of the drive or FS. Which basically means that every disk controller firmware fiddles around with the physical layout since decades. Please accept that you cannot do a disks' job in FS. The more advanced technology gets the more disks become black boxes with a defined software interface. Use this interface and drop the idea of having inside knowledge of such a device. That's other peoples' work. If you want to design smart SSD controllers hire at a company that builds those. And I don't think that doing disks' job in the FS is good idea, but I think that we should be able to minimise the impact of the translation layer. The way to do this, is to threat the device as a block device with sectors the size of erase-blocks. That's nothing too fancy, don't you think? -- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Re: New btrfs command pushed into the btrfs-progs subvol branch - commands
On Friday 12 March 2010, Chris Mason wrote: On Thu, Mar 11, 2010 at 10:44:21PM +0100, Goffredo Baroncelli wrote: Hi Chris, I updated my git repository. You can pull from [...] Wonderful. I've rebased this and put it into the subvol branch. I think I got all the commits and differences. The big reason for the rebase is to avoid the small string of single liner fixup commits. If you have other work pending I'm happy to rebase it in, otherwise please try to work against my subvol branch. Sorry; I am new on git, so some concepts for me are new. The next time I will rebase. Thanks again! -chris -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreij...@inwind.it Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Saving and restoring btrfs snapshots
On Friday 12 March 2010, Pat Patterson wrote: Are there any plans to implement something akin to ZFS send/recv, to be able to create a stream representation of a snapshot and restore it later/somewhere else? I've spent some time trawling the mailing list and wiki, but I don't see anything there. I spent a bit of time on this argument, in order to find how implement an efficient method to backup incrementally the data. AFAICT zfs send and zfs recv do the same thing that tar does. They transform a tree (or the difference between a tree and its snapshot) to a stream, and vice-versa. To transform a tree to a stream is not very interesting. The interesting part is how compare a tree and its snapshot. In fact a snapshot of a tree should a be pointer to the original tree, and when a file is modified, a branch of the modified part (the extens of the file, the directories of the path) is performed (yes I know that this a big simplification of the process). The key is that the file-system knows which part of a snapshot is still equal to the source and which not. If this kind of data is available to the user space, comparing a tree and it snapshot should be very fast. Reading the documentation of btrfs, it seems that associated the transaction there is a version number. With this version number of a directory, we would be able to verify the equality of two trees comparing only the root of the trees. This would increase the seed of two trees. But I was never able to get this version number. There is the ioctl command FS_IOC_GETVERSION, which seems to return this number. But when a directory or an its children is update, this number doesn't change. I tried to hack the kernel code in order to test different version number: I tried inode-i_generation, or btrfs_inode-generation or btrfs_inode-sequence or btrfs_inode-{last|last_sub|logged}_trans... But none of the above was useful for my purpose. Even tough there is no a clear conclusion, I hope that this note may be useful to start to discuss on this matter. Regards Goffredo Cheers, Pat -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) kreijackATinwind.it Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix small race with delalloc flushing waitqueue's
Everytime we start a new flushing thread, we init the waitqueue if there isn't a flushing thread running. The problem with this is we check space_info-flushing, which we clear right before doing a wake_up on the flushing waitqueue, which causes problems if we init the waitqueue in the middle of clearing the flushing flagh and calling wake_up. This is hard to hit, but the code is wrong anyway, so init the flushing/allocating waitqueue when creating the space info and let it be. I haven't seen the panic since I've been using this patch. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/extent-tree.c |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9c6fbd0..73ac69b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2678,6 +2678,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags, INIT_LIST_HEAD(found-block_groups); init_waitqueue_head(found-flush_wait); init_rwsem(found-groups_sem); + init_waitqueue_head(found-flush_wait); + init_waitqueue_head(found-allocate_wait); spin_lock_init(found-lock); found-flags = flags; found-total_bytes = total_bytes; @@ -2929,12 +2931,10 @@ static void flush_delalloc(struct btrfs_root *root, spin_lock(info-lock); - if (!info-flushing) { + if (!info-flushing) info-flushing = 1; - init_waitqueue_head(info-flush_wait); - } else { + else wait = true; - } spin_unlock(info-lock); @@ -2997,7 +2997,6 @@ static int maybe_allocate_chunk(struct btrfs_root *root, if (!info-allocating_chunk) { info-force_alloc = 1; info-allocating_chunk = 1; - init_waitqueue_head(info-allocate_wait); } else { wait = true; } -- 1.6.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix ENOSPC accounting when max_extent is not maxed out
A user reported a bug a few weeks back where if he set max_extent=1m and then did a dd and then stopped it, we would panic. This is because I miscalculated how many extents would be needed for splits/merges. Turns out I didn't actually take max_extent into account properly, since we only ever add 1 extent for a write, which isn't quite right for the case that say max_extent is 4k and we do 8k writes. That would result in more than 1 extent. So this patch makes us properly figure out how many extents are needed for the amount of data that is being written, and deals with splitting and merging better. I've tested this ad nauseum and it works well. It depends on all of the other patches I've sent recently, including the per-cpu pools patch. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/ctree.h|8 fs/btrfs/extent-tree.c |6 +- fs/btrfs/file.c |5 +- fs/btrfs/inode.c| 99 +++--- fs/btrfs/ordered-data.c | 13 -- 5 files changed, 90 insertions(+), 41 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f12fe00..2f5c01f 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1965,6 +1965,14 @@ static inline struct dentry *fdentry(struct file *file) return file-f_path.dentry; } +static inline int calculate_extents(struct btrfs_root *root, u64 bytes) +{ + if (bytes = root-fs_info-max_extent) + return 1; + return (int)div64_u64(bytes + root-fs_info-max_extent -1, + root-fs_info-max_extent); +} + /* extent-tree.c */ void btrfs_put_block_group(struct btrfs_block_group_cache *cache); int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 73ac69b..0085dcb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2837,7 +2837,7 @@ int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root, spin_unlock(BTRFS_I(inode)-accounting_lock); return 0; } - BTRFS_I(inode)-reserved_extents--; + BTRFS_I(inode)-reserved_extents -= num_items; spin_unlock(BTRFS_I(inode)-accounting_lock); btrfs_unreserve_metadata_space(root, num_items); @@ -3059,7 +3059,7 @@ again: if (realloc_bytes = num_bytes) { pool-total_bytes += realloc_bytes; spin_lock(BTRFS_I(inode)-accounting_lock); - BTRFS_I(inode)-reserved_extents++; + BTRFS_I(inode)-reserved_extents += num_items; spin_unlock(BTRFS_I(inode)-accounting_lock); spin_unlock(pool-lock); return 0; @@ -3074,7 +3074,7 @@ again: */ if (pool-reserved_bytes + pool-used_bytes = pool-total_bytes) { spin_lock(BTRFS_I(inode)-accounting_lock); - BTRFS_I(inode)-reserved_extents++; + BTRFS_I(inode)-reserved_extents += num_items; spin_unlock(BTRFS_I(inode)-accounting_lock); spin_unlock(pool-lock); return 0; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d146dde..a457a94 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -838,6 +838,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, unsigned long first_index; unsigned long last_index; int will_write; + int reserved = calculate_extents(root, count); will_write = ((file-f_flags O_SYNC) || IS_SYNC(inode) || (file-f_flags O_DIRECT)); @@ -855,7 +856,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, /* do the reserve before the mutex lock in case we have to do some * flushing. We wouldn't deadlock, but this is more polite. */ - err = btrfs_reserve_metadata_for_delalloc(root, inode, 1); + err = btrfs_reserve_metadata_for_delalloc(root, inode, reserved); if (err) goto out_nolock; @@ -975,7 +976,7 @@ out: mutex_unlock(inode-i_mutex); if (ret) err = ret; - btrfs_unreserve_metadata_for_delalloc(root, inode, 1); + btrfs_unreserve_metadata_for_delalloc(root, inode, reserved); out_nolock: kfree(pages); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b4056ca..09f18b9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1231,19 +1231,37 @@ static int btrfs_split_extent_hook(struct inode *inode, size = orig-end - orig-start + 1; if (size root-fs_info-max_extent) { - u64 num_extents; - u64 new_size; + u64 left_extents, right_extents; + u64 orig_extents; + u64 left_size, right_size; - new_size = orig-end - split + 1; - num_extents = div64_u64(size + root-fs_info-max_extent - 1, -
[PATCH] Btrfs: force delalloc flushing when things get desperate
When testing with max_extents=4k, we enospc out really really early. The reason for this is we really overwhelm the system with our worst case calculation. When we try to flush delalloc, we don't want everybody to wait around forever, so we wake up the waiters when we've done some of the work in hopes that its enough work to get everything they need done. The problem with this is we don't wait long enough sometimes. So if we've already done a flush_delalloc and didn't find what we need, do it again and this time wait for the flushing to be completely finished before returning. This makes my ENOSPC test actually finish, instead of finishing in about 20 seconds. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/extent-tree.c | 25 + 1 files changed, 17 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0085dcb..aeef481 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2873,7 +2873,7 @@ static noinline void flush_delalloc_async(struct btrfs_work *work) kfree(async); } -static void wait_on_flush(struct btrfs_root *root, struct btrfs_space_info *info) +static void wait_on_flush(struct btrfs_root *root, struct btrfs_space_info *info, int soft) { DEFINE_WAIT(wait); u64 num_bytes; @@ -2895,6 +2895,12 @@ static void wait_on_flush(struct btrfs_root *root, struct btrfs_space_info *info break; } + if (!soft) { + spin_unlock(info-lock); + schedule(); + continue; + } + free = 0; for_each_possible_cpu(i) { struct btrfs_reserved_space_pool *pool; @@ -2924,7 +2930,7 @@ static void wait_on_flush(struct btrfs_root *root, struct btrfs_space_info *info } static void flush_delalloc(struct btrfs_root *root, -struct btrfs_space_info *info) + struct btrfs_space_info *info, int soft) { struct async_flush *async; bool wait = false; @@ -2939,7 +2945,7 @@ static void flush_delalloc(struct btrfs_root *root, spin_unlock(info-lock); if (wait) { - wait_on_flush(root, info); + wait_on_flush(root, info, soft); return; } @@ -2953,7 +2959,7 @@ static void flush_delalloc(struct btrfs_root *root, btrfs_queue_worker(root-fs_info-enospc_workers, async-work); - wait_on_flush(root, info); + wait_on_flush(root, info, soft); return; flush: @@ -3146,14 +3152,17 @@ again: if (!delalloc_flushed) { delalloc_flushed = true; - flush_delalloc(root, meta_sinfo); + flush_delalloc(root, meta_sinfo, 1); goto again; } if (!chunk_allocated) { + int ret; + chunk_allocated = true; - btrfs_wait_ordered_extents(root, 0, 0); - maybe_allocate_chunk(root, meta_sinfo); + ret = maybe_allocate_chunk(root, meta_sinfo); + if (!ret) + flush_delalloc(root, meta_sinfo, 0); goto again; } @@ -3338,7 +3347,7 @@ again: if (!delalloc_flushed) { delalloc_flushed = true; - flush_delalloc(root, meta_sinfo); + flush_delalloc(root, meta_sinfo, 0); goto again; } -- 1.6.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
resize
Greetings, I estimate it has been 10 hours since I issued the command btrfsctl -r -4g /home to attempt to free some space for a new partition. It is still running. How long should this take? I am very concerned about the integrity of my system. Is it safe to interrupt the process? It seems increasingly likely there is some problem. My btrfs tools version is 0,19. The fs is mounted with compress and ssd options. I am running kernel 2.6.32. Any advice or suggestions are appreciated, Brian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html