machine gets unresponsive during btrfs balance
Hi, I am using a btrfs filesystem created with raid0 for data and metadata for (temporary) storage of tv recordings from my vdr. The filesystem was created under kernel version 2.6.34. An initial btrfs balance command succeeded. Since I upgraded to 2.6.35-rcX and 2.6.35 btrfs balance no longer finishes but puts the machine in some unresponsive state. Unfortunately, I do not see any kernel oops or other debug information because even the display freezes. The last thing that happens are that those two lines are written to /var/log/messages: Aug 11 21:42:23 thor kernel: btrfs: found 62911 extents Aug 11 21:42:24 thor kernel: btrfs: relocating block group 1723913469952 flags 9 After that the machine becomes immediately unresponsive. As I did not see anything that might be related to my problem in the changelog for 2.6.35.1 I did not try again with this version. Thanks, Andreas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: machine gets unresponsive during btrfs balance
On Thu, Aug 12, 2010 at 3:14 PM, Andreas Philipp philipp.andr...@gmail.com wrote: Hi, I am using a btrfs filesystem created with raid0 for data and metadata for (temporary) storage of tv recordings from my vdr. The filesystem was created under kernel version 2.6.34. An initial btrfs balance command succeeded. Since I upgraded to 2.6.35-rcX and 2.6.35 btrfs balance no longer finishes but puts the machine in some unresponsive state. Unfortunately, I do not see any kernel oops or other debug information because even the display freezes. The last thing that happens are that those two lines are written to /var/log/messages: Aug 11 21:42:23 thor kernel: btrfs: found 62911 extents Aug 11 21:42:24 thor kernel: btrfs: relocating block group 1723913469952 flags 9 After that the machine becomes immediately unresponsive. As I did not see anything that might be related to my problem in the changelog for 2.6.35.1 I did not try again with this version. Do you have more than one machines? would you please setup netconsole to see what happen. Thanks Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't unmount
On Aug 12, 2010, at 10:58 AM, K. Richard Pixley r...@noir.com wrote: And should I be worried about what umount -l might be leaving behind? (eg, any unfreed kernel resources) Or is that a reasonable way to deal with this situation on an ongoing basis? --rich On 8/12/10 08:55 , K. Richard Pixley wrote: I'm running into a situation where I can't unmount a mounted snapshot. It shows busy even though neither lsof nor fuser show any open files. Umount -f doesn't work although umount -l does. Is there anything else I can do to debug this scenario or to clear the busy status myself? Or am I down to rebooting each time? This is on stock ubuntu-10.04, x86. --rich You are lazy unmounting, as I understand, you are essentially just hiding the fact that the mount was busy to userspace... The mount will remain active in the kernel until you resolve whatever was stopping umount in the first place; kernel will then silently unmount. Does this affect all of your mounted snapshots, or only a particular one? C Anthony [mobile] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't unmount
On 8/12/10 10:46 , C Anthony Risinger wrote: On Aug 12, 2010, at 10:58 AM, K. Richard Pixleyr...@noir.com wrote: And should I be worried about what umount -l might be leaving behind? (eg, any unfreed kernel resources) Or is that a reasonable way to deal with this situation on an ongoing basis? On 8/12/10 08:55 , K. Richard Pixley wrote: I'm running into a situation where I can't unmount a mounted snapshot. It shows busy even though neither lsof nor fuser show any open files. Umount -f doesn't work although umount -l does. Is there anything else I can do to debug this scenario or to clear the busy status myself? Or am I down to rebooting each time? This is on stock ubuntu-10.04, x86. You are lazy unmounting, as I understand, you are essentially just hiding the fact that the mount was busy to userspace... The mount will remain active in the kernel until you resolve whatever was stopping umount in the first place; kernel will then silently unmount. Understood. Does this affect all of your mounted snapshots, or only a particular one? I'm only mounting one at a time so I haven't noticed. Will check next time it occurs. --rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v2 PATCH 1/6] Btrfs: Add experimental hot data hash list index
From: Ben Chociej bchoc...@gmail.com Adds a hash table structure to efficiently lookup the data temperature of a file. Also adds a function to calculate that temperature based on some metrics kept in custom frequency data structs (in the next patch). Signed-off-by: Ben Chociej bchoc...@gmail.com Signed-off-by: Matt Lupfer mlup...@gmail.com Signed-off-by: Conor Scott consc...@vt.edu Reviewed-by: Mingming Cao c...@us.ibm.com --- fs/btrfs/hotdata_hash.c | 338 +++ fs/btrfs/hotdata_hash.h | 155 ++ 2 files changed, 493 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/hotdata_hash.c create mode 100644 fs/btrfs/hotdata_hash.h diff --git a/fs/btrfs/hotdata_hash.c b/fs/btrfs/hotdata_hash.c new file mode 100644 index 000..b789edd --- /dev/null +++ b/fs/btrfs/hotdata_hash.c @@ -0,0 +1,338 @@ +/* + * fs/btrfs/hotdata_hash.c + * + * Copyright (C) 2010 International Business Machines Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include linux/list.h +#include linux/err.h +#include linux/slab.h +#include linux/module.h +#include linux/spinlock.h +#include linux/hardirq.h +#include linux/hash.h +#include linux/kthread.h +#include linux/freezer.h +#include hotdata_map.h +#include hotdata_hash.h +#include hotdata_relocate.h +#include async-thread.h +#include ctree.h + +struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask) +{ + struct heat_hashlist_node *node; + + node = kmalloc(sizeof(struct heat_hashlist_node), mask); + if (!node || IS_ERR(node)) + return node; + INIT_HLIST_NODE(node-hashnode); + node-freq_data = NULL; + node-hlist = NULL; + node-location = BTRFS_ON_ROTATING; + spin_lock_init(node-lock); + spin_lock_init(node-location_lock); + atomic_set(node-refs, 1); + + return node; +} + +void free_heat_hashlists(struct btrfs_root *root) +{ + int i; + + /* Free node/range heat hash lists */ + for (i = 0; i HEAT_HASH_SIZE; i++) { + struct hlist_node *pos = NULL, *pos2 = NULL; + struct heat_hashlist_node *heatnode = NULL; + + hlist_for_each_safe(pos, pos2, + root-heat_inode_hl[i].hashhead) { + heatnode = hlist_entry(pos, struct heat_hashlist_node, + hashnode); + hlist_del(pos); + kfree(heatnode); + } + hlist_for_each_safe(pos, pos2, + root-heat_range_hl[i].hashhead) { + heatnode = hlist_entry(pos, struct heat_hashlist_node, + hashnode); + hlist_del(pos); + kfree(heatnode); + } + } +} + +/* + * btrfs_get_temp is responsible for distilling the six heat criteria, which + * are described in detail in hotdata_hash.h) down into a single temperature + * value for the data, which is an integer between 0 and HEAT_MAX_VALUE. + * + * To accomplish this, the raw values from the btrfs_freq_data structure + * are shifted various ways in order to make the temperature calculation more + * or less sensitive to each value. + * + * Once this calibration has happened, we do some additional normalization and + * make sure that everything fits nicely in a u32. From there, we take a very + * rudimentary kind of average of each of the values, where the *_COEFF_POWER + * values act as weights for the average. + * + * Finally, we use the HEAT_HASH_BITS value, which determines the size of the + * heat hash list, to normalize the temperature to the proper granularity. + */ +int btrfs_get_temp(struct btrfs_freq_data *fdata) +{ + u32 result = 0; + + struct timespec ckt = current_kernel_time(); + u64 cur_time = timespec_to_ns(ckt); + + u32 nrr_heat = fdata-nr_reads NRR_MULTIPLIER_POWER; + u32 nrw_heat = fdata-nr_writes NRW_MULTIPLIER_POWER; + + u64 ltr_heat = (cur_time - timespec_to_ns(fdata-last_read_time)) +LTR_DIVIDER_POWER; + u64 ltw_heat = (cur_time - timespec_to_ns(fdata-last_write_time)) +LTW_DIVIDER_POWER; + + u64 avr_heat = (((u64) -1) - fdata-avg_delta_reads) +AVR_DIVIDER_POWER; + u64
[RFC v2 PATCH 2/6] Btrfs: Add data structures for hot data tracking
From: Ben Chociej bchoc...@gmail.com Adds hot_inode_tree and hot_range_tree structs to keep track of frequently accessed files and ranges within files. Trees contain hot_{inode,range}_items representing those files and ranges, each of which contains a btrfs_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Having these trees means that Btrfs can quickly determine the temperature of some data by doing some calculations on the btrfs_freq_data struct that hangs off of the tree item. Also, since it isn't entirely obvious, the frequency or reads or writes is determined by taking a kind of generalized average of the last few (2^N for some tunable N) reads or writes. Signed-off-by: Ben Chociej bchoc...@gmail.com Signed-off-by: Matt Lupfer mlup...@gmail.com Signed-off-by: Conor Scott consc...@vt.edu Reviewed-by: Mingming Cao c...@us.ibm.com --- fs/btrfs/hotdata_map.c | 804 fs/btrfs/hotdata_map.h | 167 ++ 2 files changed, 971 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/hotdata_map.c create mode 100644 fs/btrfs/hotdata_map.h diff --git a/fs/btrfs/hotdata_map.c b/fs/btrfs/hotdata_map.c new file mode 100644 index 000..ddae0c4 --- /dev/null +++ b/fs/btrfs/hotdata_map.c @@ -0,0 +1,804 @@ +/* + * fs/btrfs/hotdata_map.c + * + * Copyright (C) 2010 International Business Machines Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include linux/err.h +#include linux/slab.h +#include linux/module.h +#include linux/spinlock.h +#include linux/hardirq.h +#include linux/blkdev.h +#include ctree.h +#include hotdata_map.h +#include hotdata_hash.h +#include btrfs_inode.h +#include volumes.h + +/* kmem_cache pointers for slab caches */ +static struct kmem_cache *hot_inode_item_cache; +static struct kmem_cache *hot_range_item_cache; + +static struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode *inode, + int create); + +static int btrfs_update_range_freq(struct hot_inode_item *he, + u64 off, u64 len, int create, + struct btrfs_root *root); + +/* init hot_inode_item kmem cache */ +int __init hot_inode_item_init(void) +{ + hot_inode_item_cache = kmem_cache_create(hot_inode_item, + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); + if (!hot_inode_item_cache) + return -ENOMEM; + return 0; +} + +/* init hot_range_item kmem cache */ +int __init hot_range_item_init(void) +{ + hot_range_item_cache = kmem_cache_create(hot_range_item, + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); + if (!hot_range_item_cache) + return -ENOMEM; + return 0; +} + +void hot_inode_item_exit(void) +{ + if (hot_inode_item_cache) + kmem_cache_destroy(hot_inode_item_cache); +} + +void hot_range_item_exit(void) +{ + if (hot_range_item_cache) + kmem_cache_destroy(hot_range_item_cache); +} + +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +void hot_inode_tree_init(struct hot_inode_tree *tree) +{ + tree-map = RB_ROOT; + rwlock_init(tree-lock); +} + +/* + * Initialize the hot range tree. Should be called for each new inode + * access or other user of the hot_range interface. + */ +void hot_range_tree_init(struct hot_range_tree *tree) +{ + tree-map = RB_ROOT; + rwlock_init(tree-lock); +} + +/* + * Allocate a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +struct hot_inode_item *alloc_hot_inode_item(unsigned long ino) +{ + struct hot_inode_item *he; + he = kmem_cache_alloc(hot_inode_item_cache, GFP_KERNEL | GFP_NOFS); + if (!he || IS_ERR(he)) + return he; + + atomic_set(he-refs, 1); + he-in_tree = 0; + he-i_ino = ino; + he-heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS); + he-heat_node-freq_data = he-freq_data;
[RFC v2 PATCH 5/6] Btrfs: 3 new ioctls related to hot data features
From: Ben Chociej bchoc...@gmail.com BTRFS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. BTRFS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access 2 = migrate data to fast media based on temperature (not implemented) BTRFS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. Signed-off-by: Ben Chociej bchoc...@gmail.com Signed-off-by: Matt Lupfer mlup...@gmail.com Signed-off-by: Conor Scott consc...@vt.edu Reviewed-by: Mingming Cao c...@us.ibm.com --- fs/btrfs/ioctl.c | 142 +- fs/btrfs/ioctl.h | 23 + 2 files changed, 164 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4dbaf89..88cd0e7 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -49,6 +49,8 @@ #include print-tree.h #include volumes.h #include locking.h +#include hotdata_map.h +#include hotdata_hash.h /* Mask out flags that are inappropriate for the given type of inode. */ static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) @@ -1869,7 +1871,7 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp) return 0; } -long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) +static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) { struct btrfs_ioctl_space_args space_args; struct btrfs_ioctl_space_info space; @@ -1974,6 +1976,138 @@ long btrfs_ioctl_trans_end(struct file *file) return 0; } +/* + * Retrieve information about access frequency for the given file. Return it in + * a userspace-friendly struct for btrfsctl (or another tool) to parse. + * + * The temperature that is returned can be live -- that is, recalculated when + * the ioctl is called -- or it can be returned from the hashtable, reflecting + * the (possibly old) value that the system will use when considering files + * for migration. This behavior is determined by heat_info-live. + */ +static long btrfs_ioctl_heat_info(struct file *file, void __user *argp) +{ + struct inode *mnt_inode = fdentry(file)-d_inode; + struct inode *file_inode; + struct file *file_filp; + struct btrfs_root *root = BTRFS_I(mnt_inode)-root; + struct btrfs_ioctl_heat_info *heat_info; + struct hot_inode_tree *hitree; + struct hot_inode_item *he; + int ret; + + heat_info = kmalloc(sizeof(struct btrfs_ioctl_heat_info), + GFP_KERNEL | GFP_NOFS); + + if (copy_from_user((void *) heat_info, + argp, + sizeof(struct btrfs_ioctl_heat_info)) != 0) { + ret = -EFAULT; + goto err; + } + + file_filp = filp_open(heat_info-filename, O_RDONLY, 0); + file_inode = file_filp-f_dentry-d_inode; + filp_close(file_filp, NULL); + + hitree = root-hot_inode_tree; + read_lock(hitree-lock); + he = lookup_hot_inode_item(hitree, file_inode-i_ino); + read_unlock(hitree-lock); + + if (!he || IS_ERR(he)) { + /* we don't have any info on this file yet */ + ret = -ENODATA; + goto err; + } + + spin_lock(he-lock); + + heat_info-avg_delta_reads = + (__u64) he-freq_data.avg_delta_reads; + heat_info-avg_delta_writes = + (__u64) he-freq_data.avg_delta_writes; + heat_info-last_read_time = + (__u64) timespec_to_ns(he-freq_data.last_read_time); + heat_info-last_write_time = + (__u64) timespec_to_ns(he-freq_data.last_write_time); + heat_info-num_reads = + (__u32) he-freq_data.nr_reads; + heat_info-num_writes = + (__u32) he-freq_data.nr_writes; + + if (heat_info-live 0) { + /* got a request for live temperature, +* call btrfs_get_temp to recalculate */ + heat_info-temperature = btrfs_get_temp(he-freq_data); + } else { + /* not live temperature, get it from the hashlist */ + read_lock(he-heat_node-hlist-rwlock); + heat_info-temperature = he-heat_node-hlist-temperature; + read_unlock(he-heat_node-hlist-rwlock); + } + + spin_unlock(he-lock); + free_hot_inode_item(he); + + if (copy_to_user(argp, (void *) heat_info, +sizeof(struct btrfs_ioctl_heat_info))) { + ret = -EFAULT; + goto err; + } + + kfree(heat_info); + return 0; + +err: + kfree(heat_info); + return ret; +} + +static long
[RFC v2 PATCH 4/6] Btrfs: Add debugfs interface for hot data stats
From: Ben Chociej bchoc...@gmail.com Add a /sys/kernel/debug/btrfs_data/device_name/ directory for each volume that contains two files. The first, `inode_data', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data', contains similar information for subfile ranges. Signed-off-by: Matt Lupfer mlup...@gmail.com Signed-off-by: Conor Scott consc...@vt.edu Signed-off-by: Ben Chociej bchoc...@gmail.com Reviewed-by: Mingming Cao c...@us.ibm.com --- fs/btrfs/debugfs.c | 532 fs/btrfs/debugfs.h | 89 + 2 files changed, 621 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/debugfs.c create mode 100644 fs/btrfs/debugfs.h diff --git a/fs/btrfs/debugfs.c b/fs/btrfs/debugfs.c new file mode 100644 index 000..c11c0b6 --- /dev/null +++ b/fs/btrfs/debugfs.c @@ -0,0 +1,532 @@ +/* + * fs/btrfs/debugfs.c + * + * This file contains the code to interface with the btrfs debugfs. + * The debugfs outputs range- and file-level access frequency + * statistics for each mounted volume. + * + * Copyright (C) 2010 International Business Machines Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include linux/debugfs.h +#include linux/fs.h +#include linux/module.h +#include linux/types.h +#include linux/vmalloc.h +#include linux/limits.h +#include ctree.h +#include hotdata_map.h +#include hotdata_hash.h +#include hotdata_relocate.h +#include debugfs.h + +static int copy_msg_to_log(struct debugfs_vol_data *data, char *msg, int len) +{ + struct lstring *debugfs_log = data-debugfs_log; + uint new_log_alloc_size; + char *new_log; + + if (len = data-log_alloc_size - debugfs_log-len) { + /* Not enough room in the log buffer for the new message. */ + /* Allocate a bigger buffer. */ + new_log_alloc_size = data-log_alloc_size + LOG_PAGE_SIZE; + new_log = vmalloc(new_log_alloc_size); + + if (new_log) { + memcpy(new_log, debugfs_log-str, + debugfs_log-len); + memset(new_log + debugfs_log-len, 0, + new_log_alloc_size - debugfs_log-len); + vfree(debugfs_log-str); + debugfs_log-str = new_log; + data-log_alloc_size = new_log_alloc_size; + } else { + WARN_ON(1); + if (data-log_alloc_size - debugfs_log-len) { + #define err_msg No more memory!\n + strlcpy(debugfs_log-str + + debugfs_log-len, + err_msg, data-log_alloc_size - + debugfs_log-len); + debugfs_log-len += + min((typeof(debugfs_log-len)) + sizeof(err_msg), + ((typeof(debugfs_log-len)) + data-log_alloc_size - + debugfs_log-len)); + } + return 0; + } + } + + memcpy(debugfs_log-str + debugfs_log-len, + data-log_work_buff, len); + debugfs_log-len += (unsigned long) len; + + return len; +} + +/* Returns the number of bytes written to the log. */ +static int debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...) +{ + struct lstring *debugfs_log = data-debugfs_log; + va_list args; + int len; + + if (debugfs_log-str == NULL) + return -1; + + spin_lock(data-log_lock); + + va_start(args, fmt); + len = vsnprintf(data-log_work_buff, sizeof(data-log_work_buff), fmt, + args); + va_end(args); + + if (len = sizeof(data-log_work_buff)) { + #define truncate_msg The next message has been truncated.\n + copy_msg_to_log(data, truncate_msg, sizeof(truncate_msg)); + } + + len = copy_msg_to_log(data, data-log_work_buff, len); + spin_unlock(data-log_lock); + + return len; +} + +/* initialize a log corresponding to a btrfs
[RFC v2 PATCH 3/6] Btrfs: Add hot data relocation facilities
From: Ben Chociej bchoc...@gmail.com The relocation code operates on the heat hash lists to identify hot or cold data logical file ranges that are candidates for relocation. The triggering mechanism for relocation is controlled by a global heat threshold integer value (fs_root-heat_threshold). Ranges are queued for relocation by the periodically-executing relocate kthread, which updates the global heat threshold and responds to space pressure on the SSDs. The heat hash lists index logical ranges by heat and provide a constant-time access path to hot or cold range items. The relocation kthread uses this path to find hot or cold items to move to/from SSD. To ensure that the relocation kthread has a chance to sleep, and to prevent thrashing between SSD and HDD, there is a configurable limit to how many ranges are moved per iteration of the kthread. This limit may be overrun in the case where space pressure requires that items be aggressively moved from SSD back to HDD. This needs still more resistance to thrashing and stronger (read: actual) guarantees that relocation operations won't -ENOSPC. The relocation code has introduced two new btrfs block group types: BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD. The later is not currently implemented; to wit, this implementation does not move any metadata, including inlined extents, to SSD. When mkfs'ing a volume with the hot data relocation option, initial block groups are allocated to the proper disks. Runtime block group allocation only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD to SSD (assuming, critically, the HOTDATAMOVE option is set at mount time). Signed-off-by: Ben Chociej bchoc...@gmail.com Signed-off-by: Matt Lupfer mlup...@gmail.com Signed-off-by: Conor Scott consc...@vt.edu Reviewed-by: Mingming Cao c...@us.ibm.com --- fs/btrfs/hotdata_relocate.c | 783 +++ fs/btrfs/hotdata_relocate.h | 73 2 files changed, 856 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/hotdata_relocate.c create mode 100644 fs/btrfs/hotdata_relocate.h diff --git a/fs/btrfs/hotdata_relocate.c b/fs/btrfs/hotdata_relocate.c new file mode 100644 index 000..c5060c4 --- /dev/null +++ b/fs/btrfs/hotdata_relocate.c @@ -0,0 +1,783 @@ +/* + * fs/btrfs/hotdata_relocate.c + * + * Copyright (C) 2010 International Business Machines Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include linux/kthread.h +#include linux/list.h +#include linux/freezer.h +#include linux/spinlock.h +#include linux/bio.h +#include linux/blkdev.h +#include linux/slab.h +#include hotdata_map.h +#include hotdata_relocate.h +#include btrfs_inode.h +#include ctree.h +#include volumes.h + +/* + * Hot data relocation strategy: + * + * The relocation code below operates on the heat hash lists to identify + * hot or cold data logical file ranges that are candidates for relocation. + * The triggering mechanism for relocation is controlled by a global heat + * threshold integer value (fs_root-heat_threshold). Ranges are queued + * for relocation by the periodically executing relocate kthread, which + * updates the global heat threshold and responds to space pressure on the + * SSDs. + * + * The heat hash lists index logical ranges by heat and provide a constant-time + * access path to hot or cold range items. The relocation kthread uses this + * path to find hot or cold items to move to/from SSD. To ensure that the + * relocation kthread has a chance to sleep, and to prevent thrashing between + * SSD and HDD, there is a configurable limit to how many ranges are moved per + * iteration of the kthread. This limit may be overrun in the case where space + * pressure requires that items be aggressively moved from SSD back to HDD. + * + * This needs still more resistance to thrashing and stronger (read: actual) + * guarantees that relocation operations won't -ENOSPC. + * + * The relocation code has introduced two new btrfs block group types: + * BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD. The later is + * not currently implemented; to wit, this implementation does not move any + * metadata *including inlined extents* to SSD. + * + * When mkfs'ing a volume with the hot
[PATCH 0/2] Btrfs-progs: Add support for hot data migration
This patch set introduces functionality into btrfsctl and mkfs.btrfs to support the kernel patches for hot data tracking and migration to SSD with Btrfs. New functionality includes a -h option to mkfs.btrfs to preallocate approrpiate block group types for SSD data migration, and also includes additional options for btrfsctl to interact with the new ioctls introduced by the kernel patches. DIFFSTAT: btrfsctl.c| 111 +++- ctree.h |2 + extent-tree.c |2 +- ioctl-test.c |3 + ioctl.h | 24 + mkfs.c| 131 --- utils.c |1 + volumes.c | 73 +- volumes.h |3 +- 9 files changed, 326 insertions(+), 24 deletions(-) Signed-off-by: Ben Chociej bchoc...@gmail.com Signed-off-by: Matt Lupfer mlup...@gmail.com Tested-by: Conor Scott consc...@vt.edu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs-progs: Add hot data support in mkfs
From: Ben Chociej bchoc...@gmail.com Modified mkfs.btrfs to add hot data relocation option (-h) which preallocates BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD at mkfs time for future use by hot data relocation code. Also added a userspace function to detect whether a block device is an SSD by reading the sysfs block queue rotational flag. Signed-off-by: Ben Chociej bchoc...@gmail.com Signed-off-by: Matt Lupfer mlup...@gmail.com Tested-by: Conor Scott consc...@vt.edu --- ctree.h |2 + extent-tree.c |2 +- mkfs.c| 131 + utils.c |1 + volumes.c | 73 +++- volumes.h |3 +- 6 files changed, 190 insertions(+), 22 deletions(-) diff --git a/ctree.h b/ctree.h index 64ecf12..8c29122 100644 --- a/ctree.h +++ b/ctree.h @@ -640,6 +640,8 @@ struct btrfs_csum_item { #define BTRFS_BLOCK_GROUP_RAID1(1 4) #define BTRFS_BLOCK_GROUP_DUP (1 5) #define BTRFS_BLOCK_GROUP_RAID10 (1 6) +#define BTRFS_BLOCK_GROUP_DATA_SSD (1 7) +#define BTRFS_BLOCK_GROUP_METADATA_SSD (1 8) struct btrfs_block_group_item { __le64 used; diff --git a/extent-tree.c b/extent-tree.c index b2f9bb2..a6b2beb 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -1812,7 +1812,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, thresh) return 0; - ret = btrfs_alloc_chunk(trans, extent_root, start, num_bytes, flags); + ret = btrfs_alloc_chunk(trans, extent_root, start, num_bytes, flags, 0); if (ret == -ENOSPC) { space_info-full = 1; return 0; diff --git a/mkfs.c b/mkfs.c index 2e99b95..f45cfc3 100644 --- a/mkfs.c +++ b/mkfs.c @@ -69,7 +69,61 @@ static u64 parse_size(char *s) return atol(s) * mult; } -static int make_root_dir(struct btrfs_root *root) +static int make_root_dir2(struct btrfs_root *root, int hotdata) +{ + struct btrfs_trans_handle *trans; + u64 chunk_start = 0; + u64 chunk_size = 0; + int ret; + + trans = btrfs_start_transaction(root, 1); + + /* +* If hotdata option is set, preallocate a metadata SSD block group +* (not currently used) +*/ + if (hotdata) { + ret = btrfs_alloc_chunk(trans, root-fs_info-extent_root, + chunk_start, chunk_size, + BTRFS_BLOCK_GROUP_METADATA_SSD, hotdata); + BUG_ON(ret); + ret = btrfs_make_block_group(trans, root, 0, +BTRFS_BLOCK_GROUP_METADATA_SSD, +BTRFS_FIRST_CHUNK_TREE_OBJECTID, +chunk_start, chunk_size); + BUG_ON(ret); + } + + ret = btrfs_alloc_chunk(trans, root-fs_info-extent_root, + chunk_start, chunk_size, + BTRFS_BLOCK_GROUP_DATA, hotdata); + BUG_ON(ret); + ret = btrfs_make_block_group(trans, root, 0, +BTRFS_BLOCK_GROUP_DATA, +BTRFS_FIRST_CHUNK_TREE_OBJECTID, +chunk_start, chunk_size); + BUG_ON(ret); + + /* +* If hotdata option is set, preallocate a data SSD block group +*/ + if (hotdata) { + ret = btrfs_alloc_chunk(trans, root-fs_info-extent_root, + chunk_start, chunk_size, + BTRFS_BLOCK_GROUP_DATA_SSD, hotdata); + BUG_ON(ret); + ret = btrfs_make_block_group(trans, root, 0, +BTRFS_BLOCK_GROUP_DATA_SSD, +BTRFS_FIRST_CHUNK_TREE_OBJECTID, +chunk_start, chunk_size); + BUG_ON(ret); + } + + btrfs_commit_transaction(trans, root); + return ret; +} + +static int make_root_dir(struct btrfs_root *root, int hotdata) { struct btrfs_trans_handle *trans; struct btrfs_key location; @@ -90,7 +144,7 @@ static int make_root_dir(struct btrfs_root *root) ret = btrfs_alloc_chunk(trans, root-fs_info-extent_root, chunk_start, chunk_size, - BTRFS_BLOCK_GROUP_METADATA); + BTRFS_BLOCK_GROUP_METADATA, hotdata); BUG_ON(ret); ret = btrfs_make_block_group(trans, root, 0, BTRFS_BLOCK_GROUP_METADATA, @@ -103,16 +157,6 @@ static int make_root_dir(struct btrfs_root *root) trans = btrfs_start_transaction(root, 1); BUG_ON(!trans); - ret = btrfs_alloc_chunk(trans, root-fs_info-extent_root, - chunk_start, chunk_size, - BTRFS_BLOCK_GROUP_DATA); - BUG_ON(ret); -
[PATCH] btrfs: avoid duplications by moving the static int array from header to c file
The commit 607d432d referred a static int array defined in ctree.h, and a static inline function (btrfs_super_csum_size) using this array, the obvious problem is every c file using that function would have a local copy of that int array, multiple c files calling would result multiple copies of that array: $ nm fs/btrfs/btrfs.ko | grep btrfs_csum_sizes 010c r btrfs_csum_sizes 0114 r btrfs_csum_sizes 01c0 r btrfs_csum_sizes 05a0 r btrfs_csum_sizes the original commit has 4 c files called this static inline function, till now there are still those 4 c files calling it, so there are 4 copies of btrfs_csum_sizes; but future code may call it in more c files, resulting in more copies; fs/btrfs/ctree.h | 19 - fs/btrfs/disk-io.c | 25 + fs/btrfs/file-item.c | 56 - fs/btrfs/ioctl.c |9 --- fs/btrfs/tree-log.c | 10 +--- 5 files changed, 81 insertions(+), 38 deletions(-) multiple copies are just wasting memory; move it to a c file can avoid duplications; and since the inline function referred ARRAY_SIZE of that array, must know the array size at compile time, so cannot be inlined anyway. The cost is originally inlined function calling changed to external function calling. Signed-off-by: Cheng Renquan crq...@gmail.com --- fs/btrfs/ctree.c |9 + fs/btrfs/ctree.h |9 + 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index c3df14c..3a89207 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -24,6 +24,15 @@ #include print-tree.h #include locking.h +int btrfs_super_csum_size(struct btrfs_super_block *s) +{ + static const int btrfs_csum_sizes[] = { 4, 0 }; + + int t = btrfs_super_csum_type(s); + BUG_ON(t = ARRAY_SIZE(btrfs_csum_sizes)); + return btrfs_csum_sizes[t]; +} + static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, int level); static int split_leaf(struct btrfs_trans_handle *trans, struct btrfs_root diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e9bf864..99220ee 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -132,8 +132,6 @@ struct btrfs_ordered_sum; /* csum types */ #define BTRFS_CSUM_TYPE_CRC32 0 -static int btrfs_csum_sizes[] = { 4, 0 }; - /* four bytes for CRC32 */ #define BTRFS_EMPTY_DIR_SIZE 0 @@ -1877,12 +1875,7 @@ BTRFS_SETGET_STACK_FUNCS(super_incompat_flags, struct btrfs_super_block, BTRFS_SETGET_STACK_FUNCS(super_csum_type, struct btrfs_super_block, csum_type, 16); -static inline int btrfs_super_csum_size(struct btrfs_super_block *s) -{ - int t = btrfs_super_csum_type(s); - BUG_ON(t = ARRAY_SIZE(btrfs_csum_sizes)); - return btrfs_csum_sizes[t]; -} +int btrfs_super_csum_size(struct btrfs_super_block *s); static inline unsigned long btrfs_leaf_data(struct extent_buffer *l) { -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html