[RFC PATCH v2 5/5] BTRFS hot reloc: add hot relocation support

2013-06-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one new mount option '-o hot_move' for hot
relocation support. When hot relocation is enabled,
hot tracking will be enabled automatically.
  Its usage looks like:
mount -o hot_move
mount -o nouser,hot_move
mount -o nouser,hot_move,loop
mount -o hot_move,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/super.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9ee751f..31e1d17 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -309,8 +309,13 @@ static void btrfs_put_super(struct super_block *sb)
 * process...  Whom would you report that to?
 */
 
+   /* Hot data relocation */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_MOVE))
+   hot_relocate_exit(btrfs_sb(sb));
+
/* Hot data tracking */
-   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_MOVE)
+   || btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
hot_track_exit(sb);
 }
 
@@ -325,7 +330,7 @@ enum {
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
-   Opt_err,
+   Opt_hot_move, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -366,6 +371,7 @@ static match_table_t tokens = {
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
{Opt_fatal_errors, fatal_errors=%s},
{Opt_hot_track, hot_track},
+   {Opt_hot_move, hot_move},
{Opt_err, NULL},
 };
 
@@ -634,6 +640,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
case Opt_hot_track:
btrfs_set_opt(info-mount_opt, HOT_TRACK);
break;
+   case Opt_hot_move:
+   btrfs_set_opt(info-mount_opt, HOT_MOVE);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -853,17 +862,26 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
-   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   if (btrfs_test_opt(fs_info-tree_root, HOT_MOVE)
+   || btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
err = hot_track_init(sb);
if (err)
goto fail_hot;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_MOVE)) {
+   err = hot_relocate_init(fs_info);
+   if (err)
+   goto fail_reloc;
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
return 0;
 
+fail_reloc:
+   hot_track_exit(sb);
 fail_hot:
dput(sb-s_root);
sb-s_root = NULL;
@@ -964,6 +982,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
dentry *dentry)
seq_puts(seq, ,fatal_errors=panic);
if (btrfs_test_opt(root, HOT_TRACK))
seq_puts(seq, ,hot_track);
+   if (btrfs_test_opt(root, HOT_MOVE))
+   seq_puts(seq, ,hot_move);
return 0;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 0/5] BTRFS hot relocation support

2013-06-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  This patchset can work well with the patchset v3 for VFS
hot tracking in RAID single mode now.

  The patchset as RFC is sent out mainly to see if its design
goes in the correct development direction.

  When working on this feature, i am trying to change as less
the existing btrfs code as possible. After V0 was sent out,
i carefully checked the patchset for speed profile, and don't
think that it is meanful to BTRFS hot relocation, but think
that it is one simple and effective way to introduce one new
block group for nonrotating disk to differentiate if the block
space is reserved from rotating disk or nonrotating disk; So
It's very appreciated that the developers can double check if
the design is appropriate to BTRFS hot relocation.

  The patchset is trying to introduce hot relocation support
for BTRFS. In hybrid storage environment, when the data in
rotating disk get hot, it can be relocated to nonrotating disk
by BTRFS hot relocation support automatically; also, if
nonrotating disk ratio exceed its upper threshold, the data
which get cold can be looked up and relocated to rotating disk
to make more space in nonrotating disk at first, and then the
data which get hot will be relocated to nonrotating disk
automatically.

  BTRFS hot relocation mainly reserve block space from nonrotating
disk at first, load the hot data to page cache from rotating disk,
allocate block space from nonrotating disk, and finally write the
data to it.

Below is its TODO list:

  - BTRFS RAID full support. [Martin Steigerwald, Zhiyong]
  - Mark files as hot via ioctl. [Martin Steigerwald]
  - Easier setup. With BTRFS flexibility I would expect that a SSD as hot data
cache can be added and removed on the fly during filesystem is mounted. Only
seems supported at mkfs-time as I read the patch docs, but from my basic
technical understanding of BTRFS it can be extented to be done on the fly with
a mounted FS as well. [Martin Steigerwald]

  If you'd like to play with it, pls pull the patchset from
my git on github:
  https://github.com/wuzhy/kernel.git hot_reloc

For how to use, please refer too the example below:

root@debian-i386:~# echo 0  /sys/block/vdc/queue/rotational
^^^ Above command will hack /dev/vdc to be one SSD disk
root@debian-i386:~# echo 99  /proc/sys/fs/hot-age-interval
root@debian-i386:~# echo 10  /proc/sys/fs/hot-update-interval
root@debian-i386:~# echo 10  /proc/sys/fs/hot-reloc-interval
root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f

WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

[ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 
16 /dev/vdb
[ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 
16 /dev/vdc
[ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
3 /dev/vdb
[ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 
16 /dev/vdc
adding device /dev/vdc id 2
[ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 
3 /dev/vdc
fs created label (null) on /dev/vdb
nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
Btrfs v0.20-rc1-254-gb0136aa-dirty
root@debian-i386:~# mount -o hot_move /dev/vdb /data2
[ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
6 /dev/vdb
[ 144.870444] btrfs: disk space caching is enabled
[ 144.904214] VFS: Turning on hot data tracking
root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.0G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=2.00GB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.19MB
Data_SSD: total=8.00MB, used=0.00
root@debian-i386:~# echo 108  /proc/sys/fs/hot-reloc-threshold
^^^ Above command will start HOT RLEOCATE, because The data temperature is 
currently 109
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.1G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=6.25MB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.26MB
Data_SSD: total=2.01GB, used=2.00GB
root@debian-i386:~#

Changelog from v1:
 - Fixed up one nospc bug which is introduced by this feature.

v1:
 - Refactor introducing one new block group.


Zhi Yong Wu (5):
  BTRFS hot reloc, vfs: add one list_head field
  BTRFS hot reloc: add one new block group
  BTRFS hot reloc: add one hot reloc thread
  BTRFS hot reloc, procfs: add three proc interfaces
  

[RFC PATCH v2 1/5] BTRFS hot reloc, vfs: add one list_head field

2013-06-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one list_head field 'reloc_list' to accommodate
hot relocation support.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 1 +
 include/linux/hot_tracking.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index dbc90d4..f013182 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -44,6 +44,7 @@ static void hot_comm_item_init(struct hot_comm_item *ci, int 
type)
clear_bit(HOT_IN_LIST, ci-delete_flag);
clear_bit(HOT_DELETING, ci-delete_flag);
INIT_LIST_HEAD(ci-track_list);
+   INIT_LIST_HEAD(ci-reloc_list);
memset(ci-hot_freq_data, 0, sizeof(struct hot_freq_data));
ci-hot_freq_data.avg_delta_reads = (u64) -1;
ci-hot_freq_data.avg_delta_writes = (u64) -1;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 1009377..98bb092 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -75,6 +75,7 @@ struct hot_comm_item {
unsigned long delete_flag;
struct rcu_head c_rcu;
struct list_head track_list;/* link to *_map[] */
+   struct list_head reloc_list;/* used in hot relocation*/
 };
 
 /* An item representing an inode and its access frequency */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 3/5] BTRFS hot reloc: add one hot reloc thread

2013-06-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Add one private thread for hot relocation. It will check
if there're some extents which is hotter than the threshold
and queue them at first, if no, it will return and wait for
its next turn; otherwise, it will check if nonrotating disk
ratio is beyond its usage threshold, if no, it will directly
relocate hot extents from rotating disk to nonrotating disk;
otherwise it will find the extents with low temperature and
queue them, then relocate those extents with low temperature
and queue them, and finally relocate the hot extents from
from rotating disk to nonrotating disk.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/Makefile   |   3 +-
 fs/btrfs/ctree.h|   2 +
 fs/btrfs/hot_relocate.c | 713 
 fs/btrfs/hot_relocate.h |  43 +++
 4 files changed, 760 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/hot_relocate.c
 create mode 100644 fs/btrfs/hot_relocate.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 3932224..94f1ea5 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o
+  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
+  hot_relocate.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1c11be1..956115d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1636,6 +1636,8 @@ struct btrfs_fs_info {
struct btrfs_dev_replace dev_replace;
 
atomic_t mutually_exclusive_operation_running;
+
+   void *hot_reloc;
 };
 
 /*
diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
new file mode 100644
index 000..ae28b86
--- /dev/null
+++ b/fs/btrfs/hot_relocate.c
@@ -0,0 +1,713 @@
+/*
+ * fs/btrfs/hot_relocate.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *Ben Chociej bchoc...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/blkdev.h
+#include linux/writeback.h
+#include linux/kthread.h
+#include linux/freezer.h
+#include linux/module.h
+#include hot_relocate.h
+
+/*
+ * Hot relocation strategy:
+ *
+ * The relocation code below operates on the heat map lists to identify
+ * hot or cold data logical file ranges that are candidates for relocation.
+ * The triggering mechanism for relocation is controlled by a global heat
+ * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * queued for relocation by the periodically executing relocate kthread,
+ * which updates the global heat threshold and responds to space pressure
+ * on the nonrotating disks.
+ *
+ * The heat map lists index logical ranges by heat and provide a constant-time
+ * access path to hot or cold range items. The relocation kthread uses this
+ * path to find hot or cold items to move to/from nonrotating disks. To ensure
+ * that the relocation kthread has a chance to sleep, and to prevent thrashing
+ * between nonrotating disks and HDD, there is a configurable limit to how many
+ * ranges are moved per iteration of the kthread. This limit may be overrun in
+ * the case where space pressure requires that items be aggressively moved from
+ * nonrotating disks back to HDD.
+ *
+ * This needs still more resistance to thrashing and stronger (read: actual)
+ * guarantees that relocation operations won't -ENOSPC.
+ *
+ * The relocation code has introduced one new btrfs block group type:
+ * BTRFS_BLOCK_GROUP_DATA_NONROT.
+ *
+ * When mkfs'ing a volume with the hot data relocation option, initial block
+ * groups are allocated to the proper disks. Runtime block group allocation
+ * only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and
+ * BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates
+ * BTRFS_BLOCK_GROUP_DATA_NONROT to nonrotating disks.
+ * (assuming, critically, the HOT_MOVE option is set at mount time).
+ */
+
+/*
+ * Returns the ratio of nonrotating disks that are full.
+ * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1.
+ */
+static int hot_calc_nonrot_ratio(struct hot_reloc *hot_reloc)
+{
+   struct btrfs_space_info *info;
+   struct btrfs_device *device, *next;
+   struct btrfs_fs_info *fs_info = hot_reloc-fs_info;
+   u64 total_bytes = 0, bytes_used = 0;
+
+   /*
+* 

[RFC PATCH v2 2/5] BTRFS hot reloc: add one new block group

2013-06-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new block group BTRFS_BLOCK_GROUP_DATA_NONROT,
which is used to differentiate if the block space is reserved
and allocated from one rotating disk or nonrotating disk.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/ctree.h|  33 +++---
 fs/btrfs/extent-tree.c  |  99 +
 fs/btrfs/extent_io.c|  51 -
 fs/btrfs/extent_io.h|   7 +++
 fs/btrfs/file.c |  27 +++
 fs/btrfs/free-space-cache.c |   2 +-
 fs/btrfs/inode-map.c|   7 +--
 fs/btrfs/inode.c| 106 +++-
 fs/btrfs/ioctl.c|  17 ---
 fs/btrfs/relocation.c   |   6 ++-
 fs/btrfs/super.c|   4 +-
 fs/btrfs/volumes.c  |  29 +++-
 12 files changed, 318 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 745cac4..1c11be1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -963,6 +963,12 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10   (1ULL  6)
 #define BTRFS_BLOCK_GROUP_RAID5(1  7)
 #define BTRFS_BLOCK_GROUP_RAID6(1  8)
+/*
+ * New block groups for use with BTRFS hot relocation feature.
+ * When BTRFS hot relocation is enabled, *_NONROT block group is
+ * forced to nonrotating drives.
+ */
+#define BTRFS_BLOCK_GROUP_DATA_NONROT  (1ULL  9)
 #define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE
 
 enum btrfs_raid_types {
@@ -978,7 +984,8 @@ enum btrfs_raid_types {
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK(BTRFS_BLOCK_GROUP_DATA |\
 BTRFS_BLOCK_GROUP_SYSTEM |  \
-BTRFS_BLOCK_GROUP_METADATA)
+BTRFS_BLOCK_GROUP_METADATA | \
+BTRFS_BLOCK_GROUP_DATA_NONROT)
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 |   \
 BTRFS_BLOCK_GROUP_RAID1 |   \
@@ -1521,6 +1528,7 @@ struct btrfs_fs_info {
struct list_head space_info;
 
struct btrfs_space_info *data_sinfo;
+   struct btrfs_space_info *nonrot_data_sinfo;
 
struct reloc_control *reloc_ctl;
 
@@ -1545,6 +1553,7 @@ struct btrfs_fs_info {
u64 avail_data_alloc_bits;
u64 avail_metadata_alloc_bits;
u64 avail_system_alloc_bits;
+   u64 avail_data_nonrot_alloc_bits;
 
/* restriper state */
spinlock_t balance_lock;
@@ -1557,6 +1566,7 @@ struct btrfs_fs_info {
 
unsigned data_chunk_allocations;
unsigned metadata_ratio;
+   unsigned data_nonrot_chunk_allocations;
 
void *bdev_holder;
 
@@ -1928,6 +1938,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
 #define BTRFS_MOUNT_HOT_TRACK  (1  23)
+#define BTRFS_MOUNT_HOT_MOVE   (1  24)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
@@ -3043,6 +3054,8 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_root 
*root,
 int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 objectid, u64 offset, u64 bytenr);
+struct btrfs_block_group_cache *btrfs_lookup_first_block_group(
+   struct btrfs_fs_info *info, u64 bytenr);
 struct btrfs_block_group_cache *btrfs_lookup_block_group(
 struct btrfs_fs_info *info,
 u64 bytenr);
@@ -3093,6 +3106,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner, u64 offset, int for_cow);
+struct btrfs_block_group_cache *next_block_group(struct btrfs_root *root,
+struct btrfs_block_group_cache *cache);
 
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
@@ -3122,8 +3137,14 @@ enum btrfs_reserve_flush_enum {
BTRFS_RESERVE_FLUSH_ALL,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
+enum {
+   TYPE_ROT = 0,   /* rot - rotating */
+   TYPE_NONROT,/* nonrot - nonrotating */
+   MAX_RELOC_TYPES,
+};
+
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes, int *flag);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes, int flag);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,

[RFC PATCH v2 4/5] BTRFS hot reloc, procfs: add three proc interfaces

2013-06-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add three proc interfaces hot-reloc-interval, hot-reloc-threshold,
and hot-reloc-max-items under the dir /proc/sys/fs/ in order to
turn HOT_RELOC_INTERVAL, HOT_RELOC_THRESHOLD, and HOT_RELOC_MAX_ITEMS
into be tunable.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/hot_relocate.c | 26 +-
 fs/btrfs/hot_relocate.h |  5 -
 include/linux/btrfs.h   |  4 
 kernel/sysctl.c | 22 ++
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
index ae28b86..debf580 100644
--- a/fs/btrfs/hot_relocate.c
+++ b/fs/btrfs/hot_relocate.c
@@ -25,7 +25,7 @@
  * The relocation code below operates on the heat map lists to identify
  * hot or cold data logical file ranges that are candidates for relocation.
  * The triggering mechanism for relocation is controlled by a global heat
- * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * threshold integer value (sysctl_hot_reloc_thresh). Ranges are
  * queued for relocation by the periodically executing relocate kthread,
  * which updates the global heat threshold and responds to space pressure
  * on the nonrotating disks.
@@ -53,6 +53,15 @@
  * (assuming, critically, the HOT_MOVE option is set at mount time).
  */
 
+int sysctl_hot_reloc_thresh = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_thresh);
+
+int sysctl_hot_reloc_interval __read_mostly = 120;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_interval);
+
+int sysctl_hot_reloc_max_items __read_mostly = 250;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_max_items);
+
 /*
  * Returns the ratio of nonrotating disks that are full.
  * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1.
@@ -103,7 +112,7 @@ static int hot_calc_nonrot_ratio(struct hot_reloc 
*hot_reloc)
 static int hot_update_threshold(struct hot_reloc *hot_reloc,
int update)
 {
-   int thresh = hot_reloc-thresh;
+   int thresh = sysctl_hot_reloc_thresh;
int ratio = hot_calc_nonrot_ratio(hot_reloc);
 
/* Sometimes update global threshold, others not */
@@ -127,7 +136,7 @@ static int hot_update_threshold(struct hot_reloc *hot_reloc,
thresh = 0;
}
 
-   hot_reloc-thresh = thresh;
+   sysctl_hot_reloc_thresh = thresh;
return ratio;
 }
 
@@ -215,7 +224,7 @@ static int hot_queue_extent(struct hot_reloc *hot_reloc,
*counter = *counter + 1;
}
 
-   if (*counter = HOT_RELOC_MAX_ITEMS)
+   if (*counter = sysctl_hot_reloc_max_items)
break;
 
if (kthread_should_stop()) {
@@ -293,7 +302,7 @@ again:
while (1) {
lock_extent(tree, page_start, page_end);
ordered = btrfs_lookup_ordered_extent(inode,
-   page_start);
+ page_start);
unlock_extent(tree, page_start, page_end);
if (!ordered)
break;
@@ -559,7 +568,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
 
run++;
ratio = hot_update_threshold(hot_reloc, !(run % 15));
-   thresh = hot_reloc-thresh;
+   thresh = sysctl_hot_reloc_thresh;
 
INIT_LIST_HEAD(hot_reloc-hot_relocq[TYPE_NONROT]);
 
@@ -569,7 +578,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
if (count_to_hot == 0)
return;
 
-   count_to_cold = HOT_RELOC_MAX_ITEMS;
+   count_to_cold = sysctl_hot_reloc_max_items;
 
/* Don't move cold data to HDD unless there's space pressure */
if (ratio  HIGH_WATER_LEVEL)
@@ -653,7 +662,7 @@ static int hot_relocate_kthread(void *arg)
unsigned long delay;
 
do {
-   delay = HZ * HOT_RELOC_INTERVAL;
+   delay = HZ * sysctl_hot_reloc_interval;
if (mutex_trylock(hot_reloc-hot_reloc_mutex)) {
hot_do_relocate(hot_reloc);
mutex_unlock(hot_reloc-hot_reloc_mutex);
@@ -685,7 +694,6 @@ int hot_relocate_init(struct btrfs_fs_info *fs_info)
 
fs_info-hot_reloc = hot_reloc;
hot_reloc-fs_info = fs_info;
-   hot_reloc-thresh = HOT_RELOC_THRESHOLD;
for (i = 0; i  MAX_RELOC_TYPES; i++)
INIT_LIST_HEAD(hot_reloc-hot_relocq[i]);
mutex_init(hot_reloc-hot_reloc_mutex);
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
index 1b1cfb5..94defe6 100644
--- a/fs/btrfs/hot_relocate.h
+++ b/fs/btrfs/hot_relocate.h
@@ -18,10 +18,6 @@
 #include btrfs_inode.h
 #include volumes.h
 
-#define HOT_RELOC_INTERVAL  120
-#define HOT_RELOC_THRESHOLD 150
-#define HOT_RELOC_MAX_ITEMS 250
-
 #define HEAT_MAX_VALUE(MAP_SIZE - 1)
 #define HIGH_WATER_LEVEL  75 /* when to raise the threshold */
 

[RFC PATCH v1 1/5] BTRFS hot reloc, vfs: add one list_head field

2013-05-20 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one list_head field 'reloc_list' to accommodate
hot relocation support.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 1 +
 include/linux/hot_tracking.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 46d2f7d..2a59b09 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -41,6 +41,7 @@ static void hot_comm_item_init(struct hot_comm_item *ci, int 
type)
clear_bit(HOT_IN_LIST, ci-delete_flag);
clear_bit(HOT_DELETING, ci-delete_flag);
INIT_LIST_HEAD(ci-track_list);
+   INIT_LIST_HEAD(ci-reloc_list);
memset(ci-hot_freq_data, 0, sizeof(struct hot_freq_data));
ci-hot_freq_data.avg_delta_reads = (u64) -1;
ci-hot_freq_data.avg_delta_writes = (u64) -1;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 008a5c1..faf1acc 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -74,6 +74,7 @@ struct hot_comm_item {
unsigned long delete_flag;
struct rcu_head c_rcu;
struct list_head track_list;/* link to *_map[] */
+   struct list_head reloc_list;/* used in hot relocation*/
 };
 
 /* An item representing an inode and its access frequency */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 2/5] BTRFS hot reloc: add one new block group

2013-05-20 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new block group BTRFS_BLOCK_GROUP_DATA_NONROT,
which is used to differentiate if the block space is reserved
and allocated from one rotating disk or nonrotating disk.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/ctree.h| 33 ---
 fs/btrfs/extent-tree.c  | 99 -
 fs/btrfs/extent_io.c| 59 ++-
 fs/btrfs/extent_io.h|  7 
 fs/btrfs/file.c | 24 +++
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/inode-map.c|  7 ++--
 fs/btrfs/inode.c| 94 ++
 fs/btrfs/ioctl.c| 17 +---
 fs/btrfs/relocation.c   |  6 ++-
 fs/btrfs/super.c|  4 +-
 fs/btrfs/volumes.c  | 29 -
 12 files changed, 316 insertions(+), 65 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 133a6ed..f7a3170 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -963,6 +963,12 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10   (1ULL  6)
 #define BTRFS_BLOCK_GROUP_RAID5(1  7)
 #define BTRFS_BLOCK_GROUP_RAID6(1  8)
+/*
+ * New block groups for use with BTRFS hot relocation feature.
+ * When BTRFS hot relocation is enabled, *_NONROT block group is
+ * forced to nonrotating drives.
+ */
+#define BTRFS_BLOCK_GROUP_DATA_NONROT  (1ULL  9)
 #define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE
 
 enum btrfs_raid_types {
@@ -978,7 +984,8 @@ enum btrfs_raid_types {
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK(BTRFS_BLOCK_GROUP_DATA |\
 BTRFS_BLOCK_GROUP_SYSTEM |  \
-BTRFS_BLOCK_GROUP_METADATA)
+BTRFS_BLOCK_GROUP_METADATA | \
+BTRFS_BLOCK_GROUP_DATA_NONROT)
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 |   \
 BTRFS_BLOCK_GROUP_RAID1 |   \
@@ -1521,6 +1528,7 @@ struct btrfs_fs_info {
struct list_head space_info;
 
struct btrfs_space_info *data_sinfo;
+   struct btrfs_space_info *nonrot_data_sinfo;
 
struct reloc_control *reloc_ctl;
 
@@ -1545,6 +1553,7 @@ struct btrfs_fs_info {
u64 avail_data_alloc_bits;
u64 avail_metadata_alloc_bits;
u64 avail_system_alloc_bits;
+   u64 avail_data_nonrot_alloc_bits;
 
/* restriper state */
spinlock_t balance_lock;
@@ -1557,6 +1566,7 @@ struct btrfs_fs_info {
 
unsigned data_chunk_allocations;
unsigned metadata_ratio;
+   unsigned data_nonrot_chunk_allocations;
 
void *bdev_holder;
 
@@ -1928,6 +1938,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
 #define BTRFS_MOUNT_HOT_TRACK  (1  23)
+#define BTRFS_MOUNT_HOT_MOVE   (1  24)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
@@ -3043,6 +3054,8 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_root 
*root,
 int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 objectid, u64 offset, u64 bytenr);
+struct btrfs_block_group_cache *btrfs_lookup_first_block_group(
+   struct btrfs_fs_info *info, u64 bytenr);
 struct btrfs_block_group_cache *btrfs_lookup_block_group(
 struct btrfs_fs_info *info,
 u64 bytenr);
@@ -3093,6 +3106,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner, u64 offset, int for_cow);
+struct btrfs_block_group_cache *next_block_group(struct btrfs_root *root,
+struct btrfs_block_group_cache *cache);
 
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
@@ -3122,8 +3137,14 @@ enum btrfs_reserve_flush_enum {
BTRFS_RESERVE_FLUSH_ALL,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
+enum {
+   TYPE_ROT,   /* rot - rotating */
+   TYPE_NONROT,/* nonrot - nonrotating */
+   MAX_RELOC_TYPES,
+};
+
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes, int *flag);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes, int flag);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,

[RFC PATCH v1 5/5] BTRFS hot reloc: add hot relocation support

2013-05-20 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one new mount option '-o hot_move' for hot
relocation support. When hot relocation is enabled,
hot tracking will be enabled automatically.
  Its usage looks like:
mount -o hot_move
mount -o nouser,hot_move
mount -o nouser,hot_move,loop
mount -o hot_move,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/super.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c10477b..1377551 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -309,8 +309,13 @@ static void btrfs_put_super(struct super_block *sb)
 * process...  Whom would you report that to?
 */
 
+   /* Hot data relocation */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_MOVE))
+   hot_relocate_exit(btrfs_sb(sb));
+
/* Hot data tracking */
-   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_MOVE)
+   || btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
hot_track_exit(sb);
 }
 
@@ -325,7 +330,7 @@ enum {
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
-   Opt_err,
+   Opt_hot_move, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -366,6 +371,7 @@ static match_table_t tokens = {
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
{Opt_fatal_errors, fatal_errors=%s},
{Opt_hot_track, hot_track},
+   {Opt_hot_move, hot_move},
{Opt_err, NULL},
 };
 
@@ -634,6 +640,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
case Opt_hot_track:
btrfs_set_opt(info-mount_opt, HOT_TRACK);
break;
+   case Opt_hot_move:
+   btrfs_set_opt(info-mount_opt, HOT_MOVE);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -853,17 +862,26 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
-   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   if (btrfs_test_opt(fs_info-tree_root, HOT_MOVE)
+   || btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
err = hot_track_init(sb);
if (err)
goto fail_hot;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_MOVE)) {
+   err = hot_relocate_init(fs_info);
+   if (err)
+   goto fail_reloc;
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
return 0;
 
+fail_reloc:
+   hot_track_exit(sb);
 fail_hot:
dput(sb-s_root);
sb-s_root = NULL;
@@ -964,6 +982,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
dentry *dentry)
seq_puts(seq, ,fatal_errors=panic);
if (btrfs_test_opt(root, HOT_TRACK))
seq_puts(seq, ,hot_track);
+   if (btrfs_test_opt(root, HOT_MOVE))
+   seq_puts(seq, ,hot_move);
return 0;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 4/5] BTRFS hot reloc, procfs: add three proc interfaces

2013-05-20 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add three proc interfaces hot-reloc-interval, hot-reloc-threshold,
and hot-reloc-max-items under the dir /proc/sys/fs/ in order to
turn HOT_RELOC_INTERVAL, HOT_RELOC_THRESHOLD, and HOT_RELOC_MAX_ITEMS
into be tunable.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/hot_relocate.c | 26 +-
 fs/btrfs/hot_relocate.h |  5 -
 include/linux/btrfs.h   |  4 
 kernel/sysctl.c | 22 ++
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
index ae28b86..3a18555 100644
--- a/fs/btrfs/hot_relocate.c
+++ b/fs/btrfs/hot_relocate.c
@@ -25,7 +25,7 @@
  * The relocation code below operates on the heat map lists to identify
  * hot or cold data logical file ranges that are candidates for relocation.
  * The triggering mechanism for relocation is controlled by a global heat
- * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * threshold integer value (sysctl_hot_reloc_threshold). Ranges are
  * queued for relocation by the periodically executing relocate kthread,
  * which updates the global heat threshold and responds to space pressure
  * on the nonrotating disks.
@@ -53,6 +53,15 @@
  * (assuming, critically, the HOT_MOVE option is set at mount time).
  */
 
+int sysctl_hot_reloc_threshold = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_threshold);
+
+int sysctl_hot_reloc_interval __read_mostly = 120;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_interval);
+
+int sysctl_hot_reloc_max_items __read_mostly = 250;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_max_items);
+
 /*
  * Returns the ratio of nonrotating disks that are full.
  * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1.
@@ -103,7 +112,7 @@ static int hot_calc_nonrot_ratio(struct hot_reloc 
*hot_reloc)
 static int hot_update_threshold(struct hot_reloc *hot_reloc,
int update)
 {
-   int thresh = hot_reloc-thresh;
+   int thresh = sysctl_hot_reloc_threshold;
int ratio = hot_calc_nonrot_ratio(hot_reloc);
 
/* Sometimes update global threshold, others not */
@@ -127,7 +136,7 @@ static int hot_update_threshold(struct hot_reloc *hot_reloc,
thresh = 0;
}
 
-   hot_reloc-thresh = thresh;
+   sysctl_hot_reloc_threshold = thresh;
return ratio;
 }
 
@@ -215,7 +224,7 @@ static int hot_queue_extent(struct hot_reloc *hot_reloc,
*counter = *counter + 1;
}
 
-   if (*counter = HOT_RELOC_MAX_ITEMS)
+   if (*counter = sysctl_hot_reloc_max_items)
break;
 
if (kthread_should_stop()) {
@@ -293,7 +302,7 @@ again:
while (1) {
lock_extent(tree, page_start, page_end);
ordered = btrfs_lookup_ordered_extent(inode,
-   page_start);
+ page_start);
unlock_extent(tree, page_start, page_end);
if (!ordered)
break;
@@ -559,7 +568,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
 
run++;
ratio = hot_update_threshold(hot_reloc, !(run % 15));
-   thresh = hot_reloc-thresh;
+   thresh = sysctl_hot_reloc_threshold;
 
INIT_LIST_HEAD(hot_reloc-hot_relocq[TYPE_NONROT]);
 
@@ -569,7 +578,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
if (count_to_hot == 0)
return;
 
-   count_to_cold = HOT_RELOC_MAX_ITEMS;
+   count_to_cold = sysctl_hot_reloc_max_items;
 
/* Don't move cold data to HDD unless there's space pressure */
if (ratio  HIGH_WATER_LEVEL)
@@ -653,7 +662,7 @@ static int hot_relocate_kthread(void *arg)
unsigned long delay;
 
do {
-   delay = HZ * HOT_RELOC_INTERVAL;
+   delay = HZ * sysctl_hot_reloc_interval;
if (mutex_trylock(hot_reloc-hot_reloc_mutex)) {
hot_do_relocate(hot_reloc);
mutex_unlock(hot_reloc-hot_reloc_mutex);
@@ -685,7 +694,6 @@ int hot_relocate_init(struct btrfs_fs_info *fs_info)
 
fs_info-hot_reloc = hot_reloc;
hot_reloc-fs_info = fs_info;
-   hot_reloc-thresh = HOT_RELOC_THRESHOLD;
for (i = 0; i  MAX_RELOC_TYPES; i++)
INIT_LIST_HEAD(hot_reloc-hot_relocq[i]);
mutex_init(hot_reloc-hot_reloc_mutex);
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
index 1b1cfb5..94defe6 100644
--- a/fs/btrfs/hot_relocate.h
+++ b/fs/btrfs/hot_relocate.h
@@ -18,10 +18,6 @@
 #include btrfs_inode.h
 #include volumes.h
 
-#define HOT_RELOC_INTERVAL  120
-#define HOT_RELOC_THRESHOLD 150
-#define HOT_RELOC_MAX_ITEMS 250
-
 #define HEAT_MAX_VALUE(MAP_SIZE - 1)
 #define HIGH_WATER_LEVEL  75 /* when to raise 

[RFC PATCH v1 0/5] BTRFS hot relocation support

2013-05-20 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  The patchset as RFC is sent out mainly to see if its design
goes in the correct development direction.

  When working on this feature, i am trying to change as less
the existing btrfs code as possible. After V0 was sent out,
i carefully checked the patchset for speed profile, and don't
think that it is meanful to BTRFS hot relocation, but think
that it is one simple and effective way to introduce one new
block group for nonrotating disk to differentiate if the block
space is reserved from rotating disk or nonrotating disk; So
It's very appreciated that the developers can double check if
the design is appropriate to BTRFS hot reloction.

  The patchset is trying to introduce hot relocation support
for BTRFS. In hybrid storage environment, when the data in
rotating disk get hot, it can be relocated to nonrotating disk
by BTRFS hot relocation support automatically; also, if
nonrotating disk ratio exceed its upper threshold, the data
which get cold can be looked up and relocated to rotating disk
to make more space in nonrotating disk at first, and then the 
data which get hot will be relocated to nonrotating disk
automatically.

  BTRFS hot relocation mainly reserve block space from nonrotating
disk at first, load the hot data to page cache from rotating disk,
allocate block space from nonrotating disk, and finally write the
data to it.

  If you'd like to play with it, pls pull the patchset from
my git on github:
  https://github.com/wuzhy/kernel.git hot_reloc

For how to use, please refer too the example below:

root@debian-i386:~# echo 0  /sys/block/vdc/queue/rotational
^^^ Above command will hack /dev/vdc to be one SSD disk
root@debian-i386:~# echo 99  /proc/sys/fs/hot-age-interval
root@debian-i386:~# echo 10  /proc/sys/fs/hot-update-interval
root@debian-i386:~# echo 10  /proc/sys/fs/hot-reloc-interval
root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f
 
WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
 
[ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 
16 /dev/vdb
[ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 
16 /dev/vdc
[ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
3 /dev/vdb
[ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 
16 /dev/vdc
adding device /dev/vdc id 2
[ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 
3 /dev/vdc
fs created label (null) on /dev/vdb
nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
Btrfs v0.20-rc1-254-gb0136aa-dirty
root@debian-i386:~# mount -o hot_move /dev/vdb /data2
[ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
6 /dev/vdb
[ 144.870444] btrfs: disk space caching is enabled
[ 144.904214] VFS: Turning on hot data tracking
root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.0G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=2.00GB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.19MB
Data_SSD: total=8.00MB, used=0.00
root@debian-i386:~# echo 108  /proc/sys/fs/hot-reloc-threshold
^^^ Above command will start HOT RLEOCATE, because The data temperature is 
currently 109
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.1G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=6.25MB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.26MB
Data_SSD: total=2.01GB, used=2.00GB
root@debian-i386:~# 

Changelog from v0:
 1.) Refactor introducing one new block group.

Zhi Yong Wu (5):
  BTRFS hot reloc, vfs: add one list_head field
  BTRFS hot reloc: add one new block group
  BTRFS hot reloc: add one hot reloc thread
  BTRFS hot reloc, procfs: add three proc interfaces
  BTRFS hot reloc: add hot relocation support

 fs/btrfs/Makefile|   3 +-
 fs/btrfs/ctree.h |  35 ++-
 fs/btrfs/extent-tree.c   |  99 --
 fs/btrfs/extent_io.c |  59 +++-
 fs/btrfs/extent_io.h |   7 +
 fs/btrfs/file.c  |  24 +-
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/hot_relocate.c  | 721 +++
 fs/btrfs/hot_relocate.h  |  38 +++
 fs/btrfs/inode-map.c |   7 +-
 fs/btrfs/inode.c |  94 +-
 fs/btrfs/ioctl.c |  17 +-
 fs/btrfs/relocation.c|   6 +-
 fs/btrfs/super.c |  30 +-
 

[RFC PATCH v1 3/5] BTRFS hot reloc: add one hot reloc thread

2013-05-20 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Add one private thread for hot relocation. It will check
if there're some extents which is hotter than the threshold
and queue them at first, if no, it will return and wait for
its next turn; otherwise, it will check if nonrotating disk
ratio is beyond its usage threshold, if no, it will directly
relocate hot extents from rotating disk to nonrotating disk;
otherwise it will find the extents with low temperature and
queue them, then relocate those extents with low temperature
and queue them, and finally relocate the hot extents from
from rotating disk to nonrotating disk.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/Makefile   |   3 +-
 fs/btrfs/ctree.h|   2 +
 fs/btrfs/hot_relocate.c | 713 
 fs/btrfs/hot_relocate.h |  43 +++
 4 files changed, 760 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/hot_relocate.c
 create mode 100644 fs/btrfs/hot_relocate.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 3932224..94f1ea5 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o
+  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
+  hot_relocate.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f7a3170..6c547ca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1636,6 +1636,8 @@ struct btrfs_fs_info {
struct btrfs_dev_replace dev_replace;
 
atomic_t mutually_exclusive_operation_running;
+
+   void *hot_reloc;
 };
 
 /*
diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
new file mode 100644
index 000..ae28b86
--- /dev/null
+++ b/fs/btrfs/hot_relocate.c
@@ -0,0 +1,713 @@
+/*
+ * fs/btrfs/hot_relocate.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *Ben Chociej bchoc...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/blkdev.h
+#include linux/writeback.h
+#include linux/kthread.h
+#include linux/freezer.h
+#include linux/module.h
+#include hot_relocate.h
+
+/*
+ * Hot relocation strategy:
+ *
+ * The relocation code below operates on the heat map lists to identify
+ * hot or cold data logical file ranges that are candidates for relocation.
+ * The triggering mechanism for relocation is controlled by a global heat
+ * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * queued for relocation by the periodically executing relocate kthread,
+ * which updates the global heat threshold and responds to space pressure
+ * on the nonrotating disks.
+ *
+ * The heat map lists index logical ranges by heat and provide a constant-time
+ * access path to hot or cold range items. The relocation kthread uses this
+ * path to find hot or cold items to move to/from nonrotating disks. To ensure
+ * that the relocation kthread has a chance to sleep, and to prevent thrashing
+ * between nonrotating disks and HDD, there is a configurable limit to how many
+ * ranges are moved per iteration of the kthread. This limit may be overrun in
+ * the case where space pressure requires that items be aggressively moved from
+ * nonrotating disks back to HDD.
+ *
+ * This needs still more resistance to thrashing and stronger (read: actual)
+ * guarantees that relocation operations won't -ENOSPC.
+ *
+ * The relocation code has introduced one new btrfs block group type:
+ * BTRFS_BLOCK_GROUP_DATA_NONROT.
+ *
+ * When mkfs'ing a volume with the hot data relocation option, initial block
+ * groups are allocated to the proper disks. Runtime block group allocation
+ * only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and
+ * BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates
+ * BTRFS_BLOCK_GROUP_DATA_NONROT to nonrotating disks.
+ * (assuming, critically, the HOT_MOVE option is set at mount time).
+ */
+
+/*
+ * Returns the ratio of nonrotating disks that are full.
+ * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1.
+ */
+static int hot_calc_nonrot_ratio(struct hot_reloc *hot_reloc)
+{
+   struct btrfs_space_info *info;
+   struct btrfs_device *device, *next;
+   struct btrfs_fs_info *fs_info = hot_reloc-fs_info;
+   u64 total_bytes = 0, bytes_used = 0;
+
+   /*
+* 

[PATCH v2 10/12] VFS hot tracking, btrfs: add hot tracking support

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new mount option '-o hot_track',
and add its parsing support.
  Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Reviewed-by:   David Sterba dste...@suse.cz
Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/super.c | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 63c328a..133a6ed 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1927,6 +1927,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY(1  20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
+#define BTRFS_MOUNT_HOT_TRACK  (1  23)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a4807ce..09fb9d2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -42,6 +42,7 @@
 #include linux/cleancache.h
 #include linux/ratelimit.h
 #include linux/btrfs.h
+#include linux/hot_tracking.h
 #include compat.h
 #include delayed-inode.h
 #include ctree.h
@@ -306,6 +307,10 @@ static void btrfs_put_super(struct super_block *sb)
 * last process that kept it busy.  Or segfault in the aforementioned
 * process...  Whom would you report that to?
 */
+
+   /* Hot data tracking */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   hot_track_exit(sb);
 }
 
 enum {
@@ -318,7 +323,7 @@ enum {
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
-   Opt_check_integrity_print_mask, Opt_fatal_errors,
+   Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
Opt_err,
 };
 
@@ -359,6 +364,7 @@ static match_table_t tokens = {
{Opt_check_integrity_including_extent_data, check_int_data},
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
{Opt_fatal_errors, fatal_errors=%s},
+   {Opt_hot_track, hot_track},
{Opt_err, NULL},
 };
 
@@ -624,6 +630,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
goto out;
}
break;
+   case Opt_hot_track:
+   btrfs_set_opt(info-mount_opt, HOT_TRACK);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -843,11 +852,20 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   err = hot_track_init(sb);
+   if (err)
+   goto fail_hot;
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
return 0;
 
+fail_hot:
+   dput(sb-s_root);
+   sb-s_root = NULL;
 fail_close:
close_ctree(fs_info-tree_root);
return err;
@@ -943,6 +961,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
dentry *dentry)
seq_puts(seq, ,skip_balance);
if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR))
seq_puts(seq, ,fatal_errors=panic);
+   if (btrfs_test_opt(root, HOT_TRACK))
+   seq_puts(seq, ,hot_track);
return 0;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 11/12] VFS hot tracking: add documentation

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add Documentation for VFS hot tracking feature

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 Documentation/filesystems/00-INDEX |   2 +
 Documentation/filesystems/hot_tracking.txt | 256 +
 2 files changed, 258 insertions(+)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX 
b/Documentation/filesystems/00-INDEX
index 8042050..2454472 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -122,3 +122,5 @@ xfs.txt
- info and mount options for the XFS filesystem.
 xip.txt
- info on execute-in-place for file mappings.
+hot_tracking.txt
+   - info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt 
b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 000..9ea3fa8
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,256 @@
+Hot Data Tracking
+
+April, 2013Zhi Yong Wu wu...@linux.vnet.ibm.com
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes  Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+  The feature adds the  support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+temperature value that reflects what data is hot, and filesystem
+can use this information to move hot data from slow devices to fast
+devices.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+
+2. Motivation
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+https://btrfs.wiki.kernel.org/index.php/Project_ideas.
+It will divide into two parts. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, this feature provides the first part
+of the functionality.
+
+
+3. The Design
+
+These include the following parts:
+
+* Hooks in existing vfs functions to track data access frequency
+
+* New rb-trees for tracking access frequency of inodes and sub-file
+ranges
+The relationship between super_block and rb-trees is as below:
+hot_info.hot_inode_tree
+Each FS instance can find hot tracking info s_hot_root.
+hot_info has hot_inode_tree and it has inode's hot information,
+and it has hot_range_tree, which has range's hot information.
+
+* A list of hot inodes and hot ranges by its temperature
+
+* A debugfs interface for dumping data from the rb-trees
+
+* A work queue for updating inode heat info
+
+* Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+* An ioctl to retrieve the frequency information collected for a certain
+file
+* Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+* hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+* hot_inode_item contains access frequency data for that inode
+
+* hot_inode_item holds a heat list node to link the access frequency
+data for that inode
+
+* hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+* hot_range_item contains access frequency data for that range
+
+* hot_range_item holds a heat list node to index the access
+frequency data for that range
+
+* hot_info.heat_inode_map indexes per-inode heat list nodes
+
+* hot_info.heat_range_map indexes per-range heat list nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+  super_block
+  |
+  V
+   hot_info
+  |
++-++
+| ||
+| ||
+V VV
+heat_inode_map   hot_inode_tree heat_range_map
+|   

[PATCH v2 12/12] VFS hot tracking: add fs hot type support

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one ability to enable that specific FS
can register its own hot tracking functions.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 32 ++--
 fs/hot_tracking.h| 19 +++
 fs/ioctl.c   |  2 +-
 include/linux/fs.h   |  1 +
 include/linux/hot_tracking.h | 22 +-
 5 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 088e9aa..4eee33c 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -51,7 +51,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
struct hot_inode_item *he, loff_t start)
 {
hr-start = start;
-   hr-len = hot_shift(1, RANGE_BITS, true);
+   hr-len = hot_shift(1, he-hot_root-hot_type-range_bits, true);
hr-hot_inode = he;
hr-storage_type = -1;
hot_comm_item_init(hr-hot_range, TYPE_RANGE);
@@ -259,10 +259,11 @@ struct hot_range_item
 {
struct rb_node **p;
struct rb_node *parent = NULL;
+   struct hot_info *root = he-hot_root;
struct hot_comm_item *ci;
struct hot_range_item *hr, *hr_new = NULL;
 
-   start = hot_shift(start, RANGE_BITS, true);
+   start = hot_shift(start, root-hot_type-range_bits, true);
 
/* walk tree to find insertion point */
 redo:
@@ -350,13 +351,13 @@ static void hot_freq_update(struct hot_info *root,
 
if (write) {
freq_data-nr_writes += 1;
-   hot_freq_calc(freq_data-last_write_time,
+   HOT_FREQ_CALC(root, freq_data-last_write_time,
cur_time,
freq_data-avg_delta_writes);
freq_data-last_write_time = cur_time;
} else {
freq_data-nr_reads += 1;
-   hot_freq_calc(freq_data-last_read_time,
+   HOT_FREQ_CALC(root, freq_data-last_read_time,
cur_time,
freq_data-avg_delta_reads);
freq_data-last_read_time = cur_time;
@@ -381,7 +382,7 @@ static void hot_freq_update(struct hot_info *root,
  * the *_COEFF_POWER values and combined to a single temperature
  * value.
  */
-u32 hot_temp_calc(struct hot_comm_item *ci)
+static u32 hot_temp_calc(struct hot_comm_item *ci)
 {
u32 result = 0;
struct hot_freq_data *freq_data = ci-hot_freq_data;
@@ -501,7 +502,7 @@ static void hot_comm_item_link_cb(struct rcu_head *head)
 static int hot_map_update(struct hot_info *root,
struct hot_comm_item *ci)
 {
-   u32 temp = hot_temp_calc(ci);
+   u32 temp = HOT_TEMP_CALC(root, ci);
u8 cur_temp, prev_temp;
int flag = false;
 
@@ -564,7 +565,7 @@ static void hot_range_update(struct hot_inode_item *he,
hot_map_update(root, ci)) {
continue;
}
-   obsolete = hot_is_obsolete(ci);
+   obsolete = HOT_IS_OBSOLETE(root, ci);
if (obsolete)
hot_comm_item_unlink(root, ci);
}
@@ -1167,10 +1168,10 @@ void hot_update_freqs(struct inode *inode, loff_t start,
 * Align ranges on range size boundary
 * to prevent proliferation of range structs
 */
-   range_size  = hot_shift(1, RANGE_BITS, true);
+   range_size  = hot_shift(1, root-hot_type-range_bits, true);
end = hot_shift((start + len + range_size - 1),
-   RANGE_BITS, false);
-   cur = hot_shift(start, RANGE_BITS, false);
+   root-hot_type-range_bits, false);
+   cur = hot_shift(start, root-hot_type-range_bits, false);
for (; cur  end; cur++) {
hr = hot_range_item_lookup(he, cur, 1);
if (IS_ERR(hr)) {
@@ -1211,6 +1212,17 @@ static struct hot_info *hot_tree_init(struct super_block 
*sb)
INIT_LIST_HEAD(root-hot_map[j][i]);
}
 
+   /* Get hot type for specific FS */
+   root-hot_type = sb-s_type-hot_type;
+   if (!HOT_FREQ_FN_EXIST(root))
+   SET_HOT_FREQ_FN(root, hot_freq_calc);
+   if (!HOT_TEMP_FN_EXIST(root))
+   SET_HOT_TEMP_FN(root, hot_temp_calc);
+   if (!HOT_OBSOLETE_FN_EXIST(root))
+   SET_HOT_OBSOLETE_FN(root, hot_is_obsolete);
+   if (root-hot_type-range_bits == 0)
+   root-hot_type-range_bits = RANGE_BITS;
+
root-update_wq = alloc_workqueue(
hot_update_wq, WQ_NON_REENTRANT, 0);
if (!root-update_wq) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index d1ab48b..4756fc3 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -40,6 +40,25 @@
 #define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent 

[PATCH v2 00/12] VFS hot tracking

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  The patchset is trying to introduce hot tracking function in
VFS layer, which will keep track of real disk I/O in memory.
By it, you will easily know more details about disk I/O, and
then detect where disk I/O hot spots are. Also, specific FS
can take use of it to do accurate defragment, and hot relocation
support, etc.

  After V1 was sent out, Chandra Seetharaman has reviewed and
made a lot of comments, thanks a lot to him. Not it's time to
send out its V2 for external review, any comments or ideas are
appreciated, thanks.

NOTE:

  The patchset can be obtained via my kernel dev git on github:
git://github.com/wuzhy/kernel.git hot_tracking
  If you're interested, you can also review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

  For how to use and more other info and performance report,
please check hot_tracking.txt in Documentation and following
links:
  1.) http://lwn.net/Articles/525651/
  2.) https://lkml.org/lkml/2012/12/20/199

Changelog from v1:
 - Refactored to be under RCU [Chandra Seetharaman]
 - Merged some code changes [Chandra Seetharaman]
 - Fixed some issues [Chandra Seetharaman]

v1:
 - Solved 64 bits inode number issue. [David Sterba]
 - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
 - Cleanup Some issues [David Sterba]
 - Use a static hot debugfs root [Greg KH]

rfcv4:
 - Introduce hot func registering framework [Zhiyong]
 - Remove global variable for hot tracking [Zhiyong]
 - Add btrfs hot tracking support [Zhiyong]

rfcv3:
 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
 2.) Refactored workqueue support. [Dave Chinner]
 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
   TIME_TO_KICK, and HEAT_UPDATE_DELAY
 4.) Cleanedup a lot of other issues [Dave Chinner]

rfcv2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

rfcv1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]

Zhi Yong Wu (12):
  VFS hot tracking: introduce some data structures
  VFS hot tracking: add i/o freq tracking hooks
  VFS hot tracking: add one workqueue to update hot map
  VFS hot tracking: register one shrinker
  VFS hot tracking, rcu: introduce one rcu macro for list
  VFS hot tracking, seq_file: introduce one set of rcu seq_list
interfaces
  VFS hot tracking: add debugfs support
  VFS hot tracking: add one ioctl interface
  VFS hot tracking, procfs: add two proc interfaces
  VFS hot tracking, btrfs: add hot tracking support
  VFS hot tracking: add documentation
  VFS hot tracking: add fs hot type support

 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  256 ++
 fs/Makefile|2 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/super.c   |   22 +-
 fs/compat_ioctl.c  |5 +
 fs/dcache.c|2 +
 fs/direct-io.c |5 +
 fs/hot_tracking.c  | 1320 
 fs/hot_tracking.h  |   67 ++
 fs/ioctl.c |   70 ++
 fs/namei.c |2 +
 fs/seq_file.c  |   37 +
 include/linux/fs.h |5 +
 include/linux/hot_tracking.h   |  175 
 include/linux/rculist.h|5 +
 include/linux/seq_file.h   |7 +
 kernel/sysctl.c|   14 +
 mm/filemap.c   |6 +
 mm/page-writeback.c|   12 +
 mm/readahead.c |6 +
 21 files changed, 2019 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/12] VFS hot tracking: introduce some data structures

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/Makefile  |   2 +-
 fs/dcache.c  |   2 +
 fs/hot_tracking.c| 209 +++
 fs/hot_tracking.h|  17 
 include/linux/fs.h   |   4 +
 include/linux/hot_tracking.h | 103 +
 6 files changed, 336 insertions(+), 1 deletion(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 199c880..d0fc704 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o \
-   stack.o fs_struct.o statfs.o
+   stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..9d7c2af 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include linux/rculist_bl.h
 #include linux/prefetch.h
 #include linux/ratelimit.h
+#include linux/hot_tracking.h
 #include internal.h
 #include mount.h
 
@@ -3094,4 +3095,5 @@ void __init vfs_caches_init(unsigned long mempages)
mnt_init();
bdev_cache_init();
chrdev_init();
+   hot_cache_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 000..6bf4229
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,209 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/err.h
+#include linux/slab.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/fs.h
+#include linux/types.h
+#include linux/list_sort.h
+#include linux/limits.h
+#include hot_tracking.h
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+static void hot_inode_item_free(struct kref *kref);
+
+static void hot_comm_item_free_cb(struct rcu_head *head)
+{
+   struct hot_comm_item *ci = container_of(head,
+   struct hot_comm_item, c_rcu);
+
+   if (ci-hot_freq_data.flags == TYPE_RANGE) {
+   struct hot_range_item *hr = container_of(ci,
+   struct hot_range_item, hot_range);
+   kmem_cache_free(hot_range_item_cachep, hr);
+   } else {
+   struct hot_inode_item *he = container_of(ci,
+   struct hot_inode_item, hot_inode);
+   kmem_cache_free(hot_inode_item_cachep, he);
+   }
+}
+
+static void hot_range_item_free(struct kref *kref)
+{
+   struct hot_comm_item *ci = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_range_item *hr = container_of(ci,
+   struct hot_range_item, hot_range);
+
+   hr-hot_inode = NULL;
+
+   call_rcu(hr-hot_range.c_rcu, hot_comm_item_free_cb);
+}
+
+/*
+ * Drops the reference out on hot_comm_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_comm_item_put(struct hot_comm_item *ci)
+{
+   kref_put(ci-refs, (ci-hot_freq_data.flags == TYPE_RANGE) ?
+   hot_range_item_free : hot_inode_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_comm_item_put);
+
+static void hot_comm_item_unlink(struct hot_info *root,
+   struct hot_comm_item *ci)
+{
+   if (!test_and_set_bit(HOT_DELETING, ci-delete_flag)) {
+   hot_comm_item_put(ci);
+   }
+}
+
+/*
+ * Frees the entire hot_range_tree.
+ */
+static void 

[PATCH v2 02/12] VFS hot tracking: add i/o freq tracking hooks

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add i/o freq tracking hooks in real read/write code paths
which include read_pages(), do_writepages(), do_generic_file_read(),
and __blockdev_direct_IO().
  Currently whole FS has one RB tree to track i/o freqs for
all inodes which had real disk i/o, while every inode has its
own one RB tree to track i/o freqs for all of its extents.
  When real disk i/o for the inode are done, its own i/o freq will
be created or updated in the RB tree per FS, and the i/o freq for
all of its extents will also be done in the RB-tree per inode.
  Also, Each of the two structures hot_inode_item and hot_range_item
contains a hot_freq_data struct with its frequency of access metrics
(number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/direct-io.c   |   5 +
 fs/hot_tracking.c| 284 +++
 fs/hot_tracking.h|   4 +
 fs/namei.c   |   2 +
 include/linux/hot_tracking.h |  17 +++
 mm/filemap.c |   6 +
 mm/page-writeback.c  |  12 ++
 mm/readahead.c   |   6 +
 8 files changed, 336 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..6cb0598 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,7 @@
 #include linux/atomic.h
 #include linux/prefetch.h
 #include linux/aio.h
+#include hot_tracking.h
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1295,6 +1296,10 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
+   /* Hot data tracking */
+   hot_update_freqs(inode, offset, iov_length(iov, nr_segs),
+   rw  WRITE);
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 nr_segs, get_block, end_io,
 submit_io, flags);
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 6bf4229..cc899f4 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -26,6 +26,26 @@ static struct kmem_cache *hot_range_item_cachep 
__read_mostly;
 
 static void hot_inode_item_free(struct kref *kref);
 
+static void hot_comm_item_init(struct hot_comm_item *ci, int type)
+{
+   kref_init(ci-refs);
+   clear_bit(HOT_DELETING, ci-delete_flag);
+   memset(ci-hot_freq_data, 0, sizeof(struct hot_freq_data));
+   ci-hot_freq_data.avg_delta_reads = (u64) -1;
+   ci-hot_freq_data.avg_delta_writes = (u64) -1;
+   ci-hot_freq_data.flags = type;
+}
+
+static void hot_range_item_init(struct hot_range_item *hr,
+   struct hot_inode_item *he, loff_t start)
+{
+   hr-start = start;
+   hr-len = hot_shift(1, RANGE_BITS, true);
+   hr-hot_inode = he;
+   hr-storage_type = -1;
+   hot_comm_item_init(hr-hot_range, TYPE_RANGE);
+}
+
 static void hot_comm_item_free_cb(struct rcu_head *head)
 {
struct hot_comm_item *ci = container_of(head,
@@ -65,10 +85,27 @@ void hot_comm_item_put(struct hot_comm_item *ci)
 }
 EXPORT_SYMBOL_GPL(hot_comm_item_put);
 
+/*
+ * root-t_lock or he-i_lock is acquired in this function
+ */
 static void hot_comm_item_unlink(struct hot_info *root,
struct hot_comm_item *ci)
 {
if (!test_and_set_bit(HOT_DELETING, ci-delete_flag)) {
+   if (ci-hot_freq_data.flags == TYPE_RANGE) {
+   struct hot_range_item *hr = container_of(ci,
+   struct hot_range_item, hot_range);
+   struct hot_inode_item *he = hr-hot_inode;
+
+   spin_lock(he-i_lock);
+   rb_erase(ci-rb_node, he-hot_range_tree);
+   spin_unlock(he-i_lock);
+   } else {
+   spin_lock(root-t_lock);
+   rb_erase(ci-rb_node, root-hot_inode_tree);
+   spin_unlock(root-t_lock);
+   }
+
hot_comm_item_put(ci);
}
 }
@@ -94,6 +131,15 @@ static void hot_range_tree_free(struct hot_inode_item *he)
 
 }
 
+static void hot_inode_item_init(struct hot_inode_item *he,
+   struct hot_info *hot_root, u64 ino)
+{
+   he-i_ino = ino;
+   he-hot_root = hot_root;
+   spin_lock_init(he-i_lock);
+   hot_comm_item_init(he-hot_inode, TYPE_INODE);
+}
+
 static void hot_inode_item_free(struct kref *kref)
 {
struct hot_comm_item *ci = container_of(kref,
@@ -107,6 +153,195 @@ static void hot_inode_item_free(struct kref *kref)
call_rcu(he-hot_inode.c_rcu, 

[PATCH v2 03/12] VFS hot tracking: add one workqueue to update hot map

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a workqueue per superblock and a delayed_work
to run periodic work to update map info on each superblock.
  Two arrays of map list are defined, one is for hot inode
items, and the other is for hot extent items.
  The hot items in the RB-tree will be at first distilled
into one temperature in the range [0, 255]. If it is old,
it will be not linked or aged out, otherwise then it will
be linked to its corresponding array of map list which use
the temperature as its index.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 298 ++-
 fs/hot_tracking.h|  25 
 include/linux/hot_tracking.h |   4 +
 3 files changed, 326 insertions(+), 1 deletion(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index cc899f4..2742d9e 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -29,7 +29,9 @@ static void hot_inode_item_free(struct kref *kref);
 static void hot_comm_item_init(struct hot_comm_item *ci, int type)
 {
kref_init(ci-refs);
+   clear_bit(HOT_IN_LIST, ci-delete_flag);
clear_bit(HOT_DELETING, ci-delete_flag);
+   INIT_LIST_HEAD(ci-track_list);
memset(ci-hot_freq_data, 0, sizeof(struct hot_freq_data));
ci-hot_freq_data.avg_delta_reads = (u64) -1;
ci-hot_freq_data.avg_delta_writes = (u64) -1;
@@ -86,12 +88,21 @@ void hot_comm_item_put(struct hot_comm_item *ci)
 EXPORT_SYMBOL_GPL(hot_comm_item_put);
 
 /*
- * root-t_lock or he-i_lock is acquired in this function
+ * root-t_lock or he-i_lock, and root-m_lock
+ * are acquired in this function
  */
 static void hot_comm_item_unlink(struct hot_info *root,
struct hot_comm_item *ci)
 {
if (!test_and_set_bit(HOT_DELETING, ci-delete_flag)) {
+   if (test_and_clear_bit(HOT_IN_LIST, ci-delete_flag)) {
+   spin_lock(root-m_lock);
+   list_del_rcu(ci-track_list);
+   spin_unlock(root-m_lock);
+
+   hot_comm_item_put(ci);
+   }
+
if (ci-hot_freq_data.flags == TYPE_RANGE) {
struct hot_range_item *hr = container_of(ci,
struct hot_range_item, hot_range);
@@ -343,6 +354,274 @@ static void hot_freq_update(struct hot_info *root,
 }
 
 /*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ *
+ * With the six values, we first do some very rudimentary
+ * normalizations to each metric such that they affect the
+ * final temperature calculation exactly the right way. It's
+ * important to note that we still weren't really sure that
+ * these six adjustments were exactly right.
+ * They could definitely use more tweaking and adjustment,
+ * especially in terms of the memory footprint they consume.
+ *
+ * Next, we take the adjusted values and shift them down to
+ * a manageable size, whereafter they are weighted using the
+ * the *_COEFF_POWER values and combined to a single temperature
+ * value.
+ */
+static u32 hot_temp_calc(struct hot_comm_item *ci)
+{
+   u32 result = 0;
+   struct hot_freq_data *freq_data = ci-hot_freq_data;
+
+   struct timespec ckt = current_kernel_time();
+   u64 cur_time = timespec_to_ns(ckt);
+   u32 nrr_heat, nrw_heat;
+   u64 ltr_heat, ltw_heat, avr_heat, avw_heat;
+
+   nrr_heat = (u32)hot_shift((u64)freq_data-nr_reads,
+   NRR_MULTIPLIER_POWER, true);
+   nrw_heat = (u32)hot_shift((u64)freq_data-nr_writes,
+   NRW_MULTIPLIER_POWER, true);
+
+   ltr_heat =
+   hot_shift((cur_time - timespec_to_ns(freq_data-last_read_time)),
+   LTR_DIVIDER_POWER, false);
+   ltw_heat =
+   hot_shift((cur_time - timespec_to_ns(freq_data-last_write_time)),
+   LTW_DIVIDER_POWER, false);
+
+   avr_heat =
+   hot_shiftu64) -1) - freq_data-avg_delta_reads),
+   AVR_DIVIDER_POWER, false);
+   avw_heat =
+   hot_shiftu64) -1) - freq_data-avg_delta_writes),
+   AVW_DIVIDER_POWER, false);
+
+   /* ltr_heat is now guaranteed to be u32 safe */
+   if (ltr_heat = hot_shift((u64) 1, 32, true))
+   ltr_heat = 0;
+   else
+   ltr_heat = hot_shift((u64) 1, 32, true) - ltr_heat;
+
+   /* ltw_heat is now guaranteed to be u32 safe */
+   if (ltw_heat = hot_shift((u64) 1, 32, true))
+   ltw_heat = 0;
+   else
+   ltw_heat = hot_shift((u64) 1, 32, true) - ltw_heat;
+
+   /* avr_heat is now guaranteed to be u32 safe */
+   if (avr_heat = hot_shift((u64) 1, 32, true))
+   avr_heat = (u32) -1;

[PATCH v2 05/12] VFS hot tracking, rcu: introduce one rcu macro for list

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  This rcu macro for list will be used in seq_list
rcu interfaces.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 include/linux/rculist.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8089e35..a3fa055 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -218,6 +218,11 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
at-prev = last;
 }
 
+#define __list_for_each_rcu(pos, head) \
+   for (pos = rcu_dereference(list_next_rcu(head));\
+pos != head;   \
+pos = rcu_dereference(list_next_rcu(pos)))
+
 /**
  * list_entry_rcu - get the struct for this entry
  * @ptr:the struct list_head pointer.
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/12] VFS hot tracking, seq_file: introduce one set of rcu seq_list interfaces

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  The patch will introduce one set of rcu interface for seq_list.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/seq_file.c| 37 +
 include/linux/seq_file.h |  7 +++
 2 files changed, 44 insertions(+)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 774c1eb..301caa7 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -795,6 +795,43 @@ struct list_head *seq_list_next(void *v, struct list_head 
*head, loff_t *ppos)
 }
 EXPORT_SYMBOL(seq_list_next);
 
+struct list_head *seq_list_start_rcu(struct list_head *head, loff_t pos)
+{
+   struct list_head *lh;
+
+   __list_for_each_rcu(lh, head)
+   if (pos-- == 0)
+   return lh;
+
+   return NULL;
+}
+EXPORT_SYMBOL(seq_list_start_rcu);
+
+struct list_head *seq_list_start_head_rcu(struct list_head *head, loff_t pos)
+{
+   if (!pos)
+   return head;
+
+   return seq_list_start_rcu(head, pos - 1);
+}
+EXPORT_SYMBOL(seq_list_start_head_rcu);
+
+struct list_head *seq_list_next_rcu(void *v, struct list_head *head,
+   loff_t *ppos)
+{
+   struct list_head *lh;
+
+   ++*ppos;
+   rcu_read_lock();
+   lh = rcu_dereference(((struct list_head *)v)-next);
+   if (lh == head)
+   lh = NULL;
+   rcu_read_unlock();
+
+   return lh;
+}
+EXPORT_SYMBOL(seq_list_next_rcu);
+
 /**
  * seq_hlist_start - start an iteration of a hlist
  * @head: the head of the hlist
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 2da29ac..7e391c9 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -155,6 +155,13 @@ extern struct list_head *seq_list_start_head(struct 
list_head *head,
 extern struct list_head *seq_list_next(void *v, struct list_head *head,
loff_t *ppos);
 
+extern struct list_head *seq_list_start_rcu(struct list_head *head,
+   loff_t pos);
+extern struct list_head *seq_list_start_head_rcu(struct list_head *head,
+   loff_t pos);
+extern struct list_head *seq_list_next_rcu(void *v, struct list_head *head,
+   loff_t *ppos);
+
 /*
  * Helpers for iteration over hlist_head-s in seq_files
  */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/12] VFS hot tracking: add debugfs support

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a directory 'dev_name' in /sys/kernel/debug/hot_track/
for each volume that contains four files which are 'inode_stat',
'extent_stat', 'inode_spot', and 'extent_spot'.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 455 +++
 fs/hot_tracking.h|   5 +
 include/linux/hot_tracking.h |   1 +
 3 files changed, 461 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index af4498c..cea3675 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -17,9 +17,12 @@
 #include linux/fs.h
 #include linux/types.h
 #include linux/list_sort.h
+#include linux/debugfs.h
 #include linux/limits.h
 #include hot_tracking.h
 
+static struct dentry *hot_debugfs_root;
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -623,6 +626,444 @@ static void hot_update_worker(struct work_struct *work)
msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 }
 
+static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
+   __acquires(rcu)
+{
+   struct hot_info *root = seq-private;
+   struct rb_node *node_he, *node_hr;
+   struct hot_comm_item *ci_he, *ci_hr;
+   struct hot_inode_item *he;
+   struct hot_range_item *hr;
+   loff_t l = *pos;
+
+   rcu_read_lock();
+   node_he = rb_first(root-hot_inode_tree);
+   while (node_he) {
+   ci_he = rb_entry(node_he, struct hot_comm_item, rb_node);
+   he = container_of(ci_he, struct hot_inode_item, hot_inode);
+   node_hr = rb_first(he-hot_range_tree);
+   while (node_hr) {
+   if (!l--) {
+   ci_hr = rb_entry(node_hr,
+   struct hot_comm_item, rb_node);
+   hr = container_of(ci_hr,
+   struct hot_range_item, hot_range);
+   return hr;
+   }
+   node_hr = rb_next(node_hr);
+   }
+   node_he = rb_next(node_he);
+   }
+
+   return NULL;
+}
+
+static void *hot_range_seq_next(struct seq_file *seq,
+   void *v, loff_t *pos)
+{
+   struct rb_node *node_he, *node_hr;
+   struct hot_comm_item *ci_he, *ci_hr;
+   struct hot_range_item *hr_next = NULL, *hr = v;
+   struct hot_inode_item *he_next;
+
+   (*pos)++;
+   node_hr = rb_next(hr-hot_range.rb_node);
+   if (node_hr) {
+   ci_hr = rb_entry(node_hr, struct hot_comm_item, rb_node);
+   hr_next = container_of(ci_hr, struct hot_range_item, hot_range);
+
+   return hr_next;
+   }
+
+   node_he = rb_next(hr-hot_inode-hot_inode.rb_node);
+loop_he:
+   if (node_he) {
+   ci_he = rb_entry(node_he, struct hot_comm_item, rb_node);
+   he_next = container_of(ci_he, struct hot_inode_item, hot_inode);
+   node_hr = rb_first(he_next-hot_range_tree);
+   if (node_hr) {
+   ci_hr = rb_entry(node_hr,
+   struct hot_comm_item, rb_node);
+   hr_next = container_of(ci_hr,
+   struct hot_range_item, hot_range);
+   } else {
+   node_he = rb_next(node_he);
+   goto loop_he;
+   }
+   }
+
+   return hr_next;
+}
+
+static void hot_range_seq_stop(struct seq_file *seq, void *v)
+   __releases(rcu)
+{
+   rcu_read_unlock();
+}
+
+static int hot_range_seq_show(struct seq_file *seq, void *v)
+{
+   struct hot_range_item *hr = v;
+   struct hot_inode_item *he = hr-hot_inode;
+   struct hot_freq_data *freq_data;
+
+   freq_data = hr-hot_range.hot_freq_data;
+   seq_printf(seq, inode %llu, extent %llu+%llu,  \
+   reads %u, writes %u, temp %u,  \
+   storage type %s\n,
+   he-i_ino, (unsigned long long)hr-start,
+   (unsigned long long)hr-len,
+   freq_data-nr_reads,
+   freq_data-nr_writes,
+   (u8)hot_shift((u64)freq_data-last_temp,
+   (32 - MAP_BITS), false),
+   (hr-storage_type == 1) ? nonrot : rot);
+
+   return 0;
+}
+
+static void *hot_inode_seq_start(struct seq_file *seq, loff_t *pos)
+   __acquires(rcu)
+{
+   struct hot_info *root = seq-private;
+   struct rb_node *node;
+   struct hot_comm_item *ci;
+   struct hot_inode_item *he = NULL;
+   loff_t l = *pos;
+
+   rcu_read_lock();
+   node = 

[PATCH v2 08/12] VFS hot tracking: add one ioctl interface

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally,
retrieve the temperature from the hot data hash list instead of
recalculating it.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/compat_ioctl.c|  5 
 fs/hot_tracking.c|  2 +-
 fs/ioctl.c   | 70 
 include/linux/hot_tracking.h | 21 +
 4 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 996cdc5..97bf972 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include linux/i2c-dev.h
 #include linux/atalk.h
 #include linux/gfp.h
+#include linux/hot_tracking.h
 
 #include net/bluetooth/bluetooth.h
 #include net/bluetooth/hci.h
@@ -1402,6 +1403,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
 /* fat 'r' ioctls. These are handled by fat with -compat_ioctl,
but we don't want warnings on other file systems. So declare
them as compatible here. */
@@ -1581,6 +1585,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
case FIBMAP:
case FIGETBSZ:
case FIONREAD:
+   case FS_IOC_GET_HEAT_INFO:
if (S_ISREG(file_inode(f.file)-i_mode))
break;
/*FALL THROUGH*/
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index cea3675..1618f21 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -375,7 +375,7 @@ static void hot_freq_update(struct hot_info *root,
  * the *_COEFF_POWER values and combined to a single temperature
  * value.
  */
-static u32 hot_temp_calc(struct hot_comm_item *ci)
+u32 hot_temp_calc(struct hot_comm_item *ci)
 {
u32 result = 0;
struct hot_freq_data *freq_data = ci-hot_freq_data;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index fd507fb..f9f3497 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include linux/writeback.h
 #include linux/buffer_head.h
 #include linux/falloc.h
+#include linux/hot_tracking.h
 
 #include asm/ioctls.h
 
@@ -537,6 +538,72 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be live -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the map list, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info-live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct hot_heat_info heat_info;
+   struct hot_inode_item *he;
+   int ret = 0;
+
+   if (copy_from_user((void *)heat_info,
+   argp,
+   sizeof(struct hot_heat_info)) != 0) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+   he = hot_inode_item_lookup(inode-i_sb-s_hot_root, inode-i_ino, 0);
+   if (IS_ERR(he)) {
+   /* we don't have any info on this file yet */
+   ret = -ENODATA;
+   goto err;
+   }
+
+   heat_info.avg_delta_reads =
+   (__u64) he-hot_inode.hot_freq_data.avg_delta_reads;
+   heat_info.avg_delta_writes =
+   (__u64) he-hot_inode.hot_freq_data.avg_delta_writes;
+   heat_info.last_read_time =
+   (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_read_time);
+   heat_info.last_write_time =
+   (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_write_time);
+   heat_info.num_reads =
+   (__u32) he-hot_inode.hot_freq_data.nr_reads;
+   heat_info.num_writes =
+   (__u32) he-hot_inode.hot_freq_data.nr_writes;
+
+   if (heat_info.live  0) {
+   /*
+* got a request for live temperature,
+* call hot_calc_temp() to recalculate
+*/
+   heat_info.temp = hot_temp_calc(he-hot_inode);
+   } else {
+   /* not live temperature, get it from the map list */
+   heat_info.temp = he-hot_inode.hot_freq_data.last_temp;
+   }
+
+   hot_comm_item_put(he-hot_inode);
+
+   if (copy_to_user(argp, (void *)heat_info,
+   sizeof(struct hot_heat_info))) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+err:
+   return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +658,9 @@ int 

[PATCH v2 09/12] VFS hot tracking, procfs: add two proc interfaces

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add two proc interfaces hot-age-interval and hot-update-interval
under the dir /proc/sys/fs/ in order to turn HOT_AGE_INTERVAL and
HOT_UPDATE_INTERVAL into be tunable.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 12 +---
 fs/hot_tracking.h|  3 ---
 include/linux/hot_tracking.h |  7 +++
 kernel/sysctl.c  | 14 ++
 4 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 1618f21..088e9aa 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -23,6 +23,12 @@
 
 static struct dentry *hot_debugfs_root;
 
+int sysctl_hot_age_interval __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_age_interval);
+
+int sysctl_hot_update_interval __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -446,7 +452,7 @@ static bool hot_is_obsolete(struct hot_comm_item *ci)
struct hot_freq_data *freq_data = ci-hot_freq_data;
u64 last_read_ns, last_write_ns;
u64 cur_time = timespec_to_ns(ckt);
-   u64 kick_ns =  HOT_AGE_INTERVAL * NSEC_PER_SEC;
+   u64 kick_ns =  sysctl_hot_age_interval * NSEC_PER_SEC;
 
last_read_ns =
(cur_time - timespec_to_ns(freq_data-last_read_time));
@@ -623,7 +629,7 @@ static void hot_update_worker(struct work_struct *work)
 
/* Instert next delayed work */
queue_delayed_work(root-update_wq, root-update_work,
-   msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+   msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 }
 
 static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
@@ -1217,7 +1223,7 @@ static struct hot_info *hot_tree_init(struct super_block 
*sb)
/* Initialize hot tracking wq and arm one delayed work */
INIT_DELAYED_WORK(root-update_work, hot_update_worker);
queue_delayed_work(root-update_wq, root-update_work,
-   msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+   msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 
/* Register a shrinker callback */
root-hot_shrink.shrink = hot_track_prune;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index fcc60ac..d1ab48b 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -15,9 +15,6 @@
 #include linux/workqueue.h
 #include linux/hot_tracking.h
 
-#define HOT_UPDATE_INTERVAL 150
-#define HOT_AGE_INTERVAL 300
-
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 263a15e..6de7153 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -110,6 +110,13 @@ struct hot_info {
 };
 
 /*
+ * Two variables have meanings as below:
+ * 1. time to quit keeping track of tracking data (seconds)
+ * 2. set how often to update temperatures (seconds)
+ */
+extern int sysctl_hot_age_interval, sysctl_hot_update_interval;
+
+/*
  * Hot data tracking ioctls:
  *
  * HOT_INFO - retrieve info on frequency of access
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9edcf45..6ee4338 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1616,6 +1616,20 @@ static struct ctl_table fs_table[] = {
.proc_handler   = pipe_proc_fn,
.extra1 = pipe_min_size,
},
+   {
+   .procname   = hot-age-interval,
+   .data   = sysctl_hot_age_interval,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = hot-update-interval,
+   .data   = sysctl_hot_update_interval,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
{ }
 };
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/12] VFS hot tracking: register one shrinker

2013-05-13 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Register a shrinker to control the amount of memory that
is used in tracking hot regions. If we are throwing inodes
out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking
code is using, even if it means losing useful information
That is, the shrinker accelerates the aging process.

Signed-off-by: Chandra Seetharaman sekha...@us.ibm.com
Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 58 
 include/linux/hot_tracking.h |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 2742d9e..af4498c 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -100,6 +100,7 @@ static void hot_comm_item_unlink(struct hot_info *root,
list_del_rcu(ci-track_list);
spin_unlock(root-m_lock);
 
+   atomic_dec(root-hot_map_nr);
hot_comm_item_put(ci);
}
 
@@ -517,6 +518,7 @@ static int hot_map_update(struct hot_info *root,
else {
u32 flags = ci-hot_freq_data.flags;
 
+   atomic_inc(root-hot_map_nr);
hot_comm_item_get(ci);
 
spin_lock(root-m_lock);
@@ -642,6 +644,55 @@ void __init hot_cache_init(void)
 }
 EXPORT_SYMBOL_GPL(hot_cache_init);
 
+static void hot_prune_map(struct hot_info *root, long nr)
+{
+   int i;
+
+   for (i = 0; i  MAP_SIZE; i++) {
+   struct hot_comm_item *ci;
+   unsigned long prev_nr;
+
+   rcu_read_lock();
+   if (list_empty(root-hot_map[TYPE_INODE][i])) {
+   rcu_read_unlock();
+   continue;
+   }
+
+   list_for_each_entry_rcu(ci, root-hot_map[TYPE_INODE][i],
+   track_list) {
+   prev_nr = atomic_read(root-hot_map_nr);
+   hot_comm_item_unlink(root, ci);
+   nr -= (prev_nr - atomic_read(root-hot_map_nr));
+   if (nr = 0)
+   break;
+   }
+   rcu_read_unlock();
+
+   if (nr = 0)
+   break;
+   }
+
+   return;
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+   struct shrink_control *sc)
+{
+   struct hot_info *root =
+   container_of(shrink, struct hot_info, hot_shrink);
+
+   if (sc-nr_to_scan == 0)
+   return atomic_read(root-hot_map_nr) / 2;
+
+   if (!(sc-gfp_mask  __GFP_FS))
+   return -1;
+
+   hot_prune_map(root, sc-nr_to_scan);
+
+   return atomic_read(root-hot_map_nr);
+}
+
 /*
  * Main function to update i/o access frequencies, and it will be called
  * from read/writepages() hooks, which are read_pages(), do_writepages(),
@@ -706,6 +757,7 @@ static struct hot_info *hot_tree_init(struct super_block 
*sb)
root-hot_inode_tree = RB_ROOT;
spin_lock_init(root-t_lock);
spin_lock_init(root-m_lock);
+   atomic_set(root-hot_map_nr, 0);
 
for (i = 0; i  MAP_SIZE; i++) {
for (j = 0; j  MAX_TYPES; j++)
@@ -726,6 +778,11 @@ static struct hot_info *hot_tree_init(struct super_block 
*sb)
queue_delayed_work(root-update_wq, root-update_work,
msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 
+   /* Register a shrinker callback */
+   root-hot_shrink.shrink = hot_track_prune;
+   root-hot_shrink.seeks = DEFAULT_SEEKS;
+   register_shrinker(root-hot_shrink);
+
return root;
 }
 
@@ -737,6 +794,7 @@ static void hot_tree_exit(struct hot_info *root)
struct rb_node *node;
struct hot_comm_item *ci;
 
+   unregister_shrinker(root-hot_shrink);
cancel_delayed_work_sync(root-update_work);
destroy_workqueue(root-update_wq);
 
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index c32197b..a78b4fc 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -90,8 +90,10 @@ struct hot_info {
spinlock_t t_lock;  /* protect above tree */
struct list_head hot_map[MAX_TYPES][MAP_SIZE];  /* map of inode temp */
spinlock_t m_lock;
+   atomic_t hot_map_nr;
struct workqueue_struct *update_wq;
struct delayed_work update_work;
+   struct shrinker hot_shrink;
 };
 
 extern void __init hot_cache_init(void);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 0/5] BTRFS hot relocation support

2013-05-06 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  The patchset as RFC is sent out mainly to see if it goes in the
correct development direction.

  The patchset is trying to introduce hot relocation support
for BTRFS. In hybrid storage environment, when the data in
HDD disk get hot, it can be relocated to SSD disk by BTRFS
hot relocation support automatically; also, if SSD disk ratio
exceed its upper threshold, the data which get cold can be
looked up and relocated to HDD disk to make more space in SSD
disk at first, and then the data which get hot will be relocated
to SSD disk automatically.

  BTRFS hot relocation mainly reserve block space from SSD disk
at first, load the hot data to page cache from HDD, allocate
block space from SSD disk, and finally write the data to SSD disk.

  If you'd like to play with it, pls pull the patchset from
my git on github:
  https://github.com/wuzhy/kernel.git hot_reloc

For how to use, please refer too the example below:

root@debian-i386:~# echo 0  /sys/block/vdc/queue/rotational
^^^ Above command will hack /dev/vdc to be one SSD disk
root@debian-i386:~# echo 99  /proc/sys/fs/hot-age-interval
root@debian-i386:~# echo 10  /proc/sys/fs/hot-update-interval
root@debian-i386:~# echo 10  /proc/sys/fs/hot-reloc-interval
root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f
 
WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
 
[ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 
16 /dev/vdb
[ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 
16 /dev/vdc
[ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
3 /dev/vdb
[ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
3 /dev/vdb
[ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 
16 /dev/vdc
adding device /dev/vdc id 2
[ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 
3 /dev/vdc
fs created label (null) on /dev/vdb
nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
Btrfs v0.20-rc1-254-gb0136aa-dirty
root@debian-i386:~# mount -o hot_move /dev/vdb /data2
[ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 
6 /dev/vdb
[ 144.870444] btrfs: disk space caching is enabled
[ 144.904214] VFS: Turning on hot data tracking
root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.0G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=2.00GB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.19MB
Data_SSD: total=8.00MB, used=0.00
root@debian-i386:~# echo 108  /proc/sys/fs/hot-reloc-threshold
^^^ Above command will start HOT RLEOCATE, because The data temperature is 
currently 109
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.1G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=6.25MB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.26MB
Data_SSD: total=2.01GB, used=2.00GB
root@debian-i386:~# 

Zhi Yong Wu (5):
  vfs: add one list_head field
  btrfs: add one new block group
  btrfs: add one hot relocation kthread
  procfs: add three proc interfaces
  btrfs: add hot relocation support

 fs/btrfs/Makefile|   3 +-
 fs/btrfs/ctree.h |  26 +-
 fs/btrfs/extent-tree.c   | 107 +-
 fs/btrfs/extent_io.c |  31 +-
 fs/btrfs/extent_io.h |   4 +
 fs/btrfs/file.c  |  36 +-
 fs/btrfs/hot_relocate.c  | 802 +++
 fs/btrfs/hot_relocate.h  |  48 +++
 fs/btrfs/inode-map.c |  13 +-
 fs/btrfs/inode.c |  92 -
 fs/btrfs/ioctl.c |  23 +-
 fs/btrfs/relocation.c|  14 +-
 fs/btrfs/super.c |  30 +-
 fs/btrfs/volumes.c   |  28 +-
 fs/hot_tracking.c|   1 +
 include/linux/btrfs.h|   4 +
 include/linux/hot_tracking.h |   1 +
 kernel/sysctl.c  |  22 ++
 18 files changed, 1234 insertions(+), 51 deletions(-)
 create mode 100644 fs/btrfs/hot_relocate.c
 create mode 100644 fs/btrfs/hot_relocate.h

-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/5] vfs: add one list_head field

2013-05-06 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one list_head field 'reloc_list' to accommodate
hot relocation support.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c| 1 +
 include/linux/hot_tracking.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 3b0002c..7071ac8 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -41,6 +41,7 @@ static void hot_comm_item_init(struct hot_comm_item *ci, int 
type)
clear_bit(HOT_IN_LIST, ci-delete_flag);
clear_bit(HOT_DELETING, ci-delete_flag);
INIT_LIST_HEAD(ci-track_list);
+   INIT_LIST_HEAD(ci-reloc_list);
memset(ci-hot_freq_data, 0, sizeof(struct hot_freq_data));
ci-hot_freq_data.avg_delta_reads = (u64) -1;
ci-hot_freq_data.avg_delta_writes = (u64) -1;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 2272975..49f901c 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -68,6 +68,7 @@ struct hot_comm_item {
struct rb_node rb_node; /* rbtree index */
unsigned long delete_flag;
struct list_head track_list;/* link to *_map[] */
+   struct list_head reloc_list;/* used in hot relocation*/
 };
 
 /* An item representing an inode and its access frequency */
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 2/5] btrfs: add one new block group

2013-05-06 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new block group BTRFS_BLOCK_GROUP_DATA_SSD,
which is used to differentiate if the block space is reserved
and allocated from one HDD disk or SSD disk.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/Makefile   |   3 +-
 fs/btrfs/ctree.h|  24 ++-
 fs/btrfs/extent-tree.c  | 107 +++-
 fs/btrfs/extent_io.c|  31 --
 fs/btrfs/extent_io.h|   4 ++
 fs/btrfs/file.c |  36 +---
 fs/btrfs/hot_relocate.c |  78 +++
 fs/btrfs/hot_relocate.h |  31 ++
 fs/btrfs/inode-map.c|  13 +-
 fs/btrfs/inode.c|  92 +
 fs/btrfs/ioctl.c|  23 +--
 fs/btrfs/relocation.c   |  14 ++-
 fs/btrfs/super.c|   3 +-
 fs/btrfs/volumes.c  |  28 -
 14 files changed, 439 insertions(+), 48 deletions(-)
 create mode 100644 fs/btrfs/hot_relocate.c
 create mode 100644 fs/btrfs/hot_relocate.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 3932224..94f1ea5 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o
+  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
+  hot_relocate.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 701dec5..f4c4419 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -961,6 +961,16 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10   (1ULL  6)
 #define BTRFS_BLOCK_GROUP_RAID5(1  7)
 #define BTRFS_BLOCK_GROUP_RAID6(1  8)
+/*
+ * New block groups for use with hot data relocation feature. When hot data
+ * relocation is on, *_SSD block groups are forced to nonrotating drives and
+ * the plain DATA and METADATA block groups are forced to rotating drives.
+ *
+ * This should be further optimized, i.e. force metadata to SSD or relocate
+ * inode metadata to SSD when any of its subfile ranges are relocated to SSD
+ * so that reads and writes aren't delayed by HDD seeks.
+ */
+#define BTRFS_BLOCK_GROUP_DATA_SSD (1ULL  9)
 #define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE
 
 enum btrfs_raid_types {
@@ -976,7 +986,8 @@ enum btrfs_raid_types {
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK(BTRFS_BLOCK_GROUP_DATA |\
 BTRFS_BLOCK_GROUP_SYSTEM |  \
-BTRFS_BLOCK_GROUP_METADATA)
+BTRFS_BLOCK_GROUP_METADATA | \
+BTRFS_BLOCK_GROUP_DATA_SSD)
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 |   \
 BTRFS_BLOCK_GROUP_RAID1 |   \
@@ -1508,6 +1519,7 @@ struct btrfs_fs_info {
struct list_head space_info;
 
struct btrfs_space_info *data_sinfo;
+   struct btrfs_space_info *hot_data_sinfo;
 
struct reloc_control *reloc_ctl;
 
@@ -1532,6 +1544,7 @@ struct btrfs_fs_info {
u64 avail_data_alloc_bits;
u64 avail_metadata_alloc_bits;
u64 avail_system_alloc_bits;
+   u64 avail_data_ssd_alloc_bits;
 
/* restriper state */
spinlock_t balance_lock;
@@ -1544,6 +1557,7 @@ struct btrfs_fs_info {
 
unsigned data_chunk_allocations;
unsigned metadata_ratio;
+   unsigned data_ssd_chunk_allocations;
 
void *bdev_holder;
 
@@ -1901,6 +1915,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
 #define BTRFS_MOUNT_HOT_TRACK  (1  23)
+#define BTRFS_MOUNT_HOT_MOVE   (1  24)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
@@ -1922,6 +1937,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_INODE_NOATIME(1  9)
 #define BTRFS_INODE_DIRSYNC(1  10)
 #define BTRFS_INODE_COMPRESS   (1  11)
+#define BTRFS_INODE_HOT(1  12)
 
 #define BTRFS_INODE_ROOT_ITEM_INIT (1  31)
 
@@ -3014,6 +3030,8 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_root 
*root,
 int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  u64 objectid, u64 offset, u64 bytenr);
+struct btrfs_block_group_cache 

[RFC 4/5] procfs: add three proc interfaces

2013-05-06 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add three proc interfaces hot-reloc-interval, hot-reloc-threshold,
and hot-reloc-max-items under the dir /proc/sys/fs/ in order to
turn HOT_RELOC_INTERVAL, HOT_RELOC_THRESHOLD, and HOT_RELOC_MAX_ITEMS
into be tunable.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/hot_relocate.c | 26 +-
 fs/btrfs/hot_relocate.h |  4 
 include/linux/btrfs.h   |  4 
 kernel/sysctl.c | 22 ++
 4 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
index 683e154..aa8c9f0 100644
--- a/fs/btrfs/hot_relocate.c
+++ b/fs/btrfs/hot_relocate.c
@@ -25,7 +25,7 @@
  * The relocation code below operates on the heat map lists to identify
  * hot or cold data logical file ranges that are candidates for relocation.
  * The triggering mechanism for relocation is controlled by a global heat
- * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * threshold integer value (sysctl_hot_reloc_threshold). Ranges are
  * queued for relocation by the periodically executing relocate kthread,
  * which updates the global heat threshold and responds to space pressure
  * on the SSDs.
@@ -52,6 +52,15 @@
  * (assuming, critically, the HOT_MOVE option is set at mount time).
  */
 
+int sysctl_hot_reloc_threshold = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_threshold);
+
+int sysctl_hot_reloc_interval __read_mostly = 120;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_interval);
+
+int sysctl_hot_reloc_max_items __read_mostly = 250;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_max_items);
+
 static void hot_set_extent_bits(struct extent_io_tree *tree, u64 start,
u64 end, struct extent_state **cached_state,
gfp_t mask, int storage_type, int flag)
@@ -165,7 +174,7 @@ static int hot_calc_ssd_ratio(struct hot_reloc *hot_reloc)
 static int hot_update_threshold(struct hot_reloc *hot_reloc,
int update)
 {
-   int thresh = hot_reloc-thresh;
+   int thresh = sysctl_hot_reloc_threshold;
int ratio = hot_calc_ssd_ratio(hot_reloc);
 
/* Sometimes update global threshold, others not */
@@ -189,7 +198,7 @@ static int hot_update_threshold(struct hot_reloc *hot_reloc,
thresh = 0;
}
 
-   hot_reloc-thresh = thresh;
+   sysctl_hot_reloc_threshold = thresh;
return ratio;
 }
 
@@ -280,7 +289,7 @@ static int hot_queue_extent(struct hot_reloc *hot_reloc,
hot_comm_item_put(ci);
spin_unlock(he-i_lock);
 
-   if (*counter = HOT_RELOC_MAX_ITEMS)
+   if (*counter = sysctl_hot_reloc_max_items)
break;
 
if (kthread_should_stop()) {
@@ -361,7 +370,7 @@ again:
while (1) {
lock_extent(tree, page_start, page_end);
ordered = btrfs_lookup_ordered_extent(inode,
-   page_start);
+ page_start);
unlock_extent(tree, page_start, page_end);
if (!ordered)
break;
@@ -642,7 +651,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
 
run++;
ratio = hot_update_threshold(hot_reloc, !(run % 15));
-   thresh = hot_reloc-thresh;
+   thresh = sysctl_hot_reloc_threshold;
 
INIT_LIST_HEAD(hot_reloc-hot_relocq[TYPE_NONROT]);
 
@@ -652,7 +661,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
if (count_to_hot == 0)
return;
 
-   count_to_cold = HOT_RELOC_MAX_ITEMS;
+   count_to_cold = sysctl_hot_reloc_max_items;
 
/* Don't move cold data to HDD unless there's space pressure */
if (ratio  HIGH_WATER_LEVEL)
@@ -734,7 +743,7 @@ static int hot_relocate_kthread(void *arg)
unsigned long delay;
 
do {
-   delay = HZ * HOT_RELOC_INTERVAL;
+   delay = HZ * sysctl_hot_reloc_interval;
if (mutex_trylock(hot_reloc-hot_reloc_mutex)) {
hot_do_relocate(hot_reloc);
mutex_unlock(hot_reloc-hot_reloc_mutex);
@@ -766,7 +775,6 @@ int hot_relocate_init(struct btrfs_fs_info *fs_info)
 
fs_info-hot_reloc = hot_reloc;
hot_reloc-fs_info = fs_info;
-   hot_reloc-thresh = HOT_RELOC_THRESHOLD;
for (i = 0; i  MAX_RELOC_TYPES; i++)
INIT_LIST_HEAD(hot_reloc-hot_relocq[i]);
mutex_init(hot_reloc-hot_reloc_mutex);
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
index 077d9b3..ca30944 100644
--- a/fs/btrfs/hot_relocate.h
+++ b/fs/btrfs/hot_relocate.h
@@ -24,9 +24,6 @@ enum {
MAX_RELOC_TYPES
 };
 
-#define HOT_RELOC_INTERVAL  120
-#define HOT_RELOC_THRESHOLD 150
-#define HOT_RELOC_MAX_ITEMS 250
 
 #define HEAT_MAX_VALUE(MAP_SIZE - 1)
 

[RFC 5/5] btrfs: add hot relocation support

2013-05-06 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one new mount option '-o hot_move' for hot
relocation support. When hot relocation is enabled,
hot tracking will be enabled automatically.
  Its usage looks like:
mount -o hot_move
mount -o nouser,hot_move
mount -o nouser,hot_move,loop
mount -o hot_move,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/super.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4cbd0de..b342f6f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -311,8 +311,13 @@ static void btrfs_put_super(struct super_block *sb)
 * process...  Whom would you report that to?
 */
 
+   /* Hot data relocation */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_MOVE))
+   hot_relocate_exit(btrfs_sb(sb));
+
/* Hot data tracking */
-   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_MOVE)
+   || btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
hot_track_exit(sb);
 }
 
@@ -327,7 +332,7 @@ enum {
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
-   Opt_err,
+   Opt_hot_move, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -368,6 +373,7 @@ static match_table_t tokens = {
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
{Opt_fatal_errors, fatal_errors=%s},
{Opt_hot_track, hot_track},
+   {Opt_hot_move, hot_move},
{Opt_err, NULL},
 };
 
@@ -636,6 +642,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
case Opt_hot_track:
btrfs_set_opt(info-mount_opt, HOT_TRACK);
break;
+   case Opt_hot_move:
+   btrfs_set_opt(info-mount_opt, HOT_MOVE);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -863,17 +872,26 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
-   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   if (btrfs_test_opt(fs_info-tree_root, HOT_MOVE)
+   || btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
err = hot_track_init(sb);
if (err)
goto fail_hot;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_MOVE)) {
+   err = hot_relocate_init(fs_info);
+   if (err)
+   goto fail_reloc;
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
return 0;
 
+fail_reloc:
+   hot_track_exit(sb);
 fail_hot:
dput(sb-s_root);
sb-s_root = NULL;
@@ -974,6 +992,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
dentry *dentry)
seq_puts(seq, ,fatal_errors=panic);
if (btrfs_test_opt(root, HOT_TRACK))
seq_puts(seq, ,hot_track);
+   if (btrfs_test_opt(root, HOT_MOVE))
+   seq_puts(seq, ,hot_move);
return 0;
 }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: fix one bracket issue in mkfs.btrfs manpage

2013-03-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  In [ \fB\-f\fP\fI ], the \fI will result in the front half [of
[ -f ] doesn't the back half ]; When you issue the command
man mkfs.btrfs, you will see the difference.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 man/mkfs.btrfs.8.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
index cdccd6a..a3f1503 100644
--- a/man/mkfs.btrfs.8.in
+++ b/man/mkfs.btrfs.8.in
@@ -6,7 +6,7 @@ mkfs.btrfs \- create a btrfs filesystem
 [ \fB\-A\fP\fI alloc-start\fP ]
 [ \fB\-b\fP\fI byte-count\fP ]
 [ \fB\-d\fP\fI data-profile\fP ]
-[ \fB\-f\fP\fI ]
+[ \fB\-f\fP ]
 [ \fB\-l\fP\fI leafsize\fP ]
 [ \fB\-L\fP\fI label\fP ]
 [ \fB\-m\fP\fI metadata profile\fP ]
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: add missing qgroup synopsis in btrfs

2013-03-21 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 man/btrfs.8.in | 37 +
 1 file changed, 37 insertions(+)

diff --git a/man/btrfs.8.in b/man/btrfs.8.in
index 94f4ffe..54de60e 100644
--- a/man/btrfs.8.in
+++ b/man/btrfs.8.in
@@ -60,6 +60,18 @@ btrfs \- control a btrfs filesystem
 \fBbtrfs\fP \fBinspect-internal logical-resolve\fP
 [-Pv] [-s size] \fIlogical\fP \fIpath\fP
 .PP
+\fBbtrfs\fP \fBqgroup assign\fP \fIsrc\fP \fIdst\fP \fIpath\fP
+.PP
+\fBbtrfs\fP \fBqgroup remove\fP \fIsrc\fP \fIdst\fP \fIpath\fP
+.PP
+\fBbtrfs\fP \fBqgroup create\fP \fIqgroupid\fP \fIpath\fP
+.PP
+\fBbtrfs\fP \fBqgroup destroy\fP \fIqgroupid\fP \fIpath\fP
+.PP
+\fBbtrfs\fP \fBqgroup show\fP \fIpath\fP
+.PP
+\fBbtrfs\fP \fBqgroup limit\fP [options] \fIsize\fP|\fBnone\fP 
[\fIqgroupid\fP] \fIpath\fP
+.PP
 \fBbtrfs\fP \fBhelp|\-\-help \fP\fI\fP
 .PP
 \fBbtrfs\fP \fBcommand \-\-help \fP\fI\fP
@@ -434,6 +446,31 @@ verbose mode. print count of returned paths and all 
ioctl() return values
 set inode container's size. This is used to increase inode container's size in 
case it is
 not enough to read all the resolved results. The max value one can set is 64k.
 .RE
+.TP
+
+\fBbtrfs qgroup assign\fP \fIsrc\fP \fIdst\fP \fIpath\fP
+Enable subvolume qgroup support for a filesystem.
+.TP
+
+\fBbtrfs qgroup remove\fP \fIsrc\fP \fIdst\fP \fIpath\fP
+Remove a subvol from a quota group.
+.TP
+
+\fBbtrfs qgroup create\fP \fIqgroupid\fP \fIpath\fP
+Create a subvolume quota group.
+.TP
+
+\fBbtrfs qgroup destroy\fP \fIqgroupid\fP \fIpath\fP
+Destroy a subvolume quota group.
+.TP
+
+\fBbtrfs qgroup show\fP \fIpath\fP
+Show all subvolume quota groups.
+.TP
+
+\fBbtrfs\fP \fBqgroup limit\fP [options] \fIsize\fP|\fBnone\fP 
[\fIqgroupid\fP] \fIpath\fP
+Limit the size of a subvolume quota group.
+.RE
 
 .SH EXIT STATUS
 \fBbtrfs\fR returns a zero exist status if it succeeds. Non zero is returned in
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: Cleanup some redundant codes in btrfs_log_inode()

2013-03-18 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/tree-log.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 451fad9..83d4e1d 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3614,8 +3614,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
bool fast_search = false;
u64 ino = btrfs_ino(inode);
 
-   log = root-log_root;
-
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range()

2013-03-18 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/file-item.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index ec16020..1ba85b4 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -356,11 +356,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
 
btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
-   key.type != BTRFS_EXTENT_CSUM_KEY)
-   break;
-
-   btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
-   if (key.offset  end)
+   key.type != BTRFS_EXTENT_CSUM_KEY ||
+   key.offset  end)
break;
 
if (key.offset  start)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: update mkfs.btrfs help info for raid5/6

2013-03-03 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Since raid5/6 support was introduced, we should update mkfs.btrfs help info.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 mkfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mkfs.c b/mkfs.c
index 5ece186..f9f26a5 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -326,7 +326,7 @@ static void print_usage(void)
fprintf(stderr, options:\n);
fprintf(stderr, \t -A --alloc-start the offset to start the FS\n);
fprintf(stderr, \t -b --byte-count total number of bytes in the FS\n);
-   fprintf(stderr, \t -d --data data profile, raid0, raid1, raid10, dup 
or single\n);
+   fprintf(stderr, \t -d --data data profile, raid0, raid1, raid5, raid6, 
raid10, dup or single\n);
fprintf(stderr, \t -l --leafsize size of btree leaves\n);
fprintf(stderr, \t -L --label set a label\n);
fprintf(stderr, \t -m --metadata metadata profile, values like data 
profile\n);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 16/19] btrfs: add hot tracking support

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new mount option '-o hot_track',
and add its parsing support.
  Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/ctree.h |1 +
 fs/btrfs/super.c |   22 +-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c72ead8..4703178 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1756,6 +1756,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY(1  20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
+#define BTRFS_MOUNT_HOT_TRACK  (1  23)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 915ac14..0bcc62b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -41,6 +41,7 @@
 #include linux/slab.h
 #include linux/cleancache.h
 #include linux/ratelimit.h
+#include linux/hot_tracking.h
 #include compat.h
 #include delayed-inode.h
 #include ctree.h
@@ -299,6 +300,10 @@ static void btrfs_put_super(struct super_block *sb)
 * last process that kept it busy.  Or segfault in the aforementioned
 * process...  Whom would you report that to?
 */
+
+   /* Hot data tracking */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   hot_track_exit(sb);
 }
 
 enum {
@@ -311,7 +316,7 @@ enum {
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
-   Opt_check_integrity_print_mask, Opt_fatal_errors,
+   Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
Opt_err,
 };
 
@@ -352,6 +357,7 @@ static match_table_t tokens = {
{Opt_check_integrity_including_extent_data, check_int_data},
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
{Opt_fatal_errors, fatal_errors=%s},
+   {Opt_hot_track, hot_track},
{Opt_err, NULL},
 };
 
@@ -614,6 +620,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
goto out;
}
break;
+   case Opt_hot_track:
+   btrfs_set_opt(info-mount_opt, HOT_TRACK);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -841,11 +850,20 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   err = hot_track_init(sb);
+   if (err)
+   goto fail_hot;
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
return 0;
 
+fail_hot:
+   dput(sb-s_root);
+   sb-s_root = NULL;
 fail_close:
close_ctree(fs_info-tree_root);
return err;
@@ -941,6 +959,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
dentry *dentry)
seq_puts(seq, ,skip_balance);
if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR))
seq_puts(seq, ,fatal_errors=panic);
+   if (btrfs_test_opt(root, HOT_TRACK))
+   seq_puts(seq, ,hot_track);
return 0;
 }
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 18/19] ext4: add hot tracking support

2012-10-28 Thread zwu . kernel
From: Zheng Liu wenqing...@taobao.com

  Define a new mount option to add VFS hot
tracking support in order to use it in ext4.

CC: Zhi Yong Wu zwu.ker...@gmail.com
Signed-off-by: Zheng Liu wenqing...@taobao.com
---
 fs/ext4/ext4.h  |3 +++
 fs/ext4/super.c |   13 -
 2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c20de1..f6cff1e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1298,6 +1298,9 @@ struct ext4_sb_info {
 
/* Precomputed FS UUID checksum for seeding other checksums */
__u32 s_csum_seed;
+
+   /* Enable hot tracking or not */
+   int s_hottrack_enable;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 80928f7..ba9f376 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -864,6 +864,8 @@ static void ext4_put_super(struct super_block *sb)
ext4_ext_release(sb);
ext4_xattr_put_super(sb);
 
+   if (sbi-s_hottrack_enable)
+   hot_track_exit(sb);
if (!(sb-s_flags  MS_RDONLY)) {
EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
es-s_state = cpu_to_le16(sbi-s_mount_state);
@@ -1222,7 +1224,7 @@ enum {
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
-   Opt_max_dir_size_kb,
+   Opt_max_dir_size_kb, Opt_hottrack,
 };
 
 static const match_table_t tokens = {
@@ -1297,6 +1299,7 @@ static const match_table_t tokens = {
{Opt_init_itable, init_itable},
{Opt_noinit_itable, noinit_itable},
{Opt_max_dir_size_kb, max_dir_size_kb=%u},
+   {Opt_hottrack, hot_track},
{Opt_removed, check=none},/* mount option from ext2/3 */
{Opt_removed, nocheck},   /* mount option from ext2/3 */
{Opt_removed, reservation},   /* mount option from ext2/3 */
@@ -1595,6 +1598,14 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
sbi-s_li_wait_mult = arg;
} else if (token == Opt_max_dir_size_kb) {
sbi-s_max_dir_size_kb = arg;
+   } else if (token == Opt_hottrack) {
+   if (hot_track_init(sb)) {
+   ext4_msg(sb, KERN_ERR,
+   EXT4-fs: hot tracking initialization
+failed);
+   return -1;
+   }
+   sbi-s_hottrack_enable = 1;
} else if (token == Opt_stripe) {
sbi-s_stripe = arg;
} else if (m-flags  MOPT_DATAJ) {
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 19/19] vfs: add documentation

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add one doc for VFS hot tracking feature

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  262 
 2 files changed, 264 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX 
b/Documentation/filesystems/00-INDEX
index 8c624a1..b68bdff 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -118,3 +118,5 @@ xfs.txt
- info and mount options for the XFS filesystem.
 xip.txt
- info on execute-in-place for file mappings.
+hot_tracking.txt
+   - info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt 
b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 000..a39a96d
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,262 @@
+Hot Data Tracking
+
+September, 2012Zhi Yong Wu wu...@linux.vnet.ibm.com
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes  Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+  The feature adds experimental support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+temperature value that reflects what data is hot, and using that
+temperature to move data to SSDs.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+  Of course, users are warned not to run this code outside of development
+environments. These patches are EXPERIMENTAL, and as such they might eat
+your data and/or memory. That said, the code should be relatively safe
+when the hottrack mount option are disabled.
+
+2. Motivation
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+https://btrfs.wiki.kernel.org/index.php/Project_ideas.
+It will divide into two steps. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, it is hoped that the patchset
+for hot data tracking will eventually mature into VFS.
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+
+3. The Design
+
+These include the following parts:
+
+* Hooks in existing vfs functions to track data access frequency
+
+* New radix-trees for tracking access frequency of inodes and sub-file
+ranges
+The relationship between super_block and radix-tree is as below:
+hot_info.hot_inode_tree
+Each FS instance can find hot tracking info s_hotinfo.
+In this hot_info, it store a lot of hot tracking info such as hot_inode_tree,
+inode and range list, etc.
+
+* A list for indexing data by its temperature
+
+* A debugfs interface for dumping data from the radix-trees
+
+* A background kthread for updating inode heat info
+
+* Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+* An ioctl to retrieve the frequency information collected for a certain
+file
+* Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+* hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+* hot_inode_item contains access frequency data for that inode
+
+* hot_inode_item holds a heat list node to index the access
+frequency data for that inode
+
+* hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+* hot_range_item contains access frequency data for that range
+
+* hot_range_item holds a heat list node to index the access
+frequency data for that range
+
+* hot_info.heat_inode_map indexes per-inode heat list nodes
+
+* hot_info.heat_range_map indexes per-range heat list nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+heat_inode_map   hot_inode_tree
+| |
+| V
+|   +---hot_comm_item+
+|   |   frequency data   |
++---+   |list_head   |
+|   V^ | V
+| ...--hot_comm_item--...  | |  ...--hot_comm_item--...
+|   frequency data

[RFC v4+ hot_track 15/19] sysfs: add two hot_track proc files

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add two proc files hot-kick-time and hot-update-delay
under the dir /proc/sys/fs/ in order to turn
TIME_TO_KICK and HEAT_UPDATE_DELAY into be tunable.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   12 +---
 fs/hot_tracking.h|9 -
 include/linux/hot_tracking.h |7 +++
 kernel/sysctl.c  |   14 ++
 4 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 376d7fb..02ac4a2 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -28,6 +28,12 @@
 static DEFINE_SPINLOCK(hot_func_list_lock);
 static LIST_HEAD(hot_func_list);
 
+int sysctl_hot_kick_time __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_kick_time);
+
+int sysctl_hot_update_delay __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_delay);
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -417,7 +423,7 @@ static bool hot_is_obsolete(struct hot_freq_data *freq_data)
(cur_time - timespec_to_ns(freq_data-last_read_time));
u64 last_write_ns =
(cur_time - timespec_to_ns(freq_data-last_write_time));
-   u64 kick_ns =  TIME_TO_KICK * NSEC_PER_SEC;
+   u64 kick_ns =  sysctl_hot_kick_time * NSEC_PER_SEC;
 
if ((last_read_ns  kick_ns)  (last_write_ns  kick_ns))
ret = 1;
@@ -625,7 +631,7 @@ static void hot_update_worker(struct work_struct *work)
 
/* Instert next delayed work */
queue_delayed_work(root-update_wq, root-update_work,
-   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+   msecs_to_jiffies(sysctl_hot_update_delay * MSEC_PER_SEC));
 }
 
 /*
@@ -1316,7 +1322,7 @@ int hot_track_init(struct super_block *sb)
/* Initialize hot tracking wq and arm one delayed work */
INIT_DELAYED_WORK(root-update_work, hot_update_worker);
queue_delayed_work(root-update_wq, root-update_work,
-   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+   msecs_to_jiffies(sysctl_hot_update_delay * MSEC_PER_SEC));
 
/* Register a shrinker callback */
root-hot_shrink.shrink = hot_track_prune;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index f5ba2d6..095eab0 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -26,15 +26,6 @@
 
 #define FREQ_POWER 4
 
-/*
- * time to quit keeping track of
- * tracking data (seconds)
- */
-#define TIME_TO_KICK 300
-
-/* set how often to update temperatures (seconds) */
-#define HEAT_UPDATE_DELAY 300
-
 /* NRR/NRW heat unit = 2^X accesses */
 #define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
 #define NRR_COEFF_POWER 0
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index a16217f..416c988 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -123,6 +123,13 @@ struct hot_info {
 };
 
 /*
+ * Two variables have meanings as below:
+ * 1. time to quit keeping track of tracking data (seconds)
+ * 2. set how often to update temperatures (seconds)
+ */
+extern int sysctl_hot_kick_time, sysctl_hot_update_delay;
+
+/*
  * Hot data tracking ioctls:
  *
  * HOT_INFO - retrieve info on frequency of access
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..37624fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1545,6 +1545,20 @@ static struct ctl_table fs_table[] = {
.proc_handler   = pipe_proc_fn,
.extra1 = pipe_min_size,
},
+   {
+   .procname   = hot-kick-time,
+   .data   = sysctl_hot_kick_time,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = hot-update-delay,
+   .data   = sysctl_hot_update_delay,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
{ }
 };
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 13/19] debugfs: introduce one function

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  The debugfs function is used to get expected dentry.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/debugfs/inode.c  |   26 ++
 include/linux/debugfs.h |9 +
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b607d92..c6291bc 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -354,6 +354,32 @@ exit:
return dentry;
 }
 
+struct dentry *debugfs_get_dentry(const char *name,
+   struct dentry *parent, int len)
+{
+   struct dentry *dentry = NULL;
+   int error = 0;
+
+   error = simple_pin_fs(debug_fs_type, debugfs_mount,
+   debugfs_mount_count);
+   if (error)
+   return NULL;
+
+   if (!parent)
+   parent = debugfs_mount-mnt_root;
+
+   mutex_lock(parent-d_inode-i_mutex);
+   dentry = lookup_one_len(name, parent, strlen(name));
+   if (!IS_ERR(dentry)) {
+   mutex_unlock(parent-d_inode-i_mutex);
+   return dentry;
+   }
+   mutex_unlock(parent-d_inode-i_mutex);
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(debugfs_get_dentry);
+
 /**
  * debugfs_create_file - create a file in the debugfs filesystem
  * @name: a pointer to a string containing the name of the file to create.
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 66c434f..8913a4d 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -46,6 +46,9 @@ extern struct dentry *arch_debugfs_dir;
 extern const struct file_operations debugfs_file_operations;
 extern const struct inode_operations debugfs_link_operations;
 
+struct dentry *debugfs_get_dentry(const char *name,
+   struct dentry *parent, int len);
+
 struct dentry *debugfs_create_file(const char *name, umode_t mode,
   struct dentry *parent, void *data,
   const struct file_operations *fops);
@@ -103,6 +106,12 @@ bool debugfs_initialized(void);
 
 #include linux/err.h
 
+static inline struct dentry *debugfs_get_dentry(const char *name,
+   struct dentry *parent, int len)
+{
+   return ERR_PTR(-ENODEV);
+}
+
 /* 
  * We do not return NULL from these functions if CONFIG_DEBUG_FS is not enabled
  * so users have a chance to detect if there was a real error or not.  We don't
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 17/19] xfs: add hot tracking support

2012-10-28 Thread zwu . kernel
From: Dave Chinner dchin...@redhat.com

  Connect up the VFS hot tracking support
so XFS filesystems can make use of it.

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/xfs/xfs_mount.h |1 +
 fs/xfs/xfs_super.c |   16 
 2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index deee09e..96d93c2 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -217,6 +217,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_WSYNC(1ULL  0) /* for nfs - all 
metadata ops
   must be synchronous except
   for space allocations */
+#define XFS_MOUNT_HOTTRACK  (1ULL  1) /* hot inode tracking */
 #define XFS_MOUNT_WAS_CLEAN(1ULL  3)
 #define XFS_MOUNT_FS_SHUTDOWN  (1ULL  4) /* atomic stop of all filesystem
   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 26a09bd..48b3bed 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -61,6 +61,7 @@
 #include linux/kthread.h
 #include linux/freezer.h
 #include linux/parser.h
+#include linux/hot_tracking.h
 
 static const struct super_operations xfs_super_operations;
 static kmem_zone_t *xfs_ioend_zone;
@@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_NODELAYLOG  nodelaylog/* Delayed logging disabled */
 #define MNTOPT_DISCARDdiscard/* Discard unused blocks */
 #define MNTOPT_NODISCARD   nodiscard /* Do not discard unused blocks */
+#define MNTOPT_HOTTRACKhot_track  /* hot inode tracking */
 
 /*
  * Table driven mount option parser.
@@ -371,6 +373,8 @@ xfs_parseargs(
mp-m_flags |= XFS_MOUNT_DISCARD;
} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
mp-m_flags = ~XFS_MOUNT_DISCARD;
+   } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+   mp-m_flags |= XFS_MOUNT_HOTTRACK;
} else if (!strcmp(this_char, ihashsize)) {
xfs_warn(mp,
ihashsize no longer used, option is deprecated.);
@@ -1005,6 +1009,9 @@ xfs_fs_put_super(
 {
struct xfs_mount*mp = XFS_M(sb);
 
+   if (mp-m_flags  XFS_MOUNT_HOTTRACK)
+   hot_track_exit(sb);
+
xfs_filestream_unmount(mp);
cancel_delayed_work_sync(mp-m_sync_work);
xfs_unmountfs(mp);
@@ -1407,7 +1414,16 @@ xfs_fs_fill_super(
goto out_unmount;
}
 
+   if (mp-m_flags  XFS_MOUNT_HOTTRACK) {
+   error = hot_track_init(sb);
+   if (error)
+   goto out_free_root;
+   }
+
return 0;
+ out_free_root:
+   dput(sb-s_root);
+   sb-s_root = NULL;
  out_syncd_stop:
xfs_syncd_stop(mp);
  out_filestream_unmount:
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 14/19] vfs: add debugfs support

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a /sys/kernel/debug/hot_track/device_name/ directory for each
volume that contains two files. The first, `inode_stats', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_stats', contains similar information for
subfile ranges.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  484 ++
 fs/hot_tracking.h|5 +
 include/linux/hot_tracking.h |1 +
 3 files changed, 490 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 54a8208..376d7fb 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -21,6 +21,7 @@
 #include linux/blkdev.h
 #include linux/types.h
 #include linux/list_sort.h
+#include linux/debugfs.h
 #include linux/limits.h
 #include hot_tracking.h
 
@@ -628,6 +629,477 @@ static void hot_update_worker(struct work_struct *work)
 }
 
 /*
+ * take the inode, find ranges associated with inode
+ * and print each range data struct
+ */
+static struct hot_range_item
+*hot_range_tree_walk(struct hot_inode_item *he,
+   loff_t *pos, u32 start, bool flag)
+{
+   struct hot_range_item *hr_nodes[8];
+   loff_t l = *pos;
+   int i, n;
+
+   /* Walk the hot_range_tree for inode */
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+  (void **)hr_nodes, start,
+  ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+   spin_unlock(he-lock);
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++) {
+   if ((!flag  !l--) || (flag)) {
+   if (flag)
+   (*pos)++;
+   kref_get(hr_nodes[i]-hot_range.refs);
+   return hr_nodes[i];
+   }
+   }
+   }
+
+   return NULL;
+}
+
+static void
+*hot_inode_tree_walk(struct seq_file *seq, loff_t *pos,
+   u64 ino, bool type, bool flag)
+{
+   struct hot_info *root = seq-private;
+   struct hot_inode_item *hi_nodes[8];
+   struct hot_range_item *hr;
+   loff_t l = *pos;
+   int i, n;
+
+   while (1) {
+   spin_lock(root-lock);
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+   (void **)hi_nodes, ino,
+   ARRAY_SIZE(hi_nodes));
+   if (!n) {
+   spin_unlock(root-lock);
+   break;
+   }
+   spin_unlock(root-lock);
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++) {
+   if (!type) {
+   hr = hot_range_tree_walk(hi_nodes[i],
+   pos, 0, flag);
+   if (hr)
+   return hr;
+   } else {
+   if ((!flag  !l--) || (flag)) {
+   if (flag)
+   (*pos)++;
+   kref_get(hi_nodes[i]-hot_inode.refs);
+   return hi_nodes[i];
+   }
+   }
+   }
+   }
+
+   return NULL;
+}
+
+static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
+{
+   return hot_inode_tree_walk(seq, pos, 0, false, false);
+}
+
+static void *hot_range_seq_next(struct seq_file *seq,
+   void *v, loff_t *pos)
+{
+   struct hot_range_item *hr_next, *hr = v;
+   u32 start = hr-start + 1;
+
+   /* Walk the hot_range_tree for inode */
+   hr_next = hot_range_tree_walk(hr-hot_inode, pos, start, true);
+   if (hr_next)
+   return hr_next;
+
+   return hot_inode_tree_walk(seq, pos,
+   hr-hot_inode-i_ino + 1, false, true);
+}
+
+static void hot_range_seq_stop(struct seq_file *seq, void *v)
+{
+   struct hot_range_item *hr = v;
+
+   if (hr)
+   hot_range_item_put(hr);
+}
+
+static int hot_range_seq_show(struct seq_file *seq, void *v)
+{
+   struct hot_range_item *hr = v;
+   struct hot_inode_item *he = hr-hot_inode;
+   struct hot_freq_data *freq_data = hr-hot_range.hot_freq_data;
+
+   /* Always lock hot_inode_item first */
+   spin_lock(he-hot_inode.lock);
+   spin_lock(hr-hot_range.lock);
+   seq_printf(seq, inode #%llu, range start  \
+   %llu (range len %u) reads %u, 

[RFC v4+ hot_track 04/19] vfs: add two map arrays

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Adds two map arrays which contains
a lot of list and is used to efficiently
look up the data temperature of a file or its
ranges.
  In each list of map arrays, the array node
will keep track of temperature info.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   60 ++
 include/linux/hot_tracking.h |   16 +++
 2 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 0a7d9a3..0a603a1 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -58,6 +58,7 @@ static void hot_range_item_init(struct hot_range_item *hr, 
u32 start,
hr-hot_inode = he;
kref_init(hr-hot_range.refs);
spin_lock_init(hr-hot_range.lock);
+   INIT_LIST_HEAD(hr-hot_range.n_list);
hr-hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
hr-hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
hr-hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
@@ -88,6 +89,16 @@ static void hot_range_item_free(struct kref *kref)
struct hot_comm_item, refs);
struct hot_range_item *hr = container_of(comm_item,
struct hot_range_item, hot_range);
+   struct hot_info *root = container_of(
+   hr-hot_inode-hot_inode_tree,
+   struct hot_info, hot_inode_tree);
+
+   spin_lock(hr-hot_range.lock);
+   if (!list_empty(hr-hot_range.n_list)) {
+   list_del_init(hr-hot_range.n_list);
+   root-hot_map_nr--;
+   }
+   spin_unlock(hr-hot_range.lock);
 
radix_tree_delete(hr-hot_inode-hot_range_tree, hr-start);
kmem_cache_free(hot_range_item_cachep, hr);
@@ -132,6 +143,15 @@ static void hot_inode_item_free(struct kref *kref)
struct hot_comm_item, refs);
struct hot_inode_item *he = container_of(comm_item,
struct hot_inode_item, hot_inode);
+   struct hot_info *root = container_of(he-hot_inode_tree,
+   struct hot_info, hot_inode_tree);
+
+   spin_lock(he-hot_inode.lock);
+   if (!list_empty(he-hot_inode.n_list)) {
+   list_del_init(he-hot_inode.n_list);
+   root-hot_map_nr--;
+   }
+   spin_unlock(he-hot_inode.lock);
 
hot_range_tree_free(he);
radix_tree_delete(he-hot_inode_tree, he-i_ino);
@@ -304,6 +324,44 @@ static void hot_freq_data_update(struct hot_freq_data 
*freq_data, bool write)
 }
 
 /*
+ * Initialize inode and range map arrays.
+ */
+static void hot_map_array_init(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   INIT_LIST_HEAD(root-heat_inode_map[i].node_list);
+   INIT_LIST_HEAD(root-heat_range_map[i].node_list);
+   root-heat_inode_map[i].temp = i;
+   root-heat_range_map[i].temp = i;
+   }
+}
+
+static void hot_map_list_free(struct list_head *node_list,
+   struct hot_info *root)
+{
+   struct list_head *pos, *next;
+   struct hot_comm_item *node;
+
+   list_for_each_safe(pos, next, node_list) {
+   node = list_entry(pos, struct hot_comm_item, n_list);
+   list_del_init(node-n_list);
+   root-hot_map_nr--;
+   }
+
+}
+
+/* Free inode and range map arrays */
+static void hot_map_array_exit(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   hot_map_list_free(root-heat_inode_map[i].node_list, root);
+   hot_map_list_free(root-heat_range_map[i].node_list, root);
+   }
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 void __init hot_cache_init(void)
@@ -394,6 +452,7 @@ int hot_track_init(struct super_block *sb)
 
sb-s_hot_root = root;
hot_inode_tree_init(root);
+   hot_map_array_init(root);
 
printk(KERN_INFO VFS: Turning on hot data tracking\n);
 
@@ -405,6 +464,7 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = sb-s_hot_root;
 
+   hot_map_array_exit(root);
hot_inode_tree_exit(root);
kfree(root);
 }
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index e2d6028..4f92947 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -20,6 +20,9 @@
 #include linux/kref.h
 #include linux/fs.h
 
+#define HEAT_MAP_BITS 8
+#define HEAT_MAP_SIZE (1  HEAT_MAP_BITS)
+
 /*
  * A frequency data struct holds values that are used to
  * determine temperature of files and file ranges. These structs
@@ -36,11 +39,18 @@ struct hot_freq_data {
u32 last_temp;
 };
 
+/* List heads in hot map array */
+struct hot_map_head {
+   struct list_head node_list;
+   u8 temp;
+};
+
 /* The common info for both following structures */
 struct hot_comm_item {
struct hot_freq_data hot_freq_data;  

[RFC v4+ hot_track 06/19] vfs: add temp calculation function

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |   74 +
 1 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 0a603a1..83e590c 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -323,6 +323,80 @@ static void hot_freq_data_update(struct hot_freq_data 
*freq_data, bool write)
}
 }
 
+static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
+{
+   if (dir)
+   return counter  bits;
+   else
+   return counter  bits;
+}
+
+/*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ */
+static u32 hot_temp_calc(struct hot_freq_data *freq_data)
+{
+   u32 result = 0;
+
+   struct timespec ckt = current_kernel_time();
+   u64 cur_time = timespec_to_ns(ckt);
+
+   u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data-nr_reads,
+   NRR_MULTIPLIER_POWER, true);
+   u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data-nr_writes,
+   NRW_MULTIPLIER_POWER, true);
+
+   u64 ltr_heat =
+   hot_raw_shift((cur_time - timespec_to_ns(freq_data-last_read_time)),
+   LTR_DIVIDER_POWER, false);
+   u64 ltw_heat =
+   hot_raw_shift((cur_time - timespec_to_ns(freq_data-last_write_time)),
+   LTW_DIVIDER_POWER, false);
+
+   u64 avr_heat =
+   hot_raw_shiftu64) -1) - freq_data-avg_delta_reads),
+   AVR_DIVIDER_POWER, false);
+   u64 avw_heat =
+   hot_raw_shiftu64) -1) - freq_data-avg_delta_writes),
+   AVW_DIVIDER_POWER, false);
+
+   /* ltr_heat is now guaranteed to be u32 safe */
+   if (ltr_heat = hot_raw_shift((u64) 1, 32, true))
+   ltr_heat = 0;
+   else
+   ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
+
+   /* ltw_heat is now guaranteed to be u32 safe */
+   if (ltw_heat = hot_raw_shift((u64) 1, 32, true))
+   ltw_heat = 0;
+   else
+   ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
+
+   /* avr_heat is now guaranteed to be u32 safe */
+   if (avr_heat = hot_raw_shift((u64) 1, 32, true))
+   avr_heat = (u32) -1;
+
+   /* avw_heat is now guaranteed to be u32 safe */
+   if (avw_heat = hot_raw_shift((u64) 1, 32, true))
+   avw_heat = (u32) -1;
+
+   nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
+   (3 - NRR_COEFF_POWER), false);
+   nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
+   (3 - NRW_COEFF_POWER), false);
+   ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+   ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+   avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+   avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+   result = nrr_heat + nrw_heat + (u32) ltr_heat +
+   (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+   return result;
+}
+
 /*
  * Initialize inode and range map arrays.
  */
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 07/19] vfs: add map info update function

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |   66 +
 fs/hot_tracking.h |   21 +
 2 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 83e590c..9245dd3 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -398,6 +398,72 @@ static u32 hot_temp_calc(struct hot_freq_data *freq_data)
 }
 
 /*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature
+ */
+static void hot_map_array_update(struct hot_freq_data *freq_data,
+   struct hot_info *root)
+{
+   struct hot_map_head *buckets, *cur_bucket;
+   struct hot_comm_item *comm_item;
+   struct hot_inode_item *he;
+   struct hot_range_item *hr;
+   u32 temp = hot_temp_calc(freq_data);
+   u8 a_temp = temp  (32 - HEAT_MAP_BITS);
+   u8 b_temp = freq_data-last_temp  (32 - HEAT_MAP_BITS);
+
+   comm_item = container_of(freq_data,
+   struct hot_comm_item, hot_freq_data);
+
+   if (freq_data-flags  FREQ_DATA_TYPE_INODE) {
+   he = container_of(comm_item,
+   struct hot_inode_item, hot_inode);
+   buckets = root-heat_inode_map;
+
+   if (he == NULL)
+   return;
+
+   spin_lock(he-hot_inode.lock);
+   if (list_empty(he-hot_inode.n_list) || (a_temp != b_temp)) {
+   if (!list_empty(he-hot_inode.n_list)) {
+   list_del_init(he-hot_inode.n_list);
+   root-hot_map_nr--;
+   }
+
+   cur_bucket = buckets + a_temp;
+   list_add_tail(he-hot_inode.n_list,
+   cur_bucket-node_list);
+   root-hot_map_nr++;
+   freq_data-last_temp = temp;
+   }
+   spin_unlock(he-hot_inode.lock);
+   } else if (freq_data-flags  FREQ_DATA_TYPE_RANGE) {
+   hr = container_of(comm_item,
+   struct hot_range_item, hot_range);
+   buckets = root-heat_range_map;
+
+   if (hr == NULL)
+   return;
+
+   spin_lock(hr-hot_range.lock);
+   if (list_empty(hr-hot_range.n_list) || (a_temp != b_temp)) {
+   if (!list_empty(hr-hot_range.n_list)) {
+   list_del_init(hr-hot_range.n_list);
+   root-hot_map_nr--;
+   }
+
+   cur_bucket = buckets + a_temp;
+   list_add_tail(hr-hot_range.n_list,
+   cur_bucket-node_list);
+   root-hot_map_nr++;
+   freq_data-last_temp = temp;
+   }
+   spin_unlock(hr-hot_range.lock);
+   }
+}
+
+/*
  * Initialize inode and range map arrays.
  */
 static void hot_map_array_init(struct hot_info *root)
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index cc4666e..196b894 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -26,6 +26,27 @@
 
 #define FREQ_POWER 4
 
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ */
+#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) 
*/
+#define AVW_COEFF_POWER 0
+
 void hot_inode_item_put(struct hot_inode_item *he);
 struct hot_inode_item *hot_inode_item_find(struct hot_info *root, u64 ino);
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 12/19] vfs: add one ioctl interface

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/compat_ioctl.c|5 +++
 fs/ioctl.c   |   78 ++
 include/linux/hot_tracking.h |   19 ++
 3 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 4c6285f..ad1d603 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include linux/i2c-dev.h
 #include linux/atalk.h
 #include linux/gfp.h
+#include linux/hot_tracking.h
 
 #include net/bluetooth/bluetooth.h
 #include net/bluetooth/hci.h
@@ -1400,6 +1401,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
 /* fat 'r' ioctls. These are handled by fat with -compat_ioctl,
but we don't want warnings on other file systems. So declare
them as compatible here. */
@@ -1579,6 +1583,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
case FIBMAP:
case FIGETBSZ:
case FIONREAD:
+   case FS_IOC_GET_HEAT_INFO:
if (S_ISREG(f.file-f_path.dentry-d_inode-i_mode))
break;
/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 3bdad6d..f0e225e 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include linux/writeback.h
 #include linux/buffer_head.h
 #include linux/falloc.h
+#include hot_tracking.h
 
 #include asm/ioctls.h
 
@@ -537,6 +538,80 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be live -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info-live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct hot_heat_info *heat_info;
+   struct hot_inode_item *he;
+   int ret = 0;
+
+   heat_info = kmalloc(sizeof(struct hot_heat_info),
+   GFP_KERNEL | GFP_NOFS);
+
+   if (copy_from_user((void *) heat_info,
+   argp,
+   sizeof(struct hot_heat_info)) != 0) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+   he = hot_inode_item_find(inode-i_sb-s_hot_root, inode-i_ino);
+   if (!he) {
+   /* we don't have any info on this file yet */
+   ret = -ENODATA;
+   goto err;
+   }
+
+   spin_lock(he-hot_inode.lock);
+   heat_info-avg_delta_reads =
+   (__u64) he-hot_inode.hot_freq_data.avg_delta_reads;
+   heat_info-avg_delta_writes =
+   (__u64) he-hot_inode.hot_freq_data.avg_delta_writes;
+   heat_info-last_read_time =
+   (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_read_time);
+   heat_info-last_write_time =
+   (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_write_time);
+   heat_info-num_reads =
+   (__u32) he-hot_inode.hot_freq_data.nr_reads;
+   heat_info-num_writes =
+   (__u32) he-hot_inode.hot_freq_data.nr_writes;
+
+   if (heat_info-live  0) {
+   /*
+* got a request for live temperature,
+* call hot_hash_calc_temperature to recalculate
+*/
+   heat_info-temp =
+   inode-i_sb-s_hot_root-hot_func_type-ops.hot_temp_calc_fn(
+   he-hot_inode.hot_freq_data);
+   } else {
+   /* not live temperature, get it from the hashlist */
+   heat_info-temp = he-hot_inode.hot_freq_data.last_temp;
+   }
+   spin_unlock(he-hot_inode.lock);
+
+   hot_inode_item_put(he);
+
+   if (copy_to_user(argp, (void *) heat_info,
+   sizeof(struct hot_heat_info))) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+err:
+   kfree(heat_info);
+   return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +666,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, 
unsigned int cmd,
case FIGETBSZ:
return put_user(inode-i_sb-s_blocksize, argp);
 
+   case FS_IOC_GET_HEAT_INFO:
+   return ioctl_heat_info(filp, 

[RFC v4+ hot_track 09/19] vfs: add one work queue

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a per-superblock workqueue and a delayed_work
to run periodic work to update map info on each superblock.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   85 ++
 fs/hot_tracking.h|3 +
 include/linux/hot_tracking.h |3 +
 3 files changed, 91 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index fff0038..0ef9cad 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,9 +15,12 @@
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/hardirq.h
+#include linux/kthread.h
+#include linux/freezer.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
+#include linux/list_sort.h
 #include linux/limits.h
 #include hot_tracking.h
 
@@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root)
}
 }
 
+/* Temperature compare function*/
+static int hot_temp_cmp(void *priv, struct list_head *a,
+   struct list_head *b)
+{
+   struct hot_comm_item *ap =
+   container_of(a, struct hot_comm_item, n_list);
+   struct hot_comm_item *bp =
+   container_of(b, struct hot_comm_item, n_list);
+
+   int diff = ap-hot_freq_data.last_temp
+   - bp-hot_freq_data.last_temp;
+   if (diff  0)
+   return -1;
+   if (diff  0)
+   return 1;
+   return 0;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+   struct hot_info *root = container_of(to_delayed_work(work),
+   struct hot_info, update_work);
+   struct hot_inode_item *hi_nodes[8];
+   u64 ino = 0;
+   int i, n;
+
+   while (1) {
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+  (void **)hi_nodes, ino,
+  ARRAY_SIZE(hi_nodes));
+   if (!n)
+   break;
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++) {
+   kref_get(hi_nodes[i]-hot_inode.refs);
+   hot_map_array_update(
+   hi_nodes[i]-hot_inode.hot_freq_data, root);
+   hot_range_update(hi_nodes[i], root);
+   hot_inode_item_put(hi_nodes[i]);
+   }
+   }
+
+   /* Sort temperature map info */
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   list_sort(NULL, root-heat_inode_map[i].node_list,
+   hot_temp_cmp);
+   list_sort(NULL, root-heat_range_map[i].node_list,
+   hot_temp_cmp);
+   }
+
+   /* Instert next delayed work */
+   queue_delayed_work(root-update_wq, root-update_work,
+   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -650,9 +714,28 @@ int hot_track_init(struct super_block *sb)
hot_inode_tree_init(root);
hot_map_array_init(root);
 
+   root-update_wq = alloc_workqueue(
+   hot_update_wq, WQ_NON_REENTRANT, 0);
+   if (!root-update_wq) {
+   printk(KERN_ERR %s: Failed to create 
+   hot update workqueue\n, __func__);
+   goto failed_wq;
+   }
+
+   /* Initialize hot tracking wq and arm one delayed work */
+   INIT_DELAYED_WORK(root-update_work, hot_update_worker);
+   queue_delayed_work(root-update_wq, root-update_work,
+   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+
printk(KERN_INFO VFS: Turning on hot data tracking\n);
 
return 0;
+
+failed_wq:
+   hot_map_array_exit(root);
+   hot_inode_tree_exit(root);
+   kfree(root);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(hot_track_init);
 
@@ -660,6 +743,8 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = sb-s_hot_root;
 
+   cancel_delayed_work_sync(root-update_work);
+   destroy_workqueue(root-update_wq);
hot_map_array_exit(root);
hot_inode_tree_exit(root);
kfree(root);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index f5ec05a..92e31fb 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -32,6 +32,9 @@
  */
 #define TIME_TO_KICK 300
 
+/* set how often to update temperatures (seconds) */
+#define HEAT_UPDATE_DELAY 300
+
 /* NRR/NRW heat unit = 2^X accesses */
 #define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
 #define NRR_COEFF_POWER 0
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 4f92947..2ee0d02 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ 

[RFC v4+ hot_track 10/19] vfs: introduce hot func register framework

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one framwork to enable that specific FS
can register its own hot tracking functions.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   78 ++
 include/linux/hot_tracking.h |   25 +
 2 files changed, 96 insertions(+), 7 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 0ef9cad..c6c6138 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -24,6 +24,9 @@
 #include linux/limits.h
 #include hot_tracking.h
 
+static DEFINE_SPINLOCK(hot_func_list_lock);
+static LIST_HEAD(hot_func_list);
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -305,20 +308,23 @@ static u64 hot_average_update(struct timespec old_atime,
return new_avg;
 }
 
-static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+static void hot_freq_data_update(struct hot_info *root,
+   struct hot_freq_data *freq_data, bool write)
 {
struct timespec cur_time = current_kernel_time();
 
if (write) {
freq_data-nr_writes += 1;
-   freq_data-avg_delta_writes = hot_average_update(
+   freq_data-avg_delta_writes =
+   root-hot_func_type-ops.hot_rw_freq_calc_fn(
freq_data-last_write_time,
cur_time,
freq_data-avg_delta_writes);
freq_data-last_write_time = cur_time;
} else {
freq_data-nr_reads += 1;
-   freq_data-avg_delta_reads = hot_average_update(
+   freq_data-avg_delta_reads =
+   root-hot_func_type-ops.hot_rw_freq_calc_fn(
freq_data-last_read_time,
cur_time,
freq_data-avg_delta_reads);
@@ -430,7 +436,7 @@ static void hot_map_array_update(struct hot_freq_data 
*freq_data,
struct hot_comm_item *comm_item;
struct hot_inode_item *he;
struct hot_range_item *hr;
-   u32 temp = hot_temp_calc(freq_data);
+   u32 temp = root-hot_func_type-ops.hot_temp_calc_fn(freq_data);
u8 a_temp = temp  (32 - HEAT_MAP_BITS);
u8 b_temp = freq_data-last_temp  (32 - HEAT_MAP_BITS);
 
@@ -511,7 +517,7 @@ static void hot_range_update(struct hot_inode_item *he,
hr_nodes[i]-hot_range.hot_freq_data, root);
 
spin_lock(hr_nodes[i]-hot_range.lock);
-   obsolete = hot_is_obsolete(
+   obsolete = root-hot_func_type-ops.hot_is_obsolete_fn(
hr_nodes[i]-hot_range.hot_freq_data);
spin_unlock(hr_nodes[i]-hot_range.lock);
 
@@ -668,7 +674,7 @@ void hot_update_freqs(struct inode *inode, u64 start,
}
 
spin_lock(he-hot_inode.lock);
-   hot_freq_data_update(he-hot_inode.hot_freq_data, rw);
+   hot_freq_data_update(root, he-hot_inode.hot_freq_data, rw);
spin_unlock(he-hot_inode.lock);
 
/*
@@ -685,7 +691,7 @@ void hot_update_freqs(struct inode *inode, u64 start,
}
 
spin_lock(hr-hot_range.lock);
-   hot_freq_data_update(hr-hot_range.hot_freq_data, rw);
+   hot_freq_data_update(root, hr-hot_range.hot_freq_data, rw);
spin_unlock(hr-hot_range.lock);
 
hot_range_item_put(hr);
@@ -695,6 +701,61 @@ void hot_update_freqs(struct inode *inode, u64 start,
 }
 EXPORT_SYMBOL_GPL(hot_update_freqs);
 
+static struct hot_func_type hot_func_def = {
+   .hot_func_name = hot_type_def,
+   .ops = {
+   .hot_rw_freq_calc_fn = hot_average_update,
+   .hot_temp_calc_fn= hot_temp_calc,
+   .hot_is_obsolete_fn  = hot_is_obsolete,
+   },
+};
+
+static struct hot_func_type *hot_func_get(const char *name)
+{
+   struct hot_func_type *f, *h = hot_func_def;
+
+   spin_lock(hot_func_list_lock);
+   list_for_each_entry(f, hot_func_list, list) {
+   if (!strcmp(f-hot_func_name, name))
+   h = f;
+   }
+   spin_unlock(hot_func_list_lock);
+
+   return h;
+}
+
+int hot_func_register(struct hot_func_type *h)
+{
+   struct hot_func_type *f, *t = NULL;
+
+   /* register, don't allow duplicate names */
+   spin_lock(hot_func_list_lock);
+   list_for_each_entry(f, hot_func_list, list) {
+   if (!strcmp(f-hot_func_name, h-hot_func_name))
+   t = f;
+   }
+
+   if (t) {
+   spin_unlock(hot_func_list_lock);
+   return -EBUSY;
+   }
+
+   list_add_tail(h-list, hot_func_list);
+   spin_unlock(hot_func_list_lock);
+
+   return 0;
+}

[RFC v4+ hot_track 05/19] vfs: add hooks to enable hot tracking

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Miscellaneous features that implement hot data tracking
and generally make the hot data functions a bit more friendly.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/direct-io.c  |6 ++
 mm/filemap.c|6 ++
 mm/page-writeback.c |   12 
 mm/readahead.c  |6 ++
 4 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..1d23631 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
 #include linux/uio.h
 #include linux/atomic.h
 #include linux/prefetch.h
+#include hot_tracking.h
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1297,6 +1298,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
+   /* Hot data tracking */
+   hot_update_freqs(inode, (u64)offset,
+   (u64)iov_length(iov, nr_segs),
+   rw  WRITE);
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 nr_segs, get_block, end_io,
 submit_io, flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..51b2c48 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include linux/memcontrol.h
 #include linux/cleancache.h
+#include linux/hot_tracking.h
 #include internal.h
 
 /*
@@ -1224,6 +1225,11 @@ readpage:
 * PG_error will be set again if readpage fails.
 */
ClearPageError(page);
+
+   /* Hot data tracking */
+   hot_update_freqs(inode, (u64)page-index  PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+
/* Start the actual read. The read will unlock the page. */
error = mapping-a_ops-readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..5220040 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -35,6 +35,7 @@
 #include linux/buffer_head.h /* __set_page_dirty_buffers */
 #include linux/pagevec.h
 #include linux/timer.h
+#include linux/hot_tracking.h
 #include trace/events/writeback.h
 
 /*
@@ -1903,13 +1904,24 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
int ret;
+   pgoff_t start = 0;
+   u64 count = 0;
 
if (wbc-nr_to_write = 0)
return 0;
+
+   start = mapping-writeback_index  PAGE_CACHE_SHIFT;
+   count = (u64)wbc-nr_to_write;
+
if (mapping-a_ops-writepages)
ret = mapping-a_ops-writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+   /* Hot data tracking */
+   hot_update_freqs(mapping-host, (u64)start,
+   (count - (u64)wbc-nr_to_write) * PAGE_CACHE_SIZE, 1);
+
return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..8a24f1e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include linux/pagemap.h
 #include linux/syscalls.h
 #include linux/file.h
+#include linux/hot_tracking.h
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -138,6 +139,11 @@ static int read_pages(struct address_space *mapping, 
struct file *filp,
 out:
blk_finish_plug(plug);
 
+   /* Hot data tracking */
+   hot_update_freqs(mapping-host, (u64)(list_entry(pages-prev,\
+   struct page, lru)-index)  PAGE_CACHE_SHIFT,
+   (u64)nr_pages * PAGE_CACHE_SIZE, 0);
+
return ret;
 }
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 03/19] vfs: add I/O frequency update function

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add some util helpers to update access frequencies
for one file or its range.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  179 ++
 fs/hot_tracking.h|7 ++
 include/linux/hot_tracking.h |2 +
 3 files changed, 188 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 68591f0..0a7d9a3 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root)
}
 }
 
+struct hot_inode_item
+*hot_inode_item_find(struct hot_info *root, u64 ino)
+{
+   struct hot_inode_item *he;
+   int ret;
+
+again:
+   spin_lock(root-lock);
+   he = radix_tree_lookup(root-hot_inode_tree, ino);
+   if (he) {
+   kref_get(he-hot_inode.refs);
+   spin_unlock(root-lock);
+   return he;
+   }
+   spin_unlock(root-lock);
+
+   he = kmem_cache_zalloc(hot_inode_item_cachep,
+   GFP_KERNEL | GFP_NOFS);
+   if (!he)
+   return ERR_PTR(-ENOMEM);
+
+   hot_inode_item_init(he, ino, root-hot_inode_tree);
+
+   ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
+   if (ret) {
+   kmem_cache_free(hot_inode_item_cachep, he);
+   return ERR_PTR(ret);
+   }
+
+   spin_lock(root-lock);
+   ret = radix_tree_insert(root-hot_inode_tree, ino, he);
+   if (ret == -EEXIST) {
+   kmem_cache_free(hot_inode_item_cachep, he);
+   spin_unlock(root-lock);
+   radix_tree_preload_end();
+   goto again;
+   }
+   spin_unlock(root-lock);
+   radix_tree_preload_end();
+
+   kref_get(he-hot_inode.refs);
+   return he;
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_find);
+
+static struct hot_range_item
+*hot_range_item_find(struct hot_inode_item *he,
+   u32 start)
+{
+   struct hot_range_item *hr;
+   int ret;
+
+again:
+   spin_lock(he-lock);
+   hr = radix_tree_lookup(he-hot_range_tree, start);
+   if (hr) {
+   kref_get(hr-hot_range.refs);
+   spin_unlock(he-lock);
+   return hr;
+   }
+   spin_unlock(he-lock);
+
+   hr = kmem_cache_zalloc(hot_range_item_cachep,
+   GFP_KERNEL | GFP_NOFS);
+   if (!hr)
+   return ERR_PTR(-ENOMEM);
+
+   hot_range_item_init(hr, start, he);
+
+   ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
+   if (ret) {
+   kmem_cache_free(hot_range_item_cachep, hr);
+   return ERR_PTR(ret);
+   }
+
+   spin_lock(he-lock);
+   ret = radix_tree_insert(he-hot_range_tree, start, hr);
+   if (ret == -EEXIST) {
+   kmem_cache_free(hot_range_item_cachep, hr);
+   spin_unlock(he-lock);
+   radix_tree_preload_end();
+   goto again;
+   }
+   spin_unlock(he-lock);
+   radix_tree_preload_end();
+
+   kref_get(hr-hot_range.refs);
+   return hr;
+}
+
+/*
+ * This function does the actual work of updating
+ * the frequency numbers, whatever they turn out to be.
+ */
+static u64 hot_average_update(struct timespec old_atime,
+   struct timespec cur_time, u64 old_avg)
+{
+   struct timespec delta_ts;
+   u64 new_avg;
+   u64 new_delta;
+
+   delta_ts = timespec_sub(cur_time, old_atime);
+   new_delta = timespec_to_ns(delta_ts)  FREQ_POWER;
+
+   new_avg = (old_avg  FREQ_POWER) - old_avg + new_delta;
+   new_avg = new_avg  FREQ_POWER;
+
+   return new_avg;
+}
+
+static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+{
+   struct timespec cur_time = current_kernel_time();
+
+   if (write) {
+   freq_data-nr_writes += 1;
+   freq_data-avg_delta_writes = hot_average_update(
+   freq_data-last_write_time,
+   cur_time,
+   freq_data-avg_delta_writes);
+   freq_data-last_write_time = cur_time;
+   } else {
+   freq_data-nr_reads += 1;
+   freq_data-avg_delta_reads = hot_average_update(
+   freq_data-last_read_time,
+   cur_time,
+   freq_data-avg_delta_reads);
+   freq_data-last_read_time = cur_time;
+   }
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -199,6 +330,54 @@ err:
 EXPORT_SYMBOL_GPL(hot_cache_init);
 
 /*
+ * Main function to update access frequency from read/writepage(s) hooks
+ */
+void hot_update_freqs(struct inode *inode, u64 start,
+   u64 len, int rw)
+{
+   struct hot_info *root = inode-i_sb-s_hot_root;
+   struct hot_inode_item *he;
+   struct 

[RFC v4+ hot_track 00/19] vfs: hot data tracking

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

NOTE:

  The patchset can be obtained via my kernel dev git on github:
g...@github.com:wuzhy/kernel.git hot_tracking
  If you're interested, you can also can review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

  For more info, please check hot_tracking.txt in Documentation

TODO List:

 1.) Need to do scalability or performance tests.
 2.) Need one simpler but effective temp calc'ing function
 3.) How to save the file temperature among the umount to be able to
 preserve the file tempreture after reboot

Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to
 be very btrfs-specific.  I've taken their code and attempted to
make it more generic and integrate it at the VFS level.

Changelog from v3:
 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
 2.) Refactored workqueue support. [Dave Chinner]
 3.) Turn some Micro into be tunable   [Zhiyong, Zheng Liu]
   TIME_TO_KICK, and HEAT_UPDATE_DELAY
 4.) Introduce hot func registering framework [Zhiyong]
 5.) Remove global variable for hot tracking [Zhiyong]
 6.) Add xfs hot tracking support [Dave Chinner]
 7.) Add ext4 hot tracking support [Zheng Liu]
 8.) Cleanedup a lot of other issues [Dave Chinner]

v3:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]

v2:
 1.) Converted to one workqueue to update map info periodically [Dave Chinner]
 2.) Cleanedup a lot of other issues [Dave Chinner]

v1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) Add btrfs hot tracking support [Zhiyong]
 3.) The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]

Dave Chinner (1):
  xfs: add hot tracking support

Zheng Liu (1):
  ext4: add hot tracking support

Zhi Yong Wu (17):
  vfs: introduce private radix tree structures
  vfs: initialize and free data structures
  vfs: add I/O frequency update function
  vfs: add two map arrays
  vfs: add hooks to enable hot tracking
  vfs: add temp calculation function
  vfs: add map info update function
  vfs: add aging function
  vfs: add one work queue
  vfs: introduce hot func register framework
  vfs: register one shrinker
  vfs: add one ioctl interface
  debugfs: introduce one function
  vfs: add debugfs support
  sysfs: add two hot_track proc files
  btrfs: add hot tracking support
  vfs: add documentation

 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  262 ++
 fs/Makefile|2 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/super.c   |   22 +-
 fs/compat_ioctl.c  |5 +
 fs/dcache.c|2 +
 fs/debugfs/inode.c |   26 +
 fs/direct-io.c |6 +
 fs/ext4/ext4.h |3 +
 fs/ext4/super.c|   13 +-
 fs/hot_tracking.c  | 1367 
 fs/hot_tracking.h  |   58 ++
 fs/ioctl.c |   78 ++
 fs/xfs/xfs_mount.h |1 +
 fs/xfs/xfs_super.c |   16 +
 include/linux/debugfs.h|9 +
 include/linux/fs.h |4 +
 include/linux/hot_tracking.h   |  149 +++
 kernel/sysctl.c|   14 +
 mm/filemap.c   |6 +
 mm/page-writeback.c|   12 +
 mm/readahead.c |6 +
 23 files changed, 2061 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4+ hot_track 01/19] vfs: introduce private radix tree structures

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.
  Each of the two structures contains a hot_freq_data
struct with its frequency of access metrics (number of
{reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/Makefile  |2 +-
 fs/dcache.c  |2 +
 fs/hot_tracking.c|  108 ++
 fs/hot_tracking.h|   23 +
 include/linux/hot_tracking.h |   73 
 5 files changed, 207 insertions(+), 1 deletions(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 1d7af79..f966dea 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o fs_struct.o statfs.o
+   stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index 3a463d0..7d5be16 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include linux/rculist_bl.h
 #include linux/prefetch.h
 #include linux/ratelimit.h
+#include linux/hot_tracking.h
 #include internal.h
 #include mount.h
 
@@ -3172,4 +3173,5 @@ void __init vfs_caches_init(unsigned long mempages)
mnt_init();
bdev_cache_init();
chrdev_init();
+   hot_cache_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 000..badf47d
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,108 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/err.h
+#include linux/slab.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/hardirq.h
+#include linux/fs.h
+#include linux/blkdev.h
+#include linux/types.h
+#include linux/limits.h
+#include hot_tracking.h
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+static void hot_inode_tree_init(struct hot_info *root)
+{
+   INIT_RADIX_TREE(root-hot_inode_tree, GFP_ATOMIC);
+   spin_lock_init(root-lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_range_tree_init(struct hot_inode_item *he)
+{
+   INIT_RADIX_TREE(he-hot_range_tree, GFP_ATOMIC);
+   spin_lock_init(he-lock);
+}
+
+/*
+ * Initialize a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+static void hot_range_item_init(struct hot_range_item *hr, u32 start,
+   struct hot_inode_item *he)
+{
+   hr-start = start;
+   hr-len = RANGE_SIZE;
+   hr-hot_inode = he;
+   kref_init(hr-hot_range.refs);
+   spin_lock_init(hr-hot_range.lock);
+   hr-hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
+   hr-hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
+   hr-hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
+}
+
+/*
+ * Initialize a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using hot_free_inode_item()
+ */
+static void 

[RFC v4+ hot_track 02/19] vfs: initialize and free data structures

2012-10-28 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add initialization function to create some
key data structures when hot tracking is enabled;
Clean up them when hot tracking is disabled

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  124 ++
 fs/hot_tracking.h|2 +
 include/linux/fs.h   |4 ++
 include/linux/hot_tracking.h |2 +
 4 files changed, 132 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index badf47d..68591f0 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -75,12 +75,103 @@ static void hot_inode_item_init(struct hot_inode_item *he, 
u64 ino,
he-hot_inode_tree = hot_inode_tree;
kref_init(he-hot_inode.refs);
spin_lock_init(he-hot_inode.lock);
+   INIT_LIST_HEAD(he-hot_inode.n_list);
he-hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
he-hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
he-hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
hot_range_tree_init(he);
 }
 
+static void hot_range_item_free(struct kref *kref)
+{
+   struct hot_comm_item *comm_item = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_range_item *hr = container_of(comm_item,
+   struct hot_range_item, hot_range);
+
+   radix_tree_delete(hr-hot_inode-hot_range_tree, hr-start);
+   kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure if the reference count hits zero
+ */
+static void hot_range_item_put(struct hot_range_item *hr)
+{
+   kref_put(hr-hot_range.refs, hot_range_item_free);
+}
+
+/* Frees the entire hot_range_tree. */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+   struct hot_range_item *hr_nodes[8];
+   u32 start = 0;
+   int i, n;
+
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+   (void **)hr_nodes, start,
+   ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++)
+   hot_range_item_put(hr_nodes[i]);
+   spin_unlock(he-lock);
+   }
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+   struct hot_comm_item *comm_item = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_inode_item *he = container_of(comm_item,
+   struct hot_inode_item, hot_inode);
+
+   hot_range_tree_free(he);
+   radix_tree_delete(he-hot_inode_tree, he-i_ino);
+   kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+   kref_put(he-hot_inode.refs, hot_inode_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_put);
+
+/* Frees the entire hot_inode_tree. */
+static void hot_inode_tree_exit(struct hot_info *root)
+{
+   struct hot_inode_item *hi_nodes[8];
+   u64 ino = 0;
+   int i, n;
+
+   while (1) {
+   spin_lock(root-lock);
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+  (void **)hi_nodes, ino,
+  ARRAY_SIZE(hi_nodes));
+   if (!n) {
+   spin_unlock(root-lock);
+   break;
+   }
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++)
+   hot_inode_item_put(hi_nodes[i]);
+   spin_unlock(root-lock);
+   }
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -106,3 +197,36 @@ err:
kmem_cache_destroy(hot_inode_item_cachep);
 }
 EXPORT_SYMBOL_GPL(hot_cache_init);
+
+/*
+ * Initialize the data structures for hot data tracking.
+ */
+int hot_track_init(struct super_block *sb)
+{
+   struct hot_info *root;
+   int ret = -ENOMEM;
+
+   root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+   if (!root) {
+   printk(KERN_ERR %s: Failed to malloc memory for 
+   hot_info\n, __func__);
+   return ret;
+   }
+
+   sb-s_hot_root = root;
+   hot_inode_tree_init(root);
+
+   printk(KERN_INFO VFS: Turning on hot data tracking\n);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(hot_track_init);
+
+void hot_track_exit(struct super_block *sb)
+{
+   struct hot_info *root = sb-s_hot_root;
+
+   hot_inode_tree_exit(root);
+   kfree(root);
+}
+EXPORT_SYMBOL_GPL(hot_track_exit);
diff --git 

[RFC v4 00/15] vfs: hot data tracking

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

NOTE:

  The patchset can be obtained via my kernel dev git on github:
g...@github.com:wuzhy/kernel.git hot_tracking
  If you're interested, you can also can review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

  For more infomation, please check hot_tracking.txt in Documentation

TODO List:

 1.) Need to do scalability or performance tests.
 2.) How to save the file temperature among the umount to be able to
 preserve the file tempreture after reboot

Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to
 be very btrfs-specific.  I've taken their code and attempted to
make it more generic and integrate it at the VFS level.

Changelog from v3:
 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
 2.) Refactored workqueue support. [Dave Chinner]
 3.) Turn some Micro into be tunable[Zhiyong, Liu Zheng]
   TIME_TO_KICK, and HEAT_UPDATE_DELAY
 4.) Cleanedup a lot of other issues [Dave Chinner]

v2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

v1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]

Dave Chinner (1):
  xfs: add hot tracking support

Zhi Yong Wu (14):
  vfs,hot_track: introduce private radix tree structures
  vfs,hot_track: initialize and free key data structures
  vfs,hot_track: add the function for collecting I/O frequency
  vfs,hot_track: add two map arrays
  vfs,hot_track: add hooks to enable hot data tracking
  vfs,hot_track: add the function for updating map arrays
  vfs,hot_track: add the aging function
  vfs,hot_track: add one work queue
  vfs,hot_track: register one memory shrinker
  vfs,hot_track: add one new ioctl interface
  vfs,hot_track: add debugfs support
  vfs,hot_track: turn some Micro into be tunable
  btrfs: add hot tracking support
  vfs,hot_track: add the documentation

 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  164 
 fs/Makefile|2 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/super.c   |   22 +-
 fs/compat_ioctl.c  |5 +
 fs/dcache.c|2 +
 fs/debugfs/inode.c |   26 +
 fs/direct-io.c |6 +
 fs/hot_tracking.c  | 1308 
 fs/hot_tracking.h  |   91 ++
 fs/ioctl.c |   76 ++
 fs/xfs/xfs_mount.h |1 +
 fs/xfs/xfs_super.c |   16 +
 include/linux/debugfs.h|9 +
 include/linux/fs.h |4 +
 include/linux/hot_tracking.h   |  125 +++
 kernel/sysctl.c|   14 +
 mm/filemap.c   |6 +
 mm/page-writeback.c|   12 +
 mm/readahead.c |6 +
 21 files changed, 1896 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 07/15] vfs,hot_track: add the aging function

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |   56 +
 fs/hot_tracking.h |6 +
 2 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 05624ad..575cd3a 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -331,6 +331,24 @@ static void hot_freq_data_update(struct hot_freq_data 
*freq_data, bool write)
}
 }
 
+static bool hot_freq_data_is_obsolete(struct hot_freq_data *freq_data)
+{
+   int ret = 0;
+   struct timespec ckt = current_kernel_time();
+
+   u64 cur_time = timespec_to_ns(ckt);
+   u64 last_read_ns =
+   (cur_time - timespec_to_ns(freq_data-last_read_time));
+   u64 last_write_ns =
+   (cur_time - timespec_to_ns(freq_data-last_write_time));
+   u64 kick_ns =  TIME_TO_KICK * NSEC_PER_SEC;
+
+   if ((last_read_ns  kick_ns)  (last_write_ns  kick_ns))
+   ret = 1;
+
+   return ret;
+}
+
 static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
 {
if (dir)
@@ -495,6 +513,44 @@ static void hot_map_array_update(struct hot_freq_data 
*freq_data,
}
 }
 
+/* Update temperatures for each range item for aging purposes */
+static void hot_range_update(struct hot_inode_item *he,
+   struct hot_info *root)
+{
+   struct hot_range_item *hr_nodes[8];
+   u32 start = 0;
+   bool obsolete;
+   int i, n;
+
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+   (void **)hr_nodes, start,
+   ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+   spin_unlock(he-lock);
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++) {
+   kref_get(hr_nodes[i]-hot_range.refs);
+   hot_map_array_update(
+   hr_nodes[i]-hot_range.hot_freq_data, root);
+
+   spin_lock(hr_nodes[i]-hot_range.lock);
+   obsolete = hot_freq_data_is_obsolete(
+   hr_nodes[i]-hot_range.hot_freq_data);
+   spin_unlock(hr_nodes[i]-hot_range.lock);
+
+   hot_range_item_put(hr_nodes[i]);
+   if (obsolete)
+   hot_range_item_put(hr_nodes[i]);
+   }
+   }
+}
+
 /*
  * Initialize inode and range map arrays.
  */
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index be2365c..67c6fb6 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -26,6 +26,12 @@
 #define FREQ_POWER 4
 
 /*
+ * time to quit keeping track of
+ * tracking data (seconds)
+ */
+#define TIME_TO_KICK 300
+
+/*
  * The following comments explain what exactly comprises a unit of heat.
  *
  * Each of six values of heat are calculated and combined in order to form an
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 06/15] vfs,hot_track: add the function for updating map arrays

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |  164 +
 fs/hot_tracking.h |   54 +
 2 files changed, 218 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index b5568bc..05624ad 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -331,6 +331,170 @@ static void hot_freq_data_update(struct hot_freq_data 
*freq_data, bool write)
}
 }
 
+static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
+{
+   if (dir)
+   return counter  bits;
+   else
+   return counter  bits;
+}
+
+/*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria, which are described in detail in hot_tracking.h) down into a 
single
+ * temperature value for the data, which is an integer between 0
+ * and HEAT_MAX_VALUE.
+ *
+ * To accomplish this, the raw values from the hot_freq_data structure
+ * are shifted various ways in order to make the temperature calculation more
+ * or less sensitive to each value.
+ *
+ * Once this calibration has happened, we do some additional normalization and
+ * make sure that everything fits nicely in a u32. From there, we take a very
+ * rudimentary kind of average of each of the values, where the *_COEFF_POWER
+ * values act as weights for the average.
+ *
+ * Finally, we use the HEAT_HASH_BITS value, which determines the size of the
+ * heat list array, to normalize the temperature to the proper granularity.
+ */
+u32 hot_temp_calc(struct hot_freq_data *freq_data)
+{
+   u32 result = 0;
+
+   struct timespec ckt = current_kernel_time();
+   u64 cur_time = timespec_to_ns(ckt);
+
+   u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data-nr_reads,
+   NRR_MULTIPLIER_POWER, true);
+   u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data-nr_writes,
+   NRW_MULTIPLIER_POWER, true);
+
+   u64 ltr_heat =
+   hot_raw_shift((cur_time - timespec_to_ns(freq_data-last_read_time)),
+   LTR_DIVIDER_POWER, false);
+   u64 ltw_heat =
+   hot_raw_shift((cur_time - timespec_to_ns(freq_data-last_write_time)),
+   LTW_DIVIDER_POWER, false);
+
+   u64 avr_heat =
+   hot_raw_shiftu64) -1) - freq_data-avg_delta_reads),
+   AVR_DIVIDER_POWER, false);
+   u64 avw_heat =
+   hot_raw_shiftu64) -1) - freq_data-avg_delta_writes),
+   AVW_DIVIDER_POWER, false);
+
+   /* ltr_heat is now guaranteed to be u32 safe */
+   if (ltr_heat = hot_raw_shift((u64) 1, 32, true))
+   ltr_heat = 0;
+   else
+   ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
+
+   /* ltw_heat is now guaranteed to be u32 safe */
+   if (ltw_heat = hot_raw_shift((u64) 1, 32, true))
+   ltw_heat = 0;
+   else
+   ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
+
+   /* avr_heat is now guaranteed to be u32 safe */
+   if (avr_heat = hot_raw_shift((u64) 1, 32, true))
+   avr_heat = (u32) -1;
+
+   /* avw_heat is now guaranteed to be u32 safe */
+   if (avw_heat = hot_raw_shift((u64) 1, 32, true))
+   avw_heat = (u32) -1;
+
+   nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
+   (3 - NRR_COEFF_POWER), false);
+   nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
+   (3 - NRW_COEFF_POWER), false);
+   ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+   ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+   avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+   avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+   result = nrr_heat + nrw_heat + (u32) ltr_heat +
+   (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+   return result;
+}
+
+/*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature
+ */
+static void hot_map_array_update(struct hot_freq_data *freq_data,
+   struct hot_info *root)
+{
+   struct hot_map_head *buckets, *cur_bucket;
+   struct hot_comm_item *comm_item;
+   struct hot_inode_item *he;
+   struct hot_range_item *hr;
+   u8 a_temp, b_temp;
+   u32 temp = 0;
+
+   comm_item = container_of(freq_data,
+   struct hot_comm_item, hot_freq_data);
+
+   if (freq_data-flags  FREQ_DATA_TYPE_INODE) {
+   he = container_of(comm_item,
+   struct hot_inode_item, hot_inode);
+   buckets = root-heat_inode_map;
+
+   spin_lock(he-hot_inode.lock);
+   temp = hot_temp_calc(freq_data);
+   spin_unlock(he-hot_inode.lock);
+
+   

[RFC v4 04/15] vfs,hot_track: add two map arrays

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Adds two map arrays which contains
a lot of list and is used to efficiently
look up the data temperature of a file or its
ranges.
  In each list of map arrays, the array node
will keep track of temperature info.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   55 ++
 include/linux/hot_tracking.h |   16 
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 201598b..b5568bc 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -60,6 +60,7 @@ static void hot_range_item_init(struct hot_range_item *hr, 
u32 start,
hr-hot_inode = he;
kref_init(hr-hot_range.refs);
spin_lock_init(hr-hot_range.lock);
+   INIT_LIST_HEAD(hr-hot_range.n_list);
hr-hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
hr-hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
hr-hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
@@ -91,6 +92,13 @@ static void hot_range_item_free(struct kref *kref)
struct hot_range_item *hr = container_of(comm_item,
struct hot_range_item, hot_range);
 
+   spin_lock(hr-hot_range.lock);
+   if (!list_empty(hr-hot_range.n_list)) {
+   list_del_init(hr-hot_range.n_list);
+   hot_root-hot_map_nr--;
+   }
+   spin_unlock(hr-hot_range.lock);
+
radix_tree_delete(hr-hot_inode-hot_range_tree, hr-start);
kmem_cache_free(hot_range_item_cachep, hr);
 }
@@ -135,6 +143,13 @@ static void hot_inode_item_free(struct kref *kref)
struct hot_inode_item *he = container_of(comm_item,
struct hot_inode_item, hot_inode);
 
+   spin_lock(he-hot_inode.lock);
+   if (!list_empty(he-hot_inode.n_list)) {
+   list_del_init(he-hot_inode.n_list);
+   hot_root-hot_map_nr--;
+   }
+   spin_unlock(he-hot_inode.lock);
+
hot_range_tree_free(he);
radix_tree_delete(he-hot_inode_tree, he-i_ino);
kmem_cache_free(hot_inode_item_cachep, he);
@@ -317,6 +332,44 @@ static void hot_freq_data_update(struct hot_freq_data 
*freq_data, bool write)
 }
 
 /*
+ * Initialize inode and range map arrays.
+ */
+static void hot_map_array_init(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   INIT_LIST_HEAD(root-heat_inode_map[i].node_list);
+   INIT_LIST_HEAD(root-heat_range_map[i].node_list);
+   root-heat_inode_map[i].temp = i;
+   root-heat_range_map[i].temp = i;
+   }
+}
+
+static void hot_map_list_free(struct list_head *node_list,
+   struct hot_info *root)
+{
+   struct list_head *pos, *next;
+   struct hot_comm_item *node;
+
+   list_for_each_safe(pos, next, node_list) {
+   node = list_entry(pos, struct hot_comm_item, n_list);
+   list_del_init(node-n_list);
+   root-hot_map_nr--;
+   }
+
+}
+
+/* Free inode and range map arrays */
+static void hot_map_array_exit(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   hot_map_list_free(root-heat_inode_map[i].node_list, root);
+   hot_map_list_free(root-heat_range_map[i].node_list, root);
+   }
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 void __init hot_cache_init(void)
@@ -406,6 +459,7 @@ int hot_track_init(struct super_block *sb)
 
sb-s_hot_root = hot_root = root;
hot_inode_tree_init(root);
+   hot_map_array_init(root);
 
printk(KERN_INFO VFS: Turning on hot data tracking\n);
 
@@ -417,6 +471,7 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = sb-s_hot_root;
 
+   hot_map_array_exit(root);
hot_inode_tree_exit(root);
kfree(root);
 }
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index de68f66..0ce2ef8 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -20,6 +20,9 @@
 #include linux/kref.h
 #include linux/fs.h
 
+#define HEAT_MAP_BITS 8
+#define HEAT_MAP_SIZE (1  HEAT_MAP_BITS)
+
 /*
  * A frequency data struct holds values that are used to
  * determine temperature of files and file ranges. These structs
@@ -36,11 +39,18 @@ struct hot_freq_data {
u32 last_temp;
 };
 
+/* List heads in hot map array */
+struct hot_map_head {
+   struct list_head node_list;
+   u8 temp;
+};
+
 /* The common info for both following structures */
 struct hot_comm_item {
struct hot_freq_data hot_freq_data;  /* frequency data */
spinlock_t lock; /* protects object data */
struct kref refs;  /* prevents kfree */
+   struct list_head n_list; /* list node index */
 };
 
 /* An item representing an inode and its access frequency */
@@ -66,6 +76,12 @@ struct hot_range_item 

[RFC v4 10/15] vfs,hot_track: add one new ioctl interface

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/compat_ioctl.c|5 +++
 fs/ioctl.c   |   76 ++
 include/linux/hot_tracking.h |   19 ++
 3 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index f505402..b346324 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include linux/i2c-dev.h
 #include linux/atalk.h
 #include linux/gfp.h
+#include linux/hot_tracking.h
 
 #include net/bluetooth/bluetooth.h
 #include net/bluetooth/hci.h
@@ -1398,6 +1399,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
 /* fat 'r' ioctls. These are handled by fat with -compat_ioctl,
but we don't want warnings on other file systems. So declare
them as compatible here. */
@@ -1577,6 +1581,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
case FIBMAP:
case FIGETBSZ:
case FIONREAD:
+   case FS_IOC_GET_HEAT_INFO:
if (S_ISREG(f.file-f_path.dentry-d_inode-i_mode))
break;
/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 3bdad6d..fbad2be 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include linux/writeback.h
 #include linux/buffer_head.h
 #include linux/falloc.h
+#include hot_tracking.h
 
 #include asm/ioctls.h
 
@@ -537,6 +538,78 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be live -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info-live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct hot_heat_info *heat_info;
+   struct hot_inode_item *he;
+   int ret = 0;
+
+   heat_info = kmalloc(sizeof(struct hot_heat_info),
+   GFP_KERNEL | GFP_NOFS);
+
+   if (copy_from_user((void *) heat_info,
+   argp,
+   sizeof(struct hot_heat_info)) != 0) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+   he = hot_inode_item_find(inode-i_sb-s_hot_root, inode-i_ino);
+   if (!he) {
+   /* we don't have any info on this file yet */
+   ret = -ENODATA;
+   goto err;
+   }
+
+   spin_lock(he-hot_inode.lock);
+   heat_info-avg_delta_reads =
+   (__u64) he-hot_inode.hot_freq_data.avg_delta_reads;
+   heat_info-avg_delta_writes =
+   (__u64) he-hot_inode.hot_freq_data.avg_delta_writes;
+   heat_info-last_read_time =
+   (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_read_time);
+   heat_info-last_write_time =
+   (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_write_time);
+   heat_info-num_reads =
+   (__u32) he-hot_inode.hot_freq_data.nr_reads;
+   heat_info-num_writes =
+   (__u32) he-hot_inode.hot_freq_data.nr_writes;
+
+   if (heat_info-live  0) {
+   /*
+* got a request for live temperature,
+* call hot_hash_calc_temperature to recalculate
+*/
+   heat_info-temp = hot_temp_calc(he-hot_inode.hot_freq_data);
+   } else {
+   /* not live temperature, get it from the hashlist */
+   heat_info-temp = he-hot_inode.hot_freq_data.last_temp;
+   }
+   spin_unlock(he-hot_inode.lock);
+
+   hot_inode_item_put(he);
+
+   if (copy_to_user(argp, (void *) heat_info,
+   sizeof(struct hot_heat_info))) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+err:
+   kfree(heat_info);
+   return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +664,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, 
unsigned int cmd,
case FIGETBSZ:
return put_user(inode-i_sb-s_blocksize, argp);
 
+   case FS_IOC_GET_HEAT_INFO:
+   return ioctl_heat_info(filp, argp);
+
default:
if (S_ISREG(inode-i_mode))
error = 

[RFC v4 15/15] vfs,hot_track: add the documentation

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  164 
 2 files changed, 166 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX 
b/Documentation/filesystems/00-INDEX
index 8c624a1..b68bdff 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -118,3 +118,5 @@ xfs.txt
- info and mount options for the XFS filesystem.
 xip.txt
- info on execute-in-place for file mappings.
+hot_tracking.txt
+   - info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt 
b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 000..ccb367d
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,164 @@
+Hot Data Tracking
+
+September, 2012Zhi Yong Wu wu...@linux.vnet.ibm.com
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. Git Development Tree
+5. Usage Example
+
+
+1. Introduction
+
+  The feature adds experimental support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+temperature value that reflects what data is hot, and using that
+temperature to move data to SSDs.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+  Of course, users are warned not to run this code outside of development
+environments. These patches are EXPERIMENTAL, and as such they might eat
+your data and/or memory. That said, the code should be relatively safe
+when the hottrack mount option are disabled.
+
+2. Motivation
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+https://btrfs.wiki.kernel.org/index.php/Project_ideas.
+It will divide into two steps. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, it is hoped that the patchset
+for hot data tracking will eventually mature into VFS.
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+
+3. The Design
+
+These include the following parts:
+
+* Hooks in existing vfs functions to track data access frequency
+
+* New radix-trees for tracking access frequency of inodes and sub-file
+ranges
+The relationship between super_block and radix-tree is as below:
+hot_info.hot_inode_tree
+Each FS instance can find hot tracking info s_hotinfo.
+In this hot_info, it store a lot of hot tracking info such as hot_inode_tree,
+inode and range list, etc.
+
+* A list for indexing data by its temperature
+
+* A debugfs interface for dumping data from the radix-trees
+
+* A background kthread for updating inode heat info
+
+* Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+* An ioctl to retrieve the frequency information collected for a certain
+file
+* Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+* hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+* hot_inode_item contains access frequency data for that inode
+
+* hot_inode_item holds a heat list node to index the access
+frequency data for that inode
+
+* hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+* hot_range_item contains access frequency data for that range
+
+* hot_range_item holds a heat list node to index the access
+frequency data for that range
+
+* hot_info.heat_inode_map indexes per-inode heat list nodes
+
+* hot_info.heat_range_map indexes per-range heat list nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+heat_inode_map   hot_inode_tree
+| |
+| V
+|   +---hot_comm_item+
+|   |   frequency data   |
++---+   |list_head   |
+|   V^ | V
+| ...--hot_comm_item--...  | |  ...--hot_comm_item--...
+|   frequency data   | |frequency data
++list_head--+ +-list_head---.
+   

[RFC v4 14/15] xfs: add hot tracking support

2012-10-25 Thread zwu . kernel
From: Dave Chinner dchin...@redhat.com

  Connect up the VFS hot tracking support
so XFS filesystems can make use of it.

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/xfs/xfs_mount.h |1 +
 fs/xfs/xfs_super.c |   16 
 2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index deee09e..96d93c2 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -217,6 +217,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_WSYNC(1ULL  0) /* for nfs - all 
metadata ops
   must be synchronous except
   for space allocations */
+#define XFS_MOUNT_HOTTRACK  (1ULL  1) /* hot inode tracking */
 #define XFS_MOUNT_WAS_CLEAN(1ULL  3)
 #define XFS_MOUNT_FS_SHUTDOWN  (1ULL  4) /* atomic stop of all filesystem
   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 26a09bd..48b3bed 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -61,6 +61,7 @@
 #include linux/kthread.h
 #include linux/freezer.h
 #include linux/parser.h
+#include linux/hot_tracking.h
 
 static const struct super_operations xfs_super_operations;
 static kmem_zone_t *xfs_ioend_zone;
@@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_NODELAYLOG  nodelaylog/* Delayed logging disabled */
 #define MNTOPT_DISCARDdiscard/* Discard unused blocks */
 #define MNTOPT_NODISCARD   nodiscard /* Do not discard unused blocks */
+#define MNTOPT_HOTTRACKhot_track  /* hot inode tracking */
 
 /*
  * Table driven mount option parser.
@@ -371,6 +373,8 @@ xfs_parseargs(
mp-m_flags |= XFS_MOUNT_DISCARD;
} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
mp-m_flags = ~XFS_MOUNT_DISCARD;
+   } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+   mp-m_flags |= XFS_MOUNT_HOTTRACK;
} else if (!strcmp(this_char, ihashsize)) {
xfs_warn(mp,
ihashsize no longer used, option is deprecated.);
@@ -1005,6 +1009,9 @@ xfs_fs_put_super(
 {
struct xfs_mount*mp = XFS_M(sb);
 
+   if (mp-m_flags  XFS_MOUNT_HOTTRACK)
+   hot_track_exit(sb);
+
xfs_filestream_unmount(mp);
cancel_delayed_work_sync(mp-m_sync_work);
xfs_unmountfs(mp);
@@ -1407,7 +1414,16 @@ xfs_fs_fill_super(
goto out_unmount;
}
 
+   if (mp-m_flags  XFS_MOUNT_HOTTRACK) {
+   error = hot_track_init(sb);
+   if (error)
+   goto out_free_root;
+   }
+
return 0;
+ out_free_root:
+   dput(sb-s_root);
+   sb-s_root = NULL;
  out_syncd_stop:
xfs_syncd_stop(mp);
  out_filestream_unmount:
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 13/15] btrfs: add hot tracking support

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new mount option '-o hot_track',
and add its parsing support.
  Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/ctree.h |1 +
 fs/btrfs/super.c |   22 +-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 926c9ff..c3d28f0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1756,6 +1756,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY(1  20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
+#define BTRFS_MOUNT_HOT_TRACK  (1  23)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 915ac14..0bcc62b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -41,6 +41,7 @@
 #include linux/slab.h
 #include linux/cleancache.h
 #include linux/ratelimit.h
+#include linux/hot_tracking.h
 #include compat.h
 #include delayed-inode.h
 #include ctree.h
@@ -299,6 +300,10 @@ static void btrfs_put_super(struct super_block *sb)
 * last process that kept it busy.  Or segfault in the aforementioned
 * process...  Whom would you report that to?
 */
+
+   /* Hot data tracking */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   hot_track_exit(sb);
 }
 
 enum {
@@ -311,7 +316,7 @@ enum {
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
-   Opt_check_integrity_print_mask, Opt_fatal_errors,
+   Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
Opt_err,
 };
 
@@ -352,6 +357,7 @@ static match_table_t tokens = {
{Opt_check_integrity_including_extent_data, check_int_data},
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
{Opt_fatal_errors, fatal_errors=%s},
+   {Opt_hot_track, hot_track},
{Opt_err, NULL},
 };
 
@@ -614,6 +620,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
goto out;
}
break;
+   case Opt_hot_track:
+   btrfs_set_opt(info-mount_opt, HOT_TRACK);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -841,11 +850,20 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   err = hot_track_init(sb);
+   if (err)
+   goto fail_hot;
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
return 0;
 
+fail_hot:
+   dput(sb-s_root);
+   sb-s_root = NULL;
 fail_close:
close_ctree(fs_info-tree_root);
return err;
@@ -941,6 +959,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
dentry *dentry)
seq_puts(seq, ,skip_balance);
if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR))
seq_puts(seq, ,fatal_errors=panic);
+   if (btrfs_test_opt(root, HOT_TRACK))
+   seq_puts(seq, ,hot_track);
return 0;
 }
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 12/15] vfs,hot_track: turn some Micro into be tunable

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Turn TIME_TO_KICK and HEAT_UPDATE_DELAY
into be tunable via /proc/sys/fs/hot-kick-time and
/proc/sys/fs/hot-update-delay.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   12 +---
 fs/hot_tracking.h|9 -
 include/linux/hot_tracking.h |7 +++
 kernel/sysctl.c  |   14 ++
 4 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 18a64ee..15ed407 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -25,6 +25,12 @@
 #include linux/limits.h
 #include hot_tracking.h
 
+int sysctl_hot_kick_time __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_kick_time);
+
+int sysctl_hot_update_delay __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_delay);
+
 static struct hot_info *hot_root;
 
 /* kmem_cache pointers for slab caches */
@@ -345,7 +351,7 @@ static bool hot_freq_data_is_obsolete(struct hot_freq_data 
*freq_data)
(cur_time - timespec_to_ns(freq_data-last_read_time));
u64 last_write_ns =
(cur_time - timespec_to_ns(freq_data-last_write_time));
-   u64 kick_ns =  TIME_TO_KICK * NSEC_PER_SEC;
+   u64 kick_ns =  sysctl_hot_kick_time * NSEC_PER_SEC;
 
if ((last_read_ns  kick_ns)  (last_write_ns  kick_ns))
ret = 1;
@@ -651,7 +657,7 @@ static void hot_update_worker(struct work_struct *work)
 
/* Instert next delayed work */
queue_delayed_work(root-update_wq, root-update_work,
-   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+   msecs_to_jiffies(sysctl_hot_update_delay * MSEC_PER_SEC));
 }
 
 /*
@@ -1257,7 +1263,7 @@ int hot_track_init(struct super_block *sb)
/* Initialize hot tracking wq and arm one delayed work */
INIT_DELAYED_WORK(root-update_work, hot_update_worker);
queue_delayed_work(root-update_wq, root-update_work,
-   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+   msecs_to_jiffies(sysctl_hot_update_delay * MSEC_PER_SEC));
 
/* Register a shrinker callback */
root-hot_shrink.shrink = hot_track_prune;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 6eb024f..d6d9161 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -26,15 +26,6 @@
 #define FREQ_POWER 4
 
 /*
- * time to quit keeping track of
- * tracking data (seconds)
- */
-#define TIME_TO_KICK 300
-
-/* set how often to update temperatures (seconds) */
-#define HEAT_UPDATE_DELAY 300
-
-/*
  * The following comments explain what exactly comprises a unit of heat.
  *
  * Each of six values of heat are calculated and combined in order to form an
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 7107cfa..ea59f33 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -101,6 +101,13 @@ struct hot_info {
 };
 
 /*
+ * Two variables have meanings as below:
+ * 1. time to quit keeping track of tracking data (seconds)
+ * 2. set how often to update temperatures (seconds)
+ */
+extern int sysctl_hot_kick_time, sysctl_hot_update_delay;
+
+/*
  * Hot data tracking ioctls:
  *
  * HOT_INFO - retrieve info on frequency of access
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..37624fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1545,6 +1545,20 @@ static struct ctl_table fs_table[] = {
.proc_handler   = pipe_proc_fn,
.extra1 = pipe_min_size,
},
+   {
+   .procname   = hot-kick-time,
+   .data   = sysctl_hot_kick_time,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
+   {
+   .procname   = hot-update-delay,
+   .data   = sysctl_hot_update_delay,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
{ }
 };
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add some utils helpers to update access frequencies
for one file or its range.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  191 ++
 fs/hot_tracking.h|9 ++
 include/linux/hot_tracking.h |2 +
 3 files changed, 202 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 5fef7e5..201598b 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -173,6 +173,149 @@ static void hot_inode_tree_exit(struct hot_info *root)
}
 }
 
+struct hot_inode_item
+*hot_inode_item_find(struct hot_info *root, u64 ino)
+{
+   struct hot_inode_item *he;
+   int ret;
+
+again:
+   spin_lock(root-lock);
+   he = radix_tree_lookup(root-hot_inode_tree, ino);
+   if (he) {
+   kref_get(he-hot_inode.refs);
+   spin_unlock(root-lock);
+   return he;
+   }
+   spin_unlock(root-lock);
+
+   he = kmem_cache_zalloc(hot_inode_item_cachep,
+   GFP_KERNEL | GFP_NOFS);
+   if (!he)
+   return ERR_PTR(-ENOMEM);
+
+   hot_inode_item_init(he, ino, root-hot_inode_tree);
+
+   ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
+   if (ret) {
+   kmem_cache_free(hot_inode_item_cachep, he);
+   return ERR_PTR(ret);
+   }
+
+   spin_lock(root-lock);
+   ret = radix_tree_insert(root-hot_inode_tree, ino, he);
+   if (ret == -EEXIST) {
+   kmem_cache_free(hot_inode_item_cachep, he);
+   spin_unlock(root-lock);
+   radix_tree_preload_end();
+   goto again;
+   }
+   spin_unlock(root-lock);
+   radix_tree_preload_end();
+
+   kref_get(he-hot_inode.refs);
+   return he;
+}
+
+static struct hot_range_item
+*hot_range_item_find(struct hot_inode_item *he,
+   u32 start)
+{
+   struct hot_range_item *hr;
+   int ret;
+
+again:
+   spin_lock(he-lock);
+   hr = radix_tree_lookup(he-hot_range_tree, start);
+   if (hr) {
+   kref_get(hr-hot_range.refs);
+   spin_unlock(he-lock);
+   return hr;
+   }
+   spin_unlock(he-lock);
+
+   hr = kmem_cache_zalloc(hot_range_item_cachep,
+   GFP_KERNEL | GFP_NOFS);
+   if (!hr)
+   return ERR_PTR(-ENOMEM);
+
+   hot_range_item_init(hr, start, he);
+
+   ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
+   if (ret) {
+   kmem_cache_free(hot_range_item_cachep, hr);
+   return ERR_PTR(ret);
+   }
+
+   spin_lock(he-lock);
+   ret = radix_tree_insert(he-hot_range_tree, start, hr);
+   if (ret == -EEXIST) {
+   kmem_cache_free(hot_range_item_cachep, hr);
+   spin_unlock(he-lock);
+   radix_tree_preload_end();
+   goto again;
+   }
+   spin_unlock(he-lock);
+   radix_tree_preload_end();
+
+   kref_get(hr-hot_range.refs);
+   return hr;
+}
+
+/*
+ * This function does the actual work of updating the frequency numbers,
+ * whatever they turn out to be. FREQ_POWER determines how many atime
+ * deltas we keep track of (as a power of 2). So, setting it to anything above
+ * 16ish is probably overkill. Also, the higher the power, the more bits get
+ * right shifted out of the timestamp, reducing precision, so take note of that
+ * as well.
+ *
+ * The caller should have already locked freq_data's parent's spinlock.
+ *
+ * FREQ_POWER, defined immediately below, determines how heavily to weight
+ * the current frequency numbers against the newest access. For example, a 
value
+ * of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+ * as heavily as the existing frequency info. In essence, this is a kludged-
+ * together version of a weighted average, since we can't afford to keep all of
+ * the information that it would take to get a _real_ weighted average.
+ */
+static u64 hot_average_update(struct timespec old_atime,
+   struct timespec cur_time, u64 old_avg)
+{
+   struct timespec delta_ts;
+   u64 new_avg;
+   u64 new_delta;
+
+   delta_ts = timespec_sub(cur_time, old_atime);
+   new_delta = timespec_to_ns(delta_ts)  FREQ_POWER;
+
+   new_avg = (old_avg  FREQ_POWER) - old_avg + new_delta;
+   new_avg = new_avg  FREQ_POWER;
+
+   return new_avg;
+}
+
+static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+{
+   struct timespec cur_time = current_kernel_time();
+
+   if (write) {
+   freq_data-nr_writes += 1;
+   freq_data-avg_delta_writes = hot_average_update(
+   freq_data-last_write_time,
+   cur_time,
+   freq_data-avg_delta_writes);
+   

[RFC v4 01/15] vfs,hot_track: introduce private radix tree structures

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.
  Each of the two structures contains a hot_freq_data
struct with its frequency of access metrics (number of
{reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/Makefile  |2 +-
 fs/dcache.c  |2 +
 fs/hot_tracking.c|  107 ++
 fs/hot_tracking.h|   23 +
 include/linux/hot_tracking.h |   73 
 5 files changed, 206 insertions(+), 1 deletions(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 1d7af79..f966dea 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o fs_struct.o statfs.o
+   stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index 3a463d0..7d5be16 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include linux/rculist_bl.h
 #include linux/prefetch.h
 #include linux/ratelimit.h
+#include linux/hot_tracking.h
 #include internal.h
 #include mount.h
 
@@ -3172,4 +3173,5 @@ void __init vfs_caches_init(unsigned long mempages)
mnt_init();
bdev_cache_init();
chrdev_init();
+   hot_cache_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 000..6a0f2a3
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,107 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/err.h
+#include linux/slab.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/hardirq.h
+#include linux/fs.h
+#include linux/blkdev.h
+#include linux/types.h
+#include linux/limits.h
+#include hot_tracking.h
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+static void hot_inode_tree_init(struct hot_info *root)
+{
+   INIT_RADIX_TREE(root-hot_inode_tree, GFP_ATOMIC);
+   spin_lock_init(root-lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_range_tree_init(struct hot_inode_item *he)
+{
+   INIT_RADIX_TREE(he-hot_range_tree, GFP_ATOMIC);
+   spin_lock_init(he-lock);
+}
+
+/*
+ * Initialize a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+static void hot_range_item_init(struct hot_range_item *hr, u32 start,
+   struct hot_inode_item *he)
+{
+   hr-start = start;
+   hr-len = RANGE_SIZE;
+   hr-hot_inode = he;
+   kref_init(hr-hot_range.refs);
+   spin_lock_init(hr-hot_range.lock);
+   hr-hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
+   hr-hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
+   hr-hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
+}
+
+/*
+ * Initialize a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using hot_free_inode_item()
+ */
+static void 

[RFC v4 09/15] vfs,hot_track: register one memory shrinker

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Register a shrinker to control the amount of
memory that is used in tracking hot regions - if we are throwing
inodes out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking code is
using, even if it means losing useful information (i.e. the shrinker
accelerates the aging process).

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   61 ++
 include/linux/hot_tracking.h |1 +
 2 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index d931083..7d2c53d 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -678,6 +678,61 @@ err:
kmem_cache_destroy(hot_inode_item_cachep);
 }
 
+static int hot_track_prune_map(struct hot_map_head *map_head,
+   bool type, int nr)
+{
+   struct hot_comm_item *node;
+   int i;
+
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   while (!list_empty((map_head + i)-node_list)) {
+   if (nr-- = 0)
+   break;
+
+   node = list_first_entry((map_head + i)-node_list,
+   struct hot_comm_item, n_list);
+   if (type) {
+   struct hot_inode_item *hot_inode =
+   container_of(node,
+   struct hot_inode_item, hot_inode);
+   hot_inode_item_put(hot_inode);
+   } else {
+   struct hot_range_item *hot_range =
+   container_of(node,
+   struct hot_range_item, hot_range);
+   hot_range_item_put(hot_range);
+   }
+   }
+   }
+
+   return nr;
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+   struct shrink_control *sc)
+{
+   struct hot_info *root =
+   container_of(shrink, struct hot_info, hot_shrink);
+   int ret;
+
+   if (sc-nr_to_scan == 0)
+   return root-hot_map_nr;
+
+   if (!(sc-gfp_mask  __GFP_FS))
+   return -1;
+
+   ret = hot_track_prune_map(root-heat_range_map,
+   false, sc-nr_to_scan);
+   if (ret  0)
+   ret = hot_track_prune_map(root-heat_inode_map,
+   true, ret);
+   if (ret  0)
+   root-hot_map_nr -= (sc-nr_to_scan - ret);
+
+   return root-hot_map_nr;
+}
+
 /*
  * Main function to update access frequency from read/writepage(s) hooks
  */
@@ -758,6 +813,11 @@ int hot_track_init(struct super_block *sb)
queue_delayed_work(root-update_wq, root-update_work,
msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
 
+   /* Register a shrinker callback */
+   root-hot_shrink.shrink = hot_track_prune;
+   root-hot_shrink.seeks = DEFAULT_SEEKS;
+   register_shrinker(root-hot_shrink);
+
printk(KERN_INFO VFS: Turning on hot data tracking\n);
 
return 0;
@@ -774,6 +834,7 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = sb-s_hot_root;
 
+   unregister_shrinker(root-hot_shrink);
cancel_delayed_work_sync(root-update_work);
destroy_workqueue(root-update_wq);
hot_map_array_exit(root);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 4c5f5f3..446c99e 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -85,6 +85,7 @@ struct hot_info {
 
struct workqueue_struct *update_wq;
struct delayed_work update_work;
+   struct shrinker hot_shrink;
 };
 
 void __init hot_cache_init(void);
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 11/15] vfs,hot_track: add debugfs support

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a /sys/kernel/debug/hot_track/device_name/ directory for each
volume that contains two files. The first, `inode_stats', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_stats', contains similar information for
subfile ranges.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/debugfs/inode.c   |   26 +++
 fs/hot_tracking.c|  458 ++
 fs/hot_tracking.h|5 +
 include/linux/debugfs.h  |9 +
 include/linux/hot_tracking.h |1 +
 5 files changed, 499 insertions(+), 0 deletions(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b607d92..c6291bc 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -354,6 +354,32 @@ exit:
return dentry;
 }
 
+struct dentry *debugfs_get_dentry(const char *name,
+   struct dentry *parent, int len)
+{
+   struct dentry *dentry = NULL;
+   int error = 0;
+
+   error = simple_pin_fs(debug_fs_type, debugfs_mount,
+   debugfs_mount_count);
+   if (error)
+   return NULL;
+
+   if (!parent)
+   parent = debugfs_mount-mnt_root;
+
+   mutex_lock(parent-d_inode-i_mutex);
+   dentry = lookup_one_len(name, parent, strlen(name));
+   if (!IS_ERR(dentry)) {
+   mutex_unlock(parent-d_inode-i_mutex);
+   return dentry;
+   }
+   mutex_unlock(parent-d_inode-i_mutex);
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(debugfs_get_dentry);
+
 /**
  * debugfs_create_file - create a file in the debugfs filesystem
  * @name: a pointer to a string containing the name of the file to create.
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 7d2c53d..18a64ee 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -21,6 +21,7 @@
 #include linux/blkdev.h
 #include linux/types.h
 #include linux/list_sort.h
+#include linux/debugfs.h
 #include linux/limits.h
 #include hot_tracking.h
 
@@ -654,6 +655,451 @@ static void hot_update_worker(struct work_struct *work)
 }
 
 /*
+ * take the inode, find ranges associated with inode
+ * and print each range data struct
+ */
+static struct hot_range_item
+*hot_range_tree_walk(struct hot_inode_item *he,
+   loff_t *pos, u32 start, bool flag)
+{
+   struct hot_range_item *hr_nodes[8];
+   loff_t l = *pos;
+   int i, n;
+
+   /* Walk the hot_range_tree for inode */
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+  (void **)hr_nodes, start,
+  ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+   spin_unlock(he-lock);
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++) {
+   if ((!flag  !l--) || (flag)) {
+   if (flag)
+   (*pos)++;
+   kref_get(hr_nodes[i]-hot_range.refs);
+   return hr_nodes[i];
+   }
+   }
+   }
+
+   return NULL;
+}
+
+static void
+*hot_inode_tree_walk(u64 ino, loff_t *pos, bool type, bool flag)
+{
+   struct hot_inode_item *hi_nodes[8];
+   struct hot_range_item *hr;
+   loff_t l = *pos;
+   int i, n;
+
+   while (1) {
+   spin_lock(hot_root-lock);
+   n = radix_tree_gang_lookup(hot_root-hot_inode_tree,
+   (void **)hi_nodes, ino,
+   ARRAY_SIZE(hi_nodes));
+   if (!n) {
+   spin_unlock(hot_root-lock);
+   break;
+   }
+   spin_unlock(hot_root-lock);
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++) {
+   if (!type) {
+   hr = hot_range_tree_walk(hi_nodes[i],
+   pos, 0, flag);
+   if (hr)
+   return hr;
+   } else {
+   if ((!flag  !l--) || (flag)) {
+   if (flag)
+   (*pos)++;
+   kref_get(hi_nodes[i]-hot_inode.refs);
+   return hi_nodes[i];
+   }
+   }
+   }
+   }
+
+   return NULL;
+}
+
+static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
+{
+   return hot_inode_tree_walk(0, pos, false, false);
+}
+
+static void 

[RFC v4 05/15] vfs,hot_track: add hooks to enable hot data tracking

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Miscellaneous features that implement hot data tracking
and generally make the hot data functions a bit more friendly.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/direct-io.c  |6 ++
 mm/filemap.c|6 ++
 mm/page-writeback.c |   12 
 mm/readahead.c  |6 ++
 4 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..1d23631 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
 #include linux/uio.h
 #include linux/atomic.h
 #include linux/prefetch.h
+#include hot_tracking.h
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1297,6 +1298,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
+   /* Hot data tracking */
+   hot_update_freqs(inode, (u64)offset,
+   (u64)iov_length(iov, nr_segs),
+   rw  WRITE);
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 nr_segs, get_block, end_io,
 submit_io, flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..51b2c48 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include linux/memcontrol.h
 #include linux/cleancache.h
+#include linux/hot_tracking.h
 #include internal.h
 
 /*
@@ -1224,6 +1225,11 @@ readpage:
 * PG_error will be set again if readpage fails.
 */
ClearPageError(page);
+
+   /* Hot data tracking */
+   hot_update_freqs(inode, (u64)page-index  PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+
/* Start the actual read. The read will unlock the page. */
error = mapping-a_ops-readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..5220040 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -35,6 +35,7 @@
 #include linux/buffer_head.h /* __set_page_dirty_buffers */
 #include linux/pagevec.h
 #include linux/timer.h
+#include linux/hot_tracking.h
 #include trace/events/writeback.h
 
 /*
@@ -1903,13 +1904,24 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
int ret;
+   pgoff_t start = 0;
+   u64 count = 0;
 
if (wbc-nr_to_write = 0)
return 0;
+
+   start = mapping-writeback_index  PAGE_CACHE_SHIFT;
+   count = (u64)wbc-nr_to_write;
+
if (mapping-a_ops-writepages)
ret = mapping-a_ops-writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+   /* Hot data tracking */
+   hot_update_freqs(mapping-host, (u64)start,
+   (count - (u64)wbc-nr_to_write) * PAGE_CACHE_SIZE, 1);
+
return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..8a24f1e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include linux/pagemap.h
 #include linux/syscalls.h
 #include linux/file.h
+#include linux/hot_tracking.h
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -138,6 +139,11 @@ static int read_pages(struct address_space *mapping, 
struct file *filp,
 out:
blk_finish_plug(plug);
 
+   /* Hot data tracking */
+   hot_update_freqs(mapping-host, (u64)(list_entry(pages-prev,\
+   struct page, lru)-index)  PAGE_CACHE_SHIFT,
+   (u64)nr_pages * PAGE_CACHE_SIZE, 0);
+
return ret;
 }
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v4 02/15] vfs,hot_track: initialize and free key data structures

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add initialization function to create some
key data structures when hot tracking is enabled;
Clean up them when hot tracking is disabled

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  125 ++
 include/linux/fs.h   |4 +
 include/linux/hot_tracking.h |3 +
 3 files changed, 132 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 6a0f2a3..5fef7e5 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -21,6 +21,8 @@
 #include linux/limits.h
 #include hot_tracking.h
 
+static struct hot_info *hot_root;
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -75,12 +77,102 @@ static void hot_inode_item_init(struct hot_inode_item *he, 
u64 ino,
he-hot_inode_tree = hot_inode_tree;
kref_init(he-hot_inode.refs);
spin_lock_init(he-hot_inode.lock);
+   INIT_LIST_HEAD(he-hot_inode.n_list);
he-hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
he-hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
he-hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
hot_range_tree_init(he);
 }
 
+static void hot_range_item_free(struct kref *kref)
+{
+   struct hot_comm_item *comm_item = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_range_item *hr = container_of(comm_item,
+   struct hot_range_item, hot_range);
+
+   radix_tree_delete(hr-hot_inode-hot_range_tree, hr-start);
+   kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure if the reference count hits zero
+ */
+static void hot_range_item_put(struct hot_range_item *hr)
+{
+   kref_put(hr-hot_range.refs, hot_range_item_free);
+}
+
+/* Frees the entire hot_range_tree. */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+   struct hot_range_item *hr_nodes[8];
+   u32 start = 0;
+   int i, n;
+
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+   (void **)hr_nodes, start,
+   ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++)
+   hot_range_item_put(hr_nodes[i]);
+   spin_unlock(he-lock);
+   }
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+   struct hot_comm_item *comm_item = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_inode_item *he = container_of(comm_item,
+   struct hot_inode_item, hot_inode);
+
+   hot_range_tree_free(he);
+   radix_tree_delete(he-hot_inode_tree, he-i_ino);
+   kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+   kref_put(he-hot_inode.refs, hot_inode_item_free);
+}
+
+/* Frees the entire hot_inode_tree. */
+static void hot_inode_tree_exit(struct hot_info *root)
+{
+   struct hot_inode_item *hi_nodes[8];
+   u64 ino = 0;
+   int i, n;
+
+   while (1) {
+   spin_lock(root-lock);
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+  (void **)hi_nodes, ino,
+  ARRAY_SIZE(hi_nodes));
+   if (!n) {
+   spin_unlock(root-lock);
+   break;
+   }
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++)
+   hot_inode_item_put(hi_nodes[i]);
+   spin_unlock(root-lock);
+   }
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -105,3 +197,36 @@ void __init hot_cache_init(void)
 err:
kmem_cache_destroy(hot_inode_item_cachep);
 }
+
+/*
+ * Initialize the data structures for hot data tracking.
+ */
+int hot_track_init(struct super_block *sb)
+{
+   struct hot_info *root;
+   int ret = -ENOMEM;
+
+   root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+   if (!root) {
+   printk(KERN_ERR %s: Failed to malloc memory for 
+   hot_info\n, __func__);
+   return ret;
+   }
+
+   sb-s_hot_root = hot_root = root;
+   hot_inode_tree_init(root);
+
+   printk(KERN_INFO VFS: Turning on hot data tracking\n);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(hot_track_init);
+

[RFC v4 08/15] vfs,hot_track: add one work queue

2012-10-25 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a per-superblock workqueue and a work_struct
to run periodic work to update map info on each superblock.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   85 ++
 fs/hot_tracking.h|3 +
 include/linux/hot_tracking.h |3 +
 3 files changed, 91 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 575cd3a..d931083 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,9 +15,12 @@
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/hardirq.h
+#include linux/kthread.h
+#include linux/freezer.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
+#include linux/list_sort.h
 #include linux/limits.h
 #include hot_tracking.h
 
@@ -589,6 +592,67 @@ static void hot_map_array_exit(struct hot_info *root)
}
 }
 
+/* Temperature compare function*/
+static int hot_temp_cmp(void *priv, struct list_head *a,
+   struct list_head *b)
+{
+   struct hot_comm_item *ap =
+   container_of(a, struct hot_comm_item, n_list);
+   struct hot_comm_item *bp =
+   container_of(b, struct hot_comm_item, n_list);
+
+   int diff = ap-hot_freq_data.last_temp
+   - bp-hot_freq_data.last_temp;
+   if (diff  0)
+   return -1;
+   if (diff  0)
+   return 1;
+   return 0;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+   struct hot_info *root = container_of(to_delayed_work(work),
+   struct hot_info, update_work);
+   struct hot_inode_item *hi_nodes[8];
+   u64 ino = 0;
+   int i, n;
+
+   while (1) {
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+  (void **)hi_nodes, ino,
+  ARRAY_SIZE(hi_nodes));
+   if (!n)
+   break;
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++) {
+   kref_get(hi_nodes[i]-hot_inode.refs);
+   hot_map_array_update(
+   hi_nodes[i]-hot_inode.hot_freq_data, root);
+   hot_range_update(hi_nodes[i], root);
+   hot_inode_item_put(hi_nodes[i]);
+   }
+   }
+
+   /* Sort temperature map info */
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   list_sort(NULL, root-heat_inode_map[i].node_list,
+   hot_temp_cmp);
+   list_sort(NULL, root-heat_range_map[i].node_list,
+   hot_temp_cmp);
+   }
+
+   /* Instert next delayed work */
+   queue_delayed_work(root-update_wq, root-update_work,
+   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -681,9 +745,28 @@ int hot_track_init(struct super_block *sb)
hot_inode_tree_init(root);
hot_map_array_init(root);
 
+   root-update_wq = alloc_workqueue(
+   hot_update_wq, WQ_NON_REENTRANT, 0);
+   if (!root-update_wq) {
+   printk(KERN_ERR %s: Failed to create 
+   hot update workqueue\n, __func__);
+   goto failed_wq;
+   }
+
+   /* Initialize hot tracking wq and arm one delayed work */
+   INIT_DELAYED_WORK(root-update_work, hot_update_worker);
+   queue_delayed_work(root-update_wq, root-update_work,
+   msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+
printk(KERN_INFO VFS: Turning on hot data tracking\n);
 
return 0;
+
+failed_wq:
+   hot_map_array_exit(root);
+   hot_inode_tree_exit(root);
+   kfree(root);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(hot_track_init);
 
@@ -691,6 +774,8 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = sb-s_hot_root;
 
+   cancel_delayed_work_sync(root-update_work);
+   destroy_workqueue(root-update_wq);
hot_map_array_exit(root);
hot_inode_tree_exit(root);
kfree(root);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 67c6fb6..b9d9717 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -31,6 +31,9 @@
  */
 #define TIME_TO_KICK 300
 
+/* set how often to update temperatures (seconds) */
+#define HEAT_UPDATE_DELAY 300
+
 /*
  * The following comments explain what exactly comprises a unit of heat.
  *
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 0ce2ef8..4c5f5f3 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -82,6 +82,9 @@ struct hot_info {
/* map of range 

[RFC v3 01/13] btrfs: add one new mount option '-o hot_track'

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Introduce one new mount option '-o hot_track',
and add its parsing support.
  Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/ctree.h |1 +
 fs/btrfs/super.c |7 ++-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9821b67..094bec6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY(1  20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1  21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR   (1  22)
+#define BTRFS_MOUNT_HOT_TRACK  (1  23)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 83d6f9f..00be9e3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -41,6 +41,7 @@
 #include linux/slab.h
 #include linux/cleancache.h
 #include linux/ratelimit.h
+#include linux/hot_tracking.h
 #include compat.h
 #include delayed-inode.h
 #include ctree.h
@@ -303,7 +304,7 @@ enum {
Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
-   Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
+   Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
Opt_check_integrity_print_mask, Opt_fatal_errors,
Opt_err,
@@ -342,6 +343,7 @@ static match_table_t tokens = {
{Opt_no_space_cache, nospace_cache},
{Opt_recovery, recovery},
{Opt_skip_balance, skip_balance},
+   {Opt_hot_track, hot_track},
{Opt_check_integrity, check_int},
{Opt_check_integrity_including_extent_data, check_int_data},
{Opt_check_integrity_print_mask, check_int_print_mask=%d},
@@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
case Opt_skip_balance:
btrfs_set_opt(info-mount_opt, SKIP_BALANCE);
break;
+   case Opt_hot_track:
+   btrfs_set_opt(info-mount_opt, HOT_TRACK);
+   break;
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
case Opt_check_integrity_including_extent_data:
printk(KERN_INFO btrfs: enabling check integrity
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 06/13] vfs: add hooks to enable hot data tracking

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Miscellaneous features that implement hot data tracking
and generally make the hot data functions a bit more friendly.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/direct-io.c  |8 
 fs/hot_tracking.h   |5 +
 mm/filemap.c|7 +++
 mm/page-writeback.c |   13 +
 mm/readahead.c  |7 +++
 5 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..8960024 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
 #include linux/uio.h
 #include linux/atomic.h
 #include linux/prefetch.h
+#include hot_tracking.h
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1297,6 +1298,13 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
+   /* Hot data tracking */
+   hot_update_freqs(global_hot_tracking_info,
+   iocb-ki_filp-f_mapping-host,
+   (u64)offset,
+   (u64)iov_length(iov, nr_segs),
+   rw  WRITE);
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 nr_segs, get_block, end_io,
 submit_io, flags);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 37f69ee..42e0273 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -16,6 +16,11 @@
 #include linux/workqueue.h
 #include linux/hot_tracking.h
 
+/* Hot data tracking -- guard macros */
+#define TRACK_THIS_INODE(inode) \
+((inode-i_sb-hot_flags  MS_HOT_TRACKING)  \
+!(inode-i_flags  S_NOHOTDATATRACK))
+
 /* values for hot_freq_data flags */
 #define FREQ_DATA_TYPE_INODE (1  0)
 #define FREQ_DATA_TYPE_RANGE (1  1)
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..6b63b77 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include linux/memcontrol.h
 #include linux/cleancache.h
+#include linux/hot_tracking.h
 #include internal.h
 
 /*
@@ -1224,6 +1225,12 @@ readpage:
 * PG_error will be set again if readpage fails.
 */
ClearPageError(page);
+
+   /* Hot data tracking */
+   hot_update_freqs(global_hot_tracking_info, inode,
+   (u64)page-index  PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+
/* Start the actual read. The read will unlock the page. */
error = mapping-a_ops-readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5ad5ce2..cf5a1c8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -35,6 +35,7 @@
 #include linux/buffer_head.h /* __set_page_dirty_buffers */
 #include linux/pagevec.h
 #include linux/timer.h
+#include linux/hot_tracking.h
 #include trace/events/writeback.h
 
 /*
@@ -1895,13 +1896,25 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
int ret;
+   pgoff_t start = 0;
+   u64 count = 0;
 
if (wbc-nr_to_write = 0)
return 0;
+
+   start = mapping-writeback_index  PAGE_CACHE_SHIFT;
+   count = (u64)wbc-nr_to_write;
+
if (mapping-a_ops-writepages)
ret = mapping-a_ops-writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+   /* Hot data tracking */
+   hot_update_freqs(global_hot_tracking_info,
+   mapping-host, (u64)start,
+   (count - (u64)wbc-nr_to_write) * PAGE_CACHE_SIZE, 1);
+
return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..b62f1bb 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include linux/pagemap.h
 #include linux/syscalls.h
 #include linux/file.h
+#include linux/hot_tracking.h
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -138,6 +139,12 @@ static int read_pages(struct address_space *mapping, 
struct file *filp,
 out:
blk_finish_plug(plug);
 
+   /* Hot data tracking */
+   hot_update_freqs(global_hot_tracking_info,
+   mapping-host, (u64)(list_entry(pages-prev,\
+   struct page, lru)-index)  PAGE_CACHE_SHIFT,
+   (u64)nr_pages * PAGE_CACHE_SIZE, 0);
+
return ret;
 }
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 08/13] vfs: add aging function for old map info

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |   57 +
 fs/hot_tracking.h |6 +
 2 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 717faa7..a8dc599 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -376,6 +376,24 @@ inline void hot_update_freqs(struct hot_info *root,
hot_inode_item_put(he);
 }
 
+static bool hot_freq_data_is_aging(struct hot_freq_data *freq_data)
+{
+int ret = 0;
+struct timespec ckt = current_kernel_time();
+
+u64 cur_time = timespec_to_ns(ckt);
+u64 last_read_ns =
+(cur_time - timespec_to_ns(freq_data-last_read_time));
+u64 last_write_ns =
+(cur_time - timespec_to_ns(freq_data-last_write_time));
+u64 kick_ns = TIME_TO_KICK * (u64)10;
+
+if ((last_read_ns  kick_ns)  (last_write_ns  kick_ns))
+ret = 1;
+
+return ret;
+}
+
 static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
 {
if (dir)
@@ -529,6 +547,45 @@ static void hot_map_array_update(struct hot_freq_data 
*freq_data,
}
 }
 
+/* Update temperatures for each range item for aging purposes */
+static void hot_range_update(struct hot_inode_item *he,
+   struct hot_info *root)
+{
+   struct hot_range_item *hr_nodes[8];
+   u32 start = 0;
+   bool range_is_aging;
+   int i, n;
+
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+   (void **)hr_nodes, start,
+   ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++) {
+   kref_get(hr_nodes[i]-hot_range.refs);
+   hot_map_array_update(
+   hr_nodes[i]-hot_range.hot_freq_data, root);
+
+   spin_lock(hr_nodes[i]-hot_range.lock);
+   range_is_aging = hot_freq_data_is_aging(
+   hr_nodes[i]-hot_range.hot_freq_data);
+   spin_unlock(hr_nodes[i]-hot_range.lock);
+
+   hot_range_item_put(hr_nodes[i]);
+   if (range_is_aging) {
+   hot_range_item_put(hr_nodes[i]);
+   }
+   }
+   spin_unlock(he-lock);
+   }
+}
+
 /*
  * Initialize inode and range map arrays.
  */
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 5a9517b..d19e64a 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -31,6 +31,12 @@
 #define FREQ_POWER 4
 
 /*
+ * time to quit keeping track of
+ * tracking data (seconds)
+ */
+#define TIME_TO_KICK 400
+
+/*
  * The following comments explain what exactly comprises a unit of heat.
  *
  * Each of six values of heat are calculated and combined in order to form an
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 10/13] vfs: register one memory shrinker

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Register a shrinker to control the amount of
memory that is used in tracking hot regions - if we are throwing
inodes out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking code is
using, even if it means losing useful information (i.e. the shrinker
accelerates the aging process).

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   59 ++
 include/linux/hot_tracking.h |1 +
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index f333c47..fcde55e 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -742,6 +742,59 @@ static inline void hot_cache_exit(void)
kmem_cache_destroy(hot_inode_item_cachep);
 }
 
+static int hot_track_comm_prune(struct hot_map_head *map_head,
+   bool type, unsigned long nr)
+{
+   struct list_head *pos, *next;
+   struct hot_comm_item *node;
+   int i;
+
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   list_for_each_safe(pos, next, (map_head + i)-node_list) {
+   if (nr == 0)
+   break;
+   nr--;
+   node = list_entry(pos, struct hot_comm_item, n_list);
+   if (type) {
+   struct hot_inode_item *hot_inode =
+   container_of(node,
+   struct hot_inode_item, hot_inode);
+   hot_inode_item_put(hot_inode);
+   } else {
+   struct hot_range_item *hot_range =
+   container_of(node,
+   struct hot_range_item, hot_range);
+   hot_range_item_put(hot_range);
+   }
+   }
+
+   if (nr == 0)
+   break;
+   }
+
+   return nr;
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+   struct shrink_control *sc)
+{
+   struct hot_info *root =
+   container_of(shrink, struct hot_info, hot_shrink);
+   int ret;
+
+   if ((sc-gfp_mask  GFP_KERNEL) != GFP_KERNEL)
+   return (sc-nr_to_scan == 0) ? 0 : -1;
+
+   ret = hot_track_comm_prune(root-heat_range_map,
+   false, sc-nr_to_scan);
+   if (ret  0)
+   ret = hot_track_comm_prune(root-heat_inode_map,
+   true, ret);
+
+   return ret;
+}
+
 /*
  * Initialize the data structures for hot data tracking.
  */
@@ -774,6 +827,11 @@ void hot_track_init(struct super_block *sb)
if (err)
goto failed_wq;
 
+   /* Register a shrinker callback */
+   root-hot_shrink.shrink = hot_track_prune;
+   root-hot_shrink.seeks = DEFAULT_SEEKS;
+   register_shrinker(root-hot_shrink);
+
printk(KERN_INFO vfs: turning on hot data tracking\n);
 
return;
@@ -791,6 +849,7 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = global_hot_tracking_info;
 
+   unregister_shrinker(root-hot_shrink);
hot_wq_exit(root-update_wq);
hot_map_array_exit(root);
hot_inode_tree_exit(root);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index b37e0f8..6f31090 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -86,6 +86,7 @@ struct hot_info {
struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
 
struct workqueue_struct *update_wq;
+   struct shrinker hot_shrink;
 };
 
 extern struct hot_info *global_hot_tracking_info;
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3 12/13] vfs: add debugfs support

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a /sys/kernel/debug/hot_track/device_name/ directory for each
volume that contains two files. The first, `inode_data', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_data', contains similar information for
subfile ranges.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |  462 +
 fs/hot_tracking.h |   43 +
 2 files changed, 505 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index fcde55e..60e93e6 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -20,6 +20,8 @@
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
+#include linux/debugfs.h 
+#include linux/vmalloc.h 
 #include linux/limits.h
 #include hot_tracking.h
 
@@ -29,6 +31,13 @@ struct hot_info *global_hot_tracking_info;
 static struct kmem_cache *hot_inode_item_cachep;
 static struct kmem_cache *hot_range_item_cachep;
 
+/* list to keep track of each mounted volumes debugfs_vol_data */
+static struct list_head hot_debugfs_vol_data_list;
+/* lock for debugfs_vol_data_list */
+static spinlock_t hot_debugfs_data_list_lock;
+/* pointer to top level debugfs dentry */
+static struct dentry *hot_debugfs_root_dentry;
+
 /*
  * Initialize the inode tree. Should be called for each new inode
  * access or other user of the hot_inode interface.
@@ -706,6 +715,451 @@ static void hot_wq_exit(struct workqueue_struct *wq)
destroy_workqueue(wq);
 }
 
+static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len)
+{
+   struct lstring *debugfs_log = data-debugfs_log;
+   uint new_log_alloc_size;
+   char *new_log;
+   static char err_msg[] = No more memory!\n;
+
+   if (len = data-log_alloc_size - debugfs_log-len) {
+   /*
+* Not enough room in the log buffer for the new message.
+* Allocate a bigger buffer.
+*/
+   new_log_alloc_size = data-log_alloc_size + LOG_PAGE_SIZE;
+   new_log = vmalloc(new_log_alloc_size);
+
+   if (new_log) {
+   memcpy(new_log, debugfs_log-str, debugfs_log-len);
+   memset(new_log + debugfs_log-len, 0,
+   new_log_alloc_size - debugfs_log-len);
+   vfree(debugfs_log-str);
+   debugfs_log-str = new_log;
+   data-log_alloc_size = new_log_alloc_size;
+   } else {
+   WARN_ON(1);
+   if (data-log_alloc_size - debugfs_log-len) {
+   strlcpy(debugfs_log-str +
+   debugfs_log-len,
+   err_msg,
+   data-log_alloc_size - debugfs_log-len);
+   debugfs_log-len +=
+   min((typeof(debugfs_log-len))
+   sizeof(err_msg),
+   ((typeof(debugfs_log-len))
+   data-log_alloc_size - debugfs_log-len));
+   }
+   return 0;
+   }
+   }
+
+   memcpy(debugfs_log-str + debugfs_log-len, data-log_work_buff, len);
+   debugfs_log-len += (unsigned long) len;
+
+   return len;
+}
+
+/* Returns the number of bytes written to the log. */
+static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
+{
+   struct lstring *debugfs_log = data-debugfs_log;
+   va_list args;
+   int len;
+   static char trunc_msg[] =
+   The next message has been truncated.\n;
+
+   if (debugfs_log-str == NULL)
+   return -1;
+
+   spin_lock(data-log_lock);
+
+   va_start(args, fmt);
+   len = vsnprintf(data-log_work_buff,
+   sizeof(data-log_work_buff), fmt, args);
+   va_end(args);
+
+   if (len = sizeof(data-log_work_buff)) {
+   hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg));
+   }
+
+   len = hot_debugfs_copy(data, data-log_work_buff, len);
+   spin_unlock(data-log_lock);
+
+   return len;
+}
+
+/* initialize a log corresponding to a fs volume */
+static int hot_debugfs_log_init(struct debugfs_vol_data *data)
+{
+   int err = 0;
+   struct lstring *debugfs_log = data-debugfs_log;
+
+   spin_lock(data-log_lock);
+   debugfs_log-str = vmalloc(INIT_LOG_ALLOC_SIZE);
+   if (debugfs_log-str) {
+   memset(debugfs_log-str, 0, INIT_LOG_ALLOC_SIZE);
+   data-log_alloc_size = INIT_LOG_ALLOC_SIZE;
+   } else {
+   err = -ENOMEM;
+   }
+   spin_unlock(data-log_lock);
+
+   return err;
+}
+
+/* free a log corresponding to a fs volume */
+static void hot_debugfs_log_exit(struct debugfs_vol_data *data)

[RFC v3 13/13] vfs: add documentation

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  165 
 2 files changed, 167 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX 
b/Documentation/filesystems/00-INDEX
index 8c624a1..b68bdff 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -118,3 +118,5 @@ xfs.txt
- info and mount options for the XFS filesystem.
 xip.txt
- info on execute-in-place for file mappings.
+hot_tracking.txt
+   - info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt 
b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 000..34dc232
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,165 @@
+Hot Data Tracking
+
+September, 2012Zhi Yong Wu wu...@linux.vnet.ibm.com
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. Git Development Tree
+5. Usage Example
+
+
+1. Introduction
+
+  The feature adds experimental support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+temperature value that reflects what data is hot, and using that
+temperature to move data to SSDs.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+  Of course, users are warned not to run this code outside of development
+environments. These patches are EXPERIMENTAL, and as such they might eat
+your data and/or memory. That said, the code should be relatively safe
+when the hottrack mount option are disabled.
+
+2. Motivation
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+https://btrfs.wiki.kernel.org/index.php/Project_ideas.
+It will divide into two steps. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, it is hoped that the patchset
+for hot data tracking will eventually mature into VFS.
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+
+3. The Design
+
+These include the following parts:
+
+* Hooks in existing vfs functions to track data access frequency
+
+* New rbtrees for tracking access frequency of inodes and sub-file
+ranges (hot_rb.c)
+The relationship between super_block and rbtree is as below:
+super_block-s_hotinfo.hot_inode_tree
+In include/linux/fs.h, one struct hot_info s_hotinfo is added to
+super_block struct. Each FS instance can find hot tracking info
+s_hotinfo via its super_block. In this hot_info, it store a lot of hot
+tracking info such as hot_inode_tree, inode and range hash list, etc.
+
+* A hash list for indexing data by its temperature (hot_hash.c)
+
+* A debugfs interface for dumping data from the rbtrees (hot_debugfs.c)
+
+* A background kthread for updating inode heat info
+
+* Mount options for enabling temperature tracking(-o hottrack,
+default mean disabled) (hot_track.c)
+* An ioctl to retrieve the frequency information collected for a certain
+file
+* Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+* hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+* hot_inode_item contains access frequency data for that inode
+
+* hot_inode_item holds a heat hash node to index the access
+frequency data for that inode
+
+* hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+* hot_range_item contains access frequency data for that range
+
+* hot_range_item holds a heat hash node to index the access
+frequency data for that range
+
+* hot_info.heat_inode_map indexes per-inode heat hash nodes
+
+* hot_info.heat_range_map indexes per-range heat hash nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+heat_inode_map   hot_inode_tree
+| |
+| V
+|   +---hot_comm_item+
+|   |   frequency data   |
++---+   |list_head   |
+|   V^ | V
+| 

[RFC v3 07/13] vfs: add function for updating map arrays

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |  153 +
 fs/hot_tracking.h |   60 +
 2 files changed, 213 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 08c42c5..717faa7 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -376,6 +376,159 @@ inline void hot_update_freqs(struct hot_info *root,
hot_inode_item_put(he);
 }
 
+static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
+{
+   if (dir)
+   return counter  bits;
+   else
+   return counter  bits;
+}
+
+/*
+ * hot_temperature_calculate() is responsible for distilling the six heat
+ * criteria, which are described in detail in hot_tracking.h) down into a 
single
+ * temperature value for the data, which is an integer between 0
+ * and HEAT_MAX_VALUE.
+ *
+ * To accomplish this, the raw values from the hot_freq_data structure
+ * are shifted various ways in order to make the temperature calculation more
+ * or less sensitive to each value.
+ *
+ * Once this calibration has happened, we do some additional normalization and
+ * make sure that everything fits nicely in a u32. From there, we take a very
+ * rudimentary kind of average of each of the values, where the *_COEFF_POWER
+ * values act as weights for the average.
+ *
+ * Finally, we use the HEAT_HASH_BITS value, which determines the size of the
+ * heat list array, to normalize the temperature to the proper granularity.
+ */
+int hot_temperature_calculate(struct hot_freq_data *freq_data)
+{
+   u64 result = 0;
+
+   struct timespec ckt = current_kernel_time();
+   u64 cur_time = timespec_to_ns(ckt);
+
+   u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data-nr_reads,
+   NRR_MULTIPLIER_POWER, true);
+   u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data-nr_writes,
+   NRW_MULTIPLIER_POWER, true);
+
+   u64 ltr_heat =
+   hot_raw_shift((cur_time - timespec_to_ns(freq_data-last_read_time)),
+   LTR_DIVIDER_POWER, false);
+   u64 ltw_heat =
+   hot_raw_shift((cur_time - timespec_to_ns(freq_data-last_write_time)),
+   LTW_DIVIDER_POWER, false);
+
+   u64 avr_heat =
+   hot_raw_shiftu64) -1) - freq_data-avg_delta_reads),
+   AVR_DIVIDER_POWER, false);
+   u64 avw_heat =
+   hot_raw_shiftu64) -1) - freq_data-avg_delta_writes),
+   AVW_DIVIDER_POWER, false);
+
+   /* ltr_heat is now guaranteed to be u32 safe */
+   if (ltr_heat = hot_raw_shift((u64) 1, 32, true))
+   ltr_heat = 0;
+   else
+   ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
+
+   /* ltw_heat is now guaranteed to be u32 safe */
+   if (ltw_heat = hot_raw_shift((u64) 1, 32, true))
+   ltw_heat = 0;
+   else
+   ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
+
+   /* avr_heat is now guaranteed to be u32 safe */
+   if (avr_heat = hot_raw_shift((u64) 1, 32, true))
+   avr_heat = (u32) -1;
+
+   /* avw_heat is now guaranteed to be u32 safe */
+   if (avw_heat = hot_raw_shift((u64) 1, 32, true))
+   avw_heat = (u32) -1;
+
+   nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
+   (3 - NRR_COEFF_POWER), false);
+   nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
+   (3 - NRW_COEFF_POWER), false);
+   ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+   ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+   avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+   avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+   result = nrr_heat + nrw_heat + (u32) ltr_heat +
+   (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+   return result  (32 - HEAT_MAP_BITS);
+}
+
+/*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature
+ */
+static void hot_map_array_update(struct hot_freq_data *freq_data,
+   struct hot_info *root)
+{
+   struct hot_map_head *buckets, *cur_bucket;
+   struct hot_comm_item *comm_item;
+   struct hot_inode_item *he;
+   struct hot_range_item *hr;
+   u32 temperature = 0;
+
+   comm_item = container_of(freq_data,
+   struct hot_comm_item, hot_freq_data);
+
+   if (freq_data-flags  FREQ_DATA_TYPE_INODE) {
+   he = container_of(comm_item,
+   struct hot_inode_item, hot_inode);
+   buckets = root-heat_inode_map;
+
+   spin_lock(he-hot_inode.lock);
+   temperature = hot_temperature_calculate(freq_data);
+ 

[RFC v3 09/13] vfs: add one wq to update map info periodically

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a per-superblock workqueue and a work_struct
 to run periodic work to update map info on each superblock.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   94 ++
 fs/hot_tracking.h|3 +
 include/linux/hot_tracking.h |2 +
 3 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index a8dc599..f333c47 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,8 @@
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/hardirq.h
+#include linux/kthread.h
+#include linux/freezer.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
@@ -623,6 +625,88 @@ static void hot_map_array_exit(struct hot_info *root)
 }
 
 /*
+ * Update temperatures for each hot inode item and
+ * hot range item for aging purposes
+ */
+static void hot_temperature_update_work(struct work_struct *work)
+{
+   struct hot_update_work *hot_work =
+   container_of(work, struct hot_update_work, work);
+   struct hot_info *root = hot_work-hot_info;
+   struct hot_inode_item *hi_nodes[8];
+   unsigned long delay = HZ * HEAT_UPDATE_DELAY;
+   u64 ino = 0;
+   int i, n;
+
+   do {
+   while (1) {
+   spin_lock(root-lock);
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+  (void **)hi_nodes, ino,
+  ARRAY_SIZE(hi_nodes));
+   if (!n) {
+   spin_unlock(root-lock);
+   break;
+   }
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++) {
+   kref_get(hi_nodes[i]-hot_inode.refs);
+   hot_map_array_update(
+   hi_nodes[i]-hot_inode.hot_freq_data, 
root);
+   hot_range_update(hi_nodes[i], root);
+   hot_inode_item_put(hi_nodes[i]);
+   }
+   spin_unlock(root-lock);
+   }
+
+   if (unlikely(freezing(current))) {
+   __refrigerator(true);
+   } else {
+   set_current_state(TASK_INTERRUPTIBLE);
+   if (!kthread_should_stop()) {
+   schedule_timeout(delay);
+   }
+   __set_current_state(TASK_RUNNING);
+   }
+   } while (!kthread_should_stop());
+}
+
+static int hot_wq_init(struct hot_info *root)
+{
+   struct hot_update_work *hot_work;
+   int ret = 0;
+
+   root-update_wq = alloc_workqueue(
+   hot_temperature_update, WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+   if (!root-update_wq) {
+   printk(KERN_ERR %s: failed to create 
+   temperature update workqueue\n,
+   __func__);
+   return 1;
+   }
+
+   hot_work = kmalloc(sizeof(*hot_work), GFP_NOFS);
+   if (hot_work) {
+   hot_work-hot_info = root;
+   INIT_WORK(hot_work-work, hot_temperature_update_work);
+   queue_work(root-update_wq, hot_work-work);
+   } else {
+   printk(KERN_ERR %s: failed to create update work\n,
+   __func__);
+   ret = 1;
+   }
+
+   return ret;
+}
+
+static void hot_wq_exit(struct workqueue_struct *wq)
+{
+   flush_workqueue(wq);
+   destroy_workqueue(wq);
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 static int __init hot_cache_init(void)
@@ -686,10 +770,19 @@ void hot_track_init(struct super_block *sb)
hot_inode_tree_init(root);
hot_map_array_init(root);
 
+   err = hot_wq_init(root);
+   if (err)
+   goto failed_wq;
+
printk(KERN_INFO vfs: turning on hot data tracking\n);
 
return;
 
+failed_wq:
+   hot_map_array_exit(root);
+   hot_inode_tree_exit(root);
+   sb-hot_flags = ~MS_HOT_TRACKING;
+   kfree(root);
 failed_root:
hot_cache_exit();
 }
@@ -698,6 +791,7 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = global_hot_tracking_info;
 
+   hot_wq_exit(root-update_wq);
hot_map_array_exit(root);
hot_inode_tree_exit(root);
sb-hot_flags = ~MS_HOT_TRACKING;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index d19e64a..7a79a6d 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -36,6 +36,9 @@
  */
 #define TIME_TO_KICK 400
 
+/* set how often to update temperatures (seconds) */
+#define HEAT_UPDATE_DELAY 400
+
 /*
  * The following comments explain what exactly comprises a unit of heat.
  

[RFC v3 04/13] vfs: add function for collecting raw access info

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add some utils helpers to update access frequencies
for one file or its range.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  190 ++
 fs/hot_tracking.h|   12 +++
 include/linux/hot_tracking.h |4 +
 3 files changed, 206 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 5fd993e..86c87c7 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -174,6 +174,196 @@ static void hot_inode_tree_exit(struct hot_info *root)
}
 }
 
+struct hot_inode_item
+*hot_inode_item_find(struct hot_info *root, u64 ino)
+{
+   struct hot_inode_item *he;
+   int ret;
+
+again:
+   spin_lock(root-lock);
+   he = radix_tree_lookup(root-hot_inode_tree, ino);
+   if (he) {
+   kref_get(he-hot_inode.refs);
+   spin_unlock(root-lock);
+   return he;
+   }
+   spin_unlock(root-lock);
+
+   he = kmem_cache_zalloc(hot_inode_item_cachep,
+   GFP_KERNEL | GFP_NOFS);
+   if (!he)
+   return ERR_PTR(-ENOMEM);
+
+   hot_inode_item_init(he, ino, root-hot_inode_tree);
+
+   ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
+   if (ret) {
+   kmem_cache_free(hot_inode_item_cachep, he);
+   return ERR_PTR(ret);
+   }
+
+   spin_lock(root-lock);
+   ret = radix_tree_insert(root-hot_inode_tree, ino, he);
+   if (ret == -EEXIST) {
+   kmem_cache_free(hot_inode_item_cachep, he);
+   spin_unlock(root-lock);
+   radix_tree_preload_end();
+   goto again;
+   }
+   spin_unlock(root-lock);
+   radix_tree_preload_end();
+
+   kref_get(he-hot_inode.refs);
+   return he;
+}
+
+static struct hot_range_item
+*hot_range_item_find(struct hot_inode_item *he,
+   u32 start)
+{
+   struct hot_range_item *hr;
+   int ret;
+
+again:
+   spin_lock(he-lock);
+   hr = radix_tree_lookup(he-hot_range_tree, start);
+   if (hr) {
+   kref_get(hr-hot_range.refs);
+   spin_unlock(he-lock);
+   return hr;
+   }
+   spin_unlock(he-lock);
+
+   hr = kmem_cache_zalloc(hot_range_item_cachep,
+   GFP_KERNEL | GFP_NOFS);
+   if (!hr)
+   return ERR_PTR(-ENOMEM);
+
+   hot_range_item_init(hr, start, he);
+
+   ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
+   if (ret) {
+   kmem_cache_free(hot_range_item_cachep, hr);
+   return ERR_PTR(ret);
+   }
+
+   spin_lock(he-lock);
+   ret = radix_tree_insert(he-hot_range_tree, start, hr);
+   if (ret == -EEXIST) {
+   kmem_cache_free(hot_range_item_cachep, hr);
+   spin_unlock(he-lock);
+   radix_tree_preload_end();
+   goto again;
+   }
+   spin_unlock(he-lock);
+   radix_tree_preload_end();
+
+   kref_get(hr-hot_range.refs);
+   return hr;
+}
+
+/*
+ * This function does the actual work of updating the frequency numbers,
+ * whatever they turn out to be. FREQ_POWER determines how many atime
+ * deltas we keep track of (as a power of 2). So, setting it to anything above
+ * 16ish is probably overkill. Also, the higher the power, the more bits get
+ * right shifted out of the timestamp, reducing precision, so take note of that
+ * as well.
+ *
+ * The caller should have already locked freq_data's parent's spinlock.
+ *
+ * FREQ_POWER, defined immediately below, determines how heavily to weight
+ * the current frequency numbers against the newest access. For example, a 
value
+ * of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+ * as heavily as the existing frequency info. In essence, this is a kludged-
+ * together version of a weighted average, since we can't afford to keep all of
+ * the information that it would take to get a _real_ weighted average.
+ */
+static u64 hot_average_update(struct timespec old_atime,
+   struct timespec cur_time, u64 old_avg)
+{
+   struct timespec delta_ts;
+   u64 new_avg;
+   u64 new_delta;
+
+   delta_ts = timespec_sub(cur_time, old_atime);
+   new_delta = timespec_to_ns(delta_ts)  FREQ_POWER;
+
+   new_avg = (old_avg  FREQ_POWER) - old_avg + new_delta;
+   new_avg = new_avg  FREQ_POWER;
+
+   return new_avg;
+}
+
+static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+{
+   struct timespec cur_time = current_kernel_time();
+
+   if (write) {
+   freq_data-nr_writes += 1;
+   freq_data-avg_delta_writes = hot_average_update(
+   freq_data-last_write_time,
+   cur_time,
+   freq_data-avg_delta_writes);
+   

[RFC v3 02/13] vfs: introduce private radix tree structures

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.
  Each of the two structures contains a hot_freq_data
struct with its frequency of access metrics (number of
{reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/Makefile  |2 +-
 fs/hot_tracking.c|  138 ++
 fs/hot_tracking.h|   26 
 include/linux/hot_tracking.h |   74 ++
 4 files changed, 239 insertions(+), 1 deletions(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 1d7af79..f966dea 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o fs_struct.o statfs.o
+   stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 000..634ec03
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,138 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/err.h
+#include linux/slab.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/hardirq.h
+#include linux/fs.h
+#include linux/blkdev.h
+#include linux/types.h
+#include linux/limits.h
+#include hot_tracking.h
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep;
+static struct kmem_cache *hot_range_item_cachep;
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+static void hot_inode_tree_init(struct hot_info *root)
+{
+   INIT_RADIX_TREE(root-hot_inode_tree, GFP_ATOMIC);
+   spin_lock_init(root-lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_range_tree_init(struct hot_inode_item *he)
+{
+   INIT_RADIX_TREE(he-hot_range_tree, GFP_ATOMIC);
+   spin_lock_init(he-lock);
+}
+
+/*
+ * Initialize a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+static void hot_range_item_init(struct hot_range_item *hr, u32 start,
+   struct hot_inode_item *he)
+{
+   hr-start = start;
+   hr-len = RANGE_SIZE;
+   hr-hot_inode = he;
+   kref_init(hr-hot_range.refs);
+   spin_lock_init(hr-hot_range.lock);
+   hr-hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
+   hr-hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
+   hr-hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
+}
+
+/*
+ * Initialize a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using hot_free_inode_item()
+ */
+static void hot_inode_item_init(struct hot_inode_item *he, u64 ino,
+   struct radix_tree_root *hot_inode_tree)
+{
+   he-i_ino = ino;
+   he-hot_inode_tree = hot_inode_tree;
+   kref_init(he-hot_inode.refs);
+   spin_lock_init(he-hot_inode.lock);
+   he-hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
+   he-hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
+   he-hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
+   hot_range_tree_init(he);
+}
+
+/*
+ * 

[RFC v3 03/13] vfs: Initialize and free main data structures

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add initialization function to create some
key data structures when hot tracking is enabled;
Clean up them when hot tracking is disabled

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/btrfs/super.c |8 +++
 fs/hot_tracking.c|  118 ++
 include/linux/fs.h   |3 +
 include/linux/hot_tracking.h |4 ++
 4 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 00be9e3..da4438f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -294,6 +294,10 @@ static void btrfs_put_super(struct super_block *sb)
 * last process that kept it busy.  Or segfault in the aforementioned
 * process...  Whom would you report that to?
 */
+
+   /* Hot data tracking */
+   if (btrfs_test_opt(btrfs_sb(sb)-tree_root, HOT_TRACK))
+   hot_track_exit(sb);
 }
 
 enum {
@@ -828,6 +832,10 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
 
+   if (btrfs_test_opt(fs_info-tree_root, HOT_TRACK)) {
+   hot_track_init(sb);
+   }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb-s_flags |= MS_ACTIVE;
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 634ec03..5fd993e 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -21,6 +21,8 @@
 #include linux/limits.h
 #include hot_tracking.h
 
+struct hot_info *global_hot_tracking_info;
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep;
 static struct kmem_cache *hot_range_item_cachep;
@@ -81,6 +83,97 @@ static void hot_inode_item_init(struct hot_inode_item *he, 
u64 ino,
hot_range_tree_init(he);
 }
 
+static void hot_range_item_free(struct kref *kref)
+{
+   struct hot_comm_item *comm_item = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_range_item *hr = container_of(comm_item,
+   struct hot_range_item, hot_range);
+
+   radix_tree_delete(hr-hot_inode-hot_range_tree, hr-start);
+   kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure
+ * if the reference count hits zero
+ */
+static void hot_range_item_put(struct hot_range_item *hr)
+{
+   kref_put(hr-hot_range.refs, hot_range_item_free);
+}
+
+/* Frees the entire hot_range_tree. */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+   struct hot_range_item *hr_nodes[8];
+   u32 start = 0;
+   int i, n;
+
+   while (1) {
+   spin_lock(he-lock);
+   n = radix_tree_gang_lookup(he-hot_range_tree,
+   (void **)hr_nodes, start,
+   ARRAY_SIZE(hr_nodes));
+   if (!n) {
+   spin_unlock(he-lock);
+   break;
+   }
+
+   start = hr_nodes[n - 1]-start + 1;
+   for (i = 0; i  n; i++)
+   hot_range_item_put(hr_nodes[i]);
+   spin_unlock(he-lock);
+   }
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+   struct hot_comm_item *comm_item = container_of(kref,
+   struct hot_comm_item, refs);
+   struct hot_inode_item *he = container_of(comm_item,
+   struct hot_inode_item, hot_inode);
+
+   hot_range_tree_free(he);
+   radix_tree_delete(he-hot_inode_tree, he-i_ino);
+   kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure
+ * if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+   kref_put(he-hot_inode.refs, hot_inode_item_free);
+}
+
+/* Frees the entire hot_inode_tree. */
+static void hot_inode_tree_exit(struct hot_info *root)
+{
+   struct hot_inode_item *hi_nodes[8];
+   u64 ino = 0;
+   int i, n;
+
+   while (1) {
+   spin_lock(root-lock);
+   n = radix_tree_gang_lookup(root-hot_inode_tree,
+  (void **)hi_nodes, ino,
+  ARRAY_SIZE(hi_nodes));
+   if (!n) {
+   spin_unlock(root-lock);
+   break;
+   }
+
+   ino = hi_nodes[n - 1]-i_ino + 1;
+   for (i = 0; i  n; i++)
+   hot_inode_item_put(hi_nodes[i]);
+   spin_unlock(root-lock);
+   }
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -122,6 +215,7 @@ static inline void hot_cache_exit(void)
  */
 void hot_track_init(struct super_block *sb)
 {
+   struct hot_info *root;
int err;
 
err = hot_cache_init();
@@ -130,9 +224,33 @@ void hot_track_init(struct 

[RFC v3 05/13] vfs: add two map arrays

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Adds two map arrays which contains
a lot of list and is used to efficiently
look up the data temperature of a file or its
ranges.
  In each list of map arrays, the array node
will keep track of temperature info.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   50 ++
 include/linux/hot_tracking.h |   16 +
 2 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 86c87c7..08c42c5 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -60,6 +60,7 @@ static void hot_range_item_init(struct hot_range_item *hr, 
u32 start,
hr-hot_inode = he;
kref_init(hr-hot_range.refs);
spin_lock_init(hr-hot_range.lock);
+   INIT_LIST_HEAD(hr-hot_range.n_list);
hr-hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
hr-hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
hr-hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
@@ -77,6 +78,7 @@ static void hot_inode_item_init(struct hot_inode_item *he, 
u64 ino,
he-hot_inode_tree = hot_inode_tree;
kref_init(he-hot_inode.refs);
spin_lock_init(he-hot_inode.lock);
+   INIT_LIST_HEAD(he-hot_inode.n_list);
he-hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
he-hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
he-hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
@@ -90,6 +92,11 @@ static void hot_range_item_free(struct kref *kref)
struct hot_range_item *hr = container_of(comm_item,
struct hot_range_item, hot_range);
 
+   spin_lock(hr-hot_range.lock);
+   if (!list_empty(hr-hot_range.n_list))
+   list_del_init(hr-hot_range.n_list);
+   spin_unlock(hr-hot_range.lock);
+
radix_tree_delete(hr-hot_inode-hot_range_tree, hr-start);
kmem_cache_free(hot_range_item_cachep, hr);
 }
@@ -135,6 +142,11 @@ static void hot_inode_item_free(struct kref *kref)
struct hot_inode_item *he = container_of(comm_item,
struct hot_inode_item, hot_inode);
 
+   spin_lock(he-hot_inode.lock);
+   if (!list_empty(he-hot_inode.n_list))
+   list_del_init(he-hot_inode.n_list);
+   spin_unlock(he-hot_inode.lock);
+
hot_range_tree_free(he);
radix_tree_delete(he-hot_inode_tree, he-i_ino);
kmem_cache_free(hot_inode_item_cachep, he);
@@ -365,6 +377,42 @@ inline void hot_update_freqs(struct hot_info *root,
 }
 
 /*
+ * Initialize inode and range map arrays.
+ */
+static void hot_map_array_init(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   INIT_LIST_HEAD(root-heat_inode_map[i].node_list);
+   INIT_LIST_HEAD(root-heat_range_map[i].node_list);
+   root-heat_inode_map[i].temperature = i;
+   root-heat_range_map[i].temperature = i;
+   }
+}
+
+static void hot_map_list_free(struct list_head *node_list)
+{
+   struct list_head *pos, *next;
+   struct hot_comm_item *node;
+
+   list_for_each_safe(pos, next, node_list) {
+   node = list_entry(pos, struct hot_comm_item, n_list);
+   list_del_init(node-n_list);
+   }
+
+}
+
+/* Free inode and range map arrays */
+static void hot_map_array_exit(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_MAP_SIZE; i++) {
+   hot_map_list_free(root-heat_inode_map[i].node_list);
+   hot_map_list_free(root-heat_range_map[i].node_list);
+   }
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 static int __init hot_cache_init(void)
@@ -426,6 +474,7 @@ void hot_track_init(struct super_block *sb)
global_hot_tracking_info = root;
sb-hot_flags |= MS_HOT_TRACKING;
hot_inode_tree_init(root);
+   hot_map_array_init(root);
 
printk(KERN_INFO vfs: turning on hot data tracking\n);
 
@@ -439,6 +488,7 @@ void hot_track_exit(struct super_block *sb)
 {
struct hot_info *root = global_hot_tracking_info;
 
+   hot_map_array_exit(root);
hot_inode_tree_exit(root);
sb-hot_flags = ~MS_HOT_TRACKING;
hot_cache_exit();
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 1e0aed5..7114179 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -22,6 +22,9 @@
 
 #define MS_HOT_TRACKING(10)
 
+#define HEAT_MAP_BITS 8
+#define HEAT_MAP_SIZE (1  HEAT_MAP_BITS)
+
 /*
  * A frequency data struct holds values that are used to
  * determine temperature of files and file ranges. These structs
@@ -38,11 +41,18 @@ struct hot_freq_data {
u32 last_temperature;
 };
 
+/* List heads in hot map array */
+struct hot_map_head {
+   struct list_head node_list;
+   u32 temperature;
+};
+
 /* The common info for both following structures */
 struct 

[RFC v3 00/13] vfs: hot data tracking

2012-10-10 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

NOTE:

  The patchset is currently post out mainly to make sure
it is going in the correct direction and hope to get some
helpful comments from other guys.
  For more infomation, please check hot_tracking.txt in Documentation

TODO List:

 1.) Need to do scalability or performance tests.
 2.) Turn some Micro into be tunable
   TIME_TO_KICK, and HEAT_UPDATE_DELAY
 3.) Rafactor hot_hash_is_aging()
   If you just made the timeout value a timespec and compared
 the _timespecs_, you would be doing a lot fewer conversions.
 4.) Cleanup some unnecessary lock protect
 5.) Add more comments to explain how to calc temperature
   How to read the avg read/write time (nanoseconds,
 microseconds, jiffies??)
 6.) Make updating tempreture more parallel
 7.) How to save the file tempreture among the umount to be able to
 preserve the file tempreture after reboot
 8.) Add one new ioctl inteface to set temperature value.

Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to
 be very btrfs-specific.  I've taken their code and attempted to
make it more generic and integrate it at the VFS level.

Changelog from v2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

v1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]

Zhi Yong Wu (13):
  btrfs: add one new mount option '-o hot_track'
  vfs: introduce private radix tree structures
  vfs: Initialize and free main data structures
  vfs: add function for collecting raw access info
  vfs: add two map arrays
  vfs: add hooks to enable hot data tracking
  vfs: add function for updating map arrays
  vfs: add aging function for old map info
  vfs: add one wq to update map info periodically
  vfs: register one memory shrinker
  vfs: add 3 new ioctl interfaces
  vfs: add debugfs support
  vfs: add documentation

 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/hot_tracking.txt |  165 
 fs/Makefile|2 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/super.c   |   15 +-
 fs/compat_ioctl.c  |9 +
 fs/direct-io.c |8 +
 fs/hot_tracking.c  | 1321 
 fs/hot_tracking.h  |  155 
 fs/ioctl.c |  122 +++
 include/linux/fs.h |4 +
 include/linux/hot_tracking.h   |  123 +++
 mm/filemap.c   |7 +
 mm/page-writeback.c|   13 +
 mm/readahead.c |7 +
 15 files changed, 1952 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs-progs: Fix up memory leakage

2012-09-24 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Some code pathes forget to free memory on exit.

Changelog from v1:
  Fix the variable is used uncorrectly. [Ram Pai]

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 cmds-filesystem.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index e62c4fd..9c43d35 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -47,7 +47,7 @@ static const char * const cmd_df_usage[] = {
 
 static int cmd_df(int argc, char **argv)
 {
-   struct btrfs_ioctl_space_args *sargs;
+   struct btrfs_ioctl_space_args *sargs, *sargs_orig;
u64 count = 0, i;
int ret;
int fd;
@@ -65,7 +65,7 @@ static int cmd_df(int argc, char **argv)
return 12;
}
 
-   sargs = malloc(sizeof(struct btrfs_ioctl_space_args));
+   sargs_orig = sargs = malloc(sizeof(struct btrfs_ioctl_space_args));
if (!sargs)
return -ENOMEM;
 
@@ -83,6 +83,7 @@ static int cmd_df(int argc, char **argv)
}
if (!sargs-total_spaces) {
close(fd);
+   free(sargs);
return 0;
}
 
@@ -92,6 +93,7 @@ static int cmd_df(int argc, char **argv)
(count * sizeof(struct btrfs_ioctl_space_info)));
if (!sargs) {
close(fd);
+   free(sargs_orig);
return -ENOMEM;
}
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[resend][PATCH v2 0/2] btrfs-progs: some bugfixes

2012-09-24 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Some misc bugs are found when i work on other tasks.
Now send out them for interview, thanks.

Zhi Yong Wu (2):
  btrfs-progs: Close file descriptor on exit
  btrfs-progs: Fix up memory leakage

 cmds-filesystem.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs-progs: Close file descriptor on exit

2012-09-24 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Need to close fd on exit.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 cmds-filesystem.c |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index b1457de..e62c4fd 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -77,18 +77,23 @@ static int cmd_df(int argc, char **argv)
if (ret) {
fprintf(stderr, ERROR: couldn't get space info on '%s' - %s\n,
path, strerror(e));
+   close(fd);
free(sargs);
return ret;
}
-   if (!sargs-total_spaces)
+   if (!sargs-total_spaces) {
+   close(fd);
return 0;
+   }
 
count = sargs-total_spaces;
 
sargs = realloc(sargs, sizeof(struct btrfs_ioctl_space_args) +
(count * sizeof(struct btrfs_ioctl_space_info)));
-   if (!sargs)
+   if (!sargs) {
+   close(fd);
return -ENOMEM;
+   }
 
sargs-space_slots = count;
sargs-total_spaces = 0;
@@ -148,6 +153,7 @@ static int cmd_df(int argc, char **argv)
printf(%s: total=%s, used=%s\n, description, total_bytes,
   used_bytes);
}
+   close(fd);
free(sargs);
 
return 0;
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2 00/10] vfs: hot data tracking

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

NOTE:

  The patchset is currently post out mainly to make sure
it is going in the correct direction and hope to get some
helpful comments from other guys.
  For more infomation, please check hot_tracking.txt in Documentation

TODO List:

 1.) Need to do scalability or performance tests.
 2.) Turn some Micro into be tunable
   TIME_TO_KICK, and HEAT_UPDATE_DELAY
 3.) Rafactor hot_hash_is_aging()
   If you just made the timeout value a timespec and compared
 the _timespecs_, you would be doing a lot fewer conversions.
 4.) Cleanup some unnecessary lock protect
 5.) Add more comments to explain how to calc temperature
   How to read the avg read/write time (nanoseconds,
 microseconds, jiffies??)
 6.) Make updating tempreture more parallel
 7.) How to save the file tempreture among the umount to be able to
 preserve the file tempreture after reboot
 8.) Add one new ioctl inteface to set temperature value.

Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to
 be very btrfs-specific.  I've taken their code and attempted to
make it more generic and integrate it at the VFS level.

Changelog from v1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]

Zhi Yong Wu (10):
  vfs: introduce private rb structures
  vfs: add support for updating access frequency
  vfs: add one new mount option '-o hottrack'
  vfs: add init and exit support
  vfs: introduce one hash table
  vfs: enable hot data tracking
  vfs: fork one kthread to update data temperature
  vfs: add 3 new ioctl interfaces
  vfs: add debugfs support
  vfs: add documentation

 Documentation/filesystems/hot_tracking.txt |  106 ++
 fs/Makefile|2 +-
 fs/compat_ioctl.c  |8 +
 fs/dcache.c|2 +
 fs/direct-io.c |   10 +
 fs/hot_tracking.c  | 1563 
 fs/hot_tracking.h  |  163 +++
 fs/ioctl.c |  130 +++
 fs/namespace.c |   10 +
 fs/super.c |   11 +
 include/linux/fs.h |   15 +
 include/linux/hot_tracking.h   |  164 +++
 mm/filemap.c   |8 +
 mm/page-writeback.c|   21 +
 mm/readahead.c |9 +
 15 files changed, 2221 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2 01/10] vfs: introduce private rb structures

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold rb trees
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.
  Each of the two structures contains a hot_freq_data
struct with its frequency of access metrics (number of
{reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/Makefile  |2 +-
 fs/dcache.c  |2 +
 fs/hot_tracking.c|  116 ++
 fs/hot_tracking.h|   27 ++
 include/linux/fs.h   |4 ++
 include/linux/hot_tracking.h |   96 ++
 6 files changed, 246 insertions(+), 1 deletions(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 2fb9779..9d29618 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o fs_struct.o statfs.o
+   stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index 8086636..92470a1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include linux/rculist_bl.h
 #include linux/prefetch.h
 #include linux/ratelimit.h
+#include hot_tracking.h
 #include internal.h
 #include mount.h
 
@@ -3164,6 +3165,7 @@ void __init vfs_caches_init(unsigned long mempages)
inode_init();
files_init(mempages);
mnt_init();
+   hot_track_cache_init();
bdev_cache_init();
chrdev_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 000..173054b
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,116 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
+ *Ben Chociej bchoc...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include linux/list.h
+#include linux/err.h
+#include linux/slab.h
+#include linux/module.h
+#include linux/spinlock.h
+#include linux/hardirq.h
+#include linux/fs.h
+#include linux/blkdev.h
+#include linux/types.h
+#include hot_tracking.h
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cache;
+static struct kmem_cache *hot_range_item_cache;
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+static void hot_rb_inode_tree_init(struct hot_inode_tree *tree)
+{
+   tree-map = RB_ROOT;
+   rwlock_init(tree-lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_rb_range_tree_init(struct hot_range_tree *tree)
+{
+   tree-map = RB_ROOT;
+   rwlock_init(tree-lock);
+}
+
+/*
+ * Initialize a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_inode_item()
+ */
+void hot_rb_inode_item_init(void *_item)
+{
+   struct hot_inode_item *he = _item;
+
+   memset(he, 0, sizeof(*he));
+   kref_init(he-refs);
+   spin_lock_init(he-lock);
+   he-hot_freq_data.avg_delta_reads = (u64) -1;
+   he-hot_freq_data.avg_delta_writes = (u64) -1;
+   he-hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
+   hot_rb_range_tree_init(he-hot_range_tree);
+}
+
+/*
+ * Initialize a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+static void hot_rb_range_item_init(void *_item)
+{
+   struct 

[RFC v2 08/10] vfs: add 3 new ioctl interfaces

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in btrfs_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.

  FS_IOC_GET_HEAT_OPTS: return an integer representing the current
state of hot data tracking and migration:

0 = do nothing
1 = track frequency of access

  FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
migration, as described above.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/compat_ioctl.c|8 +++
 fs/ioctl.c   |  130 ++
 include/linux/fs.h   |   11 
 include/linux/hot_tracking.h |   12 
 4 files changed, 161 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index debdfe0..a88c7de 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1390,6 +1390,11 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+COMPATIBLE_IOCTL(FS_IOC_SET_HEAT_OPTS)
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_OPTS)
+
 /* fat 'r' ioctls. These are handled by fat with -compat_ioctl,
but we don't want warnings on other file systems. So declare
them as compatible here. */
@@ -1572,6 +1577,9 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, 
unsigned int cmd,
case FIBMAP:
case FIGETBSZ:
case FIONREAD:
+   case FS_IOC_GET_HEAT_INFO:
+   case FS_IOC_SET_HEAT_OPTS:
+   case FS_IOC_GET_HEAT_OPTS:
if (S_ISREG(filp-f_path.dentry-d_inode-i_mode))
break;
/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 29167be..394975e 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include linux/writeback.h
 #include linux/buffer_head.h
 #include linux/falloc.h
+#include hot_tracking.h
 
 #include asm/ioctls.h
 
@@ -537,6 +538,126 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be live -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info-live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+   struct inode *mnt_inode = file-f_path.dentry-d_inode;
+   struct inode *file_inode;
+   struct file *file_filp;
+   struct hot_info *root = (mnt_inode-i_sb-s_hotinfo);
+   struct hot_heat_info *heat_info;
+   struct hot_inode_tree *hitree;
+   struct hot_inode_item *he;
+   int ret;
+
+   heat_info = kmalloc(sizeof(struct hot_heat_info),
+   GFP_KERNEL | GFP_NOFS);
+
+   if (copy_from_user((void *) heat_info,
+   argp,
+   sizeof(struct hot_heat_info)) != 0) {
+   ret = -EFAULT;
+   goto err;
+   }
+
+   file_filp = filp_open(heat_info-filename, O_RDONLY, 0);
+   file_inode = file_filp-f_dentry-d_inode;
+   filp_close(file_filp, NULL);
+
+   hitree = root-hot_inode_tree;
+   read_lock(hitree-lock);
+   he = hot_rb_lookup_hot_inode_item(hitree, file_inode-i_ino);
+   read_unlock(hitree-lock);
+   if (!he) {
+   /* we don't have any info on this file yet */
+   ret = -ENODATA;
+   goto err;
+   }
+
+   spin_lock(he-lock);
+   heat_info-avg_delta_reads =
+   (__u64) he-hot_freq_data.avg_delta_reads;
+   heat_info-avg_delta_writes =
+   (__u64) he-hot_freq_data.avg_delta_writes;
+   heat_info-last_read_time =
+   (__u64) 
timespec_to_ns(he-hot_freq_data.last_read_time);
+   heat_info-last_write_time =
+   (__u64) 
timespec_to_ns(he-hot_freq_data.last_write_time);
+   heat_info-num_reads =
+   (__u32) he-hot_freq_data.nr_reads;
+   heat_info-num_writes =
+   (__u32) he-hot_freq_data.nr_writes;
+
+   if (heat_info-live  0) {
+   /* got a request for live temperature,
+* call hot_hash_calc_temperature to recalculate
+*/
+   heat_info-temperature =
+   hot_hash_calc_temperature(he-hot_freq_data);
+   } else {
+   /* not live temperature, get it from the hashlist */
+   read_lock(he-heat_node-hlist-rwlock);
+   heat_info-temperature = he-heat_node-hlist-temperature;
+   read_unlock(he-heat_node-hlist-rwlock);
+  

[RFC v2 07/10] vfs: fork one kthread to update data temperature

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Fork and run one kernel kthread to calculate
that temperature based on some metrics kept
in custom frequency data structs, and store
the info in the hash table.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|  467 +-
 fs/hot_tracking.h|   78 +++
 include/linux/hot_tracking.h |3 +
 3 files changed, 542 insertions(+), 6 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 5f96442..fd11695 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -17,6 +17,8 @@
 #include linux/spinlock.h
 #include linux/hardirq.h
 #include linux/hash.h
+#include linux/kthread.h
+#include linux/freezer.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
@@ -27,7 +29,12 @@ static struct kmem_cache *hot_inode_item_cache;
 static struct kmem_cache *hot_range_item_cache;
 static struct kmem_cache *hot_hash_node_cache;
 
+static struct task_struct *hot_track_temperature_update_kthread;
+
 static void hot_hash_node_init(void *_node);
+static int hot_hash_is_aging(struct hot_freq_data *freq_data);
+static void hot_hash_update_hash_table(struct hot_freq_data *freq_data,
+struct hot_info *root);
 
 /*
  * Initialize the inode tree. Should be called for each new inode
@@ -456,9 +463,13 @@ static struct hot_inode_item 
*hot_rb_update_inode_freq(struct inode *inode,
write_unlock(hitree-lock);
}
 
-   spin_lock(he-lock);
-   hot_rb_update_freq(he-hot_freq_data, rw);
-   spin_unlock(he-lock);
+   if (!hot_track_temperature_update_kthread
+   || hot_track_temperature_update_kthread-pid != current-pid) {
+   spin_lock(he-lock);
+   hot_rb_update_freq(he-hot_freq_data, rw);
+   spin_unlock(he-lock);
+   hot_hash_update_hash_table(he-hot_freq_data, root);
+   }
 
 out:
return he;
@@ -505,9 +516,14 @@ static bool hot_rb_update_range_freq(struct hot_inode_item 
*he,
write_unlock(hrtree-lock);
}
 
-   spin_lock(hr-lock);
-   hot_rb_update_freq(hr-hot_freq_data, rw);
-   spin_unlock(hr-lock);
+   if (!hot_track_temperature_update_kthread
+   || hot_track_temperature_update_kthread-pid != 
current-pid) {
+   spin_lock(hr-lock);
+   hot_rb_update_freq(hr-hot_freq_data, rw);
+   spin_unlock(hr-lock);
+   hot_hash_update_hash_table(hr-hot_freq_data, root);
+   }
+
hot_rb_free_hot_range_item(hr);
}
 
@@ -515,6 +531,58 @@ out:
return ret;
 }
 
+/* Walk the hot_inode_tree, locking as necessary */
+static struct hot_inode_item
+*hot_rb_find_next_hot_inode(struct hot_info *root,
+   u64 objectid)
+{
+   struct rb_node *node;
+   struct rb_node *prev;
+   struct hot_inode_item *entry;
+
+   read_lock(root-hot_inode_tree.lock);
+
+   node = root-hot_inode_tree.map.rb_node;
+   prev = NULL;
+   while (node) {
+   prev = node;
+   entry = rb_entry(node, struct hot_inode_item, rb_node);
+
+   if (objectid  entry-i_ino)
+   node = node-rb_left;
+   else if (objectid  entry-i_ino)
+   node = node-rb_right;
+   else
+   break;
+   }
+
+   if (!node) {
+   while (prev) {
+   entry = rb_entry(prev, struct hot_inode_item, rb_node);
+   if (objectid = entry-i_ino) {
+   node = prev;
+   break;
+   }
+   prev = rb_next(prev);
+   }
+   }
+
+   if (node) {
+   entry = rb_entry(node, struct hot_inode_item, rb_node);
+   /*
+ * increase reference count to prevent pruning while
+ * caller is using the hot_inode_item
+ */
+   kref_get(entry-refs);
+
+   read_unlock(root-hot_inode_tree.lock);
+   return entry;
+   }
+
+   read_unlock(root-hot_inode_tree.lock);
+   return NULL;
+}
+
 /* main function to update access frequency from read/writepage(s) hooks */
 void hot_rb_update_freqs(struct inode *inode, u64 start,
u64 len, int rw)
@@ -534,6 +602,65 @@ void hot_rb_update_freqs(struct inode *inode, u64 start,
 }
 
 /*
+ * take hot range that is now cold and remove from indexes and clean up
+ * any memory associted, involves removing hot range from rb tree, and
+ * heat hash lists, and freeing up all memory.
+ */
+static void hot_rb_remove_range_data(struct hot_inode_item *hot_inode,
+   struct hot_range_item *hr,
+   

[RFC v2 09/10] vfs: add debugfs support

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add a /sys/kernel/debug/hot_track/device_name/ directory for each
volume that contains two files. The first, `inode_data', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_data', contains similar information for
subfile ranges.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |  466 +
 fs/hot_tracking.h |   40 +
 fs/namespace.c|6 +
 3 files changed, 512 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index fd11695..6aeabad 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -22,6 +22,9 @@
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
+#include linux/debugfs.h
+#include linux/vmalloc.h
+#include linux/limits.h
 #include hot_tracking.h
 
 /* kmem_cache pointers for slab caches */
@@ -29,6 +32,13 @@ static struct kmem_cache *hot_inode_item_cache;
 static struct kmem_cache *hot_range_item_cache;
 static struct kmem_cache *hot_hash_node_cache;
 
+/* list to keep track of each mounted volumes debugfs_vol_data */
+static struct list_head hot_debugfs_vol_data_list;
+/* lock for debugfs_vol_data_list */
+static spinlock_t hot_debugfs_data_list_lock;
+/* pointer to top level debugfs dentry */
+static struct dentry *hot_debugfs_root_dentry;
+
 static struct task_struct *hot_track_temperature_update_kthread;
 
 static void hot_hash_node_init(void *_node);
@@ -1004,6 +1014,460 @@ static int hot_hash_temperature_update_kthread(void 
*arg)
return 0;
 }
 
+static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len)
+{
+   struct lstring *debugfs_log = data-debugfs_log;
+   uint new_log_alloc_size;
+   char *new_log;
+   static char err_msg[] = No more memory!\n;
+
+   if (len = data-log_alloc_size - debugfs_log-len) {
+   /* Not enough room in the log buffer for the new message. */
+   /* Allocate a bigger buffer. */
+   new_log_alloc_size = data-log_alloc_size + LOG_PAGE_SIZE;
+   new_log = vmalloc(new_log_alloc_size);
+
+   if (new_log) {
+   memcpy(new_log, debugfs_log-str, debugfs_log-len);
+   memset(new_log + debugfs_log-len, 0,
+   new_log_alloc_size - debugfs_log-len);
+   vfree(debugfs_log-str);
+   debugfs_log-str = new_log;
+   data-log_alloc_size = new_log_alloc_size;
+   } else {
+   WARN_ON(1);
+   if (data-log_alloc_size - debugfs_log-len) {
+   strlcpy(debugfs_log-str +
+   debugfs_log-len,
+   err_msg,
+   data-log_alloc_size - debugfs_log-len);
+   debugfs_log-len +=
+   min((typeof(debugfs_log-len))
+   sizeof(err_msg),
+   ((typeof(debugfs_log-len))
+   data-log_alloc_size - debugfs_log-len));
+   }
+   return 0;
+   }
+   }
+
+   memcpy(debugfs_log-str + debugfs_log-len, data-log_work_buff, len);
+   debugfs_log-len += (unsigned long) len;
+
+   return len;
+}
+
+/* Returns the number of bytes written to the log. */
+static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
+{
+   struct lstring *debugfs_log = data-debugfs_log;
+   va_list args;
+   int len;
+   static char trunc_msg[] =
+   The next message has been truncated.\n;
+
+   if (debugfs_log-str == NULL)
+   return -1;
+
+   spin_lock(data-log_lock);
+
+   va_start(args, fmt);
+   len = vsnprintf(data-log_work_buff,
+   sizeof(data-log_work_buff), fmt, args);
+   va_end(args);
+
+   if (len = sizeof(data-log_work_buff)) {
+   hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg));
+   }
+
+   len = hot_debugfs_copy(data, data-log_work_buff, len);
+   spin_unlock(data-log_lock);
+
+   return len;
+}
+
+/* initialize a log corresponding to a fs volume */
+static int hot_debugfs_log_init(struct debugfs_vol_data *data)
+{
+   int err = 0;
+   struct lstring *debugfs_log = data-debugfs_log;
+
+   spin_lock(data-log_lock);
+   debugfs_log-str = vmalloc(INIT_LOG_ALLOC_SIZE);
+   if (debugfs_log-str) {
+   memset(debugfs_log-str, 0, INIT_LOG_ALLOC_SIZE);
+   data-log_alloc_size = INIT_LOG_ALLOC_SIZE;
+   } else {
+   err = -ENOMEM;
+   }
+   spin_unlock(data-log_lock);
+
+   return err;
+}
+
+/* free a log corresponding to a fs volume */
+static void hot_debugfs_log_exit(struct 

[RFC v2 06/10] vfs: enable hot data tracking

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Miscellaneous features that implement hot data tracking
and generally make the hot data functions a bit more friendly.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/direct-io.c   |   10 ++
 include/linux/hot_tracking.h |   11 +++
 mm/filemap.c |8 
 mm/page-writeback.c  |   21 +
 mm/readahead.c   |9 +
 5 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..3773f44 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
 #include linux/uio.h
 #include linux/atomic.h
 #include linux/prefetch.h
+#include hot_tracking.h
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
+   /* Hot data tracking */
+   if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host)
+iov_length(iov, nr_segs)  0) {
+   hot_rb_update_freqs(iocb-ki_filp-f_mapping-host,
+   (u64)offset,
+   (u64)iov_length(iov, nr_segs),
+   rw  WRITE);
+   }
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 nr_segs, get_block, end_io,
 submit_io, flags);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 635ffb6..bc41f94 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -28,6 +28,14 @@
  */
 #define HOT_MOUNT_HOT_TRACK(1  0)
 
+/* Hot data tracking -- guard macros */
+#define TRACKING_HOT_TRACK(root) \
+   (root-s_hotinfo.mount_opt  HOT_MOUNT_HOT_TRACK)
+
+#define TRACK_THIS_INODE(inode) \
+   ((TRACKING_HOT_TRACK(inode-i_sb))  \
+   !(inode-i_flags  S_NOHOTDATATRACK))
+
 /* A tree that sits on the hot_info */
 struct hot_inode_tree {
struct rb_root map;
@@ -135,4 +143,7 @@ struct hot_info {
struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE];
 };
 
+extern void hot_rb_update_freqs(struct inode *inode,
+   u64 start, u64 len, int rw);
+
 #endif  /* _LINUX_HOTTRACK_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index 3843445..8b1ecff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include linux/hardirq.h /* for BUG_ON(!in_atomic()) only */
 #include linux/memcontrol.h
 #include linux/cleancache.h
+#include linux/hot_tracking.h
 #include internal.h
 
 /*
@@ -1224,6 +1225,13 @@ readpage:
 * PG_error will be set again if readpage fails.
 */
ClearPageError(page);
+
+   /* Hot data tracking */
+   if (TRACK_THIS_INODE(filp-f_mapping-host))
+   hot_rb_update_freqs(filp-f_mapping-host,
+   (u64)page-index  PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+
/* Start the actual read. The read will unlock the page. */
error = mapping-a_ops-readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5ad5ce2..552c861 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -35,6 +35,7 @@
 #include linux/buffer_head.h /* __set_page_dirty_buffers */
 #include linux/pagevec.h
 #include linux/timer.h
+#include linux/hot_tracking.h
 #include trace/events/writeback.h
 
 /*
@@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
int ret;
+   pgoff_t start = 0;
+   u64 prev_count = 0, count = 0;
 
if (wbc-nr_to_write = 0)
return 0;
+
+   /* Hot data tracking */
+   if (TRACK_THIS_INODE(mapping-host)
+wbc-range_cyclic) {
+   start = mapping-writeback_index  PAGE_CACHE_SHIFT;
+   prev_count = (u64)wbc-nr_to_write;
+   }
+
if (mapping-a_ops-writepages)
ret = mapping-a_ops-writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+   /* Hot data tracking */
+   if (TRACK_THIS_INODE(mapping-host)
+wbc-range_cyclic) {
+   count = prev_count - (u64)wbc-nr_to_write;
+   if (count)
+   hot_rb_update_freqs(mapping-host, (u64)start,
+   count * PAGE_CACHE_SIZE, 1);
+   }
+
return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index ea8f8fa..7010fc4 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include linux/pagemap.h
 #include linux/syscalls.h
 #include linux/file.h

[RFC v2 05/10] vfs: introduce one hash table

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Adds a hash table structure which contains
a lot of hash list and is used to efficiently
look up the data temperature of a file or its
ranges.
  In each hash list of hash table, the hash node
will keep track of temperature info.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   77 -
 include/linux/hot_tracking.h |   35 +++
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index fa89f70..5f96442 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -16,6 +16,7 @@
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/hardirq.h
+#include linux/hash.h
 #include linux/fs.h
 #include linux/blkdev.h
 #include linux/types.h
@@ -24,6 +25,9 @@
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cache;
 static struct kmem_cache *hot_range_item_cache;
+static struct kmem_cache *hot_hash_node_cache;
+
+static void hot_hash_node_init(void *_node);
 
 /*
  * Initialize the inode tree. Should be called for each new inode
@@ -57,6 +61,10 @@ void hot_rb_inode_item_init(void *_item)
memset(he, 0, sizeof(*he));
kref_init(he-refs);
spin_lock_init(he-lock);
+   he-heat_node = kmem_cache_alloc(hot_hash_node_cache,
+   GFP_KERNEL | GFP_NOFS);
+   hot_hash_node_init(he-heat_node);
+   he-heat_node-hot_freq_data = he-hot_freq_data;
he-hot_freq_data.avg_delta_reads = (u64) -1;
he-hot_freq_data.avg_delta_writes = (u64) -1;
he-hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
@@ -75,6 +83,10 @@ static void hot_rb_range_item_init(void *_item)
memset(hr, 0, sizeof(*hr));
kref_init(hr-refs);
spin_lock_init(hr-lock);
+   hr-heat_node = kmem_cache_alloc(hot_hash_node_cache,
+   GFP_KERNEL | GFP_NOFS);
+   hot_hash_node_init(hr-heat_node);
+   hr-heat_node-hot_freq_data = hr-hot_freq_data;
hr-hot_freq_data.avg_delta_reads = (u64) -1;
hr-hot_freq_data.avg_delta_writes = (u64) -1;
hr-hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
@@ -105,6 +117,18 @@ inode_err:
return -ENOMEM;
 }
 
+static void hot_rb_inode_item_exit(void)
+{
+   if (hot_inode_item_cache)
+   kmem_cache_destroy(hot_inode_item_cache);
+}
+
+static void hot_rb_range_item_exit(void)
+{
+   if (hot_range_item_cache)
+   kmem_cache_destroy(hot_range_item_cache);
+}
+
 /*
  * Drops the reference out on hot_inode_item by one and free the structure
  * if the reference count hits zero
@@ -510,6 +534,48 @@ void hot_rb_update_freqs(struct inode *inode, u64 start,
 }
 
 /*
+ * Initialize hash node.
+ */
+static void hot_hash_node_init(void *_node)
+{
+   struct hot_hash_node *node = _node;
+
+   memset(node, 0, sizeof(*node));
+   INIT_HLIST_NODE(node-hashnode);
+   node-hot_freq_data = NULL;
+   node-hlist = NULL;
+   spin_lock_init(node-lock);
+   kref_init(node-refs);
+}
+
+static int __init hot_hash_node_cache_init(void)
+{
+   hot_hash_node_cache = kmem_cache_create(hot_hash_node,
+   sizeof(struct hot_hash_node),
+   0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+   hot_hash_node_init);
+   if (!hot_hash_node_cache)
+   return -ENOMEM;
+
+   return 0;
+}
+
+/*
+ * Initialize inode/range hash lists.
+ */
+static void hot_hash_table_init(struct hot_info *root)
+{
+   int i;
+   for (i = 0; i  HEAT_HASH_SIZE; i++) {
+   root-heat_inode_hl[i].temperature = i;
+   root-heat_range_hl[i].temperature = i;
+   rwlock_init(root-heat_inode_hl[i].rwlock);
+   rwlock_init(root-heat_range_hl[i].rwlock);
+   }
+}
+
+/*
  * Regular mount options parser for -hottrack option.
  * return false if no -hottrack is specified;
  * otherwise return true. And the -hottrack will be
@@ -544,13 +610,18 @@ bool hot_track_parse_options(char *options)
 }
 
 /*
- * Initialize kmem cache for hot_inode_item
- * and hot_range_item
+ * Initialize kmem cache for hot_inode_item,
+ * hot_range_item and hot_hash_node
  */
 void __init hot_track_cache_init(void)
 {
if (hot_rb_item_cache_init())
return;
+
+   if (hot_hash_node_cache_init()) {
+   hot_rb_inode_item_exit();
+   hot_rb_range_item_exit();
+   }
 }
 
 /*
@@ -560,10 +631,12 @@ void hot_track_init(struct super_block *sb, const char 
*name)
 {
sb-s_hotinfo.mount_opt |= HOT_MOUNT_HOT_TRACK;
hot_rb_inode_tree_init(sb-s_hotinfo.hot_inode_tree);
+   hot_hash_table_init(sb-s_hotinfo);
 }
 
 void hot_track_exit(struct super_block *sb)
 {
sb-s_hotinfo.mount_opt = ~HOT_MOUNT_HOT_TRACK;
+   

[RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Introduce one new mount option '-o hottrack',
and add its parsing support.
  Its usage looks like:
   mount -o hottrack
   mount -o nouser,hottrack
   mount -o nouser,hottrack,loop
   mount -o hottrack,nouser

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c|   34 ++
 fs/hot_tracking.h|1 +
 fs/super.c   |5 +
 include/linux/hot_tracking.h |7 +++
 4 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 52ed926..f97e8a6 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -465,6 +465,40 @@ void hot_rb_update_freqs(struct inode *inode, u64 start,
 }
 
 /*
+ * Regular mount options parser for -hottrack option.
+ * return false if no -hottrack is specified;
+ * otherwise return true. And the -hottrack will be
+ * removed from options.
+ */
+bool hot_track_parse_options(char *options)
+{
+   long len;
+   char *p;
+   static char opts_hot[] = hottrack;
+
+   if (!options)
+   return false;
+
+   p = strstr(options, opts_hot);
+   if (!p)
+   return false;
+
+   while (p) {
+   len = options + strlen(options) - (p + strlen(opts_hot));
+   if (len == 0) {
+   options[0] = '\0';
+   break;
+   }
+
+   memmove(p, p + strlen(opts_hot) + 1, len);
+   p = strstr(options, opts_hot);
+   }
+
+   printk(KERN_INFO vfs: turning on hot data tracking\n);
+   return true;
+}
+
+/*
  * Initialize kmem cache for hot_inode_item
  * and hot_range_item
  */
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 2ba29e4..6bd09eb 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -37,6 +37,7 @@ void hot_rb_free_hot_inode_item(struct hot_inode_item *he);
 void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len,
int rw);
 
+bool hot_track_parse_options(char *options);
 void __init hot_track_cache_init(void);
 
 #endif /* __HOT_TRACKING__ */
diff --git a/fs/super.c b/fs/super.c
index 0902cfa..7eb3b0c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
 #include linux/cleancache.h
 #include linux/fsnotify.h
 #include linux/lockdep.h
+#include hot_tracking.h
 #include internal.h
 
 
@@ -1125,6 +1126,7 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
struct dentry *root;
struct super_block *sb;
char *secdata = NULL;
+   bool hottrack = false;
int error = -ENOMEM;
 
if (data  !(type-fs_flags  FS_BINARY_MOUNTDATA)) {
@@ -1137,6 +1139,9 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
goto out_free_secdata;
}
 
+   if (data  hot_track_parse_options(data))
+   hottrack = true;
+
root = type-mount(type, flags, name, data);
if (IS_ERR(root)) {
error = PTR_ERR(root);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index a566f91..bb2a41c 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -20,6 +20,11 @@
 #include linux/rbtree.h
 #include linux/kref.h
 
+/*
+ * Flags for hot data tracking mount options.
+ */
+#define HOT_MOUNT_HOT_TRACK(1  0)
+
 /* A tree that sits on the hot_info */
 struct hot_inode_tree {
struct rb_root map;
@@ -89,6 +94,8 @@ struct hot_range_item {
 };
 
 struct hot_info {
+   unsigned long mount_opt;
+
/* red-black tree that keeps track of fs-wide hot data */
struct hot_inode_tree hot_inode_tree;
 };
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2 04/10] vfs: add init and exit support

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Add initialization function to create some
key data structures when hot tracking is enabled;
Clean up them when hot tracking is disabled

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 fs/hot_tracking.c |   60 +
 fs/hot_tracking.h |2 +
 fs/namespace.c|4 +++
 fs/super.c|6 +
 4 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index f97e8a6..fa89f70 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -135,6 +135,51 @@ static void hot_rb_free_hot_range_item(struct 
hot_range_item *hr)
}
 }
 
+static int hot_rb_remove_hot_inode_item(struct hot_inode_tree *tree,
+struct hot_inode_item *he)
+{
+int ret = 0;
+rb_erase(he-rb_node, tree-map);
+he-in_tree = 0;
+return ret;
+}
+
+static int hot_rb_remove_hot_range_item(struct hot_range_tree *tree,
+struct hot_range_item *hr)
+{
+int ret = 0;
+rb_erase(hr-rb_node, tree-map);
+hr-in_tree = 0;
+return ret;
+}
+
+/* Frees the entire hot_inode_tree. */
+static void hot_rb_inode_tree_free(struct hot_info *root)
+{
+   struct rb_node *node, *node2;
+   struct hot_inode_item *he;
+   struct hot_range_item *hr;
+
+   /* Free hot inode and range trees on fs root */
+   node = rb_first(root-hot_inode_tree.map);
+
+   while (node) {
+   he = rb_entry(node, struct hot_inode_item, rb_node);
+
+   node2 = rb_first(he-hot_range_tree.map);
+   while (node2) {
+   hr = rb_entry(node2, struct hot_range_item, rb_node);
+   hot_rb_remove_hot_range_item(he-hot_range_tree, hr);
+   hot_rb_free_hot_range_item(hr);
+   node2 = rb_first(he-hot_range_tree.map);
+   }
+
+   hot_rb_remove_hot_inode_item(root-hot_inode_tree, he);
+   hot_rb_free_hot_inode_item(he);
+   node = rb_first(root-hot_inode_tree.map);
+   }
+}
+
 static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root,
unsigned long inode_num,
struct rb_node *node)
@@ -507,3 +552,18 @@ void __init hot_track_cache_init(void)
if (hot_rb_item_cache_init())
return;
 }
+
+/*
+ * Initialize the data structures for hot data tracking.
+ */
+void hot_track_init(struct super_block *sb, const char *name)
+{
+   sb-s_hotinfo.mount_opt |= HOT_MOUNT_HOT_TRACK;
+   hot_rb_inode_tree_init(sb-s_hotinfo.hot_inode_tree);
+}
+
+void hot_track_exit(struct super_block *sb)
+{
+   sb-s_hotinfo.mount_opt = ~HOT_MOUNT_HOT_TRACK;
+   hot_rb_inode_tree_free(sb-s_hotinfo);
+}
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 6bd09eb..3a8d398 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -39,5 +39,7 @@ void hot_rb_update_freqs(struct inode *inode, u64 start, u64 
len,
 
 bool hot_track_parse_options(char *options);
 void __init hot_track_cache_init(void);
+void hot_track_init(struct super_block *sb, const char *name);
+void hot_track_exit(struct super_block *sb);
 
 #endif /* __HOT_TRACKING__ */
diff --git a/fs/namespace.c b/fs/namespace.c
index 4d31f73..55006c8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,6 +20,7 @@
 #include linux/fs_struct.h   /* get_fs_root et.al. */
 #include linux/fsnotify.h/* fsnotify_vfsmount_delete */
 #include linux/uaccess.h
+#include hot_tracking.h
 #include pnode.h
 #include internal.h
 
@@ -1215,6 +1216,9 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}
 
+   if (sb-s_hotinfo.mount_opt  HOT_MOUNT_HOT_TRACK)
+   hot_track_exit(sb);
+
down_write(namespace_sem);
br_write_lock(vfsmount_lock);
event++;
diff --git a/fs/super.c b/fs/super.c
index 7eb3b0c..0999d5c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1153,6 +1153,9 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
WARN_ON(sb-s_bdi == default_backing_dev_info);
sb-s_flags |= MS_BORN;
 
+   if (hottrack)
+   hot_track_init(sb, name);
+
error = security_sb_kern_mount(sb, flags, secdata);
if (error)
goto out_sb;
@@ -1170,6 +1173,9 @@ mount_fs(struct file_system_type *type, int flags, const 
char *name, void *data)
free_secdata(secdata);
return root;
 out_sb:
+   if (hottrack)
+   hot_track_exit(sb);
+
dput(root);
deactivate_locked_super(sb);
 out_free_secdata:
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2 10/10] vfs: add documentation

2012-09-23 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 Documentation/filesystems/hot_tracking.txt |  106 
 1 files changed, 106 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/hot_tracking.txt 
b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 000..340df45
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,106 @@
+Hot Data Tracking
+
+Introduction
+---
+
+  The feature adds experimental support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+temperature value that reflects what data is hot, and using that
+temperature to move data to SSDs.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+  Of course, users are warned not to run this code outside of development
+environments. These patches are EXPERIMENTAL, and as such they might eat
+your data and/or memory. That said, the code should be relatively safe
+when the hottrack mount option are disabled.
+
+Motivation
+---
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+https://btrfs.wiki.kernel.org/index.php/Project_ideas.
+It will divide into two steps. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, it is hoped that the patchset
+for hot data tracking will eventually mature into VFS.
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+Main Parts Description
+---
+
+These include the following parts:
+* Hooks in existing vfs functions to track data access frequency
+* New rbtrees for tracking access frequency of inodes and sub-file
+ranges (hot_rb.c)
+The relationship between super_block and rbtree is as below:
+super_block-s_hotinfo.hot_inode_tree
+In include/linux/fs.h, one struct hot_info s_hotinfo is added to
+super_block struct. Each FS instance can find hot tracking info
+s_hotinfo via its super_block. In this hot_info, it store a lot of hot
+tracking info such as hot_inode_tree, inode and range hash list, etc.
+* A hash list for indexing data by its temperature (hot_hash.c)
+* A debugfs interface for dumping data from the rbtrees (hot_debugfs.c)
+* A background kthread for updating inode heat info
+* Mount options for enabling temperature tracking(-o hottrack,
+default mean disabled) (hot_track.c)
+* An ioctl to retrieve the frequency information collected for a certain
+file
+* Ioctls to enable/disable frequency tracking per inode.
+
+Git Development Tree
+---
+
+  The feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+  https://github.com/wuzhy/kernel.git hot_tracking
+  git://github.com/wuzhy/kernel.git hot_tracking
+
+Usage Example
+---
+To use hot tracking, you should mount like this:
+
+$ mount -o hottrack /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] vfs: turning on hot data tracking
+
+Mount debugfs at first:
+
+$ mount -t debugfs none /sys/kernel/debug
+$ ls -l /sys/kernel/debug/vfs_hotdata/
+total 0
+drwxr-xr-x 2 root root 0 Aug  8 04:40 sdb
+$ ls -l /sys/kernel/debug/vfs_hotdata/sdb
+total 0
+-rw-r--r-- 1 root root 0 Aug  8 04:40 inode_data
+-rw-r--r-- 1 root root 0 Aug  8 04:40 range_data
+
+View information about hot tracking from debugfs:
+
+$ echo hot tracking test  /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/inode_data
+inode #279, reads 0, writes 1, avg read time 18446744073709551615,
+avg write time 5251566408153596, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/range_data
+inode #279, range start 0 (range len 1048576) reads 0, writes 1,
+avg read time 18446744073709551615, avg write time 1128690176623144209, temp 64
+
+$ echo hot data tracking test  /mnt/file
+$ cat 

[PATCH] btrfs-progs: rework the code logic

2012-07-05 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 extent-cache.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/extent-cache.c b/extent-cache.c
index 3dd6434..a5084ee 100644
--- a/extent-cache.c
+++ b/extent-cache.c
@@ -136,10 +136,10 @@ struct cache_extent *find_first_cache_extent(struct 
cache_tree *tree,
struct cache_extent *entry;
 
ret = __tree_search(tree-root, start, 1, prev);
-   if (!ret)
+   if (!ret) {
ret = prev;
-   if (!ret)
return NULL;
+}
entry = rb_entry(ret, struct cache_extent, rb_node);
return entry;
 }
-- 
1.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html