Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

2010-04-26 Thread KOSAKI Motohiro
Hi Ted


 I happened to be going through the source code for write_cache_pages(),
 and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
 what the heck that was, so I did search for it, and found this in
 Documentation/filesystems/vfs.txt:
 
   If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to
   try too hard if there are problems, and may choose to write out
   other pages from the mapping if that is easier (e.g. due to
   internal dependencies).  If it chooses not to start writeout, it
   should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
   calling -writepage on that page.
 
   See the file Locking for more details.
 
 No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
 chooses not to writeout page and call redirty_page_for_writeback()
 instead.
 
 Is this a change we should make, for example when btrfs refuses a
 writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
 if the page involved hasn't been allocated an on-disk block yet (i.e.,
 delayed allocation)?  The change seems to be that we should call
 redirty_page_for_writeback() as before, but then _not_ unlock the page,
 and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
 us to do?

Sorry, no.

AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
(and later rd choosed to use another way).
Then, It assume writepage refusing aren't happen on majority pages.
IOW, the VM assume other many pages can writeout although the page can't.
Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
but now ext4 and btrfs refuse all writepage(). (right?)

IOW, I don't think such documentation suppose delayed allocation issue ;)

The point is, Our dirty page accounting only account per-system-memory
dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
nor per-zone dirty ratio. and then, to refuse write page and fake numa
abusing can make confusing our vm easily. if _all_ pages in our VM LRU
list (it's per-zone), page activation doesn't help. It also lead to OOM.

And I'm sorry. I have to say now all vm developers fake numa is not
production level quority yet. afaik, nobody have seriously tested our
vm code on such environment. (linux/arch/x86/Kconfig says This is only 
useful for debugging.)

--
config NUMA_EMU
bool NUMA emulation
depends on X86_64  NUMA
---help---
  Enable NUMA emulation. A flat machine will be split
  into virtual nodes when booted with numa=fake=N, where N is 
the
  number of nodes. This is only useful for debugging.


 
 Right now, the only writepage() function which is returning
 AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not
 using redirty_page_for_writeback().  Should it, out of consistency's
 sake if not to keep various zone accounting straight?

Umm. I don't know the reason. instead I've cc to hugh :)


 There are some longer-term issues, including the fact that ext4 and
 btrfs are violating some of the rules laid out in
 Documentation/vfs/Locking regarding what writepage() is supposed to do
 under direct reclaim -- something which isn't going to be practical for
 us to change on the file-system side, at least not without doing some
 pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
 if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
 gracefully with the fact that ext4 and btrfs will be refusing
 writepage() calls under certain conditions, maybe we should make this
 change?

I'm sorry again. I'm pretty sure our vm also need to change if we need
to solve your company's fake numa use case. I think our vm is still delayed 
allocation unfriendly. we haven't noticed ext4 delayed allocation issue ;-)

So, I have two questions
 - I really hope to understand ext4 delayed allocation issue, can you please
   tell me which url explain ext4 high level design and behavior about delayed
   allocation.
 - If my understood is correctly, making very much fake numa node and
   simple dd can reproduce your issue. right?

Now I'm guessing enough small vm patch can solve this issue. (that's only
guess, maybe yes maybe no). but correct understanding and correct testing
way are really necessary. please help.




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 01/12] Btrfs: Link block groups of different raid types in the same space_info

2010-04-26 Thread Yan, Zheng
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:23:52.921839641 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:23:52.926830638 +0800
@@ -662,6 +662,7 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID1(1  4)
 #define BTRFS_BLOCK_GROUP_DUP (1  5)
 #define BTRFS_BLOCK_GROUP_RAID10   (1  6)
+#define BTRFS_NR_RAID_TYPES   5
 
 struct btrfs_block_group_item {
__le64 used;
@@ -673,7 +674,8 @@ struct btrfs_space_info {
u64 flags;
 
u64 total_bytes;/* total bytes in the space */
-   u64 bytes_used; /* total bytes used on disk */
+   u64 bytes_used; /* total bytes used,
+  this does't take mirrors into account */
u64 bytes_pinned;   /* total bytes pinned, will be freed when the
   transaction finishes */
u64 bytes_reserved; /* total bytes the allocator has reserved for
@@ -686,6 +688,7 @@ struct btrfs_space_info {
   delalloc/allocations */
u64 bytes_delalloc; /* number of bytes currently reserved for
   delayed allocation */
+   u64 disk_used;  /* total bytes used on disk */
 
int full;   /* indicates that we cannot allocate any more
   chunks for this space */
@@ -703,7 +706,7 @@ struct btrfs_space_info {
int flushing;
 
/* for block groups in our same type */
-   struct list_head block_groups;
+   struct list_head block_groups[BTRFS_NR_RAID_TYPES];
spinlock_t lock;
struct rw_semaphore groups_sem;
atomic_t caching_threads;
diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c
--- 2/fs/btrfs/extent-tree.c2010-04-26 17:23:52.922840061 +0800
+++ 3/fs/btrfs/extent-tree.c2010-04-26 17:23:52.929829246 +0800
@@ -506,6 +506,9 @@ static struct btrfs_space_info *__find_s
struct list_head *head = info-space_info;
struct btrfs_space_info *found;
 
+   flags = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM |
+BTRFS_BLOCK_GROUP_METADATA;
+
rcu_read_lock();
list_for_each_entry_rcu(found, head, list) {
if (found-flags == flags) {
@@ -2659,12 +2662,21 @@ static int update_space_info(struct btrf
 struct btrfs_space_info **space_info)
 {
struct btrfs_space_info *found;
+   int i;
+   int factor;
+
+   if (flags  (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+BTRFS_BLOCK_GROUP_RAID10))
+   factor = 2;
+   else
+   factor = 1;
 
found = __find_space_info(info, flags);
if (found) {
spin_lock(found-lock);
found-total_bytes += total_bytes;
found-bytes_used += bytes_used;
+   found-disk_used += bytes_used * factor;
found-full = 0;
spin_unlock(found-lock);
*space_info = found;
@@ -2674,14 +2686,18 @@ static int update_space_info(struct btrf
if (!found)
return -ENOMEM;
 
-   INIT_LIST_HEAD(found-block_groups);
+   for (i = 0; i  BTRFS_NR_RAID_TYPES; i++)
+   INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
init_waitqueue_head(found-flush_wait);
init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
-   found-flags = flags;
+   found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
+   BTRFS_BLOCK_GROUP_SYSTEM |
+   BTRFS_BLOCK_GROUP_METADATA);
found-total_bytes = total_bytes;
found-bytes_used = bytes_used;
+   found-disk_used = bytes_used * factor;
found-bytes_pinned = 0;
found-bytes_reserved = 0;
found-bytes_readonly = 0;
@@ -2751,26 +2767,32 @@ u64 btrfs_reduce_alloc_profile(struct bt
return flags;
 }
 
-static u64 btrfs_get_alloc_profile(struct btrfs_root *root, u64 data)
+static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
 {
-   struct btrfs_fs_info *info = root-fs_info;
-   u64 alloc_profile;
+   if (flags  BTRFS_BLOCK_GROUP_DATA)
+   flags |= root-fs_info-avail_data_alloc_bits 
+root-fs_info-data_alloc_profile;
+   else if (flags  BTRFS_BLOCK_GROUP_SYSTEM)
+   flags |= root-fs_info-avail_system_alloc_bits 
+root-fs_info-system_alloc_profile;
+   else if (flags  BTRFS_BLOCK_GROUP_METADATA)
+   flags |= root-fs_info-avail_metadata_alloc_bits 
+

[PATCH V2 02/12] Btrfs: Kill allocate_wait in space_info

2010-04-26 Thread Yan, Zheng
We already have fs_info-chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:24:10.436081649 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:24:10.441079491 +0800
@@ -700,9 +700,7 @@ struct btrfs_space_info {
struct list_head list;
 
/* for controlling how we free up space for allocations */
-   wait_queue_head_t allocate_wait;
wait_queue_head_t flush_wait;
-   int allocating_chunk;
int flushing;
 
/* for block groups in our same type */
diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c
--- 2/fs/btrfs/extent-tree.c2010-04-26 17:24:10.437084933 +0800
+++ 3/fs/btrfs/extent-tree.c2010-04-26 17:24:10.444079704 +0800
@@ -70,6 +70,9 @@ static int find_next_key(struct btrfs_pa
 struct btrfs_key *key);
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
int dump_block_groups);
+static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_space_info *sinfo, u64 num_bytes);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -2690,7 +2693,6 @@ static int update_space_info(struct btrf
INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
init_waitqueue_head(found-flush_wait);
-   init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
BTRFS_BLOCK_GROUP_SYSTEM |
@@ -3003,71 +3005,6 @@ flush:
wake_up(info-flush_wait);
 }
 
-static int maybe_allocate_chunk(struct btrfs_root *root,
-struct btrfs_space_info *info)
-{
-   struct btrfs_super_block *disk_super = root-fs_info-super_copy;
-   struct btrfs_trans_handle *trans;
-   bool wait = false;
-   int ret = 0;
-   u64 min_metadata;
-   u64 free_space;
-
-   free_space = btrfs_super_total_bytes(disk_super);
-   /*
-* we allow the metadata to grow to a max of either 10gb or 5% of the
-* space in the volume.
-*/
-   min_metadata = min((u64)10 * 1024 * 1024 * 1024,
-div64_u64(free_space * 5, 100));
-   if (info-total_bytes = min_metadata) {
-   spin_unlock(info-lock);
-   return 0;
-   }
-
-   if (info-full) {
-   spin_unlock(info-lock);
-   return 0;
-   }
-
-   if (!info-allocating_chunk) {
-   info-force_alloc = 1;
-   info-allocating_chunk = 1;
-   } else {
-   wait = true;
-   }
-
-   spin_unlock(info-lock);
-
-   if (wait) {
-   wait_event(info-allocate_wait,
-  !info-allocating_chunk);
-   return 1;
-   }
-
-   trans = btrfs_start_transaction(root, 1);
-   if (!trans) {
-   ret = -ENOMEM;
-   goto out;
-   }
-
-   ret = do_chunk_alloc(trans, root-fs_info-extent_root,
-4096 + 2 * 1024 * 1024,
-info-flags, 0);
-   btrfs_end_transaction(trans, root);
-   if (ret)
-   goto out;
-out:
-   spin_lock(info-lock);
-   info-allocating_chunk = 0;
-   spin_unlock(info-lock);
-   wake_up(info-allocate_wait);
-
-   if (ret)
-   return 0;
-   return 1;
-}
-
 /*
  * Reserve metadata space for delalloc.
  */
@@ -3108,7 +3045,8 @@ again:
flushed++;
 
if (flushed == 1) {
-   if (maybe_allocate_chunk(root, meta_sinfo))
+   if (maybe_allocate_chunk(NULL, root, meta_sinfo,
+num_bytes))
goto again;
flushed++;
} else {
@@ -3223,7 +3161,8 @@ again:
if (used  meta_sinfo-total_bytes) {
retries++;
if (retries == 1) {
-   if (maybe_allocate_chunk(root, meta_sinfo))
+   if (maybe_allocate_chunk(NULL, root, meta_sinfo,
+num_bytes))
goto again;
retries++;
} else {
@@ -3420,13 +3359,28 @@ static void force_metadata_allocation(st
rcu_read_unlock();
 }
 
+static int should_alloc_chunk(struct btrfs_space_info *sinfo,
+ u64 alloc_bytes)
+{
+   u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly;
+
+   if (sinfo-bytes_used + sinfo-bytes_reserved +
+   alloc_bytes + 256 * 1024 * 1024  num_bytes)
+   return 0;
+
+   if (sinfo-bytes_used + 

[PATCH V2 04/12] Btrfs: Kill init_btrfs_i()

2010-04-26 Thread Yan, Zheng
All code in init_btrfs_i can be moved into btrfs_alloc_inode()


Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c
--- 2/fs/btrfs/inode.c  2010-04-26 17:24:41.254078880 +0800
+++ 3/fs/btrfs/inode.c  2010-04-26 17:24:41.270103836 +0800
@@ -3595,40 +3595,10 @@ again:
return 0;
 }
 
-static noinline void init_btrfs_i(struct inode *inode)
-{
-   struct btrfs_inode *bi = BTRFS_I(inode);
-
-   bi-generation = 0;
-   bi-sequence = 0;
-   bi-last_trans = 0;
-   bi-last_sub_trans = 0;
-   bi-logged_trans = 0;
-   bi-delalloc_bytes = 0;
-   bi-reserved_bytes = 0;
-   bi-disk_i_size = 0;
-   bi-flags = 0;
-   bi-index_cnt = (u64)-1;
-   bi-last_unlink_trans = 0;
-   bi-ordered_data_close = 0;
-   bi-force_compress = 0;
-   extent_map_tree_init(BTRFS_I(inode)-extent_tree, GFP_NOFS);
-   extent_io_tree_init(BTRFS_I(inode)-io_tree,
-inode-i_mapping, GFP_NOFS);
-   extent_io_tree_init(BTRFS_I(inode)-io_failure_tree,
-inode-i_mapping, GFP_NOFS);
-   INIT_LIST_HEAD(BTRFS_I(inode)-delalloc_inodes);
-   INIT_LIST_HEAD(BTRFS_I(inode)-ordered_operations);
-   RB_CLEAR_NODE(BTRFS_I(inode)-rb_node);
-   btrfs_ordered_inode_tree_init(BTRFS_I(inode)-ordered_tree);
-   mutex_init(BTRFS_I(inode)-log_mutex);
-}
-
 static int btrfs_init_locked_inode(struct inode *inode, void *p)
 {
struct btrfs_iget_args *args = p;
inode-i_ino = args-ino;
-   init_btrfs_i(inode);
BTRFS_I(inode)-root = args-root;
btrfs_set_inode_space_info(args-root, inode);
return 0;
@@ -3691,8 +3661,6 @@ static struct inode *new_simple_dir(stru
if (!inode)
return ERR_PTR(-ENOMEM);
 
-   init_btrfs_i(inode);
-
BTRFS_I(inode)-root = root;
memcpy(BTRFS_I(inode)-location, key, sizeof(*key));
BTRFS_I(inode)-dummy_inode = 1;
@@ -4091,7 +4059,6 @@ static struct inode *btrfs_new_inode(str
 * btrfs_get_inode_index_count has an explanation for the magic
 * number
 */
-   init_btrfs_i(inode);
BTRFS_I(inode)-index_cnt = 2;
BTRFS_I(inode)-root = root;
BTRFS_I(inode)-generation = trans-transid;
@@ -5262,21 +5229,46 @@ unsigned long btrfs_force_ra(struct addr
 struct inode *btrfs_alloc_inode(struct super_block *sb)
 {
struct btrfs_inode *ei;
+   struct inode *inode;
 
ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
+
+   ei-root = NULL;
+   ei-space_info = NULL;
+   ei-generation = 0;
+   ei-sequence = 0;
ei-last_trans = 0;
ei-last_sub_trans = 0;
ei-logged_trans = 0;
+   ei-delalloc_bytes = 0;
+   ei-reserved_bytes = 0;
+   ei-disk_i_size = 0;
+   ei-flags = 0;
+   ei-index_cnt = (u64)-1;
+   ei-last_unlink_trans = 0;
+
+   spin_lock_init(ei-accounting_lock);
ei-outstanding_extents = 0;
ei-reserved_extents = 0;
-   ei-root = NULL;
-   spin_lock_init(ei-accounting_lock);
+
+   ei-ordered_data_close = 0;
+   ei-dummy_inode = 0;
+   ei-force_compress = 0;
+
+   inode = ei-vfs_inode;
+   extent_map_tree_init(ei-extent_tree, GFP_NOFS);
+   extent_io_tree_init(ei-io_tree, inode-i_data, GFP_NOFS);
+   extent_io_tree_init(ei-io_failure_tree, inode-i_data, GFP_NOFS);
+   mutex_init(ei-log_mutex);
btrfs_ordered_inode_tree_init(ei-ordered_tree);
INIT_LIST_HEAD(ei-i_orphan);
+   INIT_LIST_HEAD(ei-delalloc_inodes);
INIT_LIST_HEAD(ei-ordered_operations);
-   return ei-vfs_inode;
+   RB_CLEAR_NODE(ei-rb_node);
+
+   return inode;
 }
 
 void btrfs_destroy_inode(struct inode *inode)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 08/12] Btrfs: Introduce global metadata reservation

2010-04-26 Thread Yan, Zheng
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:27:31.644829469 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:27:31.648830941 +0800
@@ -682,21 +682,15 @@ struct btrfs_space_info {
u64 bytes_reserved; /* total bytes the allocator has reserved for
   current allocations */
u64 bytes_readonly; /* total bytes that are read only */
-   u64 bytes_super;/* total bytes reserved for the super blocks */
-   u64 bytes_root; /* the number of bytes needed to commit a
-  transaction */
+
u64 bytes_may_use;  /* number of bytes that may be used for
   delalloc/allocations */
-   u64 bytes_delalloc; /* number of bytes currently reserved for
-  delayed allocation */
u64 disk_used;  /* total bytes used on disk */
 
int full;   /* indicates that we cannot allocate any more
   chunks for this space */
int force_alloc;/* set if we need to force a chunk alloc for
   this space */
-   int force_delalloc; /* make people start doing filemap_flush until
-  we're under a threshold */
 
struct list_head list;
 
diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c
--- 2/fs/btrfs/disk-io.c2010-04-26 17:27:31.638850832 +0800
+++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:31.649830174 +0800
@@ -1472,10 +1472,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   smp_mb();
-   if (root-fs_info-closing)
-   break;
-
vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
 
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
@@ -1488,11 +1484,9 @@ static int cleaner_kthread(void *arg)
if (freezing(current)) {
refrigerator();
} else {
-   smp_mb();
-   if (root-fs_info-closing)
-   break;
set_current_state(TASK_INTERRUPTIBLE);
-   schedule();
+   if (!kthread_should_stop())
+   schedule();
__set_current_state(TASK_RUNNING);
}
} while (!kthread_should_stop());
@@ -1504,36 +1498,39 @@ static int transaction_kthread(void *arg
struct btrfs_root *root = arg;
struct btrfs_trans_handle *trans;
struct btrfs_transaction *cur;
+   u64 transid;
unsigned long now;
unsigned long delay;
int ret;
 
do {
-   smp_mb();
-   if (root-fs_info-closing)
-   break;
-
delay = HZ * 30;
vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-   mutex_lock(root-fs_info-transaction_kthread_mutex);
 
-   mutex_lock(root-fs_info-trans_mutex);
+   spin_lock(root-fs_info-new_trans_lock);
cur = root-fs_info-running_transaction;
if (!cur) {
-   mutex_unlock(root-fs_info-trans_mutex);
+   spin_unlock(root-fs_info-new_trans_lock);
goto sleep;
}
 
now = get_seconds();
-   if (now  cur-start_time || now - cur-start_time  30) {
-   mutex_unlock(root-fs_info-trans_mutex);
+   if (!cur-blocked 
+   (now  cur-start_time || now - cur-start_time  30)) {
+   spin_unlock(root-fs_info-new_trans_lock);
delay = HZ * 5;
goto sleep;
}
-   mutex_unlock(root-fs_info-trans_mutex);
-   trans = btrfs_join_transaction(root, 1);
-   ret = btrfs_commit_transaction(trans, root);
+   transid = cur-transid;
+   spin_unlock(root-fs_info-new_trans_lock);
 
+   trans = btrfs_join_transaction(root, 1);
+   if (transid == trans-transid) {
+   ret = btrfs_commit_transaction(trans, root);
+   BUG_ON(ret);
+   } else {
+   btrfs_end_transaction(trans, root);
+   }
 sleep:
wake_up_process(root-fs_info-cleaner_kthread);
mutex_unlock(root-fs_info-transaction_kthread_mutex);
@@ -1541,10 +1538,10 @@ sleep:
if (freezing(current)) {
refrigerator();
} else {
-   if (root-fs_info-closing)
-   break;
 

[PATCH V2 07/12] Btrfs: Update metadata reservation for delayed allocation

2010-04-26 Thread Yan, Zheng
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h
--- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.450105767 +0800
+++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.456080004 +0800
@@ -137,8 +137,8 @@ struct btrfs_inode {
 * of extent items we've reserved metadata for.
 */
spinlock_t accounting_lock;
+   atomic_t outstanding_extents;
int reserved_extents;
-   int outstanding_extents;
 
/*
 * ordered_data_close is set by truncate when a file that used
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:26:55.451104861 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:26:55.457079656 +0800
@@ -2078,19 +2078,8 @@ int btrfs_remove_block_group(struct btrf
 u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
 void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
-
-int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root,
- struct inode *inode, int num_items);
-int btrfs_reserve_metadata_for_delalloc(struct btrfs_root *root,
-   struct inode *inode, int num_items);
-int btrfs_check_data_free_space(struct btrfs_root *root, struct inode *inode,
-   u64 bytes);
-void btrfs_free_reserved_data_space(struct btrfs_root *root,
-   struct inode *inode, u64 bytes);
-void btrfs_delalloc_reserve_space(struct btrfs_root *root, struct inode *inode,
-u64 bytes);
-void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode,
- u64 bytes);
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
 int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
int num_items, int *retries);
@@ -2098,6 +2087,10 @@ void btrfs_trans_release_metadata(struct
struct btrfs_root *root);
 int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_pending_snapshot *pending);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root);
 void btrfs_free_block_rsv(struct btrfs_root *root,
diff -urp 2/fs/btrfs/extent_io.c 3/fs/btrfs/extent_io.c
--- 2/fs/btrfs/extent_io.c  2010-04-26 17:26:55.447090049 +0800
+++ 3/fs/btrfs/extent_io.c  2010-04-26 17:26:55.458079658 +0800
@@ -336,21 +336,18 @@ static int merge_state(struct extent_io_
 }
 
 static int set_state_cb(struct extent_io_tree *tree,
-struct extent_state *state,
-unsigned long bits)
+struct extent_state *state, int *bits)
 {
if (tree-ops  tree-ops-set_bit_hook) {
return tree-ops-set_bit_hook(tree-mapping-host,
-  state-start, state-end,
-  state-state, bits);
+  state, bits);
}
 
return 0;
 }
 
 static void clear_state_cb(struct extent_io_tree *tree,
-  struct extent_state *state,
-  unsigned long bits)
+  struct extent_state *state, int *bits)
 {
if (tree-ops  tree-ops-clear_bit_hook)
tree-ops-clear_bit_hook(tree-mapping-host, state, bits);
@@ -368,9 +365,10 @@ static void clear_state_cb(struct extent
  */
 static int insert_state(struct extent_io_tree *tree,
struct extent_state *state, u64 start, u64 end,
-   int bits)
+   int *bits)
 {
struct rb_node *node;
+   int bits_to_set = *bits  ~EXTENT_CTLBITS;
int ret;
 
if (end  start) {
@@ -385,9 +383,9 @@ static int insert_state(struct extent_io
if (ret)
return ret;
 
-   if (bits  EXTENT_DIRTY)
+   if (bits_to_set  EXTENT_DIRTY)
   

[PATCH V2 09/12] Btrfs: Metadata reservation for orphan inodes

2010-04-26 Thread Yan, Zheng
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h
--- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.113080051 +0800
+++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.118079430 +0800
@@ -151,6 +151,7 @@ struct btrfs_inode {
 * of these.
 */
unsigned ordered_data_close:1;
+   unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
 
/*
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:27:52.114079844 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:27:52.119079920 +0800
@@ -1068,7 +1068,6 @@ struct btrfs_root {
int ref_cows;
int track_dirty;
int in_radix;
-   int clean_orphans;
 
u64 defrag_trans_start;
struct btrfs_key defrag_progress;
@@ -1082,8 +1081,11 @@ struct btrfs_root {
 
struct list_head root_list;
 
-   spinlock_t list_lock;
+   spinlock_t orphan_lock;
struct list_head orphan_list;
+   struct btrfs_block_rsv *orphan_block_rsv;
+   int orphan_item_inserted;
+   int orphan_cleanup_state;
 
spinlock_t inode_lock;
/* red-black tree that keeps track of in-memory inodes */
@@ -2079,6 +2081,9 @@ int btrfs_trans_reserve_metadata(struct 
int num_items, int *retries);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
+int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
+ struct inode *inode);
+void btrfs_orphan_release_metadata(struct inode *inode);
 int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_pending_snapshot *pending);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
@@ -2403,6 +2408,13 @@ int btrfs_update_inode(struct btrfs_tran
 int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
 int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode);
 void btrfs_orphan_cleanup(struct btrfs_root *root);
+void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans,
+   struct btrfs_pending_snapshot *pending,
+   u64 *bytes_to_reserve);
+void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans,
+   struct btrfs_pending_snapshot *pending);
+void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t size);
 int btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c
--- 2/fs/btrfs/disk-io.c2010-04-26 17:27:52.105081158 +0800
+++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:52.120080690 +0800
@@ -895,7 +895,8 @@ static int __setup_root(u32 nodesize, u3
root-ref_cows = 0;
root-track_dirty = 0;
root-in_radix = 0;
-   root-clean_orphans = 0;
+   root-orphan_item_inserted = 0;
+   root-orphan_cleanup_state = 0;
 
root-fs_info = fs_info;
root-objectid = objectid;
@@ -905,12 +906,13 @@ static int __setup_root(u32 nodesize, u3
root-in_sysfs = 0;
root-inode_tree = RB_ROOT;
root-block_rsv = NULL;
+   root-orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(root-dirty_list);
INIT_LIST_HEAD(root-orphan_list);
INIT_LIST_HEAD(root-root_list);
spin_lock_init(root-node_lock);
-   spin_lock_init(root-list_lock);
+   spin_lock_init(root-orphan_lock);
spin_lock_init(root-inode_lock);
spin_lock_init(root-accounting_lock);
mutex_init(root-objectid_mutex);
@@ -1194,19 +1196,23 @@ again:
if (root)
return root;
 
-   ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid);
-   if (ret == 0)
-   ret = -ENOENT;
-   if (ret  0)
-   return ERR_PTR(ret);
-
root = btrfs_read_fs_root_no_radix(fs_info-tree_root, location);
if (IS_ERR(root))
return root;
 
-   WARN_ON(btrfs_root_refs(root-root_item) == 0);
set_anon_super(root-anon_super, NULL);
 
+   if (btrfs_root_refs(root-root_item) == 0) {
+   ret = -ENOENT;
+   goto fail;
+   }
+
+   ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid);
+   if (ret  0)
+   goto fail;
+   if (ret == 0)
+   root-orphan_item_inserted = 1;
+
ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
if (ret)
goto fail;
@@ -1215,10 +1221,9 @@ again:
ret = radix_tree_insert(fs_info-fs_roots_radix,
(unsigned long)root-root_key.objectid,

[PATCH V2 10/12] Btrfs: Metadata ENOSPC handling for tree log

2010-04-26 Thread Yan, Zheng
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c
--- 2/fs/btrfs/disk-io.c2010-04-26 17:28:05.496079922 +0800
+++ 3/fs/btrfs/disk-io.c2010-04-26 17:28:05.506079726 +0800
@@ -973,42 +973,6 @@ static int find_and_setup_root(struct bt
return 0;
 }
 
-int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info)
-{
-   struct extent_buffer *eb;
-   struct btrfs_root *log_root_tree = fs_info-log_root_tree;
-   u64 start = 0;
-   u64 end = 0;
-   int ret;
-
-   if (!log_root_tree)
-   return 0;
-
-   while (1) {
-   ret = find_first_extent_bit(log_root_tree-dirty_log_pages,
-   0, start, end, EXTENT_DIRTY | EXTENT_NEW);
-   if (ret)
-   break;
-
-   clear_extent_bits(log_root_tree-dirty_log_pages, start, end,
- EXTENT_DIRTY | EXTENT_NEW, GFP_NOFS);
-   }
-   eb = fs_info-log_root_tree-node;
-
-   WARN_ON(btrfs_header_level(eb) != 0);
-   WARN_ON(btrfs_header_nritems(eb) != 0);
-
-   ret = btrfs_free_reserved_extent(fs_info-tree_root,
-   eb-start, eb-len);
-   BUG_ON(ret);
-
-   free_extent_buffer(eb);
-   kfree(fs_info-log_root_tree);
-   fs_info-log_root_tree = NULL;
-   return 0;
-}
-
 static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info)
 {
diff -urp 2/fs/btrfs/disk-io.h 3/fs/btrfs/disk-io.h
--- 2/fs/btrfs/disk-io.h2010-04-26 17:28:05.495079921 +0800
+++ 3/fs/btrfs/disk-io.h2010-04-26 17:28:05.507080566 +0800
@@ -95,8 +95,6 @@ int btrfs_congested_async(struct btrfs_f
 unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info);
 int btrfs_write_tree_block(struct extent_buffer *buf);
 int btrfs_wait_tree_block_writeback(struct extent_buffer *buf);
-int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
diff -urp 2/fs/btrfs/file-item.c 3/fs/btrfs/file-item.c
--- 2/fs/btrfs/file-item.c  2010-04-26 17:28:05.503100326 +0800
+++ 3/fs/btrfs/file-item.c  2010-04-26 17:28:05.507080566 +0800
@@ -656,6 +656,9 @@ again:
goto found;
}
ret = PTR_ERR(item);
+   if (ret != -EFBIG  ret != -ENOENT)
+   goto fail_unlock;
+
if (ret == -EFBIG) {
u32 item_size;
/* we found one, but it isn't big enough yet */
diff -urp 2/fs/btrfs/tree-log.c 3/fs/btrfs/tree-log.c
--- 2/fs/btrfs/tree-log.c   2010-04-26 17:28:05.498105836 +0800
+++ 3/fs/btrfs/tree-log.c   2010-04-26 17:28:05.509079730 +0800
@@ -134,6 +134,7 @@ static int start_log_trans(struct btrfs_
   struct btrfs_root *root)
 {
int ret;
+   int err = 0;
 
mutex_lock(root-log_mutex);
if (root-log_root) {
@@ -154,17 +155,19 @@ static int start_log_trans(struct btrfs_
mutex_lock(root-fs_info-tree_log_mutex);
if (!root-fs_info-log_root_tree) {
ret = btrfs_init_log_root_tree(trans, root-fs_info);
-   BUG_ON(ret);
+   if (ret)
+   err = ret;
}
-   if (!root-log_root) {
+   if (err == 0  !root-log_root) {
ret = btrfs_add_log_tree(trans, root);
-   BUG_ON(ret);
+   if (ret)
+   err = ret;
}
mutex_unlock(root-fs_info-tree_log_mutex);
root-log_batch++;
atomic_inc(root-log_writers);
mutex_unlock(root-log_mutex);
-   return 0;
+   return err;
 }
 
 /*
@@ -375,7 +378,7 @@ insert:
BUG_ON(ret);
}
} else if (ret) {
-   BUG();
+   return ret;
}
dst_ptr = btrfs_item_ptr_offset(path-nodes[0],
path-slots[0]);
@@ -1698,9 +1701,9 @@ static noinline int walk_down_log_tree(s
 
next = btrfs_find_create_tree_block(root, bytenr, blocksize);
 
-   wc-process_func(root, next, wc, ptr_gen);
-
if (*level == 1) {
+   wc-process_func(root, next, wc, ptr_gen);
+
path-slots[*level]++;
if (wc-free) {
btrfs_read_buffer(next, ptr_gen);
@@ -1733,35 +1736,7 @@ static noinline int 

[PATCH V2 11/12] Btrfs: Pre-allocate space for data relocation

2010-04-26 Thread Yan, Zheng
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:28:20.493839748 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:28:20.498830465 +0800
@@ -2419,6 +2419,9 @@ int btrfs_cont_expand(struct inode *inod
 int btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_root *root);
+int btrfs_prealloc_file_range(struct inode *inode, int mode,
+ u64 start, u64 num_bytes, u64 min_size,
+ loff_t actual_len, u64 *alloc_hint);
 extern const struct dentry_operations btrfs_dentry_operations;
 
 /* ioctl.c */
diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c
--- 2/fs/btrfs/inode.c  2010-04-26 17:28:20.489839672 +0800
+++ 3/fs/btrfs/inode.c  2010-04-26 17:28:20.500829420 +0800
@@ -1174,6 +1174,13 @@ out_check:
   num_bytes, num_bytes, type);
BUG_ON(ret);
 
+   if (root-root_key.objectid ==
+   BTRFS_DATA_RELOC_TREE_OBJECTID) {
+   ret = btrfs_reloc_clone_csums(inode, cur_offset,
+ num_bytes);
+   BUG_ON(ret);
+   }
+
extent_clear_unlock_delalloc(inode, BTRFS_I(inode)-io_tree,
cur_offset, cur_offset + num_bytes - 1,
locked_page, EXTENT_CLEAR_UNLOCK_PAGE |
@@ -6079,16 +6086,15 @@ out_unlock:
return err;
 }
 
-static int prealloc_file_range(struct inode *inode, u64 start, u64 end,
-   u64 alloc_hint, int mode, loff_t actual_len)
+int btrfs_prealloc_file_range(struct inode *inode, int mode,
+ u64 start, u64 num_bytes, u64 min_size,
+ loff_t actual_len, u64 *alloc_hint)
 {
struct btrfs_trans_handle *trans;
struct btrfs_root *root = BTRFS_I(inode)-root;
struct btrfs_key ins;
u64 cur_offset = start;
-   u64 num_bytes = end - start;
int ret = 0;
-   u64 i_size;
 
while (num_bytes  0) {
trans = btrfs_start_transaction(root, 3);
@@ -6097,9 +6103,8 @@ static int prealloc_file_range(struct in
break;
}
 
-   ret = btrfs_reserve_extent(trans, root, num_bytes,
-  root-sectorsize, 0, alloc_hint,
-  (u64)-1, ins, 1);
+   ret = btrfs_reserve_extent(trans, root, num_bytes, min_size,
+  0, *alloc_hint, (u64)-1, ins, 1);
if (ret) {
btrfs_end_transaction(trans, root);
break;
@@ -6116,20 +6121,19 @@ static int prealloc_file_range(struct in
 
num_bytes -= ins.offset;
cur_offset += ins.offset;
-   alloc_hint = ins.objectid + ins.offset;
+   *alloc_hint = ins.objectid + ins.offset;
 
inode-i_ctime = CURRENT_TIME;
BTRFS_I(inode)-flags |= BTRFS_INODE_PREALLOC;
if (!(mode  FALLOC_FL_KEEP_SIZE) 
-   (actual_len  inode-i_size) 
-   (cur_offset  inode-i_size)) {
-
+   (actual_len  inode-i_size) 
+   (cur_offset  inode-i_size)) {
if (cur_offset  actual_len)
-   i_size  = actual_len;
+   i_size_write(inode, actual_len);
else
-   i_size = cur_offset;
-   i_size_write(inode, i_size);
-   btrfs_ordered_update_i_size(inode, i_size, NULL);
+   i_size_write(inode, cur_offset);
+   i_size_write(inode, cur_offset);
+   btrfs_ordered_update_i_size(inode, cur_offset, NULL);
}
 
ret = btrfs_update_inode(trans, root, inode);
@@ -6215,16 +6219,16 @@ static long btrfs_fallocate(struct inode
if (em-block_start == EXTENT_MAP_HOLE ||
(cur_offset = inode-i_size 
 !test_bit(EXTENT_FLAG_PREALLOC, em-flags))) {
-   ret = prealloc_file_range(inode,
- cur_offset, last_byte,
-   alloc_hint, mode, offset+len);
+   ret = btrfs_prealloc_file_range(inode, 0, cur_offset,
+   last_byte - cur_offset,
+   1  inode-i_blkbits,
+

Re: NFS mount attempts hangs with btrfs on server side

2010-04-26 Thread Jan Engelhardt

On Wednesday 2010-04-21 12:37, Manio wrote:
 On 2010-04-21 12:29, Jan Engelhardt wrote:
 I'd rather have a real description rather than stuff like that.
 Since I am not on debian, old nfs-utils would not happen -
 rpcbind-0.1.6+git20080930 and nfs-kernel-server-1.1.3
 should be very much recent enough.
 Especially since exportability of filesystems is usually not
 so much a userspace thing.

 Unfortunatelly i can't tell you which package exactly was causing the problem.

So I bisected it then.

commit 3a340251597a5b0c579c31d8caf9aa3b53a77016
Author: David Woodhouse david.woodho...@intel.com
Date:   Thu Aug 28 11:05:17 2008 -0400

Use fsid from statfs for UUID if blkid can't cope (or not used)

Signed-off-by: David Woodhouse david.woodho...@intel.com
Signed-off-by: Steve Dickson ste...@redhat.com

It is the commit enabling btrfs mounts. The message hints to blkid
being unable to get the uuid or something, however, the standalone
/sbin/blkid tool does get a uuid:

/dev/loop2: UUID=e19fe89b-cde3-4ccc-bc70-b759a57bd1c9
UUID_SUB=f29c6218-d040-4546-a227-4dd2d2142817 TYPE=btrfs 

So what went wrong here?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

2010-04-26 Thread Theodore Tso

On Apr 26, 2010, at 6:18 AM, KOSAK
 AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
 (and later rd choosed to use another way).
 Then, It assume writepage refusing aren't happen on majority pages.
 IOW, the VM assume other many pages can writeout although the page can't.
 Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
 but now ext4 and btrfs refuse all writepage(). (right?)

No, not exactly.   Btrfs refuses the writepage() in the direct reclaim cases 
(i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone 
scanning.  I don't want to speak for Chris, but I assume it's due to stack 
depth concerns --- if it was just due to worrying about fs recursion issues, i 
assume all of the btrfs allocations could be done GFP_NOFS.

Ext4 is slightly different; it refuses writepages() if the inode blocks for the 
page haven't yet been allocated.  (Regardless of whether it's happening for 
direct reclaim or zone scanning.)  However, if the on-disk block has been 
assigned (i.e., this isn't a delalloc case), ext4 will honor the writepage().   
(i.e., if this is an mmap of an already existing file, or if the space has been 
pre-allocated using fallocate()).The reason for ext4's concern is lock 
ordering, although I'm investigating whether I can fix this.   If we call 
set_page_writeback() to set PG_writeback (plus set the various bits of magic fs 
accounting), and then drop the page_lock, does that protect us from random 
changes happening to the page (i.e., from vmtruncate, etc.)?

 
 IOW, I don't think such documentation suppose delayed allocation issue ;)
 
 The point is, Our dirty page accounting only account per-system-memory
 dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
 nor per-zone dirty ratio. and then, to refuse write page and fake numa
 abusing can make confusing our vm easily. if _all_ pages in our VM LRU
 list (it's per-zone), page activation doesn't help. It also lead to OOM.
 
 And I'm sorry. I have to say now all vm developers fake numa is not
 production level quority yet. afaik, nobody have seriously tested our
 vm code on such environment. (linux/arch/x86/Kconfig says This is only 
 useful for debugging.)

So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a 
red herring.   That code is in production here, and we've made all sorts of 
changes so ti can be used for more than just debugging.  So please ignore it, 
it's our local hack, and if it breaks that's our problem.More importantly, 
just two weeks ago I talked to soeone in the financial sector, who was testing 
out ext4 on an upstream kernel, and not using our hacks that force 128MB zones, 
and he ran into the ext4/OOM problem while using an upstream kernel.  It 
involved Oracle pinning down 3G worth of pages, and him trying to do a huge 
streaming backup (which of course wasn't using fallocate or direct I/O) under 
ext4, and he had the same issue --- an OOM, that I'm pretty sure was caused by 
the fact that ext4_writepage() was refusing the writepage() and most of the 
pages weren't nailed down by Oracle were delalloc.The same test scenario 
using ext3 worked just fine, of course.

Under normal cases it's not a problem since statistically there should be 
enough other pages in the system compared to the number of pages that are 
subject to delalloc, such that pages can usually get pushed out until the 
writeback code can get around to writing out the pages.   But in cases where 
the zones have been made artificially small, or you have a big program like 
Oracle pinning down a large number of pages, then of course we have problems. 

I'm trying to fix things from the file system side, which means trying to 
understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in 
Documentation/filesystems/Locking as something which MUST be used if 
writepage() is going refuse a page.  And then I discovered no one is actually 
using it.   So that's why I was asking with respect whether the Locking 
documentation file was out of date, or whether all of the file systems are 
doing it wrong.

On a related example of how file system code isnt' necessarily following what 
is required/recommended by the Locking documentation, ext2 and ext3 are both 
NOT using set_page_writeback()/end_page_writeback(), but are rather keeping the 
page locked until after they call block_write_full_page(), because of concerns 
of truncate coming in and screwing things up.   But now looking at Locking, it 
appears that set_page_writeback() is as good as page_lock() for preventing the 
truncate code from coming in and screwing everything up?   It's not clear to me 
exactly what locking guarantees are provided against truncate by 
set_page_writeback().   And suppose we are writing out a whole cluster of 
pages, say 4MB worth of pages; do we need to call set_page_writeback() on every 
single page in the cluster before we do the I/O to make sure 

Re: list subvolumes with new btrfs command

2010-04-26 Thread C Anthony Risinger
 I am using ubuntu-10.04-rc with kernel compiled from the almost
 lastest source , the btrfs-progs is latest too.

 You can modify line

  fprintf(stderr, ERROR: can't perform the search\n);
 to
  fprintf(stderr, ERROR: can't perform the search: %s\n, strerror(errno));

 to see what happened on earth.

nice:

$ sudo btrfs subvolume list /
ERROR: can't perform the search: Inappropriate ioctl for device

i'm not really familiar with C, or anything this low level, does this
help you diagnose my problem?

thanks again for the help thus far,
C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

2010-04-26 Thread Chris Mason
On Mon, Apr 26, 2010 at 10:50:45AM -0400, Theodore Tso wrote:
 
 On Apr 26, 2010, at 6:18 AM, KOSAK
  AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
  (and later rd choosed to use another way).
  Then, It assume writepage refusing aren't happen on majority pages.
  IOW, the VM assume other many pages can writeout although the page can't.
  Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is 
  returned.
  but now ext4 and btrfs refuse all writepage(). (right?)
 
 No, not exactly.   Btrfs refuses the writepage() in the direct reclaim
 cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the
 case of zone scanning.  I don't want to speak for Chris, but I assume
 it's due to stack depth concerns --- if it was just due to worrying
 about fs recursion issues, i assume all of the btrfs allocations could
 be done GFP_NOFS.
 

Btrfs refuses all PF_MEMALLOC writepage.  It will go ahead and process a
regular writepage but in practice that never happens...everyone else
except a few internal btrfs callers use writepages.

I wish I had thought of stack depth back then, but really this was to
keep kswapd out of the heavy work done by delalloc.  From a locking
point of view we're properly GPF_NOFS, so its safe, but it just isn't a
great way to use precious PF_MEMALLOC cycles.

 Ext4 is slightly different; it refuses writepages() if the inode
 blocks for the page haven't yet been allocated.  (Regardless of
 whether it's happening for direct reclaim or zone scanning.)  However,
 if the on-disk block has been assigned (i.e., this isn't a delalloc
 case), ext4 will honor the writepage().   (i.e., if this is an mmap of
 an already existing file, or if the space has been pre-allocated using
 fallocate()).The reason for ext4's concern is lock ordering,
 although I'm investigating whether I can fix this.   If we call
 set_page_writeback() to set PG_writeback (plus set the various bits of
 magic fs accounting), and then drop the page_lock, does that protect
 us from random changes happening to the page (i.e., from vmtruncate,
 etc.)?

PG_writeback will protect you from vmtruncate, but may also want to
have page_mkwrite wait for pages in flight.

 
  
  IOW, I don't think such documentation suppose delayed allocation issue ;)
  
  The point is, Our dirty page accounting only account per-system-memory
  dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
  nor per-zone dirty ratio. and then, to refuse write page and fake numa
  abusing can make confusing our vm easily. if _all_ pages in our VM LRU
  list (it's per-zone), page activation doesn't help. It also lead to OOM.
  
  And I'm sorry. I have to say now all vm developers fake numa is not
  production level quority yet. afaik, nobody have seriously tested our
  vm code on such environment. (linux/arch/x86/Kconfig says This is only 
  useful for debugging.)
 

 So I'm sorry I mentioned the fake numa bit, since I think this is a
 bit of a red herring.   That code is in production here, and we've
 made all sorts of changes so ti can be used for more than just
 debugging.  So please ignore it, it's our local hack, and if it breaks
 that's our problem.More importantly, just two weeks ago I talked
 to soeone in the financial sector, who was testing out ext4 on an
 upstream kernel, and not using our hacks that force 128MB zones, and
 he ran into the ext4/OOM problem while using an upstream kernel.  It
 involved Oracle pinning down 3G worth of pages, and him trying to do a
 huge streaming backup (which of course wasn't using fallocate or
 direct I/O) under ext4, and he had the same issue --- an OOM, that I'm
 pretty sure was caused by the fact that ext4_writepage() was refusing
 the writepage() and most of the pages weren't nailed down by Oracle
 were delalloc.The same test scenario using ext3 worked just fine,
 of course.
 
 Under normal cases it's not a problem since statistically there should
 be enough other pages in the system compared to the number of pages
 that are subject to delalloc, such that pages can usually get pushed
 out until the writeback code can get around to writing out the pages.
 But in cases where the zones have been made artificially small, or you
 have a big program like Oracle pinning down a large number of pages,
 then of course we have problems. 
 
 I'm trying to fix things from the file system side, which means trying
 to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is
 described in Documentation/filesystems/Locking as something which MUST
 be used if writepage() is going refuse a page.  And then I discovered
 no one is actually using it.   So that's why I was asking with respect
 whether the Locking documentation file was out of date, or whether all
 of the file systems are doing it wrong.
 

 On a related example of how file system code isnt' necessarily
 following what is required/recommended by the Locking documentation,
 ext2 and ext3 are both NOT using
 

Re: list subvolumes with new btrfs command

2010-04-26 Thread Hubert Kario
On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote:
  I am using ubuntu-10.04-rc with kernel compiled from the almost
  lastest source , the btrfs-progs is latest too.
 
  You can modify line
 
   fprintf(stderr, ERROR: can't perform the search\n);
  to
   fprintf(stderr, ERROR: can't perform the search: %s\n,
  strerror(errno));
 
  to see what happened on earth.
 
 nice:
 
 $ sudo btrfs subvolume list /
 ERROR: can't perform the search: Inappropriate ioctl for device
 
 i'm not really familiar with C, or anything this low level, does this
 help you diagnose my problem?

Have you tried to run it on the device with the btrfs, not the mount point?

It looks like the ioctl was made too restrictive about its arguments.

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

System Zarządzania Jakością
zgodny z normą ISO 9001:2000
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list subvolumes with new btrfs command

2010-04-26 Thread C Anthony Risinger
On Mon, Apr 26, 2010 at 12:58 PM, Hubert Kario h...@qbs.com.pl wrote:
 On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote:
  I am using ubuntu-10.04-rc with kernel compiled from the almost
  lastest source , the btrfs-progs is latest too.
 
  You can modify line
 
   fprintf(stderr, ERROR: can't perform the search\n);
  to
   fprintf(stderr, ERROR: can't perform the search: %s\n,
  strerror(errno));
 
  to see what happened on earth.

 nice:

 $ sudo btrfs subvolume list /
 ERROR: can't perform the search: Inappropriate ioctl for device

 i'm not really familiar with C, or anything this low level, does this
 help you diagnose my problem?

 Have you tried to run it on the device with the btrfs, not the mount point?

 It looks like the ioctl was made too restrictive about its arguments.

ah yes i missed mentioning that to, tried that:

$ sudo btrfs sub list /dev/sda2
ERROR: '/dev/sda2' is not a subvolume

no dice :(
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list subvolumes with new btrfs command

2010-04-26 Thread C Anthony Risinger
On Mon, Apr 26, 2010 at 2:10 PM, C Anthony Risinger anth...@extof.me wrote:
 On Mon, Apr 26, 2010 at 12:58 PM, Hubert Kario h...@qbs.com.pl wrote:
 On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote:
  I am using ubuntu-10.04-rc with kernel compiled from the almost
  lastest source , the btrfs-progs is latest too.
 
  You can modify line
 
   fprintf(stderr, ERROR: can't perform the search\n);
  to
   fprintf(stderr, ERROR: can't perform the search: %s\n,
  strerror(errno));
 
  to see what happened on earth.

 nice:

 $ sudo btrfs subvolume list /
 ERROR: can't perform the search: Inappropriate ioctl for device

 i'm not really familiar with C, or anything this low level, does this
 help you diagnose my problem?

 Have you tried to run it on the device with the btrfs, not the mount point?

 It looks like the ioctl was made too restrictive about its arguments.

 ah yes i missed mentioning that to, tried that:

 $ sudo btrfs sub list /dev/sda2
 ERROR: '/dev/sda2' is not a subvolume

 no dice :(

i tried setting up loopback with a newly formatted btrfs image +
mounting, same result: Inappropriate ioctl for device.  same error
whether i point the command at the default subvolume or a snapshot.
is there anything (missing) i should check in regards to my kernel
(module/progs mismatch)?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list subvolumes with new btrfs command

2010-04-26 Thread C Anthony Risinger
On Mon, Apr 26, 2010 at 3:51 PM, C Anthony Risinger anth...@extof.me wrote:
 On Mon, Apr 26, 2010 at 2:10 PM, C Anthony Risinger anth...@extof.me wrote:
 On Mon, Apr 26, 2010 at 12:58 PM, Hubert Kario h...@qbs.com.pl wrote:
 On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote:
  I am using ubuntu-10.04-rc with kernel compiled from the almost
  lastest source , the btrfs-progs is latest too.
 
  You can modify line
 
   fprintf(stderr, ERROR: can't perform the search\n);
  to
   fprintf(stderr, ERROR: can't perform the search: %s\n,
  strerror(errno));
 
  to see what happened on earth.

 nice:

 $ sudo btrfs subvolume list /
 ERROR: can't perform the search: Inappropriate ioctl for device

 i'm not really familiar with C, or anything this low level, does this
 help you diagnose my problem?

 Have you tried to run it on the device with the btrfs, not the mount point?

 It looks like the ioctl was made too restrictive about its arguments.

 ah yes i missed mentioning that to, tried that:

 $ sudo btrfs sub list /dev/sda2
 ERROR: '/dev/sda2' is not a subvolume

 no dice :(

 i tried setting up loopback with a newly formatted btrfs image +
 mounting, same result: Inappropriate ioctl for device.  same error
 whether i point the command at the default subvolume or a snapshot.
 is there anything (missing) i should check in regards to my kernel
 (module/progs mismatch)?

bleh, looks like my kernel didn't have what it needed; i thought
2.6.33/stock Arch kernel was recent enough.  i booted an 2.6.34rc5
kernel any everything works now:

$ sudo btrfs sub list /
ID 259 top level 5 path vps/var/lib/vps-lxc/tpl/arch-nano
ID 260 top level 5 path vps/var/lib/vps-lxc/dom/dom1

heh, i forgot about those snapshots :-).  i will compensate for this
possibility in my initrd hook.

apologies for the noise.  on a parting note, the strerror(errno) was
a nice change, and might be a useful addition for others, as it also
pointed my in the right direction for permission problems (without
sudo/non-super):

$ btrfs sub list /
ERROR: can't perform the search: Operation not permitted

other than that, thanks for the assistance; the new btrfs tool is nice.

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[For review] [PATCH] Check all kmalloc(), etc, functions for success

2010-04-26 Thread Chris Samuel

Hi Chris, et. al,

I was bored on the long flight from Melbourne to LA (and kept
awake by crying babies) so I thought I'd dip my toe into kernel
programming and thought I'd see if any results from kmalloc()
were being used without being checked for success first.

Turns out there were quite a few that I found by hand with a
simple git grep -A2 kmalloc fs/btrfs and so I've gone through
and either BUG_ON()'d them or made them return -ENOMEM in those
cases where the return value is checked.

I then installed Coccinelle (packaged in Ubuntu 10.04) and
used the kmtest.cocci file to pick up other cases of memory
allocations that need to be tested:

http://coccinelle.lip6.fr/impact/kmtest.html

There was one odd case in fs/btrfs/inode.c where the kzalloc()
was preceded by a WARN_ON(pages); which would always be true
as the only prior reference was its declaration and initialisation
to NULL, so I took the liberty of moving that after the allocation
and changing it to a BUG_ON().

As I'm new to this I'm only using BUG_ON() as that seems to be
used elsewhere for this case in btrfs but the kernel itself
(include/asm-generic/bug.h) seems to indicate that you should
only use BUG_ON() as a last resort.

Please review the patch and let me know whether I'm on the
right track or not!  Just be gentle with me, I'm jetlagged. :-)

Patch is included inline and also attached as a MIME attachment
to give a better alternative in case of wordwrap breakage in
the inline version.

All the best,
Chris

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 396039b..eb6e785 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -351,6 +351,7 @@ int btrfs_submit_compressed_write(struct inode 
*inode, u64 start,


WARN_ON(start  ((u64)PAGE_CACHE_SIZE - 1));
cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS);
+   BUG_ON(!cb);
atomic_set(cb-pending_bios, 0);
cb-errors = 0;
cb-inode = inode;
@@ -588,6 +589,7 @@ int btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,


compressed_len = em-block_len;
cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS);
+   BUG_ON(!cb);
atomic_set(cb-pending_bios, 0);
cb-errors = 0;
cb-inode = inode;
@@ -609,6 +611,7 @@ int btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,

 PAGE_CACHE_SIZE;
cb-compressed_pages = kmalloc(sizeof(struct page *) * nr_pages,
   GFP_NOFS);
+   BUG_ON(!cb-compressed_pages);
bdev = BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev;

for (page_index = 0; page_index  nr_pages; page_index++) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e7b8f2c..3e5f0ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1967,6 +1967,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,

log_tree_root = kzalloc(sizeof(struct btrfs_root),
  GFP_NOFS);
+   BUG_ON(!log_tree_root);

__setup_root(nodesize, leafsize, sectorsize, stripesize,
 log_tree_root, fs_info, 
BTRFS_TREE_LOG_OBJECTID);

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b34d32f..6e20c54 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7161,6 +7161,8 @@ static noinline int relocate_one_extent(struct 
btrfs_root *extent_root,

u64 group_start = group-key.objectid;
new_extents = kmalloc(sizeof(*new_extents),
  GFP_NOFS);
+   if(!new_extents)
+   goto out;
nr_extents = 1;
ret = get_new_locations(reloc_inode,
extent_key,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 29ff749..59765bc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -877,6 +877,7 @@ static ssize_t btrfs_file_write(struct file *file, 
const char __user *buf,

file_update_time(file);

pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
+   BUG_ON(!pages);

/* generic_write_checks can change our pos */
start_pos = pos;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2bfdc64..d1216ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -284,6 +284,7 @@ static noinline int add_async_extent(struct 
async_cow *cow,

struct async_extent *async_extent;

async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS);
+   BUG_ON(!async_extent);
async_extent-start = start;
async_extent-ram_size = ram_size;
async_extent-compressed_size = compressed_size;
@@ -382,8 +383,8 @@ again:
if (!(BTRFS_I(inode)-flags  

physician mailing list

2010-04-26 Thread Paulson | Graham
To get additional details, samples and counts for our USA contact data please
email me at this address successto...@gmx.com

we have lots of different lists in many fields and this week is the time to buy 
with lowered list prices.
  




Send email to  to ensure no further communication

after MENU key pressed, off hook to interrupt and exit
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html