[RFC][PATCH] Btrfs: Fix the write error into prealloc file

2011-09-28 Thread WuBo
reproduce:
dd if=/dev/zero of=prealloc_test bs=4K count=1
fallocate -n -o 4K -l 1M prealloc_test
dd if=/dev/zero of=tmpfile1 bs=1M
dd if=/dev/zero of=tmpfile2 bs=4K
dd if=/dev/zero of=prealloc_test seek=1 bs=4K count=2 conv=notrunc

Although the prealloc_test file still has space, the write will be
failed. Because the reserve code think the space is full.
Before reserve the data space for writing, checking whether the inode
has prealloc extent or not. If match the range, it's means we don't 
need reserve the space. I also use a extra bit EXTENT_PREALLOC to
record the NO-need-reserve-space extent_state.

There is a another dangerous that if after we don't reserve the 
prealloc range, there is a writeback thread change the prealloc range
into regular range nicely. The reserve space will wrong, what we do
is waiting all the dirty pages in prealloc range to write back.

I made it for RFC because this patch will lead to performance decay
also with disk fragment growth.

Signed-off-by: Wu Bo 
---
 fs/btrfs/ctree.h   |6 ++-
 fs/btrfs/extent-tree.c |   30 ++--
 fs/btrfs/extent_io.c   |   17 +++
 fs/btrfs/extent_io.h   |6 +++
 fs/btrfs/file.c|2 +-
 fs/btrfs/inode.c   |  114 ---
 fs/btrfs/ioctl.c   |2 +-
 7 files changed, 161 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b99c79..030cd28 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2234,7 +2234,8 @@ int btrfs_snap_reserve_metadata(struct btrfs_trans_handle 
*trans,
struct btrfs_pending_snapshot *pending);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start,
+u64 num_bytes);
 void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root);
@@ -2595,6 +2596,9 @@ int btrfs_prealloc_file_range_trans(struct inode *inode,
u64 start, u64 num_bytes, u64 min_size,
loff_t actual_len, u64 *alloc_hint);
 extern const struct dentry_operations btrfs_dentry_operations;
+extern int btrfs_search_prealloc_file_range(struct inode *inode,
+   u64 start, u64 len,
+   u64 *need_reserve);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 80d6148..2a37571 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4054,17 +4054,37 @@ void btrfs_delalloc_release_metadata(struct inode 
*inode, u64 num_bytes)
to_free);
 }
 
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 num_bytes)
 {
int ret;
+   u64 need_reserve = num_bytes;
 
-   ret = btrfs_check_data_free_space(inode, num_bytes);
-   if (ret)
-   return ret;
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC) {
+   struct extent_state *cached_state = NULL;
+
+   lock_extent_bits(&BTRFS_I(inode)->io_tree, start,
+start + num_bytes - 1, 0,
+&cached_state, GFP_NOFS);
+
+   ret = btrfs_search_prealloc_file_range(inode,
+   start, num_bytes, &need_reserve);
+
+   unlock_extent_cached(&BTRFS_I(inode)->io_tree, start,
+start + num_bytes - 1,
+&cached_state, GFP_NOFS);
+   if (ret)
+   return ret;
+   }
+
+   if (need_reserve != 0) {
+   ret = btrfs_check_data_free_space(inode, need_reserve);
+   if (ret)
+   return ret;
+   }
 
ret = btrfs_delalloc_reserve_metadata(inode, num_bytes);
if (ret) {
-   btrfs_free_reserved_data_space(inode, num_bytes);
+   btrfs_free_reserved_data_space(inode, need_reserve);
return ret;
}
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8491712..b872a04 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -953,6 +953,20 @@ static int clear_extent_uptodate(struct extent_io_tree 
*tree, u64 start,
cached_state, mask);
 }
 
+int set_extent_prealloc(struct extent_io_tree *tree, u64 start, u64 end,
+   gfp_t mask)
+{
+   return set_extent_bit(tree, start, end, EXTENT_PREALLOC, 0, NULL,
+ 

Re: [PATCH] Btrfs: fix tree corruption after multi-thread snapshots and inode cache flush

2011-09-28 Thread Yan, Zheng
On 09/29/2011 02:47 PM, Miao Xie wrote:
> On thu, 29 Sep 2011 12:25:56 +0800, Yan, Zheng wrote:
>> On 09/29/2011 10:00 AM, Liu Bo wrote:
>>> The btrfs snapshotting code requires that once a root has been
>>> snapshotted, we don't change it during a commit.
>>>
>>> But there are two cases to lead to tree corruptions:
>>>
>>> 1) multi-thread snapshots can commit serveral snapshots in a transaction,
>>>and this may change the src root when processing the following pending
>>>snapshots, which lead to the former snapshots corruptions;
>>>
>>> 2) the free inode cache was changing the roots when it root the cache,
>>>which lead to corruptions.
>>>
>> For the case 2, the free inode cache of newly created snapshot is invalid.
>> So it's better to avoid modifying snapshotted trees.
> 
> I think this feature, that the inode cache is written out after creating 
> snapshot,
> was implemented on purpose. Because some i-node IDs are freed after their 
> tree is
> committed, and so the newly created snapshot must cache the i-node ID again to
> guarantee the inode cache is right, even though we write out the inode cache 
> of
> the trees before they are snapshotted. So it is unnecessary to make the inode 
> cache
> be written out before creating snapshot.
> 

When opening the newly created snapshot, orphan cleanup will find these
freed-after-commited inodes and update the inode cache. So technically,
rescan is not required.

> Li, am I right?
> 
> Thanks
> Miao
> 
>>
>>> This fixes things by making sure we force COW the block after we create a
>>> snapshot during commiting a transaction, then any changes to the roots
>>> will result in COW, and we get all the fs roots and snapshot roots to be
>>> consistent.
>>>
>>> Signed-off-by: Liu Bo 
>>> Signed-off-by: Miao Xie 
>>> ---
>>>  fs/btrfs/ctree.c   |   17 -
>>>  fs/btrfs/ctree.h   |2 ++
>>>  fs/btrfs/transaction.c |8 
>>>  3 files changed, 26 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>>> index 011cab3..49dad7d 100644
>>> --- a/fs/btrfs/ctree.c
>>> +++ b/fs/btrfs/ctree.c
>>> @@ -514,10 +514,25 @@ static inline int should_cow_block(struct 
>>> btrfs_trans_handle *trans,
>>>struct btrfs_root *root,
>>>struct extent_buffer *buf)
>>>  {
>>> +   /* ensure we can see the force_cow */
>>> +   smp_rmb();
>>> +
>>> +   /*
>>> +* We do not need to cow a block if
>>> +* 1) this block is not created or changed in this transaction;
>>> +* 2) this block does not belong to TREE_RELOC tree;
>>> +* 3) the root is not forced COW.
>>> +*
>>> +* What is forced COW:
>>> +*when we create snapshot during commiting the transaction,
>>> +*after we've finished coping src root, we must COW the shared
>>> +*block to ensure the metadata consistency.
>>> +*/
>>> if (btrfs_header_generation(buf) == trans->transid &&
>>> !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>>> !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
>>> - btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
>>> + btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
>>> +   !root->force_cow)
>>> return 0;
>>> return 1;
>>>  }
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 03912c5..bece0df 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1225,6 +1225,8 @@ struct btrfs_root {
>>>  * for stat.  It may be used for more later
>>>  */
>>> dev_t anon_dev;
>>> +
>>> +   int force_cow;
>>>  };
>>>  
>>>  struct btrfs_ioctl_defrag_range_args {
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 7dc36fa..bf6e2b3 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -816,6 +816,10 @@ static noinline int commit_fs_roots(struct 
>>> btrfs_trans_handle *trans,
>>>  
>>> btrfs_save_ino_cache(root, trans);
>>>  
>>> +   /* see comments in should_cow_block() */
>>> +   root->force_cow = 0;
>>> +   smp_wmb();
>>> +
>>> if (root->commit_root != root->node) {
>>> mutex_lock(&root->fs_commit_mutex);
>>> switch_commit_root(root);
>>> @@ -976,6 +980,10 @@ static noinline int create_pending_snapshot(struct 
>>> btrfs_trans_handle *trans,
>>> btrfs_tree_unlock(old);
>>> free_extent_buffer(old);
>>>  
>>> +   /* see comments in should_cow_block() */
>>> +   root->force_cow = 1;
>>> +   smp_wmb();
>>> +
>>> btrfs_set_root_node(new_root_item, tmp);
>>> /* record when the snapshot was created in key.offset */
>>> key.offset = trans->transid;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majord

Re: [PATCH] Btrfs: fix tree corruption after multi-thread snapshots and inode cache flush

2011-09-28 Thread Miao Xie
On thu, 29 Sep 2011 12:25:56 +0800, Yan, Zheng wrote:
> On 09/29/2011 10:00 AM, Liu Bo wrote:
>> The btrfs snapshotting code requires that once a root has been
>> snapshotted, we don't change it during a commit.
>>
>> But there are two cases to lead to tree corruptions:
>>
>> 1) multi-thread snapshots can commit serveral snapshots in a transaction,
>>and this may change the src root when processing the following pending
>>snapshots, which lead to the former snapshots corruptions;
>>
>> 2) the free inode cache was changing the roots when it root the cache,
>>which lead to corruptions.
>>
> For the case 2, the free inode cache of newly created snapshot is invalid.
> So it's better to avoid modifying snapshotted trees.

I think this feature, that the inode cache is written out after creating 
snapshot,
was implemented on purpose. Because some i-node IDs are freed after their tree 
is
committed, and so the newly created snapshot must cache the i-node ID again to
guarantee the inode cache is right, even though we write out the inode cache of
the trees before they are snapshotted. So it is unnecessary to make the inode 
cache
be written out before creating snapshot.

Li, am I right?

Thanks
Miao

> 
>> This fixes things by making sure we force COW the block after we create a
>> snapshot during commiting a transaction, then any changes to the roots
>> will result in COW, and we get all the fs roots and snapshot roots to be
>> consistent.
>>
>> Signed-off-by: Liu Bo 
>> Signed-off-by: Miao Xie 
>> ---
>>  fs/btrfs/ctree.c   |   17 -
>>  fs/btrfs/ctree.h   |2 ++
>>  fs/btrfs/transaction.c |8 
>>  3 files changed, 26 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index 011cab3..49dad7d 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -514,10 +514,25 @@ static inline int should_cow_block(struct 
>> btrfs_trans_handle *trans,
>> struct btrfs_root *root,
>> struct extent_buffer *buf)
>>  {
>> +/* ensure we can see the force_cow */
>> +smp_rmb();
>> +
>> +/*
>> + * We do not need to cow a block if
>> + * 1) this block is not created or changed in this transaction;
>> + * 2) this block does not belong to TREE_RELOC tree;
>> + * 3) the root is not forced COW.
>> + *
>> + * What is forced COW:
>> + *when we create snapshot during commiting the transaction,
>> + *after we've finished coping src root, we must COW the shared
>> + *block to ensure the metadata consistency.
>> + */
>>  if (btrfs_header_generation(buf) == trans->transid &&
>>  !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>>  !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
>> -  btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
>> +  btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
>> +!root->force_cow)
>>  return 0;
>>  return 1;
>>  }
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 03912c5..bece0df 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1225,6 +1225,8 @@ struct btrfs_root {
>>   * for stat.  It may be used for more later
>>   */
>>  dev_t anon_dev;
>> +
>> +int force_cow;
>>  };
>>  
>>  struct btrfs_ioctl_defrag_range_args {
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 7dc36fa..bf6e2b3 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -816,6 +816,10 @@ static noinline int commit_fs_roots(struct 
>> btrfs_trans_handle *trans,
>>  
>>  btrfs_save_ino_cache(root, trans);
>>  
>> +/* see comments in should_cow_block() */
>> +root->force_cow = 0;
>> +smp_wmb();
>> +
>>  if (root->commit_root != root->node) {
>>  mutex_lock(&root->fs_commit_mutex);
>>  switch_commit_root(root);
>> @@ -976,6 +980,10 @@ static noinline int create_pending_snapshot(struct 
>> btrfs_trans_handle *trans,
>>  btrfs_tree_unlock(old);
>>  free_extent_buffer(old);
>>  
>> +/* see comments in should_cow_block() */
>> +root->force_cow = 1;
>> +smp_wmb();
>> +
>>  btrfs_set_root_node(new_root_item, tmp);
>>  /* record when the snapshot was created in key.offset */
>>  key.offset = trans->transid;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix tree corruption after multi-thread snapshots and inode cache flush

2011-09-28 Thread Yan, Zheng
On 09/29/2011 10:00 AM, Liu Bo wrote:
> The btrfs snapshotting code requires that once a root has been
> snapshotted, we don't change it during a commit.
> 
> But there are two cases to lead to tree corruptions:
> 
> 1) multi-thread snapshots can commit serveral snapshots in a transaction,
>and this may change the src root when processing the following pending
>snapshots, which lead to the former snapshots corruptions;
> 
> 2) the free inode cache was changing the roots when it root the cache,
>which lead to corruptions.
> 
For the case 2, the free inode cache of newly created snapshot is invalid.
So it's better to avoid modifying snapshotted trees.

> This fixes things by making sure we force COW the block after we create a
> snapshot during commiting a transaction, then any changes to the roots
> will result in COW, and we get all the fs roots and snapshot roots to be
> consistent.
> 
> Signed-off-by: Liu Bo 
> Signed-off-by: Miao Xie 
> ---
>  fs/btrfs/ctree.c   |   17 -
>  fs/btrfs/ctree.h   |2 ++
>  fs/btrfs/transaction.c |8 
>  3 files changed, 26 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 011cab3..49dad7d 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -514,10 +514,25 @@ static inline int should_cow_block(struct 
> btrfs_trans_handle *trans,
>  struct btrfs_root *root,
>  struct extent_buffer *buf)
>  {
> + /* ensure we can see the force_cow */
> + smp_rmb();
> +
> + /*
> +  * We do not need to cow a block if
> +  * 1) this block is not created or changed in this transaction;
> +  * 2) this block does not belong to TREE_RELOC tree;
> +  * 3) the root is not forced COW.
> +  *
> +  * What is forced COW:
> +  *when we create snapshot during commiting the transaction,
> +  *after we've finished coping src root, we must COW the shared
> +  *block to ensure the metadata consistency.
> +  */
>   if (btrfs_header_generation(buf) == trans->transid &&
>   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>   !(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
> -   btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
> +   btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
> + !root->force_cow)
>   return 0;
>   return 1;
>  }
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 03912c5..bece0df 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1225,6 +1225,8 @@ struct btrfs_root {
>* for stat.  It may be used for more later
>*/
>   dev_t anon_dev;
> +
> + int force_cow;
>  };
>  
>  struct btrfs_ioctl_defrag_range_args {
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 7dc36fa..bf6e2b3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -816,6 +816,10 @@ static noinline int commit_fs_roots(struct 
> btrfs_trans_handle *trans,
>  
>   btrfs_save_ino_cache(root, trans);
>  
> + /* see comments in should_cow_block() */
> + root->force_cow = 0;
> + smp_wmb();
> +
>   if (root->commit_root != root->node) {
>   mutex_lock(&root->fs_commit_mutex);
>   switch_commit_root(root);
> @@ -976,6 +980,10 @@ static noinline int create_pending_snapshot(struct 
> btrfs_trans_handle *trans,
>   btrfs_tree_unlock(old);
>   free_extent_buffer(old);
>  
> + /* see comments in should_cow_block() */
> + root->force_cow = 1;
> + smp_wmb();
> +
>   btrfs_set_root_node(new_root_item, tmp);
>   /* record when the snapshot was created in key.offset */
>   key.offset = trans->transid;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix tree corruption after multi-thread snapshots and inode cache flush

2011-09-28 Thread Liu Bo
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.

But there are two cases to lead to tree corruptions:

1) multi-thread snapshots can commit serveral snapshots in a transaction,
   and this may change the src root when processing the following pending
   snapshots, which lead to the former snapshots corruptions;

2) the free inode cache was changing the roots when it root the cache,
   which lead to corruptions.

This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.

Signed-off-by: Liu Bo 
Signed-off-by: Miao Xie 
---
 fs/btrfs/ctree.c   |   17 -
 fs/btrfs/ctree.h   |2 ++
 fs/btrfs/transaction.c |8 
 3 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 011cab3..49dad7d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -514,10 +514,25 @@ static inline int should_cow_block(struct 
btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct extent_buffer *buf)
 {
+   /* ensure we can see the force_cow */
+   smp_rmb();
+
+   /*
+* We do not need to cow a block if
+* 1) this block is not created or changed in this transaction;
+* 2) this block does not belong to TREE_RELOC tree;
+* 3) the root is not forced COW.
+*
+* What is forced COW:
+*when we create snapshot during commiting the transaction,
+*after we've finished coping src root, we must COW the shared
+*block to ensure the metadata consistency.
+*/
if (btrfs_header_generation(buf) == trans->transid &&
!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
!(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID &&
- btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
+ btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)) &&
+   !root->force_cow)
return 0;
return 1;
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 03912c5..bece0df 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1225,6 +1225,8 @@ struct btrfs_root {
 * for stat.  It may be used for more later
 */
dev_t anon_dev;
+
+   int force_cow;
 };
 
 struct btrfs_ioctl_defrag_range_args {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 7dc36fa..bf6e2b3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -816,6 +816,10 @@ static noinline int commit_fs_roots(struct 
btrfs_trans_handle *trans,
 
btrfs_save_ino_cache(root, trans);
 
+   /* see comments in should_cow_block() */
+   root->force_cow = 0;
+   smp_wmb();
+
if (root->commit_root != root->node) {
mutex_lock(&root->fs_commit_mutex);
switch_commit_root(root);
@@ -976,6 +980,10 @@ static noinline int create_pending_snapshot(struct 
btrfs_trans_handle *trans,
btrfs_tree_unlock(old);
free_extent_buffer(old);
 
+   /* see comments in should_cow_block() */
+   root->force_cow = 1;
+   smp_wmb();
+
btrfs_set_root_node(new_root_item, tmp);
/* record when the snapshot was created in key.offset */
key.offset = trans->transid;
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix missing clear_extent_bit

2011-09-28 Thread Liu Bo
On 09/28/2011 09:44 PM, Chris Mason wrote:
> Excerpts from Josef Bacik's message of 2011-09-28 08:34:03 -0400:
>> On 09/28/2011 06:00 AM, Liu Bo wrote:
>>> We forget to clear inode's dirty_bytes and EXTENT_DIRTY at the end of write.
>>>
>> We don't set EXTENT_DIRTY unless we failed to read a block and that's to
>> keep track of the area we are re-reading, unless I'm missing something?
>>  Thanks,
> 
> Josef and I have been talking about this one on IRC.
> We do set EXTENT_DIRTY during set_extent_delalloc, but as far as I can
> tell we no longer need to.  Can you please experiment with just not
> setting the dirty bit during delalloc instead?
> 

Sure.  So this EXTENT_DIRTY is only for METADATA use.

thanks,
liubo

> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File compression control, again.

2011-09-28 Thread David Sterba
On Wed, Sep 28, 2011 at 02:50:13PM +, Artem wrote:
> Li Zefan  cn.fujitsu.com> writes:
> > See this "Per file/directory controls for COW and compression":
> > 
> > http://marc.info/?l=linux-btrfs&m=130078867208491&w=2
> 
> Thanks again!
> I wrote a program to see if ioctl compression control works
> ( https://gist.github.com/1248085 )

you have to call IOC_GETFLAGS first, then

flags |= FS_NOCOMP_FL;

27  int flags = compression ? FS_COMPR_FL : FS_NOCOMP_FL;
28  int ret = ioctl (fd, FS_IOC_SETFLAGS, &flags);

else all other flags will be dropped.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


群发软件+买家搜索机+最新广交会买家、海关数据,B2B询盘买家500万。

2011-09-28 Thread 仅10元每天
群发软件+买家搜索机+109届广交会买家、展会买家、海关数据,B2B询盘买家500万。

一共8个包(数据是全行业的,按照行业分好类,并且可以按照关键词查询的): 
1,2011春季109届广交会买家数据库新鲜出炉,超级新鲜买家,新鲜数据,容易成单! 
2,最新全球买家库,共451660条数据。 
3,2008年,2009年,2010年 春季+秋季广交会买家名录,103 104 105 106 107 108 共六届 共120.6万数据。
4,2010年国际促销协会(PPAI)成员名单 PPAI Members Directory,非常重要的大买家。
5,2010年到香港采购的国外客人名录(香港贸发局提供),共7.2万数据,超级重要的买家。
6,60.8万条最新国外B2B买家询盘。
7,2009年海关提单数据piers版数据 1千万。
8,群发软件,群发软件的部署与安装。

共 500万个买家,每个均有Email. 

保证每天都有买家回复。
保证每天都有买家回复。

要的抓紧联系QQ: 1339625218   或者立即回复邮箱: 1339625...@qq.com
要的抓紧联系QQ: 1339625218   或者立即回复邮箱: 1339625...@qq.com
要的抓紧联系QQ: 1339625218   或者立即回复邮箱: 1339625...@qq.com
 

诚信为本,如果不信任本人,可以走淘宝交易,收货验证后再付款,这是对您最好的保障了。 

保证每天都有买家回复。
保证每天都有买家回复。
保证每天都有买家回复。




广交会买家按产品类别分类,分为以下几类:
1 办公设备
2 编织及藤铁工艺品
3 玻璃
4 餐厨用具
5 车辆
6 大型机械及设备
7 电子电气
8 电子消费品
9 纺织
10 服装
11 个人护理
12 工程机械
13 工具
14 化工
15 计算机及通讯
16 家居用品
17 家居装饰
18 家具
19 家用电器
20 建筑及装饰材料
21 节日用品
22 礼品及赠品
23 摩托车
24 汽车配件
25 食品
26 陶瓷
27 铁石
28 玩具
29 卫浴
30 五金
31 小型机械
32 鞋
33 休闲用品
34 医疗
35 浴室产品
36 园林
37 照明产品
38 钟表眼镜
39 自行车
40 包


保证每天都有买家回复。
保证每天都有买家回复。
保证每天都有买家回复。
保证每天都有买家回复。
保证每天都有买家回复。

Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:50:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 01:55:51PM +0900, Minchan Kim wrote:
> > Hi Hannes,
> > 
> > On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
> > > The amount of dirtyable pages should not include the full number of
> > > free pages: there is a number of reserved pages that the page
> > > allocator and kswapd always try to keep free.
> > > 
> > > The closer (reclaimable pages - dirty pages) is to the number of
> > > reserved pages, the more likely it becomes for reclaim to run into
> > > dirty pages:
> > > 
> > >+--+ ---
> > >|   anon   |  |
> > >+--+  |
> > >|  |  |
> > >|  |  -- dirty limit new-- flusher new
> > >|   file   |  | |
> > >|  |  | |
> > >|  |  -- dirty limit old-- flusher old
> > >|  ||
> > >+--+   --- reclaim
> > >| reserved |
> > >+--+
> > >|  kernel  |
> > >+--+
> > > 
> > > This patch introduces a per-zone dirty reserve that takes both the
> > > lowmem reserve as well as the high watermark of the zone into account,
> > > and a global sum of those per-zone values that is subtracted from the
> > > global amount of dirtyable pages.  The lowmem reserve is unavailable
> > > to page cache allocations and kswapd tries to keep the high watermark
> > > free.  We don't want to end up in a situation where reclaim has to
> > > clean pages in order to balance zones.
> > > 
> > > Not treating reserved pages as dirtyable on a global level is only a
> > > conceptual fix.  In reality, dirty pages are not distributed equally
> > > across zones and reclaim runs into dirty pages on a regular basis.
> > > 
> > > But it is important to get this right before tackling the problem on a
> > > per-zone level, where the distance between reclaim and the dirty pages
> > > is mostly much smaller in absolute numbers.
> > > 
> > > Signed-off-by: Johannes Weiner 
> > > ---
> > >  include/linux/mmzone.h |6 ++
> > >  include/linux/swap.h   |1 +
> > >  mm/page-writeback.c|6 --
> > >  mm/page_alloc.c|   19 +++
> > >  4 files changed, 30 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 1ed4116..37a61e7 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -317,6 +317,12 @@ struct zone {
> > >*/
> > >   unsigned long   lowmem_reserve[MAX_NR_ZONES];
> > >  
> > > + /*
> > > +  * This is a per-zone reserve of pages that should not be
> > > +  * considered dirtyable memory.
> > > +  */
> > > + unsigned long   dirty_balance_reserve;
> > > +
> > >  #ifdef CONFIG_NUMA
> > >   int node;
> > >   /*
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index b156e80..9021453 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -209,6 +209,7 @@ struct swap_list_t {
> > >  /* linux/mm/page_alloc.c */
> > >  extern unsigned long totalram_pages;
> > >  extern unsigned long totalreserve_pages;
> > > +extern unsigned long dirty_balance_reserve;
> > >  extern unsigned int nr_free_buffer_pages(void);
> > >  extern unsigned int nr_free_pagecache_pages(void);
> > >  
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index da6d263..c8acf8a 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -170,7 +170,8 @@ static unsigned long 
> > > highmem_dirtyable_memory(unsigned long total)
> > >   &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> > >  
> > >   x += zone_page_state(z, NR_FREE_PAGES) +
> > > -  zone_reclaimable_pages(z);
> > > +  zone_reclaimable_pages(z) -
> > > +  zone->dirty_balance_reserve;
> > >   }
> > >   /*
> > >* Make sure that the number of highmem pages is never larger
> > > @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
> > >  {
> > >   unsigned long x;
> > >  
> > > - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > > + x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> > > + dirty_balance_reserve;
> > >  
> > >   if (!vm_highmem_is_dirtyable)
> > >   x -= highmem_dirtyable_memory(x);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 1dba05e..f8cba89 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
> > >  
> > >  unsigned long totalram_pages __read_mostly;
> > >  unsigned long totalreserve_pages __read_mostly;
> > > +/*
> > > + * When calculating the number of globally allowed dirty pages, there
> > > + * is a certain number of per-zone reserves that should not be
> > > + * considered dirtyable memory.  This

Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:11:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> > On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > > The maximum number of dirty pages that exist in the system at any time
> > > is determined by a number of pages considered dirtyable and a
> > > user-configured percentage of those, or an absolute number in bytes.
> > 
> > It's explanation of old approach.
> 
> What do you mean?  This does not change with this patch.  We still
> have a number of dirtyable pages and a limit that is applied
> relatively to this number.
> 
> > > This number of dirtyable pages is the sum of memory provided by all
> > > the zones in the system minus their lowmem reserves and high
> > > watermarks, so that the system can retain a healthy number of free
> > > pages without having to reclaim dirty pages.
> > 
> > It's a explanation of new approach.
> 
> Same here, this aspect is also not changed with this patch!
> 
> > > But there is a flaw in that we have a zoned page allocator which does
> > > not care about the global state but rather the state of individual
> > > memory zones.  And right now there is nothing that prevents one zone
> > > from filling up with dirty pages while other zones are spared, which
> > > frequently leads to situations where kswapd, in order to restore the
> > > watermark of free pages, does indeed have to write pages from that
> > > zone's LRU list.  This can interfere so badly with IO from the flusher
> > > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > > requests from reclaim already, taking away the VM's only possibility
> > > to keep such a zone balanced, aside from hoping the flushers will soon
> > > clean pages from that zone.
> > 
> > It's a explanation of old approach, again!
> > Shoudn't we move above phrase of new approach into below?
> 
> Everything above describes the current behaviour (at the point of this
> patch, so respecting lowmem_reserve e.g. is part of the current
> behaviour by now) and its problems.  And below follows a description
> of how the patch tries to fix it.

It seems that it's not a good choice to use "old" and "new" terms.
Hannes, please ignore, it's not a biggie.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File compression control, again.

2011-09-28 Thread Artem
Li Zefan  cn.fujitsu.com> writes:
> See this "Per file/directory controls for COW and compression":
> 
>   http://marc.info/?l=linux-btrfs&m=130078867208491&w=2

Thanks again!
I wrote a program to see if ioctl compression control works
( https://gist.github.com/1248085 )
and it does!  : )

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: only inherit btrfs specific flags when creating files

2011-09-28 Thread Christoph Hellwig
On Wed, Sep 28, 2011 at 08:26:09AM -0400, Josef Bacik wrote:
> > 
> > It shows EXT[3,4]_APPEND_FL should be inherited from their parent, is this 
> > the standard?
> > 
> 
> I have no idea actually, it was just failing on xfstest 79 and when I
> took out the inheritance thing it passed so I took the test to be the
> standard, maybe we should open this up to a wider audience.  Thanks,

We had a little discussion on this when Stefan Behrens made this
test generic, and the conclusion was that the other filesystems should
adopt the xfs behaviour.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix missing clear_extent_bit

2011-09-28 Thread Chris Mason
Excerpts from Josef Bacik's message of 2011-09-28 08:34:03 -0400:
> On 09/28/2011 06:00 AM, Liu Bo wrote:
> > We forget to clear inode's dirty_bytes and EXTENT_DIRTY at the end of write.
> > 
> 
> We don't set EXTENT_DIRTY unless we failed to read a block and that's to
> keep track of the area we are re-reading, unless I'm missing something?
>  Thanks,

Josef and I have been talking about this one on IRC.
We do set EXTENT_DIRTY during set_extent_delalloc, but as far as I can
tell we no longer need to.  Can you please experiment with just not
setting the dirty bit during delalloc instead?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS data structures integrity

2011-09-28 Thread Kasatkin, Dmitry
Hello,

I have a question about BTRFS data structure integrity.

On Ext3 file system I was able to modify offline inode block mapping
in such a way,
that 2 inodes did point to the same data blocks, so when modifying one
file, did affect another file..
FSCK detects such problems and create duplicated blocks, so that inode
content will not overlap...

Does Ext4 suffers from the same problem?

Can anyone please tell if BTRFS is persistent to such attacks and
running fsck is not needed?

Thanks,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File compression control, again.

2011-09-28 Thread Artem
Li Zefan  cn.fujitsu.com> writes:
> See this "Per file/directory controls for COW and compression":
> 
>   http://marc.info/?l=linux-btrfs&m=130078867208491&w=2
> 
> And the user tool patch (which got no reply):
> 
>   http://marc.info/?l=linux-btrfs&m=130311215721242&w=2
> 
> So you can create a directory, and set the no-compress flag for it, and
> then any file created in that dir will inherit the flag.

Thanks, Li, but how do I set the no-compress flag?
The patched chattr you mention can only set the FS_COMPR_FL.
The 'C' argument is now used for FS_NOCOW_FL.

Could we use another flag for copy-on-write control in chattr?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] scrub updates for 3.2

2011-09-28 Thread Arne Jansen
On 28.09.2011 15:17, Arne Jansen wrote:
> Hi Chris,
> 
> I rebased my readahead-patches for scrub to your current
> integration-test branch (83f4e90fd11) and pushed it to:
> 
> g...@github.com:sensille/linux.git for-chris

git://github.com/sensille/linux.git for-chris

of course...

> 
> It just contains the readahead patch, which gives a significant
> performance improvement for scrub. Currently scrub is the only
> consumer.
> 
> Thanks,
> Arne
> 
> Arne Jansen (7):
>   btrfs: add an extra wait mode to read_extent_buffer_pages
>   btrfs: add READAHEAD extent buffer flag
>   btrfs: state information for readahead
>   btrfs: initial readahead code and prototypes
>   btrfs: hooks for readahead
>   btrfs: test ioctl for readahead
>   btrfs: use readahead API for scrub
> 
>  fs/btrfs/Makefile|3 +-
>  fs/btrfs/ctree.h |   21 ++
>  fs/btrfs/disk-io.c   |   85 +-
>  fs/btrfs/disk-io.h   |2 +
>  fs/btrfs/extent_io.c |9 +-
>  fs/btrfs/extent_io.h |4 +
>  fs/btrfs/ioctl.c |   93 +-
>  fs/btrfs/ioctl.h |   16 +
>  fs/btrfs/reada.c |  949 
> ++
>  fs/btrfs/scrub.c |  116 +++
>  fs/btrfs/volumes.c   |8 +
>  fs/btrfs/volumes.h   |8 +
>  12 files changed, 1239 insertions(+), 75 deletions(-)
>  create mode 100644 fs/btrfs/reada.c
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] scrub updates for 3.2

2011-09-28 Thread Arne Jansen
Hi Chris,

I rebased my readahead-patches for scrub to your current
integration-test branch (83f4e90fd11) and pushed it to:

g...@github.com:sensille/linux.git for-chris

It just contains the readahead patch, which gives a significant
performance improvement for scrub. Currently scrub is the only
consumer.

Thanks,
Arne

Arne Jansen (7):
  btrfs: add an extra wait mode to read_extent_buffer_pages
  btrfs: add READAHEAD extent buffer flag
  btrfs: state information for readahead
  btrfs: initial readahead code and prototypes
  btrfs: hooks for readahead
  btrfs: test ioctl for readahead
  btrfs: use readahead API for scrub

 fs/btrfs/Makefile|3 +-
 fs/btrfs/ctree.h |   21 ++
 fs/btrfs/disk-io.c   |   85 +-
 fs/btrfs/disk-io.h   |2 +
 fs/btrfs/extent_io.c |9 +-
 fs/btrfs/extent_io.h |4 +
 fs/btrfs/ioctl.c |   93 +-
 fs/btrfs/ioctl.h |   16 +
 fs/btrfs/reada.c |  949 ++
 fs/btrfs/scrub.c |  116 +++
 fs/btrfs/volumes.c   |8 +
 fs/btrfs/volumes.h   |8 +
 12 files changed, 1239 insertions(+), 75 deletions(-)
 create mode 100644 fs/btrfs/reada.c
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix missing clear_extent_bit

2011-09-28 Thread Josef Bacik
On 09/28/2011 06:00 AM, Liu Bo wrote:
> We forget to clear inode's dirty_bytes and EXTENT_DIRTY at the end of write.
> 

We don't set EXTENT_DIRTY unless we failed to read a block and that's to
keep track of the area we are re-reading, unless I'm missing something?
 Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: only inherit btrfs specific flags when creating files

2011-09-28 Thread Josef Bacik
On 09/27/2011 08:59 PM, Liu Bo wrote:
> On 09/27/2011 11:02 PM, Josef Bacik wrote:
>> Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
>> weren't supposed to.  There isn't any specific documentation on this so I'm
>> taking the test as the standard of how things work, and having S_APPEND set 
>> on a
>> directory doesn't mean that S_APPEND gets inherited by its children 
>> according to
>> this test.  So only inherit btrfs specific things.  This will let us set
>> compress/nocompress on specific directories and everything in the directories
>> will inherit this flag, same with nodatacow.  With this patch test 79 passes.
>> Thanks,
>>
> 
> I've checked ext3&4, they have such comments:
> 
> /* Flags that should be inherited by new inodes from their parent. */
> #define EXT3_FL_INHERITED (EXT3_SECRM_FL | EXT3_UNRM_FL | EXT3_COMPR_FL |\
>EXT3_SYNC_FL | EXT3_IMMUTABLE_FL | EXT3_APPEND_FL 
> |\
>EXT3_NODUMP_FL | EXT3_NOATIME_FL | 
> EXT3_COMPRBLK_FL|\
>EXT3_NOCOMPR_FL | EXT3_JOURNAL_DATA_FL |\
>EXT3_NOTAIL_FL | EXT3_DIRSYNC_FL)
> 
> It shows EXT[3,4]_APPEND_FL should be inherited from their parent, is this 
> the standard?
> 

I have no idea actually, it was just failing on xfstest 79 and when I
took out the inheritance thing it passed so I took the test to be the
standard, maybe we should open this up to a wider audience.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] ENOSPC rework and random fixes for next merge window

2011-09-28 Thread Chris Mason
Excerpts from Josef Bacik's message of 2011-09-26 17:36:32 -0400:
> Hello,
> 
> Chris can you pull from
> 
> git://github.com/josefbacik/linux.git for-chris

I've pulled this into a new integration-test branch where I'm starting
to pile things in for the next merge window.  Thanks Josef!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix missing clear_extent_bit

2011-09-28 Thread Liu Bo
We forget to clear inode's dirty_bytes and EXTENT_DIRTY at the end of write.

Signed-off-by: Liu Bo 
---
 fs/btrfs/file.c  |1 -
 fs/btrfs/inode.c |5 -
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e7872e4..3f3b4a8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1150,7 +1150,6 @@ fail:
faili--;
}
return err;
-
 }
 
 static noinline ssize_t __btrfs_buffered_write(struct file *file,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0ccc743..d42bea4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -882,7 +882,7 @@ static noinline int cow_file_range(struct inode *inode,
 */
op = unlock ? EXTENT_CLEAR_UNLOCK_PAGE : 0;
op |= EXTENT_CLEAR_UNLOCK | EXTENT_CLEAR_DELALLOC |
-   EXTENT_SET_PRIVATE2;
+ EXTENT_SET_PRIVATE2;
 
extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree,
 start, start + ram_size - 1,
@@ -1778,6 +1778,9 @@ static int btrfs_finish_ordered_io(struct inode *inode, 
u64 start, u64 end)
   ordered_extent->len);
BUG_ON(ret);
}
+   clear_extent_bit(io_tree, ordered_extent->file_offset,
+ordered_extent->file_offset + ordered_extent->len - 1,
+EXTENT_DIRTY, 0, 0, &cached_state, GFP_NOFS);
unlock_extent_cached(io_tree, ordered_extent->file_offset,
 ordered_extent->file_offset +
 ordered_extent->len - 1, &cached_state, GFP_NOFS);
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Mel Gorman
On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.
> 
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.
> 
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones.  And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone's LRU list.  This can interfere so badly with IO from the flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM's only possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.
> 
> Enter per-zone dirty limits.  They are to a zone's dirtyable memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place.  As the number
> of pages considered dirtyable exclude the zones' lowmem reserves and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
> 
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon.  The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task's NUMA memory policy - that has not
> exceeded its dirty limit.
> 
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case.  With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations.  Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that 'spill over' are limited themselves by the lower zones'
> dirty constraints, and thus unlikely to become a problem.
> 
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation.  Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2/4] mm: writeback: cleanups in preparation for per-zone dirty limits

2011-09-28 Thread Mel Gorman
On Fri, Sep 23, 2011 at 04:41:07PM +0200, Johannes Weiner wrote:
> On Thu, Sep 22, 2011 at 10:52:42AM +0200, Johannes Weiner wrote:
> > On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
> > > Should we rename determine_dirtyable_memory() to
> > > global_dirtyable_memory(), to get some sense of its relationship with
> > > zone_dirtyable_memory()?
> > 
> > Sounds good.
> 
> ---
> 
> The next patch will introduce per-zone dirty limiting functions in
> addition to the traditional global dirty limiting.
> 
> Rename determine_dirtyable_memory() to global_dirtyable_memory()
> before adding the zone-specific version, and fix up its documentation.
> 
> Also, move the functions to determine the dirtyable memory and the
> function to calculate the dirty limit based on that together so that
> their relationship is more apparent and that they can be commented on
> as a group.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

2011-09-28 Thread Johannes Weiner
On Wed, Sep 28, 2011 at 01:55:51PM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
> > The amount of dirtyable pages should not include the full number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> > 
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> > 
> >+--+ ---
> >|   anon   |  |
> >+--+  |
> >|  |  |
> >|  |  -- dirty limit new-- flusher new
> >|   file   |  | |
> >|  |  | |
> >|  |  -- dirty limit old-- flusher old
> >|  ||
> >+--+   --- reclaim
> >| reserved |
> >+--+
> >|  kernel  |
> >+--+
> > 
> > This patch introduces a per-zone dirty reserve that takes both the
> > lowmem reserve as well as the high watermark of the zone into account,
> > and a global sum of those per-zone values that is subtracted from the
> > global amount of dirtyable pages.  The lowmem reserve is unavailable
> > to page cache allocations and kswapd tries to keep the high watermark
> > free.  We don't want to end up in a situation where reclaim has to
> > clean pages in order to balance zones.
> > 
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix.  In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> > 
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> > 
> > Signed-off-by: Johannes Weiner 
> > ---
> >  include/linux/mmzone.h |6 ++
> >  include/linux/swap.h   |1 +
> >  mm/page-writeback.c|6 --
> >  mm/page_alloc.c|   19 +++
> >  4 files changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 1ed4116..37a61e7 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -317,6 +317,12 @@ struct zone {
> >  */
> > unsigned long   lowmem_reserve[MAX_NR_ZONES];
> >  
> > +   /*
> > +* This is a per-zone reserve of pages that should not be
> > +* considered dirtyable memory.
> > +*/
> > +   unsigned long   dirty_balance_reserve;
> > +
> >  #ifdef CONFIG_NUMA
> > int node;
> > /*
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b156e80..9021453 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -209,6 +209,7 @@ struct swap_list_t {
> >  /* linux/mm/page_alloc.c */
> >  extern unsigned long totalram_pages;
> >  extern unsigned long totalreserve_pages;
> > +extern unsigned long dirty_balance_reserve;
> >  extern unsigned int nr_free_buffer_pages(void);
> >  extern unsigned int nr_free_pagecache_pages(void);
> >  
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index da6d263..c8acf8a 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned 
> > long total)
> > &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> >  
> > x += zone_page_state(z, NR_FREE_PAGES) +
> > -zone_reclaimable_pages(z);
> > +zone_reclaimable_pages(z) -
> > +zone->dirty_balance_reserve;
> > }
> > /*
> >  * Make sure that the number of highmem pages is never larger
> > @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
> >  {
> > unsigned long x;
> >  
> > -   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> > +   dirty_balance_reserve;
> >  
> > if (!vm_highmem_is_dirtyable)
> > x -= highmem_dirtyable_memory(x);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1dba05e..f8cba89 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
> >  
> >  unsigned long totalram_pages __read_mostly;
> >  unsigned long totalreserve_pages __read_mostly;
> > +/*
> > + * When calculating the number of globally allowed dirty pages, there
> > + * is a certain number of per-zone reserves that should not be
> > + * considered dirtyable memory.  This is the sum of those reserves
> > + * over all existing zones that contribute dirtyable memory.
> > + */
> > +unsigned long dirty_balance_reserve __read_mostly;
> > +
> >  int percpu_pagelist_fraction;
> >  gfp_t gfp_allowed_mask __read_mostly = GFP_BOO

Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Johannes Weiner
On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > The maximum number of dirty pages that exist in the system at any time
> > is determined by a number of pages considered dirtyable and a
> > user-configured percentage of those, or an absolute number in bytes.
> 
> It's explanation of old approach.

What do you mean?  This does not change with this patch.  We still
have a number of dirtyable pages and a limit that is applied
relatively to this number.

> > This number of dirtyable pages is the sum of memory provided by all
> > the zones in the system minus their lowmem reserves and high
> > watermarks, so that the system can retain a healthy number of free
> > pages without having to reclaim dirty pages.
> 
> It's a explanation of new approach.

Same here, this aspect is also not changed with this patch!

> > But there is a flaw in that we have a zoned page allocator which does
> > not care about the global state but rather the state of individual
> > memory zones.  And right now there is nothing that prevents one zone
> > from filling up with dirty pages while other zones are spared, which
> > frequently leads to situations where kswapd, in order to restore the
> > watermark of free pages, does indeed have to write pages from that
> > zone's LRU list.  This can interfere so badly with IO from the flusher
> > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > requests from reclaim already, taking away the VM's only possibility
> > to keep such a zone balanced, aside from hoping the flushers will soon
> > clean pages from that zone.
> 
> It's a explanation of old approach, again!
> Shoudn't we move above phrase of new approach into below?

Everything above describes the current behaviour (at the point of this
patch, so respecting lowmem_reserve e.g. is part of the current
behaviour by now) and its problems.  And below follows a description
of how the patch tries to fix it.

> > Enter per-zone dirty limits.  They are to a zone's dirtyable memory
> > what the global limit is to the global amount of dirtyable memory, and
> > try to make sure that no single zone receives more than its fair share
> > of the globally allowed dirty pages in the first place.  As the number
> > of pages considered dirtyable exclude the zones' lowmem reserves and
> > high watermarks, the maximum number of dirty pages in a zone is such
> > that the zone can always be balanced without requiring page cleaning.
> > 
> > As this is a placement decision in the page allocator and pages are
> > dirtied only after the allocation, this patch allows allocators to
> > pass __GFP_WRITE when they know in advance that the page will be
> > written to and become dirty soon.  The page allocator will then
> > attempt to allocate from the first zone of the zonelist - which on
> > NUMA is determined by the task's NUMA memory policy - that has not
> > exceeded its dirty limit.
> > 
> > At first glance, it would appear that the diversion to lower zones can
> > increase pressure on them, but this is not the case.  With a full high
> > zone, allocations will be diverted to lower zones eventually, so it is
> > more of a shift in timing of the lower zone allocations.  Workloads
> > that previously could fit their dirty pages completely in the higher
> > zone may be forced to allocate from lower zones, but the amount of
> > pages that 'spill over' are limited themselves by the lower zones'
> > dirty constraints, and thus unlikely to become a problem.
> 
> That's a good justification.
> 
> > For now, the problem of unfair dirty page distribution remains for
> > NUMA configurations where the zones allowed for allocation are in sum
> > not big enough to trigger the global dirty limits, wake up the flusher
> > threads and remedy the situation.  Because of this, an allocation that
> > could not succeed on any of the considered zones is allowed to ignore
> > the dirty limits before going into direct reclaim or even failing the
> > allocation, until a future patch changes the global dirty throttling
> > and flusher thread activation so that they take individual zone states
> > into account.
> > 
> > Signed-off-by: Johannes Weiner 
> 
> Otherwise, looks good to me.
> Reviewed-by: Minchan Kim 

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html