Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Yan, Zheng
On Fri, Dec 15, 2017 at 12:53 AM, Jan Kara <j...@suse.cz> wrote:
> On Thu 14-12-17 22:30:26, Yan, Zheng wrote:
>> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <j...@suse.cz> wrote:
>> > On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>> >> We recently got an Oops report:
>> >>
>> >> BUG: unable to handle kernel NULL pointer dereference at (null)
>> >> IP: jbd2__journal_start+0x38/0x1a2
>> >> [...]
>> >> Call Trace:
>> >>   ext4_page_mkwrite+0x307/0x52b
>> >>   _ext4_get_block+0xd8/0xd8
>> >>   do_page_mkwrite+0x6e/0xd8
>> >>   handle_mm_fault+0x686/0xf9b
>> >>   mntput_no_expire+0x1f/0x21e
>> >>   __do_page_fault+0x21d/0x465
>> >>   dput+0x4a/0x2f7
>> >>   page_fault+0x22/0x30
>> >>   copy_user_generic_string+0x2c/0x40
>> >>   copy_page_to_iter+0x8c/0x2b8
>> >>   generic_file_read_iter+0x26e/0x845
>> >>   timerqueue_del+0x31/0x90
>> >>   ceph_read_iter+0x697/0xa33 [ceph]
>> >>   hrtimer_cancel+0x23/0x41
>> >>   futex_wait+0x1c8/0x24d
>> >>   get_futex_key+0x32c/0x39a
>> >>   __vfs_read+0xe0/0x130
>> >>   vfs_read.part.1+0x6c/0x123
>> >>   handle_mm_fault+0x831/0xf9b
>> >>   __fget+0x7e/0xbf
>> >>   SyS_read+0x4d/0xb5
>> >>
>> >> ceph_read_iter() uses current->journal_info to pass context info to
>> >> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>> >> has already gotten capability of using page cache (distinguish read
>> >> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>> >> then calls generic_file_read_iter().
>> >>
>> >> In above Oops, page fault happened when copying data to userspace.
>> >> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>> >> current->journal_info and assumed it is journal handle.
>> >>
>> >> I checked other filesystems, btrfs probably suffers similar problem
>> >> for its readpage. (page fault happens when write() copies data from
>> >> userspace memory and the memory is mapped to a file in btrfs.
>> >> verify_parent_transid() can be called during readpage)
>> >>
>> >> Cc: sta...@vger.kernel.org
>> >> Signed-off-by: "Yan, Zheng" <z...@redhat.com>
>> >
>> > I agree with the analysis but the patch is too ugly too live. Ceph just
>> > should not be abusing current->journal_info for passing information between
>> > two random functions or when it does a hackery like this, it should just
>> > make sure the pieces hold together. Poluting generic code to accommodate
>> > this hack in Ceph is not acceptable. Also bear in mind there are likely
>> > other code paths (e.g. memory reclaim) which could recurse into another
>> > filesystem confusing it with non-NULL current->journal_info in the same
>> > way.
>>
>> But ...
>>
>> some filesystem set journal_info in its write_begin(), then clear it
>> in write_end(). If buffer for write is mapped to another filesystem,
>> current->journal can leak to the later filesystem's page_readpage().
>> The later filesystem may read current->journal and treat it as its own
>> journal handle.  Besides, most filesystem's vm fault handle is
>> filemap_fault(), filemap also may tigger memory reclaim.
>
> Did you really observe this? Because write path uses
> iov_iter_copy_from_user_atomic() which does not allow page faults to
> happen. All page faulting happens in iov_iter_fault_in_readable() before
> ->write_begin() is called. And the recursion problems like you mention
> above are exactly the reason why things are done in a more complicated way
> like this.

I think you are right.

>
>> >
>> > In this particular case I'm not sure why does ceph pass 'filp' into
>> > readpage() / readpages() handler when it already gets that pointer as part
>> > of arguments...
>>
>> It actually a flag which tells ceph_readpages() if its caller is
>> ceph_read_iter or readahead/fadvise/madvise. because when there are
>> multiple clients read/write a file a the same time, page cache should
>> be disabled.
>
> I'm not sure I understand the reasoning properly but from what you say
> above it rather seems the 'hint' should be stored in the inode (or possibly
> struct file)?
>

The capability of using page cache is hold by the process who got it.
ceph_read_iter() first gets the capability, calls
generic_file_read_iter(), then release the capability. The capability
can not be easily stored in inode or file because it can be revoked by
server any time if caller does not hold it

Regards
Yan, Zheng


> Honza
> --
> Jan Kara <j...@suse.com>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Yan, Zheng
On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <j...@suse.cz> wrote:
> On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>> We recently got an Oops report:
>>
>> BUG: unable to handle kernel NULL pointer dereference at (null)
>> IP: jbd2__journal_start+0x38/0x1a2
>> [...]
>> Call Trace:
>>   ext4_page_mkwrite+0x307/0x52b
>>   _ext4_get_block+0xd8/0xd8
>>   do_page_mkwrite+0x6e/0xd8
>>   handle_mm_fault+0x686/0xf9b
>>   mntput_no_expire+0x1f/0x21e
>>   __do_page_fault+0x21d/0x465
>>   dput+0x4a/0x2f7
>>   page_fault+0x22/0x30
>>   copy_user_generic_string+0x2c/0x40
>>   copy_page_to_iter+0x8c/0x2b8
>>   generic_file_read_iter+0x26e/0x845
>>   timerqueue_del+0x31/0x90
>>   ceph_read_iter+0x697/0xa33 [ceph]
>>   hrtimer_cancel+0x23/0x41
>>   futex_wait+0x1c8/0x24d
>>   get_futex_key+0x32c/0x39a
>>   __vfs_read+0xe0/0x130
>>   vfs_read.part.1+0x6c/0x123
>>   handle_mm_fault+0x831/0xf9b
>>   __fget+0x7e/0xbf
>>   SyS_read+0x4d/0xb5
>>
>> ceph_read_iter() uses current->journal_info to pass context info to
>> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>> has already gotten capability of using page cache (distinguish read
>> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>> then calls generic_file_read_iter().
>>
>> In above Oops, page fault happened when copying data to userspace.
>> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>> current->journal_info and assumed it is journal handle.
>>
>> I checked other filesystems, btrfs probably suffers similar problem
>> for its readpage. (page fault happens when write() copies data from
>> userspace memory and the memory is mapped to a file in btrfs.
>> verify_parent_transid() can be called during readpage)
>>
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: "Yan, Zheng" <z...@redhat.com>
>
> I agree with the analysis but the patch is too ugly too live. Ceph just
> should not be abusing current->journal_info for passing information between
> two random functions or when it does a hackery like this, it should just
> make sure the pieces hold together. Poluting generic code to accommodate
> this hack in Ceph is not acceptable. Also bear in mind there are likely
> other code paths (e.g. memory reclaim) which could recurse into another
> filesystem confusing it with non-NULL current->journal_info in the same
> way.

But ...

some filesystem set journal_info in its write_begin(), then clear it
in write_end(). If buffer for write is mapped to another filesystem,
current->journal can leak to the later filesystem's page_readpage().
The later filesystem may read current->journal and treat it as its own
journal handle.  Besides, most filesystem's vm fault handle is
filemap_fault(), filemap also may tigger memory reclaim.

>
> In this particular case I'm not sure why does ceph pass 'filp' into
> readpage() / readpages() handler when it already gets that pointer as part
> of arguments...

It actually a flag which tells ceph_readpages() if its caller is
ceph_read_iter or readahead/fadvise/madvise. because when there are
multiple clients read/write a file a the same time, page cache should
be disabled.

Regards
Yan, Zheng

>
> Honza
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a728bed16c20..db2a50233c49 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, 
>> unsigned long address,
>>   unsigned int flags)
>>  {
>>   int ret;
>> + void *old_journal_info;
>>
>>   __set_current_state(TASK_RUNNING);
>>
>> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, 
>> unsigned long address,
>>   if (flags & FAULT_FLAG_USER)
>>   mem_cgroup_oom_enable();
>>
>> + /*
>> +  * Fault can happen when filesystem A's read_iter()/write_iter()
>> +  * copies data to/from userspace. Filesystem A may have set
>> +  * current->journal_info. If the userspace memory is MAP_SHARED
>> +  * mapped to a file in filesystem B, we later may call filesystem
>> +  * B's vm operation. Filesystem B may also want to read/set
>> +  * current->journal_info.
>> +  */
>> + old_journal_info = current->journal_info;
>> + current->journal_info = NULL;
>> +
>>   if (unlikely(is_vm_hugetlb_page(vma)))
>>   ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>&

[PATCH] mm: save/restore current->journal_info in handle_mm_fault

2017-12-14 Thread Yan, Zheng
We recently got an Oops report:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: jbd2__journal_start+0x38/0x1a2
[...]
Call Trace:
  ext4_page_mkwrite+0x307/0x52b
  _ext4_get_block+0xd8/0xd8
  do_page_mkwrite+0x6e/0xd8
  handle_mm_fault+0x686/0xf9b
  mntput_no_expire+0x1f/0x21e
  __do_page_fault+0x21d/0x465
  dput+0x4a/0x2f7
  page_fault+0x22/0x30
  copy_user_generic_string+0x2c/0x40
  copy_page_to_iter+0x8c/0x2b8
  generic_file_read_iter+0x26e/0x845
  timerqueue_del+0x31/0x90
  ceph_read_iter+0x697/0xa33 [ceph]
  hrtimer_cancel+0x23/0x41
  futex_wait+0x1c8/0x24d
  get_futex_key+0x32c/0x39a
  __vfs_read+0xe0/0x130
  vfs_read.part.1+0x6c/0x123
  handle_mm_fault+0x831/0xf9b
  __fget+0x7e/0xbf
  SyS_read+0x4d/0xb5

ceph_read_iter() uses current->journal_info to pass context info to
ceph_readpages(). Because ceph_readpages() needs to know if its caller
has already gotten capability of using page cache (distinguish read
from readahead/fadvise). ceph_read_iter() set current->journal_info,
then calls generic_file_read_iter().

In above Oops, page fault happened when copying data to userspace.
Page fault handler called ext4_page_mkwrite(). Ext4 code read
current->journal_info and assumed it is journal handle.

I checked other filesystems, btrfs probably suffers similar problem
for its readpage. (page fault happens when write() copies data from
userspace memory and the memory is mapped to a file in btrfs.
verify_parent_transid() can be called during readpage)

Cc: sta...@vger.kernel.org
Signed-off-by: "Yan, Zheng" <z...@redhat.com>
---
 mm/memory.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index a728bed16c20..db2a50233c49 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned 
long address,
unsigned int flags)
 {
int ret;
+   void *old_journal_info;
 
__set_current_state(TASK_RUNNING);
 
@@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
if (flags & FAULT_FLAG_USER)
mem_cgroup_oom_enable();
 
+   /*
+* Fault can happen when filesystem A's read_iter()/write_iter()
+* copies data to/from userspace. Filesystem A may have set
+* current->journal_info. If the userspace memory is MAP_SHARED
+* mapped to a file in filesystem B, we later may call filesystem
+* B's vm operation. Filesystem B may also want to read/set
+* current->journal_info.
+*/
+   old_journal_info = current->journal_info;
+   current->journal_info = NULL;
+
if (unlikely(is_vm_hugetlb_page(vma)))
ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
else
ret = __handle_mm_fault(vma, address, flags);
 
+   current->journal_info = old_journal_info;
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_oom_disable();
/*
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at

2017-12-13 Thread Yan, Zheng


> On 13 Dec 2017, at 13:38, Adam Borowski <kilob...@angband.pl> wrote:
> 
> This link is replicated in most filesystems' config stanzas.  Referring
> to an archived version of that site is pointless as it mostly deals with
> patches; user documentation is available elsewhere.
> 
> Signed-off-by: Adam Borowski <kilob...@angband.pl>
> ---
> Sending this as one piece; if you guys would instead prefer this chopped
> into tiny per-filesystem bits, please say so.
> 
> 
> Documentation/filesystems/ext2.txt |  2 --
> Documentation/filesystems/ext4.txt |  7 +++
> fs/9p/Kconfig  |  3 ---
> fs/Kconfig |  6 +-
> fs/btrfs/Kconfig   |  3 ---
> fs/ceph/Kconfig    |  3 —

Ceph bits looks good.

Acked-by:  Yan, Zheng" <z...@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-04 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 10:18 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Fri, Jun 2, 2017 at 2:18 PM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Fri, Jun 2, 2017 at 7:33 PM, Arnd Bergmann <a...@arndb.de> wrote:
>>> On Fri, Jun 2, 2017 at 1:18 PM, Yan, Zheng <uker...@gmail.com> wrote:
>>> What I meant is another related problem in ceph_mkdir() where the
>>> i_ctime field of the parent inode is different between the persistent
>>> representation in the mds and the in-memory representation.
>>>
>>
>> I don't see any problem in mkdir case. Parent inode's i_ctime in mds is set 
>> to
>> r_stamp. When client receives request reply, it set its in-memory inode's 
>> ctime
>> to the same time stamp.
>
> Ok, I see it now, thanks for the clarification. Most other file systems do 
> this
> the other way round and update all fields in the in-memory inode structure
> first and then write that to persistent storage, so I was getting confused 
> about
> the order of events here.
>
> If I understand it all right, we have three different behaviors in ceph now,
> though the differences are very minor and probably don't ever matter:
>
> - in setattr(), we update ctime in the in-memory inode first and then send
>   the same time to the mds, and expect to set it again when the reply comes.
>
> - in ceph_write_iter write() and mmap/page_mkwrite(), we call
>   file_update_time() to set i_mtime and i_ctime to the same
>   timestamp first once a write is observed by the fs and then take
>   two other timestamps that we send to the mds, and update the
>   in-memory inode a second time when the reply comes. ctime
>   is never older than mtime here, as far as I can tell, but it may
>   be newer when the timer interrupt happens between taking the
>   two stamps.

We don't use request to send i_mtime/i_ctime to mds in this case.
Instead, we use cap flush message. i_mtime/i_ctime are directly
encoded in cap flush message. When mds receives the cap flush message,
it writes i_mtime/i_ctime to persistent storage and sends a cap flush
ack message to client. (when client receives the cap flush ack
message, it does not update i_mtime/i_ctime). There is no issue as you
described.

>
> - in all other calls, we only update the inode (and/or parent inode)
>   after the reply arrives.

There are two cases. 1. Client updates in-memory inode's ctime, it
sends the new ctime to mds through cap flush message. 2. client set
mds request's r_stamp and send the request to mds. MDS updates
relavent inodes' ctime and sends reply to client. Client updates
in-memory inodes' ctime according to the reply.

Regards
Yan, Zheng

>
>Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-02 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 7:33 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Fri, Jun 2, 2017 at 1:18 PM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Fri, Jun 2, 2017 at 6:51 PM, Arnd Bergmann <a...@arndb.de> wrote:
>>> On Fri, Jun 2, 2017 at 12:10 PM, Yan, Zheng <uker...@gmail.com> wrote:
>>>> On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann <a...@arndb.de> wrote:
>>>>> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote:
>>>>>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>>>>>> wrote:
>>>>>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> 
>>>>>>> wrote:
>>>>>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote:
>>>>>
>>>>> I believe the bug you see is the result of the two timestamps
>>>>> currently being almost guaranteed to be different in the latest
>>>>> kernels.
>>>>> Changing r_stamp to use current_kernel_time() will make it the
>>>>> same value most of the time (as it was before Deepa's patch),
>>>>> but when the timer interrupt happens between the timestamps,
>>>>> the two are still different, it's just much harder to hit.
>>>>>
>>>>> I think the proper solution should be to change __ceph_setattr()
>>>>> in a way that has req->r_stamp always synchronized with i_ctime.
>>>>> If we copy i_ctime to r_stamp, that will also take care of the
>>>>> future issues with the planned changes to current_time().
>>>>>
>>>> I already have a patch
>>>> https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af
>>>
>>> Looks good to me. In case anyone cares:
>>> Acked-by: Arnd Bergmann <a...@arndb.de>
>>>
>>>>> The part I don't understand is what else r_stamp (i.e. the time
>>>>> stamp in ceph_msg_data with type==
>>>>> CEPH_MSG_CLIENT_REQUEST) is used for, other than setting
>>>>> ctime in CEPH_MDS_OP_SETATTR.
>>>>>
>>>>> Will this be used to update the stored i_ctime for other operations
>>>>> too? If so, we would need to synchronize it with the in-memory
>>>>> i_ctime for all operations that do this.
>>>>>
>>>>
>>>> yes,  mds uses it to update ctime of modified inodes. For example,
>>>> when handling mkdir, mds set ctime of both parent inode and new inode
>>>> to r_stamp.
>>>
>>> I see, so we may have a variation of that problem there as well: From
>>> my reading of the code, the child inode is not in memory yet, so
>>> that seems fine, but I could not find where the parent in-memory inode
>>> i_ctime is updated in ceph, but it is most likely not the same as
>>> req->r_stamp (assuming it gets updated at all).
>>
>> i_ctime is updated when handling request reply, by ceph_fill_file_time().
>> __ceph_setattr() can update the in-memory inode's ctime after request
>> reply is received. The difference between ktime_get_real_ts() and
>> current_time() can be larger than round-trip time of request. So it's
>> still possible that __ceph_setattr() make ctime go back.
>
> But the __ceph_setattr() problem should be fixed by your patch, right?
>
> What I meant is another related problem in ceph_mkdir() where the
> i_ctime field of the parent inode is different between the persistent
> representation in the mds and the in-memory representation.
>

I don't see any problem in mkdir case. Parent inode's i_ctime in mds is set to
r_stamp. When client receives request reply, it set its in-memory inode's ctime
to the same time stamp.

Regards
Yan, Zheng

> Arnd
>
>>> Would it make sense require all callers of ceph_mdsc_do_request()
>>> to update r_stamp at the same time as i_ctime to keep them in sync?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-02 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 6:51 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Fri, Jun 2, 2017 at 12:10 PM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann <a...@arndb.de> wrote:
>>> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote:
>>>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>>>> wrote:
>>>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> 
>>>>> wrote:
>>>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote:
>>>
>>> I believe the bug you see is the result of the two timestamps
>>> currently being almost guaranteed to be different in the latest
>>> kernels.
>>> Changing r_stamp to use current_kernel_time() will make it the
>>> same value most of the time (as it was before Deepa's patch),
>>> but when the timer interrupt happens between the timestamps,
>>> the two are still different, it's just much harder to hit.
>>>
>>> I think the proper solution should be to change __ceph_setattr()
>>> in a way that has req->r_stamp always synchronized with i_ctime.
>>> If we copy i_ctime to r_stamp, that will also take care of the
>>> future issues with the planned changes to current_time().
>>>
>> I already have a patch
>> https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af
>
> Looks good to me. In case anyone cares:
> Acked-by: Arnd Bergmann <a...@arndb.de>
>
>>> The part I don't understand is what else r_stamp (i.e. the time
>>> stamp in ceph_msg_data with type==
>>> CEPH_MSG_CLIENT_REQUEST) is used for, other than setting
>>> ctime in CEPH_MDS_OP_SETATTR.
>>>
>>> Will this be used to update the stored i_ctime for other operations
>>> too? If so, we would need to synchronize it with the in-memory
>>> i_ctime for all operations that do this.
>>>
>>
>> yes,  mds uses it to update ctime of modified inodes. For example,
>> when handling mkdir, mds set ctime of both parent inode and new inode
>> to r_stamp.
>
> I see, so we may have a variation of that problem there as well: From
> my reading of the code, the child inode is not in memory yet, so
> that seems fine, but I could not find where the parent in-memory inode
> i_ctime is updated in ceph, but it is most likely not the same as
> req->r_stamp (assuming it gets updated at all).

i_ctime is updated when handling request reply, by ceph_fill_file_time().
__ceph_setattr() can update the in-memory inode's ctime after request
reply is received. The difference between ktime_get_real_ts() and
current_time() can be larger than round-trip time of request. So it's
still possible that __ceph_setattr() make ctime go back.

Regards
Yan, Zheng


>
> Would it make sense require all callers of ceph_mdsc_do_request()
> to update r_stamp at the same time as i_ctime to keep them in sync?
>
> Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-02 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>> wrote:
>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> wrote:
>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote:
>>>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote:
>>>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote:
>>>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>>>>>>> wrote:
>>>>>>
>>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>>>> index 517838b..77204da 100644
>>>>>>>> --- a/drivers/block/rbd.c
>>>>>>>> +++ b/drivers/block/rbd.c
>>>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>>>>>>> rbd_obj_request *obj_request)
>>>>>>>>  {
>>>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>>>>>>
>>>>>>>> -   osd_req->r_mtime = CURRENT_TIME;
>>>>>>>> +   ktime_get_real_ts(_req->r_mtime);
>>>>>>>> osd_req->r_data_offset = obj_request->offset;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>> index c681762..1d3fa90 100644
>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>>>>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int 
>>>>>>>> mode)
>>>>>>>>  {
>>>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>>>>>>> +   struct timespec ts;
>>>>>>>>
>>>>>>>> if (!req)
>>>>>>>> return ERR_PTR(-ENOMEM);
>>>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>>>>>>> *mdsc, int op, int mode)
>>>>>>>> init_completion(>r_safe_completion);
>>>>>>>> INIT_LIST_HEAD(>r_unsafe_item);
>>>>>>>>
>>>>>>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>>>>>>> +   ktime_get_real_ts();
>>>>>>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>>>>>>
>>>>>>> This change causes our kernel_untar_tar test case to fail (inode's
>>>>>>> ctime goes back). The reason is that there is time drift between the
>>>>>>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>>>>>>> revert this change until current_time() uses ktime_get_real_ts()
>>>>>>> internally.
>>>>>>
>>>>>> Hmm, the change was not supposed to have a user-visible effect, so
>>>>>> something has gone wrong, but I don't immediately see how it
>>>>>> relates to what you observe.
>>>>>>
>>>>>> ktime_get_real_ts() and current_time() use the same time base, there
>>>>>> is no drift, but there is a difference in resolution, as the latter uses
>>>>>> the time stamp of the last jiffies update, which may be up to one jiffy
>>>>>> (10ms) behind the exact time we put in the request stamps here.
>>>>>>
>>>>>> Do you still see problems if you use current_kernel_time() instead of
>>>>>> ktime_get_real_ts()?
>>>>>
>>>>> The problem disappears after using current_kernel_time().
>>>>>
>>>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417
>>>>
>>>> From the commit above:
>>>> "It seems there is time drift between ktime_get_real_ts() and
>>>> current_kernel_time()"
>>>>
>>>> Its more of a granularity difference. current_ker

Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> wrote:
> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> wrote:
>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote:
>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote:
>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote:
>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>>>>> wrote:
>>>>
>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>> index 517838b..77204da 100644
>>>>>> --- a/drivers/block/rbd.c
>>>>>> +++ b/drivers/block/rbd.c
>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>>>>> rbd_obj_request *obj_request)
>>>>>>  {
>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>>>>
>>>>>> -   osd_req->r_mtime = CURRENT_TIME;
>>>>>> +   ktime_get_real_ts(_req->r_mtime);
>>>>>> osd_req->r_data_offset = obj_request->offset;
>>>>>>  }
>>>>>>
>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>> index c681762..1d3fa90 100644
>>>>>> --- a/fs/ceph/mds_client.c
>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>>>>  {
>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>>>>> +   struct timespec ts;
>>>>>>
>>>>>> if (!req)
>>>>>> return ERR_PTR(-ENOMEM);
>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>>>>> *mdsc, int op, int mode)
>>>>>> init_completion(>r_safe_completion);
>>>>>> INIT_LIST_HEAD(>r_unsafe_item);
>>>>>>
>>>>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>>>>> +   ktime_get_real_ts();
>>>>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>>>>
>>>>> This change causes our kernel_untar_tar test case to fail (inode's
>>>>> ctime goes back). The reason is that there is time drift between the
>>>>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>>>>> revert this change until current_time() uses ktime_get_real_ts()
>>>>> internally.
>>>>
>>>> Hmm, the change was not supposed to have a user-visible effect, so
>>>> something has gone wrong, but I don't immediately see how it
>>>> relates to what you observe.
>>>>
>>>> ktime_get_real_ts() and current_time() use the same time base, there
>>>> is no drift, but there is a difference in resolution, as the latter uses
>>>> the time stamp of the last jiffies update, which may be up to one jiffy
>>>> (10ms) behind the exact time we put in the request stamps here.
>>>>
>>>> Do you still see problems if you use current_kernel_time() instead of
>>>> ktime_get_real_ts()?
>>>
>>> The problem disappears after using current_kernel_time().
>>>
>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417
>>
>> From the commit above:
>> "It seems there is time drift between ktime_get_real_ts() and
>> current_kernel_time()"
>>
>> Its more of a granularity difference. current_kernel_time() returns
>> the cached time at the last tick, where as ktime_get_real_ts() reads
>> the clocksource hardware and returns the immediate time.
>>
>> Filesystems usually use the cached time (similar to
>> CLOCK_REALTIME_COARSE), for performance reasons, as touching the
>> clocksource takes time.
>
> Alternatively, it would be best for this code also to use current_time().
> I had suggested this in one of the previous versions of the patch.
> The implementation of current_time() will change when we switch vfs to
> use 64 bit time. This will prevent such errors from happening again.
> But, this also means there is more code reordering for these modules
> to get a reference to inode.
>

I took a look. it's quite inconvenience to use current_time(). I
prefer to temporarily use current_kernel_time().

Regards
Yan, Zheng



> -Deepa
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>> wrote:
>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index 517838b..77204da 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>> rbd_obj_request *obj_request)
>>>  {
>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>
>>> -   osd_req->r_mtime = CURRENT_TIME;
>>> +   ktime_get_real_ts(_req->r_mtime);
>>> osd_req->r_data_offset = obj_request->offset;
>>>  }
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index c681762..1d3fa90 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>  {
>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>> +   struct timespec ts;
>>>
>>> if (!req)
>>> return ERR_PTR(-ENOMEM);
>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>> *mdsc, int op, int mode)
>>> init_completion(>r_safe_completion);
>>> INIT_LIST_HEAD(>r_unsafe_item);
>>>
>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>> +   ktime_get_real_ts();
>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>
>> This change causes our kernel_untar_tar test case to fail (inode's
>> ctime goes back). The reason is that there is time drift between the
>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>> revert this change until current_time() uses ktime_get_real_ts()
>> internally.
>
> Hmm, the change was not supposed to have a user-visible effect, so
> something has gone wrong, but I don't immediately see how it
> relates to what you observe.
>
> ktime_get_real_ts() and current_time() use the same time base, there
> is no drift, but there is a difference in resolution, as the latter uses
> the time stamp of the last jiffies update, which may be up to one jiffy
> (10ms) behind the exact time we put in the request stamps here.
>
It happens in following sequence of events

1. create a new file, the inode's ctime is set to ktime_get_real_ts()
2. chmod the new file, the inode's ctime is set to current_time().

Inode's ctime goes back when current_time() is behind ktime_get_real_ts().

Regards
Yan, Zheng

> Do you still see problems if you use current_kernel_time() instead of
> ktime_get_real_ts()?
>
> Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> 
>> wrote:
>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index 517838b..77204da 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
>>> rbd_obj_request *obj_request)
>>>  {
>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>
>>> -   osd_req->r_mtime = CURRENT_TIME;
>>> +   ktime_get_real_ts(_req->r_mtime);
>>> osd_req->r_data_offset = obj_request->offset;
>>>  }
>>>
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index c681762..1d3fa90 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>  {
>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>> +   struct timespec ts;
>>>
>>> if (!req)
>>> return ERR_PTR(-ENOMEM);
>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client 
>>> *mdsc, int op, int mode)
>>> init_completion(>r_safe_completion);
>>> INIT_LIST_HEAD(>r_unsafe_item);
>>>
>>> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>> +   ktime_get_real_ts();
>>> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>
>> This change causes our kernel_untar_tar test case to fail (inode's
>> ctime goes back). The reason is that there is time drift between the
>> time stamps got by  ktime_get_real_ts() and current_time(). We need to
>> revert this change until current_time() uses ktime_get_real_ts()
>> internally.
>
> Hmm, the change was not supposed to have a user-visible effect, so
> something has gone wrong, but I don't immediately see how it
> relates to what you observe.
>
> ktime_get_real_ts() and current_time() use the same time base, there
> is no drift, but there is a difference in resolution, as the latter uses
> the time stamp of the last jiffies update, which may be up to one jiffy
> (10ms) behind the exact time we put in the request stamps here.
>
> Do you still see problems if you use current_kernel_time() instead of
> ktime_get_real_ts()?

The problem disappears after using current_kernel_time().

https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417


Regards
Yan, Zheng
>
> Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

2017-06-01 Thread Yan, Zheng
On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> wrote:
> CURRENT_TIME is not y2038 safe.
> The macro will be deleted and all the references to it
> will be replaced by ktime_get_* apis.
>
> struct timespec is also not y2038 safe.
> Retain timespec for timestamp representation here as ceph
> uses it internally everywhere.
> These references will be changed to use struct timespec64
> in a separate patch.
>
> The current_fs_time() api is being changed to use vfs
> struct inode* as an argument instead of struct super_block*.
>
> Set the new mds client request r_stamp field using
> ktime_get_real_ts() instead of using current_fs_time().
>
> Also, since r_stamp is used as mtime on the server, use
> timespec_trunc() to truncate the timestamp, using the right
> granularity from the superblock.
>
> This api will be transitioned to be y2038 safe along
> with vfs.
>
> Signed-off-by: Deepa Dinamani <deepa.ker...@gmail.com>
> Reviewed-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/block/rbd.c   | 2 +-
>  fs/ceph/mds_client.c  | 4 +++-
>  net/ceph/messenger.c  | 6 --
>  net/ceph/osd_client.c | 4 ++--
>  4 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 517838b..77204da 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct 
> rbd_obj_request *obj_request)
>  {
> struct ceph_osd_request *osd_req = obj_request->osd_req;
>
> -   osd_req->r_mtime = CURRENT_TIME;
> +   ktime_get_real_ts(_req->r_mtime);
> osd_req->r_data_offset = obj_request->offset;
>  }
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index c681762..1d3fa90 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>  ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>  {
> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
> +   struct timespec ts;
>
> if (!req)
> return ERR_PTR(-ENOMEM);
> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, 
> int op, int mode)
> init_completion(>r_safe_completion);
> INIT_LIST_HEAD(>r_unsafe_item);
>
> -   req->r_stamp = current_fs_time(mdsc->fsc->sb);
> +   ktime_get_real_ts();
> +   req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);

This change causes our kernel_untar_tar test case to fail (inode's
ctime goes back). The reason is that there is time drift between the
time stamps got by  ktime_get_real_ts() and current_time(). We need to
revert this change until current_time() uses ktime_get_real_ts()
internally.

Regards
Yan, Zheng


>
> req->r_op = op;
> req->r_direct_mode = mode;
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index f76bb33..5766a6c 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1386,8 +1386,9 @@ static void prepare_write_keepalive(struct 
> ceph_connection *con)
> dout("prepare_write_keepalive %p\n", con);
> con_out_kvec_reset(con);
> if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) {
> -   struct timespec now = CURRENT_TIME;
> +   struct timespec now;
>
> +   ktime_get_real_ts();
> con_out_kvec_add(con, sizeof(tag_keepalive2), 
> _keepalive2);
> ceph_encode_timespec(>out_temp_keepalive2, );
> con_out_kvec_add(con, sizeof(con->out_temp_keepalive2),
> @@ -3176,8 +3177,9 @@ bool ceph_con_keepalive_expired(struct ceph_connection 
> *con,
>  {
> if (interval > 0 &&
> (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2)) {
> -   struct timespec now = CURRENT_TIME;
> +   struct timespec now;
> struct timespec ts;
> +   ktime_get_real_ts();
> jiffies_to_timespec(interval, );
> ts = timespec_add(con->last_keepalive_ack, ts);
> return timespec_compare(, ) >= 0;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index e15ea9e..242d7c0 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -3574,7 +3574,7 @@ ceph_osdc_watch(struct ceph_osd_client *osdc,
> ceph_oid_copy(>t.base_oid, oid);
> ceph_oloc_copy(>t.base_oloc, oloc);
> lreq->t.flags = CEPH_OSD_FLAG_WRITE;
> -   lreq->mtime = CURRENT_TIME;
> +   ktime_get_real_ts(>mtim

Re: bad unlock balance in btrfs_commit_transaction_async

2013-08-27 Thread Yan, Zheng
On Wed, Aug 28, 2013 at 8:56 AM, Dan Mick dan.m...@inktank.com wrote:
 Another developer just noticed this in testing; anyone have any ideas?


btrfs_ioctl_start_sync() calls btrfs_attach_transaction_barrier() which further
calls start_transaction() with type == TRANS_ATTACH.  In start_transaction(),
sb_start_intwrite() is called when (type  __TRANS_FREEZABLE) is true. but
(TRANS_ATTACH  __TRANS_FREEZABLE) is false. So we see the bad
bad unlock balance bug.

Yan, Zheng

 On 08/22/2013 05:40 PM, Sage Weil wrote:

 I just noticed that there is a locking imbalance warning with sb_internal
 in the transaction commit code.  I believe this has only started appearing
 recently (after I merged -rc5 into my testing tree), but I'm working on
 confirming that. The error is

 4[27034.835134] =
 4[27034.839854] [ BUG: bad unlock balance detected! ]
 4[27034.844576] 3.11.0-rc5-ceph-00061-g546140d #1 Not tainted
 4[27034.849992] -
 4[27034.854713] ceph-osd/30797 is trying to release lock (sb_internal)
 at:
 4[27034.861304] [a0148fd8]
 btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs]
 4[27034.868994] but there are no more locks to release!
 4[27034.873887]
 4[27034.873887] other info that might help us debug this:
 4[27034.880448] no locks held by ceph-osd/30797.
 4[27034.884733]
 4[27034.884733] stack backtrace:
 4[27034.889123] CPU: 0 PID: 30797 Comm: ceph-osd Not tainted
 3.11.0-rc5-ceph-00061-g546140d #1
 4[27034.897421] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS
 1.6.3 02/07/2011
 4[27034.904938]  a0148fd8 88020baf9c68 81642d85
 0007
 4[27034.912411]  88021b32deb0 88020baf9c98 810ab89e
 88020cff8000
 4[27034.919883]  0246 88020aaeddd0 a0148fd8
 88020baf9ce8
 4[27034.927358] Call Trace:
 4[27034.929836]  [a0148fd8] ?
 btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs]
 4[27034.937790]  [81642d85] dump_stack+0x46/0x58
 4[27034.942951]  [810ab89e]
 print_unlock_imbalance_bug+0xfe/0x110
 4[27034.949599]  [a0148fd8] ?
 btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs]
 4[27034.957552]  [810aeafe] lock_release+0x15e/0x220
 4[27034.963069]  [a0148fff]
 btrfs_commit_transaction_async+0x1ef/0x2c0 [btrfs]
 4[27034.970850]  [810af6f5] ?
 trace_hardirqs_on_caller+0x105/0x1d0
 4[27034.977587]  [a0177907] btrfs_ioctl_start_sync+0x47/0xc0
 [btrfs]
 4[27034.984499]  [a017c575] btrfs_ioctl+0xe55/0x1af0 [btrfs]
 4[27034.990700]  [8164ab2b] ? _raw_spin_unlock+0x2b/0x40
 4[27034.996552]  [810b40b5] ? do_futex+0xa45/0xbb0
 4[27035.001885]  [8119cf0c] ? fget_light+0x3c/0x130
 4[27035.007302]  [811922a6] do_vfs_ioctl+0x96/0x560
 4[27035.012720]  [8119cf6e] ? fget_light+0x9e/0x130
 4[27035.018137]  [8119cf0c] ? fget_light+0x3c/0x130
 4[27035.023556]  [81192801] SyS_ioctl+0x91/0xb0
 4[27035.028628]  [813338fe] ?
 trace_hardirqs_on_thunk+0x3a/0x3f
 4[27035.035089]  [81653782] system_call_fastpath+0x16/0x1b

 This is presumably some breakage in the freeze locking dance that goes on
 with async commits (btrfs_commit_transaction_async caller takes the freeze
 semaphore, do_async_commit releases it), but it's not obvious to me what
 broke yet.  Unless this rings any bells for anyone, I'll go ahead and
 bisect.

 Thanks!
 sage
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] kernel BUG at fs/btrfs/relocation.c:2502!

2011-09-21 Thread Yan, Zheng
On Wed, Sep 21, 2011 at 6:06 PM, Liu Bo liubo2...@cn.fujitsu.com wrote:

 Hi,

 While running my tool(attachment), I would encounter the BUG_ON, and I FAILED 
 to find where went wrong :(

 The tool is with inode_cache option, and mainly do three things:
 a. run Chris's synctest in BACKGROUND
 b. create 100 snapshots
 c. after b, run btrfs fi balance


 You can follow these tips to reproduce the bug:

 1) untar the attachment,
 2) prepare 4 partitions, the mount point is default to /mnt,
 3) $ ./2_while.sh /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb4
 4) then just wait several minutes and you will get the bug.

 NOTE:
 You MAY hit a warning and I've fixed it with this patch:
 http://marc.info/?l=linux-btrfsm=131547325515336w=2

 ===
 kernel BUG at fs/btrfs/relocation.c:2502!
 [...]
 Call Trace:
  [a03d7a6b] ? block_rsv_add_bytes+0x2b/0x70 [btrfs]
  [a043292f] relocate_tree_blocks+0x60f/0x6d0 [btrfs]
  [a0433498] ? add_data_references+0x248/0x260 [btrfs]
  [a0433722] relocate_block_group+0x272/0x620 [btrfs]
  [a0433c83] btrfs_relocate_block_group+0x1b3/0x2d0 [btrfs]
  [a0413163] btrfs_relocate_chunk+0x93/0x6a0 [btrfs]
  [8103bfb3] ? __wake_up+0x53/0x70
  [a041db42] ? btrfs_tree_read_unlock_blocking+0x42/0x70 [btrfs]
  [a04144d2] btrfs_balance+0x212/0x2a0 [btrfs]
  [81152459] ? path_openat+0x109/0x3e0
  [a041d398] btrfs_ioctl+0x798/0xd20 [btrfs]
  [81112ec3] ? handle_mm_fault+0x143/0x260
  [81474504] ? do_page_fault+0x1d4/0x440
  [8115562a] do_vfs_ioctl+0x9a/0x540
  [81155b71] sys_ioctl+0xa1/0xb0
  [8147896b] system_call_fastpath+0x16/0x1b
 Code: 00 00 00 00 00 eb f6 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 
 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 0f 0b 0f 1f 84 00 
 00 00 00 00 eb f6 48 83 7a 68 00 0f 84 eb fa
 RIP  [a0430c26] do_relocation+0x546/0x570 [btrfs]
  RSP 88003c0859a8
 ---[ end trace 6a4328153ff7ff17 ]---


call btrfs_save_ino_cache in commit_fs_roots is completely wrong.
modification to fs trees is not allowed after create_pending_snapshots()
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] kernel BUG at fs/btrfs/relocation.c:2502!

2011-09-21 Thread Yan, Zheng
On Thu, Sep 22, 2011 at 7:42 AM, David Sterba d...@jikos.cz wrote:
 On Wed, Sep 21, 2011 at 06:57:56PM +0800, Yan, Zheng wrote:
 modification to fs trees is not allowed after create_pending_snapshots()

 Do you have an idea whether there is a reasonable way to catch this in
 code? (even if only under a config option).


no idea
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: check file extent backref offset underflow

2011-08-28 Thread Yan, Zheng
Offset field in data extent backref can underflow if clone range ioctl
is used. We can reliably detect the underflow because max file size is
limited to 2^63 and max data extent size is limited by block group size.

Signed-off-by: Zheng Yan  zheng.z@intel.com
---
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 59bb176..107c9cf 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3323,8 +3323,11 @@ static int find_data_references(struct reloc_control *rc,
}
 
key.objectid = ref_objectid;
-   key.offset = ref_offset;
key.type = BTRFS_EXTENT_DATA_KEY;
+   if (ref_offset  ((u64)-1  32))
+   key.offset = 0;
+   else
+   key.offset = ref_offset;
 
path-search_commit_root = 1;
path-skip_locking = 1;
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Btrfs design defect in extent backref ?

2011-08-25 Thread Yan, Zheng
On Thu, Aug 25, 2011 at 3:56 PM, Li Zefan l...@cn.fujitsu.com wrote:
 We have an offset in file extent to indicate its position in the
 corresponding extent item in extent tree. We also have an offset in
 extent item to indicate the start position of the file extent that
 uses this item.

 The math is:

    extent_item.extent_data_ref.offset = file_pos - file_extent.extent_offset.

                       e1
 disk extents:    |--|
                 ^
                 |                  e2
                 |          |-|
                 |          |   ^
                 |          |   |
                 v          v   |
 file extents:    |- f1 -|- f2 -|

 So it looks like e2.offset points to f1 not f2. Therefore given an extent 
 item,
 we'll have to search through all the file extents in an inode to find the
 relative file extent in the worst case, which makes this field somewhat 
 useless.


The reason for this is reducing number of file extent backref itmes.
we don't have to search all the file extents because the file extent size
is limited and we have extent_data_ref.count.

 What makes things worse is the above fomula can make the offset a negative
 value (cast to u64):

    # touch /mnt/dst
    # clone_range -s 8192 -d 4096 /mnt/src /mnt/dst
    # umount /mnt
    # btrfs-debug-tree /dev/sda7
    ...
        item 2 key (12582912 EXTENT_ITEM 49152) itemoff 3865 itemsize 82
                extent refs 2 gen 8 flags 1
                extent data backref root 5 objectid 258 offset 
 18446744073709543424 count 1
                extent data backref root 5 objectid 257 offset 0 count 1
    ...

 and relocation won't work in this case:

    # mount /dev/sda7 /mnt
    # rm /mnt/src
    # sync
    # btrfs fi bal /mnt
    (kernel warning !!)
    (hung up !!)

 I don't see the necessity or benefit of the substraction in the fomula,
 and I think the correct one is:

    extent_item.extent_data_ref.offset = file_pos

 (As a side effect thereafter we don't need extent_data_ref.count)

 That's what this patch does. Unfornately it is an incompatable change
 in disk format.

 So I think we have to live with this defect, just fix relocation for
 the negative offset case ?

I prefer fixing relocation.


 Signed-off-by: Li Zefan l...@cn.fujitsu.com
 ---
  fs/btrfs/extent-tree.c |    1 -
  fs/btrfs/file.c        |   11 +--
  fs/btrfs/inode.c       |    7 +++
  fs/btrfs/ioctl.c       |    2 +-
  fs/btrfs/relocation.c  |    1 -
  5 files changed, 9 insertions(+), 13 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index f5be06a..3924e03 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -2578,7 +2578,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
 *trans,
                                continue;

                        num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
 -                       key.offset -= btrfs_file_extent_offset(buf, fi);
                        ret = process_func(trans, root, bytenr, num_bytes,
                                           parent, ref_root, key.objectid,
                                           key.offset);
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index e7872e4..7f65a27 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -678,7 +678,7 @@ next_slot:
                                                disk_bytenr, num_bytes, 0,
                                                root-root_key.objectid,
                                                new_key.objectid,
 -                                               start - extent_offset);
 +                                               start);
                                BUG_ON(ret);
                                *hint_byte = disk_bytenr;
                        }
 @@ -752,8 +752,7 @@ next_slot:
                                ret = btrfs_free_extent(trans, root,
                                                disk_bytenr, num_bytes, 0,
                                                root-root_key.objectid,
 -                                               key.objectid, key.offset -
 -                                               extent_offset);
 +                                               key.objectid, key.offset);
                                BUG_ON(ret);
                                inode_sub_bytes(inode,
                                                extent_end - key.offset);
 @@ -962,7 +961,7 @@ again:

                ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
                                           root-root_key.objectid,
 -                                          ino, orig_offset);
 +                                          ino, split);
                BUG_ON(ret);

                if (split == start) {
 @@ -989,7 +988,7 @@ again:
                del_nr++;
                ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
                                        

Re: [RFC] Btrfs design defect in extent backref ?

2011-08-25 Thread Yan, Zheng
On Fri, Aug 26, 2011 at 10:00 AM, Li Zefan l...@cn.fujitsu.com wrote:
 Yan, Zheng wrote:
 On Thu, Aug 25, 2011 at 3:56 PM, Li Zefan l...@cn.fujitsu.com wrote:
 We have an offset in file extent to indicate its position in the
 corresponding extent item in extent tree. We also have an offset in
 extent item to indicate the start position of the file extent that
 uses this item.

 The math is:

    extent_item.extent_data_ref.offset = file_pos - 
 file_extent.extent_offset.

                       e1
 disk extents:    |--|
                 ^
                 |                  e2
                 |          |-|
                 |          |   ^
                 |          |   |
                 v          v   |
 file extents:    |- f1 -|- f2 -|

 So it looks like e2.offset points to f1 not f2. Therefore given an extent 
 item,
 we'll have to search through all the file extents in an inode to find the
 relative file extent in the worst case, which makes this field somewhat 
 useless.


 The reason for this is reducing number of file extent backref itmes.

 It seems to me a rare case, which isn't worth the complexity and inconvenience
 it brings, and it requires an extra field (.count).

Random write workload isn't a rare case.

 we don't have to search all the file extents because the file extent size
 is limited and we have extent_data_ref.count.

 Yes we have to, and for a big file with many small file extents, the extent
 number is not trivial.

Max file extent size is 128M, so only need to scan a 128M range in the
worst case.


 What makes things worse is the above fomula can make the offset a negative
 value (cast to u64):

    # touch /mnt/dst
    # clone_range -s 8192 -d 4096 /mnt/src /mnt/dst
    # umount /mnt
    # btrfs-debug-tree /dev/sda7
    ...
        item 2 key (12582912 EXTENT_ITEM 49152) itemoff 3865 itemsize 82
                extent refs 2 gen 8 flags 1
                extent data backref root 5 objectid 258 offset 
 18446744073709543424 count 1
                extent data backref root 5 objectid 257 offset 0 count 1
    ...

 and relocation won't work in this case:

    # mount /dev/sda7 /mnt
    # rm /mnt/src
    # sync
    # btrfs fi bal /mnt
    (kernel warning !!)
    (hung up !!)

 I don't see the necessity or benefit of the substraction in the fomula,
 and I think the correct one is:

    extent_item.extent_data_ref.offset = file_pos

 (As a side effect thereafter we don't need extent_data_ref.count)

 That's what this patch does. Unfornately it is an incompatable change
 in disk format.

 So I think we have to live with this defect, just fix relocation for
 the negative offset case ?

 I prefer fixing relocation.


 Sure, though I would prefer the alternative if not for the stablity of
 disk format.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)

2011-06-14 Thread Yan, Zheng
I found another bug. There are codes (btrfs_save_ino_cache) that
modify fs trees after
create_pending_snapshots is called. This can corrupt your fs.

On Mon, Jun 13, 2011 at 3:13 PM, Li Zefan l...@cn.fujitsu.com wrote:
 Cc: Josef

 I encountered following panic using 'btrfs-unstable + for-linus'
 kernel.

 I ran btrfs fi bal /test5 command, and mount option of /test5
 is as follows:

  /dev/sdc3 on /test5 type btrfs 
 (rw,space_cache,compress=lzo,inode_cache)

 So, just a btrfs fi bal would lead to the bug?
 I think so.

 It should be specific to the inode caching code.  The balancing code is
 finding the inode map cache extents, but it doesn't know how to relocate
 them.

 However, the panic has occurred even if inode_cahce is turned off.
 Is this another problem?


 I don't think free inode cache isthe cause of the bug here (even if 
 inode_cache
 is turned on).

 What I have found out is:

 1. git checkout a4abeea41adfa3c143c289045f4625dfaeba2212

 So the top commit is the removal of trans_mutex and no delayed_inode patch
 or free inode cache patchset in the tree, and bug can be triggered.

 2. git checkout 2a1eb4614d984d5cd4c928784e9afcf5c07f93be

 So the top commit is the one before trans_mutex removal, and no bug triggered.

 3. test linus' tree

 bug triggered.

 4. revert that suspicoius commit manually from linus' tree

 no bug.

 so either that commit is buggy or it reveals some bugs covered by the 
 trans_mutex.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)

2011-06-13 Thread Yan, Zheng
Add a mutex to btrfs_init_reloc_root() to prevent the reloc tree
creation from concurrent execution.

On Mon, Jun 13, 2011 at 3:13 PM, Li Zefan l...@cn.fujitsu.com wrote:
 Cc: Josef

 I encountered following panic using 'btrfs-unstable + for-linus'
 kernel.

 I ran btrfs fi bal /test5 command, and mount option of /test5
 is as follows:

  /dev/sdc3 on /test5 type btrfs 
 (rw,space_cache,compress=lzo,inode_cache)

 So, just a btrfs fi bal would lead to the bug?
 I think so.

 It should be specific to the inode caching code.  The balancing code is
 finding the inode map cache extents, but it doesn't know how to relocate
 them.

 However, the panic has occurred even if inode_cahce is turned off.
 Is this another problem?


 I don't think free inode cache isthe cause of the bug here (even if 
 inode_cache
 is turned on).

 What I have found out is:

 1. git checkout a4abeea41adfa3c143c289045f4625dfaeba2212

 So the top commit is the removal of trans_mutex and no delayed_inode patch
 or free inode cache patchset in the tree, and bug can be triggered.

 2. git checkout 2a1eb4614d984d5cd4c928784e9afcf5c07f93be

 So the top commit is the one before trans_mutex removal, and no bug triggered.

 3. test linus' tree

 bug triggered.

 4. revert that suspicoius commit manually from linus' tree

 no bug.

 so either that commit is buggy or it reveals some bugs covered by the 
 trans_mutex.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)

2011-06-13 Thread Yan, Zheng
The usage of trans_mutex in relocation code is subtle. It controls
interaction of relocation
with transaction start, transaction commit and snapshot creation.
Simple replacing
trans_mutex with trans_lock is wrong.


On Mon, Jun 13, 2011 at 3:13 PM, Li Zefan l...@cn.fujitsu.com wrote:
 Cc: Josef

 I encountered following panic using 'btrfs-unstable + for-linus'
 kernel.

 I ran btrfs fi bal /test5 command, and mount option of /test5
 is as follows:

  /dev/sdc3 on /test5 type btrfs 
 (rw,space_cache,compress=lzo,inode_cache)

 So, just a btrfs fi bal would lead to the bug?
 I think so.

 It should be specific to the inode caching code.  The balancing code is
 finding the inode map cache extents, but it doesn't know how to relocate
 them.

 However, the panic has occurred even if inode_cahce is turned off.
 Is this another problem?


 I don't think free inode cache isthe cause of the bug here (even if 
 inode_cache
 is turned on).

 What I have found out is:

 1. git checkout a4abeea41adfa3c143c289045f4625dfaeba2212

 So the top commit is the removal of trans_mutex and no delayed_inode patch
 or free inode cache patchset in the tree, and bug can be triggered.

 2. git checkout 2a1eb4614d984d5cd4c928784e9afcf5c07f93be

 So the top commit is the one before trans_mutex removal, and no bug triggered.

 3. test linus' tree

 bug triggered.

 4. revert that suspicoius commit manually from linus' tree

 no bug.

 so either that commit is buggy or it reveals some bugs covered by the 
 trans_mutex.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)

2011-06-13 Thread Yan, Zheng
On Tue, Jun 14, 2011 at 3:55 AM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Yan, Zheng's message of 2011-06-13 10:58:35 -0400:
 The usage of trans_mutex in relocation code is subtle. It controls
 interaction of relocation
 with transaction start, transaction commit and snapshot creation.
 Simple replacing
 trans_mutex with trans_lock is wrong.

 So, I've got a mutex around the reloc_root here and that was almost but
 not quite enough.  It looks like the biggest problem is that we need to
 wait in btrfs_record_root_in_trans for anyone inside merge_reloc_roots.

 I'm surviving much longer with a patch in place that synchronizes
 btrfs_record_root_in_trans better.

 Zheng if you have other comments on the locking please let me know.


following untested patch may help.
---
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 378b5b4..0b20dda 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -951,6 +951,7 @@ struct btrfs_fs_info {
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
+   struct mutex reloc_mutex;
/*
 * this protects the ordered operations list only while we are
 * processing all of the entries on it.  This way we make
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9f68c68..28f8b11 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1714,6 +1714,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
mutex_init(fs_info-transaction_kthread_mutex);
mutex_init(fs_info-cleaner_mutex);
mutex_init(fs_info-volume_mutex);
+   mutex_init(fs_info-reloc_mutex);
init_rwsem(fs_info-extent_commit_sem);
init_rwsem(fs_info-cleanup_work_sem);
init_rwsem(fs_info-subvol_sem);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b1ef27c..620e4af 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1330,18 +1330,20 @@ int btrfs_init_reloc_root(struct
btrfs_trans_handle *trans,
  struct btrfs_root *root)
 {
struct btrfs_root *reloc_root;
-   struct reloc_control *rc = root-fs_info-reloc_ctl;
+   struct reloc_control *rc;
int clear_rsv = 0;

+   mutex_lock(root-fs_info-reloc_mutex);
if (root-reloc_root) {
reloc_root = root-reloc_root;
reloc_root-last_trans = trans-transid;
-   return 0;
+   goto unlock;
}

+   rc = root-fs_info-reloc_ctl;
if (!rc || !rc-create_reloc_tree ||
root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID)
-   return 0;
+   goto unlock;

if (!trans-block_rsv) {
trans-block_rsv = rc-block_rsv;
@@ -1353,6 +1355,8 @@ int btrfs_init_reloc_root(struct
btrfs_trans_handle *trans,

__add_reloc_root(reloc_root);
root-reloc_root = reloc_root;
+unlock:
+   mutex_unlock(root-fs_info-reloc_mutex);
return 0;
 }

@@ -1367,8 +1371,9 @@ int btrfs_update_reloc_root(struct
btrfs_trans_handle *trans,
int del = 0;
int ret;

+   mutex_lock(root-fs_info-reloc_mutex);
if (!root-reloc_root)
-   return 0;
+   goto unlock;

reloc_root = root-reloc_root;
root_item = reloc_root-root_item;
@@ -1390,6 +1395,8 @@ int btrfs_update_reloc_root(struct
btrfs_trans_handle *trans,
ret = btrfs_update_root(trans, root-fs_info-tree_root,
reloc_root-root_key, root_item);
BUG_ON(ret);
+unlock:
+   mutex_unlock(root-fs_info-reloc_mutex);
return 0;
 }

@@ -2142,10 +2149,10 @@ int prepare_to_merge(struct reloc_control *rc, int err)
u64 num_bytes = 0;
int ret;

-   spin_lock(root-fs_info-trans_lock);
+   mutex_lock(root-fs_info-reloc_mutex);
rc-merging_rsv_size += root-nodesize * (BTRFS_MAX_LEVEL - 1) * 2;
rc-merging_rsv_size += rc-nodes_relocated * 2;
-   spin_unlock(root-fs_info-trans_lock);
+   mutex_unlock(root-fs_info-reloc_mutex);
 again:
if (!err) {
num_bytes = rc-merging_rsv_size;
@@ -2214,9 +2221,9 @@ int merge_reloc_roots(struct reloc_control *rc)
int ret;
 again:
root = rc-extent_root;
-   spin_lock(root-fs_info-trans_lock);
+   mutex_lock(root-fs_info-reloc_mutex);
list_splice_init(rc-reloc_roots, reloc_roots);
-   spin_unlock(root-fs_info-trans_lock);
+   mutex_unlock(root-fs_info-reloc_mutex);

while (!list_empty(reloc_roots)) {
found = 1;
@@ -3590,17 +3597,17 @@ next:
 static void set_reloc_control(struct reloc_control *rc)
 {
struct btrfs_fs_info *fs_info = rc-extent_root-fs_info;
-   spin_lock(fs_info-trans_lock);
+   mutex_lock(fs_info-reloc_mutex);
fs_info-reloc_ctl = rc;
-   spin_unlock(fs_info-trans_lock);
+   mutex_unlock(fs_info-reloc_mutex);
 }

 static void unset_reloc_control(struct reloc_control *rc)
 {
 

Re: [PATCH] Btrfs: check root_key's offset instead

2011-06-08 Thread Yan, Zheng
On Wed, Jun 8, 2011 at 5:46 PM, liubo liubo2...@cn.fujitsu.com wrote:
 When we use reloc root to cow or copy a tree block, we do not set the block's
 owner, instead we set its header's flag with BTRFS_HEADER_FLAG_RELOC.

 So here we should check for root_key's offset.

 Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
 ---
  fs/btrfs/extent-tree.c |    2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 5b9b6b6..0bda273 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6160,7 +6160,7 @@ static noinline int walk_up_proc(struct 
 btrfs_trans_handle *trans,
                if (wc-flags[level + 1]  BTRFS_BLOCK_FLAG_FULL_BACKREF)
                        parent = path-nodes[level + 1]-start;
                else
 -                       BUG_ON(root-root_key.objectid !=
 +                       BUG_ON(root-root_key.offset !=
                               btrfs_header_owner(path-nodes[level + 1]));
        }


This is wrong, all blocks with BTRFS_HEADER_FLAG_RELOC flag set should
uss full back references.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-convert crashes

2011-04-27 Thread Yan, Zheng
On Wed, Apr 27, 2011 at 1:20 PM, Brian Parma freec...@cox.net wrote:
 I have a 1.5 TB (1,475,720,773,632) partition that I wanted to convert from
 ext4 to btrfs.  It is currently used as / for ubuntu 10.10.

 I booted into 11.04 beta2 and tried a 'btrfs-convert /dev/sdc1', but after
 about 20 minutes it segfaulted.

 I performed a:

 sck.ext4 -cDfty -C 0 /dev/sdc1


 After everything was clean, I downloaded the debugging symbols for
 btrfs-convert and tried again.  Below is the 'bt full' output.  I don't have
 enough free space to copy all the data off, create a fresh btrfs partition,
 and copy everything back on (I have backups of important stuff).  Is there
 something else I can try to get this to work?

 Brian


The crash was caused by the hard links per directory limit in btrfs.
In short, your ext4 is not convertible.

 at: http://pastebin.com/NEwJNzuP


 #0  0x77444d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
 No symbol table info available.
 #1  0x77448ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
 No symbol table info available.
 #2  0x0040502c in btrfs_extend_item (trans=value optimized out,
 root=0x633920, path=value optimized out, data_size=27) at ctree.c:2525
        slot =value optimized out
        slot_orig =value optimized out
        leaf = 0x1955250
        nritems = 1
        data_end =value optimized out
        old_data =value optimized out
        i =value optimized out
        __PRETTY_FUNCTION__ = btrfs_extend_item
 #3  0x0040e32d in btrfs_insert_inode_ref (trans=0xc9ef10,
 root=0x633920, name=0xcfa314 gtfntf.f.svn-base, name_len=17,
    inode_objectid=value optimized out, ref_objectid=value optimized out,
 index=150) at inode-item.c:135
        old_size = 3945
        path = 0x1639aa0
        key = {objectid = 37361107, type = 12 'f', offset = 37359706}
        ref =value optimized out
        ptr =value optimized out
        ret =value optimized out
        ins_len = 27
        __PRETTY_FUNCTION__ = btrfs_insert_inode_ref
 #4  0x00413fff in dir_iterate_proc (dir=value optimized out,
 entry=value optimized out, old=0xcfa30c, offset=value optimized out,
    blocksize=value optimized out, buf=value optimized out,
 priv_data=0x7fffe370) at convert.c:289
        ret =value optimized out
        file_type =value optimized out
        objectid = 37361107
        dotdot = ..
        location = {objectid = 37361107, type = 1 '', offset = 0}
        dirent = 0xcfa30c
        idata = 0x7fffe370
        __PRETTY_FUNCTION__ = dir_iterate_proc
 #5  0x77bbdc13 in ext2fs_process_dir_block () from
 /lib/x86_64-linux-gnu/libext2fs.so.2
 No symbol table info available.
 #6  0x77bbac02 in ext2fs_block_iterate2 () from
 /lib/x86_64-linux-gnu/libext2fs.so.2
 No symbol table info available.
 #7  0x77bbdfb8 in ext2fs_dir_iterate2 () from
 /lib/x86_64-linux-gnu/libext2fs.so.2
 No symbol table info available.
 #8  0x0041689d in create_dir_entries (devname=0x7fffe897
 /dev/sdc1, datacsum=1, packing=1, noxattr=0) at convert.c:322
        err =value optimized out
        data = {trans = 0xc9ef10, root = 0x633920, inode = 0x7fffe1c0,
 objectid = 37359706, index_cnt = 150, parent = 37359705, errcode = 0}
        ret =value optimized out
 #9  copy_single_inode (devname=0x7fffe897 /dev/sdc1, datacsum=1,
 packing=1, noxattr=0) at convert.c:1072
        ret =value optimized out
        btrfs_inode = {generation = 1, transid = 140737354044640, size =
 4994, nbytes = 0, block_group = 0, nlink = 1, uid = 1000, gid = 1000, mode =
 16877,
          rdev = 0, flags = 0, sequence = 140737351933932, reserved = {0,
 140737354040256, 140733193388033, 0}, atime = {sec = 1303466526, nsec = 0},
 ctime = {
            sec = 1296464377, nsec = 0}, mtime = {sec = 1296464377, nsec =
 0}, otime = {sec = 0, nsec = 0}}
 #10 copy_inodes (devname=0x7fffe897 /dev/sdc1, datacsum=1, packing=1,
 noxattr=0) at convert.c:1154
        ret =value optimized out
        err =value optimized out
        ext2_scan = 0xce2300
        ext2_ino = 37359452
        objectid = 37359706
        ext2_inode = {i_mode = 16877, i_uid = 1000, i_size = 16384, i_atime =
 1303466526, i_ctime = 1296464377, i_mtime = 1296464377, i_dtime = 0, i_gid =
 1000,
          i_links_count = 2, i_blocks = 32, i_flags = 528384, osd1 = {linux1
 = {l_i_version = 1981}, hurd1 = {h_i_translator = 1981}}, i_block = {193290,
 4, 0,
            0, 1, 149430439, 1, 3, 149430464, 0, 0, 0, 0, 0, 0}, i_generation
 = 2854948622, i_file_acl = 0, i_dir_acl = 0, i_faddr = 0, osd2 = {linux2 = {
              l_i_blocks_hi = 0, l_i_file_acl_high = 0, l_i_uid_high = 0,
 l_i_gid_high = 0, l_i_reserved2 = 0}, hurd2 = {h_i_frag = 0 '�',
              h_i_fsize = 0 '�', h_i_mode_high = 0, h_i_uid_high = 0,
 h_i_gid_high = 0, h_i_author = 0}}}
        trans = 0xc9ef10
 #11 do_convert (devname=0x7fffe897 /dev/sdc1, datacsum=1, packing=1,
 noxattr=0) at convert.c:2411
        i =value 

Re: [PATCH] Btrfs: do not release delalloc space until after we end the transaction

2011-04-13 Thread Yan, Zheng
On Thu, Apr 14, 2011 at 2:54 AM, Josef Bacik jo...@redhat.com wrote:
 There have been many sporadic reports of the following panic

 [ cut here ]
 kernel BUG at fs/btrfs/extent-tree.c:5498!
 invalid opcode:  [#1] PREEMPT SMP
 last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
 CPU 7
 Modules linked in: btrfs zlib_deflate libcrc32c netconsole configfs 
 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand 
 acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 
 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath kvm uinput 
 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq 
 snd_seq_device snd_pcm snd_timer snd hp_wmi i5400_edac sparse_keymap iTCO_wdt 
 rfkill edac_core tg3 shpchp iTCO_vendor_support soundcore wmi floppy 
 snd_page_alloc pcspkr i5k_amb [last unloaded: btrfs]

 Pid: 28504, comm: btrfs-endio-wri Tainted: G        W   2.6.39-rc2+ #35 
 Hewlett-Packard HP xw6600 Workstation/0A9Ch
 RIP: 0010:[a044ec34]  [a044ec34] 
 alloc_reserved_file_extent+0x9a/0x1e5 [btrfs]
 RSP: 0018:88000b4319f0  EFLAGS: 00010286
 RAX: ffe4 RBX: 880009fdc438 RCX: 880020c216d0
 RDX: 88000b4318c0 RSI: 00d5 RDI: 
 RBP: 88000b431a70 R08: ffe4 R09: 880020c216d0
 R10: 0001 R11: 88000b431b10 R12: 88000b431b10
 R13: 00b2 R14:  R15: 88002225f2f8
 FS:  () GS:88003e40() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 003738ca6940 CR3: 2a39a000 CR4: 06e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process btrfs-endio-wri (pid: 28504, threadinfo 88000b43, task 
 880032278000)
 Stack:
  0001 88002a92 881d 038d
   0005 88003aa38000 81481012
  88000c3bb480 8800241d01c8 88000b431a60 880031a040a8
 Call Trace:
  [81481012] ? sub_preempt_count+0x97/0xaa
  [a044f92e] run_clustered_refs+0x61b/0x700 [btrfs]
  [81480f89] ? sub_preempt_count+0xe/0xaa
  [a0446ee9] ? spin_lock+0xe/0x10 [btrfs]
  [a044fae4] btrfs_run_delayed_refs+0xd1/0x1ab [btrfs]
  [8147dc1c] ? _raw_spin_unlock+0x4a/0x57
  [a045af1b] __btrfs_end_transaction+0x89/0x1ed [btrfs]
  [a045b0c2] btrfs_end_transaction+0x15/0x17 [btrfs]
  [a0466932] btrfs_finish_ordered_io+0x29c/0x2bf [btrfs]
  [a04669d6] btrfs_writepage_end_io_hook+0x81/0x8d [btrfs]
  [a0477fd5] end_bio_extent_writepage+0xae/0x159 [btrfs]
  [811457e3] bio_endio+0x2d/0x2f
  [a0456c44] end_workqueue_fn+0x111/0x120 [btrfs]
  [a0480a0e] worker_loop+0x192/0x4d1 [btrfs]
  [a048087c] ? btrfs_queue_worker+0x22c/0x22c [btrfs]
  [81068a69] kthread+0xa0/0xa8
  [8107a847] ? trace_hardirqs_on_caller+0x111/0x135
  [81485364] kernel_thread_helper+0x4/0x10
  [8147e398] ? retint_restore_args+0x13/0x13
  [810689c9] ? __init_kthread_worker+0x5b/0x5b
  [81485360] ? gs_change+0x13/0x13
 Code: 44 8b 45 90 0f 84 58 01 00 00 80 88 88 00 00 00 08 41 83 c0 18 4c 89 e1 
 48 8b 72 20 4c 89 ff 48 89 c2 e8 1f b4 ff ff 85 c0 74 04 0f 0b eb fe 48 8b 
 03 48 89 45 c8 8b 73 40 48 89 c7 e8 bc 98 ff
 RIP  [a044ec34] alloc_reserved_file_extent+0x9a/0x1e5 [btrfs]
  RSP 88000b4319f0
 ---[ end trace 81d1c68cb00af83e ]---

 This is because we have been releasing the delalloc bytes before ending the
 transaction.  However the way we make allocations, any updates to the
 extent_tree are delayed and then run when the transaction runs, so we still 
 have
 plenty of space that we need to use.  So instead release the delalloc bytes
 _after_ we end the transaction so that we don't get this false ENOSPC.  
 Thanks,


This is wrong, because btrfs_run_delayed_refs uses global block reservation.


 Signed-off-by: Josef Bacik jo...@redhat.com
 ---
  fs/btrfs/inode.c |    8 ++--
  1 files changed, 6 insertions(+), 2 deletions(-)

 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index ade00e7..b1e5b11 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -1783,9 +1783,13 @@ out:
                if (trans)
                        btrfs_end_transaction_nolock(trans, root);
        } else {
 -               btrfs_delalloc_release_metadata(inode, ordered_extent-len);
                if (trans)
                        btrfs_end_transaction(trans, root);
 +               /*
 +                * Release after the transaction ends so it covers the delayed
 +                * ref updates
 +                */
 +               btrfs_delalloc_release_metadata(inode, ordered_extent-len);
        }

        /* once for us */
 @@ -5897,8 +5901,8 @@ out_unlock:
    

Re: [PATCH] Btrfs: fix infinite loop in btrfs_shrink_device()

2011-02-25 Thread Yan, Zheng
On Sat, Feb 26, 2011 at 7:43 AM, Ilya Dryomov idryo...@gmail.com wrote:
 In case of an ENOSPC error from btrfs_relocate_chunk() (line 2202) while
 relocating a block group with offset 0 we end up endlessly looping.
 This happens because key.offset -= 1 statement then unconditionally
 brings us back to the beginnig of the loop (key.offset == (u64)-1).

 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  fs/btrfs/volumes.c |    3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index dd13eb8..0cb94ce 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -2212,7 +2212,8 @@ again:
                        goto done;
                if (ret == -ENOSPC)
                        failed++;
 -               key.offset -= 1;
 +               if (--key.offset == -1)
 +                       break;
        }

it should be  if (--key.offset == (u64) -1)


        if (failed  !retried) {
 --
 1.7.2.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: Fix balance panic

2011-01-26 Thread Yan, Zheng
Mark the cloned backref_node as checked in clone_backref_node()

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 045c9c2..bef9c22 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1157,6 +1157,7 @@ static int clone_backref_node(struct btrfs_trans_handle 
*trans,
new_node-bytenr = dest-node-start;
new_node-level = node-level;
new_node-lowest = node-lowest;
+   new_node-checked = 1;
new_node-root = dest;
 
if (!node-lowest) {
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel error during btrfs balance

2011-01-21 Thread Yan, Zheng
please try patch attached below, Thanks.

---
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b37d723..49d6b13 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1158,6 +1158,7 @@ static int clone_backref_node(struct
btrfs_trans_handle *trans,
new_node-bytenr = dest-node-start;
new_node-level = node-level;
new_node-lowest = node-lowest;
+   new_node-checked = 1;
new_node-root = dest;

if (!node-lowest) {
---


On Fri, Jan 21, 2011 at 4:50 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 I hit the same bug again I think:

 [291835.724344] [ cut here ]
 [291835.724376] kernel BUG at fs/btrfs/relocation.c:836!
 [291835.724401] invalid opcode:  [#1] SMP
 [291835.724424] last sysfs file:
 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
 [291835.724461] CPU 0
 [291835.724472] Modules linked in: uvcvideo snd_usb_audio
 snd_usbmidi_lib videodev v4l1_compat snd_rawmidi v4l2_compat_ioctl32
 btrfs zlib_deflate libcrc32c sha256_generic cryptd aes_x86_64
 aes_generic cbc dm_crypt tun ebtable_nat ebtables ipt_MASQUERADE
 iptable_nat nf_nat bridge stp llc nfsd lockd nfs_acl auth_rpcgss
 exportfs nls_utf8 cifs fscache sunrpc cpufreq_ondemand acpi_cpufreq
 freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
 ip6table_filter ip6_tables ipv6 kvm_intel kvm dummy uinput
 snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq
 snd_seq_device e1000e snd_pcm snd_timer i2c_i801 snd shpchp iTCO_wdt
 iTCO_vendor_support soundcore dell_wmi sparse_keymap snd_page_alloc
 serio_raw joydev wmi dcdbas microcode usb_storage uas raid1 pata_acpi
 ata_generic radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last
 unloaded: scsi_wait_scan]
 [291835.725002]
 [291835.725013] Pid: 27386, comm: btrfs Tainted: G          I
 2.6.37-2.fc15.x86_64 #1
 [291835.725062] RIP: 0010:[a0565237]  [a0565237]
 build_backref_tree+0x473/0xd6d [btrfs]
 [291835.725126] RSP: 0018:8800373bf9c8  EFLAGS: 00010246
 [291835.725152] RAX: 8801367d5100 RBX: 88020b110880 RCX:
 0040
 [291835.725186] RDX: 0030 RSI: 006dd08d3000 RDI:
 880100069820
 [291835.725219] RBP: 8800373bfaf8 R08: 8050 R09:
 8800373bf980
 [291835.725253] R10: 8800373bf918 R11: 88020b110880 R12:
 8801367d5100
 [291835.725254] R13: 88012c0a24c0 R14: 88021e2013f0 R15:
 88021e201cf0
 [291835.725254] FS:  7fcb1a6cc760() GS:8800bfa0()
 knlGS:
 [291835.725254] CS:  0010 DS:  ES:  CR0: 8005003b
 [291835.725254] CR2: 02feeeb8 CR3: 0001c2943000 CR4:
 000426e0
 [291835.725254] DR0:  DR1:  DR2:
 
 [291835.725254] DR3:  DR6: 0ff0 DR7:
 0400
 [291835.725254] Process btrfs (pid: 27386, threadinfo 8800373be000,
 task 88022452ae40)
 [291835.725254] Stack:
 [291835.725254]  ea0004b5a470 ea00 8800373bf9f8
 8800373bfaa8
 [291835.725254]   88005faafbb0 880100069808
 880100069d78
 [291835.725254]  88012c0a2aa0 880100069820 88020b1108c0
 880100069d80
 [291835.725254] Call Trace:
 [291835.725254]  [a0565c91] relocate_tree_blocks+0x160/0x478
 [btrfs]
 [291835.725254]  [a056463d] ? add_tree_block+0x11e/0x13e [btrfs]
 [291835.725254]  [a0566b45] relocate_block_group+0x1e3/0x490
 [btrfs]
 [291835.725254]  [8103edb9] ? should_resched+0xe/0x2e
 [291835.725254]  [a0566f39]
 btrfs_relocate_block_group+0x147/0x28a [btrfs]
 [291835.725254]  [a054e52a]
 btrfs_relocate_chunk.clone.40+0x61/0x4ab [btrfs]
 [291835.725254]  [a05152d4] ? btrfs_item_key+0x1e/0x20 [btrfs]
 [291835.725254]  [a05152f0] ? btrfs_item_key_to_cpu+0x1a/0x36
 [btrfs]
 [291835.725254]  [a054c2a8] ? read_extent_buffer+0xc3/0xe3 [btrfs]
 [291835.725254]  [a05154e6] ?
 btrfs_header_nritems.clone.12+0x17/0x1c [btrfs]
 [291835.725254]  [a054cff6] ? btrfs_item_key_to_cpu+0x2a/0x46
 [btrfs]
 [291835.725254]  [a055045e] btrfs_balance+0x1a3/0x1f0 [btrfs]
 [291835.725254]  [8112bce5] ? do_filp_open+0x226/0x5c8
 [291835.725254]  [a0556773] btrfs_ioctl+0x641/0x846 [btrfs]
 [291835.725254]  [811f3ed1] ? file_has_perm+0xa5/0xc7
 [291835.725254]  [8112e091] do_vfs_ioctl+0x4b1/0x4f2
 [291835.725254]  [8112e128] sys_ioctl+0x56/0x7a
 [291835.725254]  [8100acc2] system_call_fastpath+0x16/0x1b
 [291835.725254] Code: 48 8b 45 89 49 8d 7d 10 48 8d 75 b0 49 89 44 24 18
 8a 43 70 ff c0 41 88 44 24 70 e8 f7 c3 ff ff eb 17 f6 40 71 10 49 89 c4
 75 02 0f 0b 49 8d 45 10 49 89 45 10 49 89 45 18 48 8b b5 20 ff ff ff
 [291835.725254] RIP  [a0565237] build_backref_tree+0x473/0xd6d
 [btrfs]
 [291835.725254]  RSP 8800373bf9c8
 [291835.738971] ---[ end trace a7919e7f17c0a727 

Re: v0.19-35-g1b444cd btrfsck says snapshots have errors

2011-01-20 Thread Yan, Zheng
On Fri, Jan 21, 2011 at 6:52 AM, Ian! D. Allen idal...@idallen.ca wrote:
 Still getting btrfsck errors with this:

 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git

 # ./btrfstest.sh
 Using /mnt/sdb1 /dev/sdb1 on /dev/sdb
 + mkfs.btrfs -L BTRFStest /dev/sdb1

 WARNING! - Btrfs v0.19-35-g1b444cd IS EXPERIMENTAL
 WARNING! - see http://btrfs.wiki.kernel.org before using

 fs created label BTRFStest on /dev/sdb1
        nodesize 4096 leafsize 4096 sectorsize 4096 size 2.00GB
 Btrfs v0.19-35-g1b444cd
 + mount -o noatime /dev/sdb1 /mnt/sdb1
 + btrfs subvolume snapshot /mnt/sdb1 /mnt/sdb1/snap1
 Create a snapshot of '/mnt/sdb1' in '/mnt/sdb1/snap1'
 + btrfs subvolume snapshot /mnt/sdb1/snap1 /mnt/sdb1/snap2
 Create a snapshot of '/mnt/sdb1/snap1' in '/mnt/sdb1/snap2'
 + btrfs subvolume snapshot /mnt/sdb1/snap2 /mnt/sdb1/snap3
 Create a snapshot of '/mnt/sdb1/snap2' in '/mnt/sdb1/snap3'
 + btrfs subvolume snapshot /mnt/sdb1/snap3 /mnt/sdb1/snap4
 Create a snapshot of '/mnt/sdb1/snap3' in '/mnt/sdb1/snap4'
 + btrfs subvolume snapshot /mnt/sdb1/snap4 /mnt/sdb1/snap5
 Create a snapshot of '/mnt/sdb1/snap4' in '/mnt/sdb1/snap5'
 + umount /dev/sdb1
 + btrfsck /dev/sdb1
 fs tree 256 refs 6
        unresolved ref root 256 dir 256 index 2 namelen 5 name snap1 error 600
        unresolved ref root 257 dir 256 index 2 namelen 5 name snap1 error 600
        unresolved ref root 258 dir 256 index 2 namelen 5 name snap1 error 600
        unresolved ref root 259 dir 256 index 2 namelen 5 name snap1 error 600
        unresolved ref root 260 dir 256 index 2 namelen 5 name snap1 error 600
 found 49152 bytes used err is 1
 total csum bytes: 0
 total tree bytes: 49152
 total fs tree bytes: 28672
 btree space waste bytes: 39360
 file data blocks allocated: 0
  referenced 0
 Btrfs v0.19-35-g1b444cd



These is caused by a design flaw, you can safely ignore them.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel error during btrfs balance

2011-01-18 Thread Yan, Zheng
On Tue, Jan 18, 2011 at 9:22 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 On 01/18/2011 01:54 AM, Yan, Zheng wrote:
 On Mon, Jan 17, 2011 at 10:14 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 btrfs balance results in:

 http://pastebin.com/v5j0809M

 My system: fully up-to-date Fedora 14 with rawhide kernel to make btrfs
 balance do useful stuff to my free space:

 kernel-2.6.37-2.fc15.x86_64
 btrfs-progs-0.19-12.fc14.x86_64

 Filesystem had 0 bytes free, should be 45G, so on darklings advice I ran
 btrfs balance on the fs, while doing heavy I/O (re-running 5 backup jobs
 that had failed due to ENOSP).
 Up until the crash, btrfs balance did retrieve a couple of Gigs free
 space though, so that part of the plan worked just fine.


 Please try 2.6.36 kernel.

 Thanks for your (short) advice. Could you please elaborate. I was in
 fact using a 2.6.35.10-74.fc14.x86_64 kernel before, but darkling
 adviced me to switch to a newer kernel to reclaim free space by
 balancing -- the idea was that newer kernels have better balancing
 implementation, more effective at reclaiming free space.

 Now your advice is to take a small step back again, from 2.6.37 to
 2.6.36 (which is still higher than the 2.6.35 I was using before). Is
 that because you think that 2.6.37 may have introduced the bug that I
 ran into? Do you think that 2.6.36 is still recent enough to have the
 effective balancing so that I will in fact be able to reclaim some free
 space? Or is is just a shot in the dark with no reasoning whatsoever ;)

 Please don't feel offended, but from your 4-word sentence I really can't
 tell.


Just try narrowing down the bug, because I never saw bug like this before.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Yan, Zheng
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik jo...@redhat.com wrote:
 Here are patches to do offline deduplication for Btrfs.  It works well for the
 cases it's expected to, I'm looking for feedback on the ioctl interface and
 such, I'm well aware there are missing features for the userspace app (like
 being able to set a different blocksize).  If this interface is acceptable I
 will flesh out the userspace app a little more, but I believe the kernel side 
 is
 ready to go.

 Basically I think online dedup is huge waste of time and completely useless.
 You are going to want to do different things with different data.  For 
 example,
 for a mailserver you are going to want to have very small blocksizes, but for
 say a virtualization image store you are going to want much larger blocksizes.
 And lets not get into heterogeneous environments, those just get much too
 complicated.  So my solution is batched dedup, where a user just runs this
 command and it dedups everything at this point.  This avoids the very costly
 overhead of having to hash and lookup for duplicate extents online and lets us
 be _much_ more flexible about what we want to deduplicate and how we want to 
 do
 it.

 For the userspace app it only does 64k blocks, or whatever the largest area it
 can read out of a file.  I'm going to extend this to do the following things 
 in
 the near future

 1) Take the blocksize as an argument so we can have bigger/smaller blocks
 2) Have an option to _only_ honor the blocksize, don't try and dedup smaller
 blocks
 3) Use fiemap to try and dedup extents as a whole and just ignore specific
 blocksizes
 4) Use fiemap to determine what would be the most optimal blocksize for the 
 data
 you want to dedup.

 I've tested this out on my setup and it seems to work well.  I appreciate any
 feedback you may have.  Thanks,


FYI: Using clone ioctl can do the same thing (except reading data and
computing hash in user space).

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] x86: hpet: Fix HPET timer + NMI watchdog panic

2010-12-28 Thread Yan, Zheng
HPET doesn't use timer_interrupt() as interrupt handler now. So count of
HPET interrupt isn't recorded in per_cpu(irq_stat, cpu).irq0_irqs. This
confuses NMI watchdog when using HPET as tick device.

Signed-off-by: Yan, Zheng zheng.z@intel.com

---
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 55e4de6..cca94f2 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -12,6 +12,9 @@ typedef struct {
unsigned int apic_timer_irqs;   /* arch dependent */
unsigned int irq_spurious_count;
 #endif
+#ifdef CONFIG_HPET_TIMER
+   unsigned int hpet_timer_irqs;
+#endif
unsigned int x86_platform_ipis; /* arch dependent */
unsigned int apic_perf_irqs;
unsigned int apic_irq_work_irqs;
diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c
index c90041c..cdb38d9 100644
--- a/arch/x86/kernel/apic/nmi.c
+++ b/arch/x86/kernel/apic/nmi.c
@@ -80,8 +80,15 @@ static inline int mce_in_progress(void)
  */
 static inline unsigned int get_timer_irqs(int cpu)
 {
-   return per_cpu(irq_stat, cpu).apic_timer_irqs +
+   unsigned int irqs; 
+#ifdef CONFIG_HPET_TIMER
+   irqs = per_cpu(irq_stat, cpu).hpet_timer_irqs;
+#else
+   irqs = 0;
+#endif
+   irqs += per_cpu(irq_stat, cpu).apic_timer_irqs +
per_cpu(irq_stat, cpu).irq0_irqs;
+   return irqs;
 }
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 4ff5968..b536474c 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -517,6 +517,7 @@ static irqreturn_t hpet_interrupt_handler(int irq, void 
*data)
struct hpet_dev *dev = (struct hpet_dev *)data;
struct clock_event_device *hevt = dev-evt;
 
+   inc_irq_stat(hpet_timer_irqs);
if (!hevt-event_handler) {
printk(KERN_INFO Spurious HPET timer interrupt on HPET timer 
%d\n,
dev-num);
diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index fb5cc5e1..2098c56 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -56,7 +56,7 @@ unsigned long profile_pc(struct pt_regs *regs)
 EXPORT_SYMBOL(profile_pc);
 
 /*
- * Default timer interrupt handler for PIT/HPET
+ * Default timer interrupt handler for PIT
  */
 static irqreturn_t timer_interrupt(int irq, void *dev_id)
 {

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: hpet: Fix HPET timer + NMI watchdog panic

2010-12-28 Thread Yan, Zheng
Sent this mail to wrong list, sorry for interrupting.

Yan, Zheng

On Wed, Dec 29, 2010 at 9:57 AM, Yan, Zheng zheng.z@linux.intel.com wrote:
 HPET doesn't use timer_interrupt() as interrupt handler now. So count of
 HPET interrupt isn't recorded in per_cpu(irq_stat, cpu).irq0_irqs. This
 confuses NMI watchdog when using HPET as tick device.

 Signed-off-by: Yan, Zheng zheng.z@intel.com

 ---
 diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
 index 55e4de6..cca94f2 100644
 --- a/arch/x86/include/asm/hardirq.h
 +++ b/arch/x86/include/asm/hardirq.h
 @@ -12,6 +12,9 @@ typedef struct {
        unsigned int apic_timer_irqs;   /* arch dependent */
        unsigned int irq_spurious_count;
  #endif
 +#ifdef CONFIG_HPET_TIMER
 +       unsigned int hpet_timer_irqs;
 +#endif
        unsigned int x86_platform_ipis; /* arch dependent */
        unsigned int apic_perf_irqs;
        unsigned int apic_irq_work_irqs;
 diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c
 index c90041c..cdb38d9 100644
 --- a/arch/x86/kernel/apic/nmi.c
 +++ b/arch/x86/kernel/apic/nmi.c
 @@ -80,8 +80,15 @@ static inline int mce_in_progress(void)
  */
  static inline unsigned int get_timer_irqs(int cpu)
  {
 -       return per_cpu(irq_stat, cpu).apic_timer_irqs +
 +       unsigned int irqs;
 +#ifdef CONFIG_HPET_TIMER
 +       irqs = per_cpu(irq_stat, cpu).hpet_timer_irqs;
 +#else
 +       irqs = 0;
 +#endif
 +       irqs += per_cpu(irq_stat, cpu).apic_timer_irqs +
                per_cpu(irq_stat, cpu).irq0_irqs;
 +       return irqs;
  }

  #ifdef CONFIG_SMP
 diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
 index 4ff5968..b536474c 100644
 --- a/arch/x86/kernel/hpet.c
 +++ b/arch/x86/kernel/hpet.c
 @@ -517,6 +517,7 @@ static irqreturn_t hpet_interrupt_handler(int irq, void 
 *data)
        struct hpet_dev *dev = (struct hpet_dev *)data;
        struct clock_event_device *hevt = dev-evt;

 +       inc_irq_stat(hpet_timer_irqs);
        if (!hevt-event_handler) {
                printk(KERN_INFO Spurious HPET timer interrupt on HPET timer 
 %d\n,
                                dev-num);
 diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
 index fb5cc5e1..2098c56 100644
 --- a/arch/x86/kernel/time.c
 +++ b/arch/x86/kernel/time.c
 @@ -56,7 +56,7 @@ unsigned long profile_pc(struct pt_regs *regs)
  EXPORT_SYMBOL(profile_pc);

  /*
 - * Default timer interrupt handler for PIT/HPET
 + * Default timer interrupt handler for PIT
  */
  static irqreturn_t timer_interrupt(int irq, void *dev_id)
  {

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] can not allocate space for caching data

2010-12-20 Thread Yan, Zheng
On Mon, Dec 20, 2010 at 11:41 PM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Miao Xie's message of 2010-12-20 08:13:14 -0500:
 On Mon, 20 Dec 2010 07:44:06 -0500, Chris Mason wrote:
  Excerpts from Miao Xie's message of 2010-12-20 07:25:10 -0500:
  Hi, Chris
 
  There is something wrong with this patch:
 
  commit 83a50de97fe96aca82389e061862ed760ece2283
  Author: Chris Masonchris.ma...@oracle.com
  Date:   Mon Dec 13 15:06:46 2010 -0500
 
       Btrfs: prevent RAID level downgrades when space is low
 
       The extent allocator has code that allows us to fill
       allocations from any available block group, even if it doesn't
       match the raid level we've requested.
 
  Btrfs has added the space of single chunks and raid0 chunks into the space
  information, so when we use btrfs_check_data_free_space() to check if 
  there
  is some space for storing file data, this function may return true. So we
  write the data into the cache successfully. But, the extent allocator can
  not allocate any space to store that cached data, and then the file system
  panic.
 
  I think we subtract that space from the space information, or split the 
  space
  information into two types, one is used to manage the chunks with 
  duplication,
  the other manages the other chunks.
 
  Ok, do you have a test case that triggers this?  I'll work out a patch.
  Yan Zheng's original idea of 'the chunks should be readonly' should help
  us deduct them from the total.

 # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
 # mount /dev/sda9 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=99
    (fill the file system)
 # umount /mnt
 # mount /dev/sda9 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=1000
 # sync

 Looks like we've got an off by one bug in set_block_group_ro, which is
 why our block group isn't getting set to ro.  With this patch, we're
 properly setting the block group ro, and the enospc accounting is done
 correctly.

 It should also be able to replace my commit above.  Please take a look,
 Zheng does this look correct to you?

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 227e581..6f7d758 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -7970,13 +7970,14 @@ static int set_block_group_ro(struct 
 btrfs_block_group_cache *cache)

        if (sinfo-bytes_used + sinfo-bytes_reserved + sinfo-bytes_pinned +
            sinfo-bytes_may_use + sinfo-bytes_readonly +
 -           cache-reserved_pinned + num_bytes  sinfo-total_bytes) {
 +           cache-reserved_pinned + num_bytes = sinfo-total_bytes) {
                sinfo-bytes_readonly += num_bytes;
                sinfo-bytes_reserved += cache-reserved_pinned;
                cache-reserved_pinned = 0;
                cache-ro = 1;
                ret = 0;
        }
 +
        spin_unlock(cache-lock);
        spin_unlock(sinfo-lock);
        return ret;


Looks good for me,

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/5 v3] Btrfs: avoid transaction stuff when btrfs is readonly

2010-12-15 Thread Yan, Zheng
On Fri, Dec 3, 2010 at 4:16 PM, liubo liubo2...@cn.fujitsu.com wrote:
 When the filesystem is readonly, avoid transaction stuff by checking MS_RDONLY
 at start transaction time.

 Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
 ---
  fs/btrfs/transaction.c |    3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index 1fffbc0..14a597d 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -181,6 +181,9 @@ static struct btrfs_trans_handle 
 *start_transaction(struct btrfs_root *root,
        struct btrfs_trans_handle *h;
        struct btrfs_transaction *cur_trans;
        int ret;
 +
 +       if (root-fs_info-sb-s_flags  MS_RDONLY)
 +               return ERR_PTR(-EROFS);
  again:
        h = kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS);
        if (!h)

There are cases that we need to start transaction when MS_RDONLY flag is set.
For example, remount FS into read-only mode and log replay.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 4/5] implement metadata_ra in btrfs

2010-12-13 Thread Yan, Zheng
On Mon, Dec 13, 2010 at 3:22 PM, Shaohua Li shaohua...@intel.com wrote:
 Implementation btrfs .metadata_readahead. In btrfs, all metadata pages are in 
 a
 special btree_inode. We do readahead in it.

 Signed-off-by: Shaohua Li shaohua...@intel.com

 ---
  fs/btrfs/disk-io.c |   10 ++
  fs/btrfs/super.c   |   13 +
  mm/readahead.c     |    1 +
  3 files changed, 24 insertions(+)

 Index: linux/fs/btrfs/disk-io.c
 ===
 --- linux.orig/fs/btrfs/disk-io.c       2010-12-07 13:32:24.0 +0800
 +++ linux/fs/btrfs/disk-io.c    2010-12-07 13:33:08.0 +0800
 @@ -776,6 +776,15 @@ static int btree_readpage(struct file *f
        return extent_read_full_page(tree, page, btree_get_extent);
  }

 +static int btree_readpages(struct file *file, struct address_space *mapping,
 +               struct list_head *pages, unsigned nr_pages)
 +{
 +       struct extent_io_tree *tree;
 +       tree = BTRFS_I(mapping-host)-io_tree;
 +       return extent_readpages(tree, mapping, pages, nr_pages,
 +                       btree_get_extent);
 +}
 +
  static int btree_releasepage(struct page *page, gfp_t gfp_flags)
  {
        struct extent_io_tree *tree;
 @@ -819,6 +828,7 @@ static void btree_invalidatepage(struct

  static const struct address_space_operations btree_aops = {
        .readpage       = btree_readpage,
 +       .readpages      = btree_readpages,
        .writepage      = btree_writepage,
        .writepages     = btree_writepages,
        .releasepage    = btree_releasepage,
 Index: linux/fs/btrfs/super.c
 ===
 --- linux.orig/fs/btrfs/super.c 2010-12-07 13:32:24.0 +0800
 +++ linux/fs/btrfs/super.c      2010-12-07 13:33:08.0 +0800
 @@ -892,6 +892,18 @@ out:
                return -ENOENT;
  }

 +static int btrfs_metadata_readahead(struct super_block *sb, loff_t offset,
 +       ssize_t size)
 +{
 +       struct btrfs_root *tree_root = btrfs_sb(sb);
 +       struct inode *btree_inode = tree_root-fs_info-btree_inode;
 +       struct address_space *mapping = btree_inode-i_mapping;
 +
 +       force_page_cache_readahead(mapping, NULL, offset  PAGE_CACHE_SHIFT,
 +               size  PAGE_CACHE_SHIFT);
 +       return 0;
 +}
 +
  static const struct super_operations btrfs_super_ops = {
        .drop_inode     = btrfs_drop_inode,
        .evict_inode    = btrfs_evict_inode,
 @@ -907,6 +919,7 @@ static const struct super_operations btr
        .freeze_fs      = btrfs_freeze,
        .unfreeze_fs    = btrfs_unfreeze,
        .metadata_incore = btrfs_metadata_incore,
 +       .metadata_readahead = btrfs_metadata_readahead,
  };

  static const struct file_operations btrfs_ctl_fops = {
 Index: linux/mm/readahead.c
 ===
 --- linux.orig/mm/readahead.c   2010-12-07 13:32:24.0 +0800
 +++ linux/mm/readahead.c        2010-12-07 13:33:08.0 +0800
 @@ -228,6 +228,7 @@ int force_page_cache_readahead(struct ad
        }
        return ret;
  }
 +EXPORT_SYMBOL_GPL(force_page_cache_readahead);

  /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a



btrfs will crash If the read-ahead range falls into unallocated chunk.
need code to check validity of the user input.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not loop through raid types when looking for free extent

2010-12-13 Thread Yan, Zheng
On Tue, Dec 14, 2010 at 9:33 AM, Chris Mason chris.ma...@oracle.com wrote:
 Excerpts from Yan, Zheng's message of 2010-11-16 20:38:23 -0500:
 On Wed, Nov 17, 2010 at 5:22 AM, Josef Bacik jo...@redhat.com wrote:
  There is a bug in find_free_extent where if we don't find a free extent in 
  the
  raid type we are looking for, we loop through to the next raid type.  This 
  is
  not ok since we need to make sure we honor the raid types we are given.  So
  instead kill this check and get the proper index for the raid type we want 
  from
  the allocator.  Thanks,
 

 Loop through raid types is for handling failure in the middle of raid type
 conversion.

 The problem is that nothing prevents it from looping back to a raid0
 chunk when we really want raid1 or raid10.  And mkfs leaves behind a
 small raid0 chunk (4MB) that is uses as it assembles all the devices.

check code at end of btrfs_read_block_groups, it prevents allocating
from raid0 when there are mirrored block groups.


 I confirmed that we often use the small raid0 chunk even in raid1 or
 raid10.

 Please take a look at this commit:

 http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=commit;h=83a50de97fe96aca82389e061862ed760ece2283

 -chris
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: Fix page leak in compressed writeback path

2010-12-05 Thread Yan, Zheng
start + num_bytes = actual_end can happen when compressed page writeback 
races
with file truncation. In that case we need unlock and release pages past the end
of file.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8039390..2ea98d8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -495,7 +495,7 @@ again:
add_async_extent(async_cow, start, num_bytes,
 total_compressed, pages, nr_pages_ret);
 
-   if (start + num_bytes  end  start + num_bytes  actual_end) {
+   if (start + num_bytes  end) {
start += num_bytes;
pages = NULL;
cond_resched();
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly

2010-12-01 Thread Yan, Zheng
On Thu, Dec 2, 2010 at 11:42 AM, liubo liubo2...@cn.fujitsu.com wrote:
 On 12/01/2010 06:20 PM, liubo wrote:
 When the filesystem is readonly, avoid transaction stuff by checking 
 MS_RDONLY at
 start transaction time.


 This patch may lead btrfs panic.

 Since btrfs allows transaction under readonly fs state, which is a bit weird, 
 btrfs
 does not even check the returned transaction from start_transaction, although 
 it may
 return -ENOMEM.

btrfs may do log replay even mount as readonly.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not loop through raid types when looking for free extent

2010-11-16 Thread Yan, Zheng
On Wed, Nov 17, 2010 at 5:22 AM, Josef Bacik jo...@redhat.com wrote:
 There is a bug in find_free_extent where if we don't find a free extent in the
 raid type we are looking for, we loop through to the next raid type.  This is
 not ok since we need to make sure we honor the raid types we are given.  So
 instead kill this check and get the proper index for the raid type we want 
 from
 the allocator.  Thanks,


Loop through raid types is for handling failure in the middle of raid type
conversion.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not loop through raid types when looking for free extent

2010-11-16 Thread Yan, Zheng
On Wed, Nov 17, 2010 at 5:22 AM, Josef Bacik jo...@redhat.com wrote:
 There is a bug in find_free_extent where if we don't find a free extent in the
 raid type we are looking for, we loop through to the next raid type.  This is
 not ok since we need to make sure we honor the raid types we are given.  So
 instead kill this check and get the proper index for the raid type we want 
 from
 the allocator.  Thanks,


Loop through raid types is for handling failure in the middle of raid type
conversion.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-convert fails

2010-10-31 Thread Yan, Zheng
On Sun, Oct 31, 2010 at 11:55 PM, Helmut Hullen hul...@t-online.de wrote:
 Hallo, linux-btrfs,

 I've tried to convert a 12 GByte ext2 partition (nearly full, 280 MByte
 free) with btrfs-convert.

 After about 15 minutes (700-MHz-CPU) the system tells

  ...
  creating ext2fs image file
  cleaning up system chunk
  btrfs-convert: extent-tree.c:2529: btrfs_reserve_extent: Assertion
     `!(ret)' failed
  Abgebrochen

Try btrfs-convert -r /dev/xxx, hopefully it will recover your ext2.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.35.4 fumble-spolision

2010-09-10 Thread Yan, Zheng
        item 17 key (1923814207488 EXTENT_ITEM 4096) itemoff 2978 itemsize 51
                extent refs 1 gen 127137 flags 2
                tree block key (9962449 1 0) level 0
                tree block backref root 877
        item 18 key (1923814211584 EXTENT_ITEM 4096) itemoff 2927 itemsize 51
                extent refs 1 gen 119662 flags 2
                tree block key (9722750 54 1542337137) level 0
                tree block backref root 848
        item 19 key (1923814215680 EXTENT_ITEM 4096) itemoff 2876 itemsize 51
                extent refs 1 gen 119662 flags 2
                tree block key (9722750 54 1633150225) level 0
                tree block backref root 848
        item 20 key (1923814219776 EXTENT_ITEM 4096) itemoff 2825 itemsize 51
                extent refs 1 gen 116478 flags 258
                tree block key (2769869 54 1517502465) level 0
                shared block backref parent 1923805945856
        item 21 key (1923814223872 EXTENT_ITEM 4096) itemoff 2693 itemsize 132
                extent refs 10 gen 118289 flags 2
                tree block key (10471699 1 0) level 1
                tree block backref root 879
                tree block backref root 873
                tree block backref root 867
                tree block backref root 864
                tree block backref root 861
                tree block backref root 855
                tree block backref root 852
                tree block backref root 849
                tree block backref root 846
                tree block backref root 843
        item 22 key (1923814240256 EXTENT_ITEM 4096) itemoff 2642 itemsize 51
                extent refs 1 gen 127137 flags 2
                tree block key (9962452 1 0) level 0
                tree block backref root 877
        item 23 key (1923814244352 EXTENT_ITEM 4096) itemoff 2591 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 2250929569) level 0
                tree block backref root 862
        item 24 key (1923814248448 EXTENT_ITEM 4096) itemoff 2540 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 2530040045) level 0
                tree block backref root 862
        item 25 key (1923814252544 EXTENT_ITEM 4096) itemoff 2489 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 693460895) level 0
                tree block backref root 862
        item 26 key (1923814256640 EXTENT_ITEM 4096) itemoff 2438 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 4250039336) level 0
                tree block backref root 862
        item 27 key (1923814264832 EXTENT_ITEM 4096) itemoff 2387 itemsize 51
                extent refs 1 gen 125542 flags 2
                tree block key (7531027 54 716511755) level 0
                tree block backref root 870
        item 28 key (1923814268928 EXTENT_ITEM 4096) itemoff 2336 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 583696754) level 0
                tree block backref root 862
        item 29 key (1923814273024 EXTENT_ITEM 4096) itemoff 2285 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 846280235) level 0
                tree block backref root 862
        item 30 key (1923814277120 EXTENT_ITEM 4096) itemoff 2234 itemsize 51
                extent refs 1 gen 123524 flags 2
                tree block key (4962881 54 108099388) level 0
                tree block backref root 862
        item 31 key (1923814281216 EXTENT_ITEM 4096) itemoff 2183 itemsize 51
                extent refs 1 gen 116704 flags 258
                tree block key (9759696 60 2) level 0
                tree block backref root 839
        item 32 key (1923814285312 EXTENT_ITEM 4096) itemoff 2105 itemsize 78
                extent refs 4 gen 123524 flags 2
                tree block key (4962881 60 1452615) level 0
                tree block backref root 892
                tree block backref root 868
                tree block backref root 865
                tree block backref root 862
        item 33 key (1923814293504 EXTENT_ITEM 4096) itemoff 2054 itemsize 51
                extent refs 1 gen 121426 flags 2
                tree block key (5643880 60 602) level 0
                tree block backref root 853
 failed to find block number 1923814297600
 Abort

How large is the FS ? Is it possible to run btrfs-image and send the
output file to us?

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: machine gets unresponsive during btrfs balance

2010-08-12 Thread Yan, Zheng
On Thu, Aug 12, 2010 at 3:14 PM, Andreas Philipp
philipp.andr...@gmail.com wrote:
 Hi,

 I am using a btrfs filesystem created with raid0 for data and metadata
 for (temporary) storage of tv recordings from my vdr. The filesystem was
 created under kernel version 2.6.34. An initial btrfs balance command
 succeeded. Since I upgraded to 2.6.35-rcX and 2.6.35 btrfs balance no
 longer finishes but puts the machine in some unresponsive state.
 Unfortunately, I do not see any kernel oops or other debug information
 because even the display freezes. The last thing that happens are that
 those two lines are written to /var/log/messages:
 Aug 11 21:42:23 thor kernel: btrfs: found 62911 extents
 Aug 11 21:42:24 thor kernel: btrfs: relocating block group 1723913469952
 flags 9
 After that the machine becomes immediately unresponsive.

 As I did not see anything that might be related to my problem in the
 changelog for 2.6.35.1 I did not try again with this version.


Do you have more than one machines? would you please setup netconsole
to see what happen.

Thanks
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Code bug or data bug?

2010-08-10 Thread Yan, Zheng
On Tue, Aug 10, 2010 at 6:20 AM, K. Richard Pixley r...@noir.com wrote:
  I've just gotten:

 r...@diamonds:~$ time sudo /sbin/btrfsck /dev/sda7
 btrfsck: btrfsck.c:585: splice_shared_node: Assertion `!(src ==
 src_node-root_cache)' failed.
 Aborted

 Does this indicate a coding error in btrfsck or a data error in my file
 system?

 --rich

 r...@diamonds:~$ dpkg -l | grep btrfs
 ii  btrfs-tools                                0.19+20100601-3
                   Checksumming Copy on Write Filesystem utilit
 r...@diamonds:~$ uname -a
 Linux diamonds 2.6.32-24-generic-pae #38-Ubuntu SMP Mon Jul 5 10:54:21 UTC
 2010 i686 GNU/Linux
 r...@diamonds:~$ lsb_release -a
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description:    Ubuntu 10.04.1 LTS
 Release:        10.04
 Codename:       lucid


This is a bug in btrfsck.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] btrfs: fix a wrong error check in add_ra_bio_pages()

2010-07-29 Thread Yan, Zheng
2010/7/29 Miao Xie mi...@cn.fujitsu.com:
 From: Liu Bo liubo2...@cn.fujitsu.com

 Only when a page is not found by page_index, we'll go to the error check.

 Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/compression.c |    2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
 index cb3877c..8458840 100644
 --- a/fs/btrfs/compression.c
 +++ b/fs/btrfs/compression.c
 @@ -467,7 +467,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
                rcu_read_lock();
                page = radix_tree_lookup(mapping-page_tree, page_index);
                rcu_read_unlock();
 -               if (page) {
 +               if (!page) {
  check_misses:
                        misses++;
                        if (misses  4)

This patch is wrong. The word miss here means miss for read-ahead because
the page is already in the page cache

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intermittent no space errors

2010-07-27 Thread Yan, Zheng
On Tue, Jul 27, 2010 at 5:09 AM, Dave Cundiff syshack...@gmail.com wrote:
 Hello,

 On 2.6.35-rc5 I'm seeing some weird behavior under heavy IO loads. I
 have a backup process that fires up several rsync processes. These
 mirror several dozen servers to individual sub-volumes. Everyday I
 snapshot each sub-volume and rsync over it.

 The problem I'm seeing is my rsync processes are failing randomly with
 No space left on device. This is a 6 Terabyte volume with plenty of
 free space.

 Mount options:
 /dev/sdb on /backups type btrfs (rw,max_inline=0,compress)

 [r...@rsync1 ~]# btrfs filesystem df /backups/
 Data: total=1.88TB, used=1.88TB
 Metadata: total=43.38GB, used=32.06GB
 System: total=12.00MB, used=260.00KB

 [r...@rsync1 ~]# df /dev/sdb
 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/sdb             5781249024 2087273084 3693975940  37% /backups

 They don't all fail at once. Normally I have 4-5 running at a time and
 1 or 2 will drop out with a no space error. The rest continue on. I've
 noticed it will generally occur on ones that are in the middle of
 transferring a very large file. If I lighten the load to one rsync at
 a time it appears to happen less frequently.

 Any known issues I should be aware of?


Thank you for reporting this. I will dig in.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: unlinked X orphans messages

2010-07-19 Thread Yan, Zheng
On Mon, Jul 19, 2010 at 5:01 PM, Xavier Nicollet nicol...@jeru.org wrote:
 Hi,

 I am using btrfs for remote backups (via rsync), with daily and weekly
 snapshots.

 I see these messages in kern.log:

 Jul 18 07:09:43 backup1 kernel: [3437126.458374] btrfs: unlinked 9 orphans
 Jul 18 12:01:01 backup1 kernel: [3454604.905856] btrfs: unlinked 1 orphans
 Jul 18 13:01:51 backup1 kernel: [3458254.990199] btrfs: unlinked 1 orphans
 Jul 19 04:01:41 backup1 kernel: [3512244.236347] btrfs: unlinked 1 orphans

 Is this something I have to be afraid of ?

 Linux debian lenny, pure btrfs partition with no raid, vanilla kernel:
 2.6.34.


Nothing to worry about.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] btrfs hangup when we run the sync command

2010-07-15 Thread Yan, Zheng
2010/7/15 Miao Xie mi...@cn.fujitsu.com:
 Hi, everyone

 I found btrfs will hangup when we run the sync command on my
 x86_64 box.

 The reproduce steps is following:
 # mkfs.btrfs -s 8192 -l 8192 -n 8192 /dev/sda1
 # mount /dev/sda1 /mnt
 # echo 1234567  /mnt/aaa
 # sync
 (btrfs hangs up)

 It seems that the btrfs doesn't support the sectorsize which is
 greater than the page size just like ext2/3/4, though we can use
 mkfs.btrfs to make a filesystem with a big sectorsize. Am I right?

 If yes, we must do more check in the mkfs.btrfs.


yes, btrfs doesn't support the sectorsize  PAGE_size.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck segmentation fault + trace

2010-06-16 Thread Yan, Zheng
On Thu, Jun 17, 2010 at 4:25 AM, John Wyzer john.wy...@gmx.de wrote:
 On 16/06/10 19:29, Chris Mason wrote:
 On Wed, Jun 16, 2010 at 06:35:25PM +0200, John Wyzer wrote:
 Is 2.6.34 working normally?

 Yes. I can boot with 2.6.34.y and everything works fine. (Actually,
 before trying 2.6.45-rc3, I had an uptime of two weeks on this laptop.
 Now, I'm writing this email on 2.6.34.y.)

 But every time you boot 2.6.35 you get errors?  Would it be possible to
 save the console output (netconsole works well)

 [...]
 device fsid 3247922091b53feb-dcb02f0506fbdc8b devid 1 transid 155748
 /dev/mapper/root
 EXT3-fs: barriers not enabled
 kjournald starting.  Commit interval 5 seconds
 EXT3-fs (sda4): warning: maximal mount count reached, running e2fsck is
 recommended
 EXT3-fs (sda4): using internal journal
 EXT3-fs (sda4): recovery complete
 EXT3-fs (sda4): mounted filesystem with writeback data mode
 btrfs: fail to dirty  inode 9959493 error -28
 btrfs: fail to dirty  inode 2987803 error -28
 btrfs: fail to dirty  inode 2987803 error -28
 btrfs: fail to dirty  inode 8873620 error -28
 btrfs: fail to dirty  inode 8873620 error -28
 btrfs: fail to dirty  inode 803894 error -28
 btrfs: fail to dirty  inode 2988335 error -28
 btrfs: fail to dirty  inode 2987971 error -28
 btrfs: fail to dirty  inode 2987972 error -28
 btrfs: fail to dirty  inode 2988336 error -28
 btrfs: fail to dirty  inode 6631 error -28
 btrfs: fail to dirty  inode 803896 error -28
 btrfs: fail to dirty  inode 6632 error -28
 btrfs: fail to dirty  inode 6633 error -28
 btrfs: fail to dirty  inode 6634 error -28
 [...] (nothing new coming, only error -28)

 Apart from that, I had messages that there was no space left on /, but
 those were from userspace and not logged via netconsole.


looks like the fs runs out of metadata space. btrfs in 2.6.35 reserves
more metadata space for system use than btrfs in 2.6.34. That's why
these error message only appear on 2.6.35.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still ENOSPC problems with 2.6.35-rc3

2010-06-16 Thread Yan, Zheng
On Thu, Jun 17, 2010 at 1:48 AM, Johannes Hirte
johannes.hi...@fem.tu-ilmenau.de wrote:
 With kernel-2.6.34 I run into the ENOSPC problems that where reported on this
 list recently. The filesystem was somewhat over 90% full and most operations 
 on
 it caused a Oops. I was able to delete files by trial and error and freed up
 half of the filesystem space. Operation on the other files still caused an 
 Oops.

 For 2.6.35 there went some patches in, that addressed this problem. Sadly they
 don't fix it but only avoid the Oops. A simple 'ls' on this filesystem results
 in

To avoid ENOSPC oops, btrfs in 2.6.35 reserves more metadata space for
system use than older btrfs. If the FS has already ran out of metadata space,
using btrfs in 2.6.35 doesn't help.

Yan, Zheng


 [ cut here ]
 WARNING: at fs/btrfs/extent-tree.c:3441 btrfs_block_rsv_check+0x10c/0x13e()
 Hardware name: To Be Filled By O.E.M.
 Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth
 snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event snd_seq
 snd_pcm_oss snd_mixer_oss radeon ttm drm_kms_helper drm i2c_algo_bit
 snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device
 snd_timer snd_page_alloc snd_util_mem snd_hwdep snd amd64_edac_mod sata_sil sg
 sr_mod uhci_hcd ohci_hcd edac_core edac_mce_amd k8temp i2c_amd8111 i2c_amd756
 hwmon
 Pid: 26973, comm: ls Not tainted 2.6.35-rc3 #1
 Call Trace:
  [81031044] ? warn_slowpath_common+0x78/0x8c
  [81147fdf] ? btrfs_block_rsv_check+0x10c/0x13e
  [81155857] ? __btrfs_end_transaction+0x9f/0x1b1
  [8115aaa2] ? btrfs_dirty_inode+0x58/0xf9
  [810b07ba] ? __mark_inode_dirty+0x25/0x149
  [810a809a] ? touch_atime+0xfc/0x125
  [810a3a32] ? filldir+0x0/0xc3
  [810a3c1c] ? vfs_readdir+0x76/0x9c
  [810a3d7e] ? sys_getdents+0x7d/0xcd
  [81364f1f] ? page_fault+0x1f/0x30
  [81001e2b] ? system_call_fastpath+0x16/0x1b
 ---[ end trace 4aa882f64f792d16 ]---
 block_rsv size 654311424 reserved 67809280 freed 0 0
 [ cut here ]
 WARNING: at fs/btrfs/extent-tree.c:3441 btrfs_block_rsv_check+0x10c/0x13e()
 Hardware name: To Be Filled By O.E.M.
 Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth
 snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event snd_seq
 snd_pcm_oss snd_mixer_oss radeon ttm drm_kms_helper drm i2c_algo_bit
 snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device
 snd_timer snd_page_alloc snd_util_mem snd_hwdep snd amd64_edac_mod sata_sil sg
 sr_mod uhci_hcd ohci_hcd edac_core edac_mce_amd k8temp i2c_amd8111 i2c_amd756
 hwmon
 Pid: 26970, comm: btrfs-transacti Tainted: G        W   2.6.35-rc3 #1
 Call Trace:
  [81031044] ? warn_slowpath_common+0x78/0x8c
  [81147fdf] ? btrfs_block_rsv_check+0x10c/0x13e
  [81155857] ? __btrfs_end_transaction+0x9f/0x1b1
  [81155a7a] ? btrfs_commit_transaction+0xf4/0x5fd
  [8102c39f] ? enqueue_task+0x39/0x47
  [81363dbb] ? mutex_lock+0xd/0x31
  [81043979] ? autoremove_wake_function+0x0/0x2a
  [81151b5b] ? transaction_kthread+0x16d/0x213
  [811519ee] ? transaction_kthread+0x0/0x213
  [810435ad] ? kthread+0x75/0x7d
  [81002b54] ? kernel_thread_helper+0x4/0x10
  [81043538] ? kthread+0x0/0x7d
  [81002b50] ? kernel_thread_helper+0x0/0x10
 ---[ end trace 4aa882f64f792d17 ]---
 block_rsv size 654311424 reserved 67809280 freed 0 0
 [ cut here ]
 WARNING: at fs/btrfs/extent-tree.c:3441 btrfs_block_rsv_check+0x10c/0x13e()
 Hardware name: To Be Filled By O.E.M.
 Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth
 snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event snd_seq
 snd_pcm_oss snd_mixer_oss radeon ttm drm_kms_helper drm i2c_algo_bit
 snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device
 snd_timer snd_page_alloc snd_util_mem snd_hwdep snd amd64_edac_mod sata_sil sg
 sr_mod uhci_hcd ohci_hcd edac_core edac_mce_amd k8temp i2c_amd8111 i2c_amd756
 hwmon
 Pid: 26973, comm: ls Tainted: G        W   2.6.35-rc3 #1
 Call Trace:
  [81031044] ? warn_slowpath_common+0x78/0x8c
  [81147fdf] ? btrfs_block_rsv_check+0x10c/0x13e
  [81155857] ? __btrfs_end_transaction+0x9f/0x1b1
  [811562fb] ? start_transaction+0x15f/0x1c4
  [8115aaaf] ? btrfs_dirty_inode+0x65/0xf9
  [810b07ba] ? __mark_inode_dirty+0x25/0x149
  [810a809a] ? touch_atime+0xfc/0x125
  [810a3a32] ? filldir+0x0/0xc3
  [810a3c1c] ? vfs_readdir+0x76/0x9c
  [810a3d7e] ? sys_getdents+0x7d/0xcd
  [81364f1f] ? page_fault+0x1f/0x30
  [81001e2b] ? system_call_fastpath+0x16/0x1b
 ---[ end trace 4aa882f64f792d18 ]---
 block_rsv size 654311424 reserved 67809280 freed 0 0
 btrfs: fail to dirty  inode 256 error -28
 [ cut here ]
 WARNING: at fs/btrfs

Re: btrfsck segmentation fault + trace

2010-06-16 Thread Yan, Zheng
On Thu, Jun 17, 2010 at 7:26 AM, John Wyzer john.wy...@gmx.de wrote:
 On 17/06/10 00:45, Yan, Zheng wrote:
 On Thu, Jun 17, 2010 at 4:25 AM, John Wyzer john.wy...@gmx.de wrote:
 btrfs: fail to dirty  inode 6634 error -28
 [...] (nothing new coming, only error -28)

 Apart from that, I had messages that there was no space left on /, but
 those were from userspace and not logged via netconsole.


 looks like the fs runs out of metadata space. btrfs in 2.6.35 reserves
 more metadata space for system use than btrfs in 2.6.34. That's why
 these error message only appear on 2.6.35.

 Hmm. I've formatted with -m single because I wanted to avoid running out
 of space for metadata. So if there's more then 10% of space left which
 is about 40GB that seems like quite a bit of waste... ;-)

 I'll stay with 2.6.34 then for the time being.


Spaces for data and metadata are separated even the FS was formatted
with -m single.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression in 2.6.35 RC1 (Ubuntu 2.6.35-1-generic)

2010-06-11 Thread Yan, Zheng
/0x210 [btrfs]
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858189]  [a01694e0]
 btrfs_commit_transaction+0x80/0x710 [btrfs]
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858198]  [81582a9e] ?
 mutex_lock+0x1e/0x50
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858227]  [a0169f8b] ?
 start_transaction+0x1ab/0x230 [btrfs]
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858238]  [8107d610] ?
 autoremove_wake_function+0x0/0x40
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858265]  [a0163d53]
 transaction_kthread+0x283/0x290 [btrfs]
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858293]  [a0163ad0] ?
 transaction_kthread+0x0/0x290 [btrfs]
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858302]  [8107d0b6]
 kthread+0x96/0xa0
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858311]  [8100aee4]
 kernel_thread_helper+0x4/0x10
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858320]  [8107d020] ?
 kthread+0x0/0xa0
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858327]  [8100aee0] ?
 kernel_thread_helper+0x0/0x10
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858441]  RSP 88012821b820
 Jun 10 21:23:17 bradf-x301 kernel: [  209.858448] ---[ end trace
 32d3e1002acaefc5 ]---


I have already sent a patch for this.
http://www.spinics.net/lists/linux-btrfs/msg05150.html

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix null dereference in relocation.c

2010-05-31 Thread Yan, Zheng
Fix a potential null dereference in relocation.c

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/relocation.c 2/fs/btrfs/relocation.c
--- 1/fs/btrfs/relocation.c 2010-05-26 00:13:07.227605825 +0800
+++ 2/fs/btrfs/relocation.c 2010-05-31 16:35:23.489829633 +0800
@@ -784,16 +784,17 @@ again:
struct btrfs_extent_ref_v0 *ref0;
ref0 = btrfs_item_ptr(eb, path1-slots[0],
struct btrfs_extent_ref_v0);
-   root = find_tree_root(rc, eb, ref0);
-   if (!root-ref_cows)
-   cur-cowonly = 1;
if (key.objectid == key.offset) {
+   root = find_tree_root(rc, eb, ref0);
if (root  !should_ignore_root(root))
cur-root = root;
else
list_add(cur-list, useless);
break;
}
+   if (is_cowonly_root(btrfs_ref_root_v0(eb,
+ ref0)))
+   cur-cowonly = 1;
}
 #else
BUG_ON(key.type == BTRFS_EXTENT_REF_V0_KEY);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix BUG_ON for fs converted from extN

2010-05-31 Thread Yan, Zheng
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c
--- 1/fs/btrfs/extent-tree.c2010-05-26 23:55:46.610378078 +0800
+++ 3/fs/btrfs/extent-tree.c2010-05-31 16:36:51.907580723 +0800
@@ -4360,7 +4360,8 @@ void btrfs_free_tree_block(struct btrfs_
 
block_rsv = get_block_rsv(trans, root);
cache = btrfs_lookup_block_group(root-fs_info, buf-start);
-   BUG_ON(block_rsv-space_info != cache-space_info);
+   if (block_rsv-space_info != cache-space_info)
+   goto out;
 
if (btrfs_header_generation(buf) == trans-transid) {
if (root-root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk space accounting and subvolume delete

2010-05-31 Thread Yan, Zheng
On Tue, Jun 1, 2010 at 3:01 AM, Bruce Guenter br...@untroubled.org wrote:
 On Wed, May 12, 2010 at 01:02:07PM +0800, Yan, Zheng  wrote:
 Dropping a tree can be lengthy. It's not good to let sync wait for hours.
 For most linux FS, 'sync' just force an transaction/journal commit. I don't
 think they wait for large operations that can span multiple transactions to
 complete.

 What happens to the consistency of the filesystem if a crash happens
 during this process?


This does not break the consistency of the filesystem. Next mount will find the
partial dropped tree and restart the dropping process.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs-convert: Add extent iteration functions.

2010-05-18 Thread Yan, Zheng
On Sat, Mar 20, 2010 at 12:26 PM, Sean Bartell
wingedtachik...@gmail.com wrote:
 A filesystem can have disk extents in arbitrary places on the disk, as
 well as extents that must be read into memory because they have
 compression or encryption btrfs doesn't support. These extents can be
 passed to the new extent iteration functions, which will handle all the
 details of alignment, allocation, etc.
 ---
  convert.c |  604 
 -
  1 files changed, 401 insertions(+), 203 deletions(-)

 diff --git a/convert.c b/convert.c
 index c48f8ba..bd91990 100644
 --- a/convert.c
 +++ b/convert.c
 @@ -357,7 +357,7 @@ error:
  }

  static int read_disk_extent(struct btrfs_root *root, u64 bytenr,
 -                           u32 num_bytes, char *buffer)
 +                           u64 num_bytes, char *buffer)
  {
        int ret;
        struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices;
 @@ -371,6 +371,23 @@ fail:
                ret = -1;
        return ret;
  }
 +
 +static int write_disk_extent(struct btrfs_root *root, u64 bytenr,
 +                            u64 num_bytes, const char *buffer)
 +{
 +       int ret;
 +       struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices;
 +
 +       ret = pwrite(fs_devs-latest_bdev, buffer, num_bytes, bytenr);
 +       if (ret != num_bytes)
 +               goto fail;
 +       ret = 0;
 +fail:
 +       if (ret  0)
 +               ret = -1;
 +       return ret;
 +}
 +
  /*
  * Record a file extent. Do all the required works, such as inserting
  * file extent item, inserting extent item and backref item into extent
 @@ -378,8 +395,7 @@ fail:
  */
  static int record_file_extent(struct btrfs_trans_handle *trans,
                              struct btrfs_root *root, u64 objectid,
 -                             struct btrfs_inode_item *inode,
 -                             u64 file_pos, u64 disk_bytenr,
 +                             u64 *inode_nbytes, u64 file_pos, u64 
 disk_bytenr,
                              u64 num_bytes, int checksum)
  {
        int ret;
 @@ -391,7 +407,6 @@ static int record_file_extent(struct btrfs_trans_handle 
 *trans,
        struct btrfs_path path;
        struct btrfs_extent_item *ei;
        u32 blocksize = root-sectorsize;
 -       u64 nbytes;

        if (disk_bytenr == 0) {
                ret = btrfs_insert_file_extent(trans, root, objectid,
 @@ -450,8 +465,7 @@ static int record_file_extent(struct btrfs_trans_handle 
 *trans,
        btrfs_set_file_extent_other_encoding(leaf, fi, 0);
        btrfs_mark_buffer_dirty(leaf);

 -       nbytes = btrfs_stack_inode_nbytes(inode) + num_bytes;
 -       btrfs_set_stack_inode_nbytes(inode, nbytes);
 +       *inode_nbytes += num_bytes;

        btrfs_release_path(root, path);

 @@ -492,95 +506,355 @@ fail:
        return ret;
  }

 -static int record_file_blocks(struct btrfs_trans_handle *trans,
 -                             struct btrfs_root *root, u64 objectid,
 -                             struct btrfs_inode_item *inode,
 -                             u64 file_block, u64 disk_block,
 -                             u64 num_blocks, int checksum)
 -{
 -       u64 file_pos = file_block * root-sectorsize;
 -       u64 disk_bytenr = disk_block * root-sectorsize;
 -       u64 num_bytes = num_blocks * root-sectorsize;
 -       return record_file_extent(trans, root, objectid, inode, file_pos,
 -                                 disk_bytenr, num_bytes, checksum);
 -}
 -
 -struct blk_iterate_data {
 +struct extent_iterate_data {
        struct btrfs_trans_handle *trans;
        struct btrfs_root *root;
 -       struct btrfs_inode_item *inode;
 +       u64 *inode_nbytes;
        u64 objectid;
 -       u64 first_block;
 -       u64 disk_block;
 -       u64 num_blocks;
 -       u64 boundary;
 -       int checksum;
 -       int errcode;
 +       int checksum, packing;
 +       u64 last_file_off;
 +       u64 total_size;
 +       enum {EXTENT_ITERATE_TYPE_NONE, EXTENT_ITERATE_TYPE_MEM,
 +             EXTENT_ITERATE_TYPE_DISK} type;
 +       u64 size;
 +       u64 file_off; /* always aligned to sectorsize */
 +       char *data; /* for mem */
 +       u64 disk_off; /* for disk */
  };

 -static int block_iterate_proc(ext2_filsys ext2_fs,
 -                             u64 disk_block, u64 file_block,
 -                             struct blk_iterate_data *idata)
 +static u64 extent_boundary(struct btrfs_root *root, u64 extent_start)
  {
 -       int ret;
 -       int sb_region;
 -       int do_barrier;
 -       struct btrfs_root *root = idata-root;
 -       struct btrfs_trans_handle *trans = idata-trans;
 -       struct btrfs_block_group_cache *cache;
 -       u64 bytenr = disk_block * root-sectorsize;
 -
 -       sb_region = intersect_with_sb(bytenr, root-sectorsize);
 -       do_barrier = sb_region || disk_block = idata-boundary;
 -       if ((idata-num_blocks  0  do_barrier) ||
 -           (file_block  idata-first_block + idata-num_blocks) ||
 

Re: [PATCH 1/4] btrfs-convert: make more use of cache_free_extents

2010-05-18 Thread Yan, Zheng
On Sat, Mar 20, 2010 at 12:24 PM, Sean Bartell
wingedtachik...@gmail.com wrote:
 An extent_io_tree is used for all free space information. This allows
 removal of ext2_alloc_block and ext2_free_block, and makes
 create_ext2_image less ext2-specific.
 ---
  convert.c |  154 
 +++--
  1 files changed, 99 insertions(+), 55 deletions(-)

 diff --git a/convert.c b/convert.c
 index d037c98..c48f8ba 100644
 --- a/convert.c
 +++ b/convert.c
 @@ -95,29 +95,10 @@ static int close_ext2fs(ext2_filsys fs)
        return 0;
  }

 -static int ext2_alloc_block(ext2_filsys fs, u64 goal, u64 *block_ret)
 +static int ext2_cache_free_extents(ext2_filsys ext2_fs,
 +                                  struct extent_io_tree *free_tree)
  {
 -       blk_t block;
 -
 -       if (!ext2fs_new_block(fs, goal, NULL, block)) {
 -               ext2fs_fast_mark_block_bitmap(fs-block_map, block);
 -               *block_ret = block;
 -               return 0;
 -       }
 -       return -ENOSPC;
 -}
 -
 -static int ext2_free_block(ext2_filsys fs, u64 block)
 -{
 -       BUG_ON(block != (blk_t)block);
 -       ext2fs_fast_unmark_block_bitmap(fs-block_map, block);
 -       return 0;
 -}
 -
 -static int cache_free_extents(struct btrfs_root *root, ext2_filsys ext2_fs)
 -
 -{
 -       int i, ret = 0;
 +       int ret = 0;
        blk_t block;
        u64 bytenr;
        u64 blocksize = ext2_fs-blocksize;
 @@ -127,29 +108,68 @@ static int cache_free_extents(struct btrfs_root *root, 
 ext2_filsys ext2_fs)
                if (ext2fs_fast_test_block_bitmap(ext2_fs-block_map, block))
                        continue;
                bytenr = block * blocksize;
 -               ret = set_extent_dirty(root-fs_info-free_space_cache,
 -                                      bytenr, bytenr + blocksize - 1, 0);
 +               ret = set_extent_dirty(free_tree, bytenr,
 +                                      bytenr + blocksize - 1, 0);
                BUG_ON(ret);
        }

 +       return 0;
 +}
 +
 +/* mark btrfs-reserved blocks as used */
 +static void adjust_free_extents(ext2_filsys ext2_fs,
 +                               struct extent_io_tree *free_tree)
 +{
 +       int i;
 +       u64 bytenr;
 +       u64 blocksize = ext2_fs-blocksize;
 +
 +       clear_extent_dirty(free_tree, 0, BTRFS_SUPER_INFO_OFFSET - 1, 0);
 +
        for (i = 0; i  BTRFS_SUPER_MIRROR_MAX; i++) {
                bytenr = btrfs_sb_offset(i);
                bytenr = ~((u64)STRIPE_LEN - 1);
                if (bytenr = blocksize * ext2_fs-super-s_blocks_count)
                        break;
 -               clear_extent_dirty(root-fs_info-free_space_cache, bytenr,
 -                                  bytenr + STRIPE_LEN - 1, 0);
 +               clear_extent_dirty(free_tree, bytenr, bytenr + STRIPE_LEN - 1,
 +                                  0);
        }
 +}

 -       clear_extent_dirty(root-fs_info-free_space_cache,
 -                          0, BTRFS_SUPER_INFO_OFFSET - 1, 0);
 -
 +static int alloc_blocks(struct extent_io_tree *free_tree,
 +                       u64 *blocks, int num, u64 blocksize)
 +{
 +       u64 start;
 +       u64 end;
 +       u64 last = 0;
 +       u64 mask = blocksize - 1;
 +       int ret;
 +       while(num) {
 +               ret = find_first_extent_bit(free_tree, last, start, end,
 +                                           EXTENT_DIRTY);
 +               if (ret)
 +                       goto fail;
 +               last = end + 1;
 +               if (start  mask)
 +                       start = (start  mask) + blocksize;
 +               if (last - start  blocksize)
 +                       continue;
 +               *blocks++ = start;
 +               num--;
 +               last = start + blocksize;
 +               clear_extent_dirty(free_tree, start, last - 1, 0);
 +       }
        return 0;
 +fail:
 +       fprintf(stderr, not enough free space\n);
 +       return -ENOSPC;
  }

  static int custom_alloc_extent(struct btrfs_root *root, u64 num_bytes,
                               u64 hint_byte, struct btrfs_key *ins)
  {
 +       u64 blocksize = root-sectorsize;
 +       u64 mask = blocksize - 1;
        u64 start;
        u64 end;
        u64 last = hint_byte;
 @@ -171,6 +191,8 @@ static int custom_alloc_extent(struct btrfs_root *root, 
 u64 num_bytes,

                start = max(last, start);
                last = end + 1;
 +               if (start  mask)
 +                       start = (start  mask) + blocksize;
                if (last - start  num_bytes)
                        continue;

 @@ -1186,9 +1208,9 @@ static int create_image_file_range(struct 
 btrfs_trans_handle *trans,
                                   struct btrfs_root *root, u64 objectid,
                                   struct btrfs_inode_item *inode,
                                   u64 start_byte, u64 end_byte,
 -                                  ext2_filsys ext2_fs)
 +                               

Re: Bug when resizing FS

2010-05-14 Thread Yan, Zheng
On Sat, May 15, 2010 at 12:23 AM, Martin Bueger
mbuer...@edu.uni-klu.ac.at wrote:
 Hello,

 when I try to resize the FS with btrfsctl -r it works using + or -, hence,
 extending or shrinking the FS but when I want to set it to a certain size I
 always hit the follwing bug:

 [ cut here ]

 invalid opcode:  [#1] SMP
 last sysfs file:
 /sys/devices/pci:00/:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb/size
 Modules linked in: sr_mod i2c_piix4 cdrom processor container thermal ac
 button i2c_core

 Pid: 4044, comm: btrfs-delalloc- Not tainted 2.6.33-zen2 #1 440BX Desktop
 Reference Platform/VMware Virtual Platform
 EIP: 0060:[c12545a8] EFLAGS: 00010286 CPU: 0
 EIP is at cow_file_range+0x638/0x650
 EAX: ffe4 EBX: 1b5cb000 ECX: 3d3d EDX: 0001
 ESI:  EDI:  EBP: cd22a034 ESP: caa4de40
  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
 Process btrfs-delalloc- (pid: 4044, ti=caa4c000 task=de0d0b10
 task.ti=caa4c000)
 Stack:
  1000  1000    1b5cb000 
 0   caa4debf 0001  c1a4d1a0 cd22a12c 1000
 0 cd22a038 de032800 cb25a180 1000  caa4dea8  1000
 Call Trace:
  [c12bbe1a] ? __prop_inc_single+0x3a/0x50
  [c1255580] ? submit_compressed_extents+0x260/0x4f0
  [c127b10e] ? run_ordered_completions+0x5e/0xb0
  [c127b85b] ? worker_loop+0x12b/0x410
  [c127b730] ? worker_loop+0x0/0x410
  [c103e994] ? kthread+0x74/0x80
  [c103e920] ? kthread+0x0/0x80
  [c10030b6] ? kernel_thread_helper+0x6/0x10
 Code: 8b 94 24 b8 00 00 00 83 d6 00 0f ac f3 0c 01 1a 8b 84 24 b4 00 00 00 c7
 00 01 00 00 00 e9 23 fe ff ff 0f 0b eb fe 90 8d 74 26 00 0f 0b eb fe 8d 74 
 26
 00 31 db 31 f6 e9 a1 fb ff ff 0f 0b eb fe
 EIP: [c12545a8] cow_file_range+0x638/0x650 SS:ESP 0068:caa4de40
 ---[ end trace 31b4672bb84c5cec ]---

 The command I ran: btrfsctl -r 1g /mnt/point



Looks like an ENOSPC Oops, this will be improved in 2.6.35
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: pass buffer extent to btrfs_free_tree_block

2010-05-11 Thread Yan, Zheng
prepare for the log code

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/ctree.c 2/fs/btrfs/ctree.c
--- 1/fs/btrfs/ctree.c  2010-04-14 14:49:56.342950744 +0800
+++ 2/fs/btrfs/ctree.c  2010-05-11 14:00:04.122357838 +0800
@@ -279,7 +279,8 @@ int btrfs_block_can_be_shared(struct btr
 static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct extent_buffer *buf,
-  struct extent_buffer *cow)
+  struct extent_buffer *cow,
+  int *last_ref)
 {
u64 refs;
u64 owner;
@@ -365,6 +366,7 @@ static noinline int update_ref_for_cow(s
BUG_ON(ret);
}
clean_tree_block(trans, root, buf);
+   *last_ref = 1;
}
return 0;
 }
@@ -392,6 +394,7 @@ static noinline int __btrfs_cow_block(st
struct extent_buffer *cow;
int level;
int unlock_orig = 0;
+   int last_ref = 0;
u64 parent_start;
 
if (*cow_ret == buf)
@@ -441,7 +444,7 @@ static noinline int __btrfs_cow_block(st
(unsigned long)btrfs_header_fsid(cow),
BTRFS_FSID_SIZE);
 
-   update_ref_for_cow(trans, root, buf, cow);
+   update_ref_for_cow(trans, root, buf, cow, last_ref);
 
if (buf == root-node) {
WARN_ON(parent  parent != buf);
@@ -456,8 +459,8 @@ static noinline int __btrfs_cow_block(st
extent_buffer_get(cow);
spin_unlock(root-node_lock);
 
-   btrfs_free_tree_block(trans, root, buf-start, buf-len,
-   parent_start, root-root_key.objectid, level);
+   btrfs_free_tree_block(trans, root, buf, parent_start,
+ last_ref);
free_extent_buffer(buf);
add_root_to_dirty_list(root);
} else {
@@ -472,8 +475,8 @@ static noinline int __btrfs_cow_block(st
btrfs_set_node_ptr_generation(parent, parent_slot,
  trans-transid);
btrfs_mark_buffer_dirty(parent);
-   btrfs_free_tree_block(trans, root, buf-start, buf-len,
-   parent_start, root-root_key.objectid, level);
+   btrfs_free_tree_block(trans, root, buf, parent_start,
+ last_ref);
}
if (unlock_orig)
btrfs_tree_unlock(buf);
@@ -948,6 +951,22 @@ int btrfs_bin_search(struct extent_buffe
return bin_search(eb, key, level, slot);
 }
 
+static void root_add_used(struct btrfs_root *root, u32 size)
+{
+   spin_lock(root-node_lock);
+   btrfs_set_root_used(root-root_item,
+   btrfs_root_used(root-root_item) + size);
+   spin_unlock(root-node_lock);
+}
+
+static void root_sub_used(struct btrfs_root *root, u32 size)
+{
+   spin_lock(root-node_lock);
+   btrfs_set_root_used(root-root_item,
+   btrfs_root_used(root-root_item) - size);
+   spin_unlock(root-node_lock);
+}
+
 /* given a node and slot number, this reads the blocks it points to.  The
  * extent buffer is returned with a reference taken (but unlocked).
  * NULL is returned on error.
@@ -1018,7 +1037,11 @@ static noinline int balance_level(struct
btrfs_tree_lock(child);
btrfs_set_lock_blocking(child);
ret = btrfs_cow_block(trans, root, child, mid, 0, child);
-   BUG_ON(ret);
+   if (ret) {
+   btrfs_tree_unlock(child);
+   free_extent_buffer(child);
+   goto enospc;
+   }
 
spin_lock(root-node_lock);
root-node = child;
@@ -1033,11 +1056,12 @@ static noinline int balance_level(struct
btrfs_tree_unlock(mid);
/* once for the path */
free_extent_buffer(mid);
-   ret = btrfs_free_tree_block(trans, root, mid-start, mid-len,
-   0, root-root_key.objectid, level);
+
+   root_sub_used(root, mid-len);
+   btrfs_free_tree_block(trans, root, mid, 0, 1);
/* once for the root ptr */
free_extent_buffer(mid);
-   return ret;
+   return 0;
}
if (btrfs_header_nritems(mid) 
BTRFS_NODEPTRS_PER_BLOCK(root) / 4)
@@ -1087,23 +,16 @@ static noinline int balance_level(struct
if (wret  0  wret != -ENOSPC)
ret = wret;
if (btrfs_header_nritems(right) == 0) {
-   u64 bytenr = right-start;
-   u32 blocksize = right-len

[PATCH 3/5] btrfs: split btrfs_alloc_free_block()

2010-05-11 Thread Yan, Zheng
split btrfs_alloc_free_block() into btrfs_reserved_tree_block()
and btrfs_alloc_reserved_tree_block().

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 3/fs/btrfs/ctree.h 4/fs/btrfs/ctree.h
--- 3/fs/btrfs/ctree.h  2010-05-11 14:09:45.052107958 +0800
+++ 4/fs/btrfs/ctree.h  2010-05-11 13:15:47.060357000 +0800
@@ -1978,6 +1978,15 @@ struct btrfs_block_group_cache *btrfs_lo
 void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
 u64 btrfs_find_block_group(struct btrfs_root *root,
   u64 search_start, u64 search_hint, int owner);
+struct extent_buffer *btrfs_reserve_tree_block(struct btrfs_trans_handle 
*trans,
+  struct btrfs_root *root,
+  u32 blocksize, int level,
+  u64 hint, u64 empty_size);
+int btrfs_alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct extent_buffer *buf,
+   u64 parent, u64 root_objectid,
+   struct btrfs_disk_key *key, int level);
 struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u32 blocksize,
u64 parent, u64 root_objectid,
diff -urp 3/fs/btrfs/extent-tree.c 4/fs/btrfs/extent-tree.c
--- 3/fs/btrfs/extent-tree.c2010-05-11 14:12:00.044357180 +0800
+++ 4/fs/btrfs/extent-tree.c2010-05-11 13:26:38.036107000 +0800
@@ -4956,64 +4998,6 @@ int btrfs_alloc_logged_file_extent(struc
return ret;
 }
 
-/*
- * finds a free extent and does all the dirty work required for allocation
- * returns the key for the extent through ins, and a tree buffer for
- * the first block of the extent through buf.
- *
- * returns 0 if everything worked, non-zero otherwise.
- */
-static int alloc_tree_block(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 num_bytes, u64 parent, u64 root_objectid,
-   struct btrfs_disk_key *key, int level,
-   u64 empty_size, u64 hint_byte, u64 search_end,
-   struct btrfs_key *ins)
-{
-   int ret;
-   u64 flags = 0;
-
-   ret = btrfs_reserve_extent(trans, root, num_bytes, num_bytes,
-  empty_size, hint_byte, search_end,
-  ins, 0);
-   if (ret)
-   return ret;
-
-   if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) {
-   if (parent == 0)
-   parent = ins-objectid;
-   flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
-   } else
-   BUG_ON(parent  0);
-
-   if (root_objectid != BTRFS_TREE_LOG_OBJECTID) {
-   struct btrfs_delayed_extent_op *extent_op;
-   extent_op = kzalloc(sizeof(*extent_op), GFP_NOFS);
-   BUG_ON(!extent_op);
-   if (key)
-   memcpy(extent_op-key, key, sizeof(extent_op-key));
-   extent_op-flags_to_set = flags;
-   extent_op-update_key = 1;
-   extent_op-update_gen = 1;
-   extent_op-update_flags = 1;
-
-   ret = btrfs_add_delayed_tree_ref(trans, ins-objectid,
-   ins-offset, parent, root_objectid,
-   level, BTRFS_ADD_DELAYED_EXTENT,
-   extent_op);
-   BUG_ON(ret);
-   }
-
-   if (root_objectid == root-root_key.objectid) {
-   u64 used;
-   spin_lock(root-node_lock);
-   used = btrfs_root_used(root-root_item) + num_bytes;
-   btrfs_set_root_used(root-root_item, used);
-   spin_unlock(root-node_lock);
-   }
-   return ret;
-}
-
 struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u32 blocksize,
@@ -5052,8 +5036,68 @@ struct extent_buffer *btrfs_init_new_buf
return buf;
 }
 
+struct extent_buffer *btrfs_reserve_tree_block(struct btrfs_trans_handle 
*trans,
+  struct btrfs_root *root,
+  u32 blocksize, int level,
+  u64 hint, u64 empty_size)
+{
+
+   struct btrfs_key ins;
+   struct extent_buffer *buf;
+   int ret;
+
+   ret = btrfs_reserve_extent(trans, root, blocksize, blocksize,
+  empty_size, hint, (u64)-1, ins, 0);
+   if (ret)
+   return ERR_PTR(ret);
+
+   buf = btrfs_init_new_buffer(trans, root

[PATCH 4/5] btrfs: don't cache empty block groups during mount

2010-05-11 Thread Yan, Zheng
the tree log recover code expects no free space cached before it executes.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 4/fs/btrfs/extent-tree.c 8/fs/btrfs/extent-tree.c
--- 4/fs/btrfs/extent-tree.c2010-05-11 14:15:29.174108554 +0800
+++ 8/fs/btrfs/extent-tree.c2010-05-11 13:26:38.036107000 +0800
@@ -316,11 +329,6 @@ static int caching_kthread(void *data)
if (!path)
return -ENOMEM;
 
-   exclude_super_stripes(extent_root, block_group);
-   spin_lock(block_group-space_info-lock);
-   block_group-space_info-bytes_super += block_group-bytes_super;
-   spin_unlock(block_group-space_info-lock);
-
last = max_t(u64, block_group-key.objectid, BTRFS_SUPER_INFO_OFFSET);
 
/*
@@ -7499,6 +7541,7 @@ int btrfs_free_block_groups(struct btrfs
if (block_group-cached == BTRFS_CACHE_STARTED)
wait_block_group_cache_done(block_group);
 
+   free_excluded_extents(info-extent_root, block_group);
btrfs_remove_free_space_cache(block_group);
btrfs_put_block_group(block_group);
 
@@ -7586,26 +7629,12 @@ int btrfs_read_block_groups(struct btrfs
cache-flags = btrfs_block_group_flags(cache-item);
cache-sectorsize = root-sectorsize;
 
-   /*
-* check for two cases, either we are full, and therefore
-* don't need to bother with the caching work since we won't
-* find any space, or we are empty, and we can just add all
-* the space in and be done with it.  This saves us _alot_ of
-* time, particularly in the full case.
-*/
-   if (found_key.offset == btrfs_block_group_used(cache-item)) {
-   exclude_super_stripes(root, cache);
-   cache-last_byte_to_unpin = (u64)-1;
-   cache-cached = BTRFS_CACHE_FINISHED;
-   free_excluded_extents(root, cache);
-   } else if (btrfs_block_group_used(cache-item) == 0) {
-   exclude_super_stripes(root, cache);
+   exclude_super_stripes(root, cache);
+   /* check for the case that block group is full */
+   if (found_key.offset == cache-bytes_super +
+   btrfs_block_group_used(cache-item)) {
cache-last_byte_to_unpin = (u64)-1;
cache-cached = BTRFS_CACHE_FINISHED;
-   add_new_free_space(cache, root-fs_info,
-  found_key.objectid,
-  found_key.objectid +
-  found_key.offset);
free_excluded_extents(root, cache);
}
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk space accounting and subvolume delete

2010-05-11 Thread Yan, Zheng
On Tue, May 11, 2010 at 11:45 PM, Bruce Guenter br...@untroubled.org wrote:
 On Tue, May 11, 2010 at 08:10:38AM +0800, Yan, Zheng  wrote:
 This is because the snapshot deleting ioctl only removes the a link.

 Right, I understand that.  That part is not unexpected, as it works just
 like unlink would.  However...

 The corresponding tree is dropped in the background by a kernel thread.

 The surprise is that 'sync', in any form I was able to try, does not
 wait until all or even most of the I/O is completed.  Apparently the
 standards spec for sync(2) says it is not required to wait for I/O to
 complete, but AFAIK all other Linux FS do wait (the man page for sync(2)
 implies as much, as does the info page for sync in glibc).

 The only way I've found so far to force this behavior is to unmount, and
 that's rather intrusive to other users of the FS.

 We could probably add another ioctl that waits until the tree has been
 completely dropped.

 Since the expected behavior for sync is to wait until all pending I/O
 has been completed, I would argue this should be the default action for
 sync.  Am I misunderstanding something?


Dropping a tree can be lengthy. It's not good to let sync wait for hours.
For most linux FS, 'sync' just force an transaction/journal commit. I don't
think they wait for large operations that can span multiple transactions to
complete.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix block generation verification race

2010-05-02 Thread Yan, Zheng
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.

Signed-off-by: Yan Zheng zheng@oracle.com
---
diff -urp 1/fs/btrfs/ctree.c 2/fs/btrfs/ctree.c
--- 1/fs/btrfs/ctree.c  2010-04-14 14:49:56.342950744 +0800
+++ 2/fs/btrfs/ctree.c  2010-05-03 09:44:24.426642447 +0800
@@ -1589,7 +1589,7 @@ read_block_for_search(struct btrfs_trans
btrfs_release_path(NULL, p);
 
ret = -EAGAIN;
-   tmp = read_tree_block(root, blocknr, blocksize, gen);
+   tmp = read_tree_block(root, blocknr, blocksize, 0);
if (tmp) {
/*
 * If the read above didn't mark this buffer up to date,
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug when removing device

2010-04-30 Thread Yan, Zheng
 [  298.659939]  [c02024c3] ? __mem_cgroup_try_charge+0x53/0x330
 [  298.659956]  [f85e1b9d] ? btrfs_ioctl+0x79d/0x9c0 [btrfs]
 [  298.659964]  [c01dd0d8] ? memdup_user+0x38/0x70
 [  298.659980]  [f85e1bbc] ? btrfs_ioctl+0x7bc/0x9c0 [btrfs]
 [  298.659986]  [c01e28b8] ? __do_fault+0x3e8/0x560
 [  298.659993]  [c01e47e5] ? handle_mm_fault+0x145/0xaa0
 [  298.66]  [c0215532] ? vfs_ioctl+0x32/0xb0
 [  298.660016]  [f85e1400] ? btrfs_ioctl+0x0/0x9c0 [btrfs]
 [  298.660022]  [c0215c92] ? do_vfs_ioctl+0x72/0x5c0
 [  298.660029]  [c05a1abd] ? do_page_fault+0x1cd/0x440
 [  298.660035]  [c0210e3b] ? putname+0x2b/0x40
 [  298.660041]  [c0205e9a] ? do_sys_open+0xfa/0x120
 [  298.660047]  [c0216247] ? sys_ioctl+0x67/0x80
 [  298.660053]  [c0102fe3] ? sysenter_do_call+0x12/0x28
 [  298.660057] Code: 85 ff 75 db e9 49 fb ff ff 0f 0b 66 90 eb fc ba
 29 09 00 00 b8 c8 60 5f f8 e8 af 5c b5 c7 0f b6 7e 25 e9 c8 fc ff ff
 0f 0b eb fe 0f 0b eb fe 8b 45 d4 e8 d6 1b fa ff c7 45 c4 f4 ff ff ff
 e9 02
 [  298.660124] EIP: [f85f3b7e] relocate_tree_blocks+0x52e/0x590
 [btrfs] SS:ESP 0068:cd333c14
 [  298.660150] ---[ end trace fb3e62da0e52a0bd ]---


I have sent a set of patches that address bugs like this.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 01/12] Btrfs: Link block groups of different raid types in the same space_info

2010-04-26 Thread Yan, Zheng
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:23:52.921839641 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:23:52.926830638 +0800
@@ -662,6 +662,7 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID1(1  4)
 #define BTRFS_BLOCK_GROUP_DUP (1  5)
 #define BTRFS_BLOCK_GROUP_RAID10   (1  6)
+#define BTRFS_NR_RAID_TYPES   5
 
 struct btrfs_block_group_item {
__le64 used;
@@ -673,7 +674,8 @@ struct btrfs_space_info {
u64 flags;
 
u64 total_bytes;/* total bytes in the space */
-   u64 bytes_used; /* total bytes used on disk */
+   u64 bytes_used; /* total bytes used,
+  this does't take mirrors into account */
u64 bytes_pinned;   /* total bytes pinned, will be freed when the
   transaction finishes */
u64 bytes_reserved; /* total bytes the allocator has reserved for
@@ -686,6 +688,7 @@ struct btrfs_space_info {
   delalloc/allocations */
u64 bytes_delalloc; /* number of bytes currently reserved for
   delayed allocation */
+   u64 disk_used;  /* total bytes used on disk */
 
int full;   /* indicates that we cannot allocate any more
   chunks for this space */
@@ -703,7 +706,7 @@ struct btrfs_space_info {
int flushing;
 
/* for block groups in our same type */
-   struct list_head block_groups;
+   struct list_head block_groups[BTRFS_NR_RAID_TYPES];
spinlock_t lock;
struct rw_semaphore groups_sem;
atomic_t caching_threads;
diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c
--- 2/fs/btrfs/extent-tree.c2010-04-26 17:23:52.922840061 +0800
+++ 3/fs/btrfs/extent-tree.c2010-04-26 17:23:52.929829246 +0800
@@ -506,6 +506,9 @@ static struct btrfs_space_info *__find_s
struct list_head *head = info-space_info;
struct btrfs_space_info *found;
 
+   flags = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM |
+BTRFS_BLOCK_GROUP_METADATA;
+
rcu_read_lock();
list_for_each_entry_rcu(found, head, list) {
if (found-flags == flags) {
@@ -2659,12 +2662,21 @@ static int update_space_info(struct btrf
 struct btrfs_space_info **space_info)
 {
struct btrfs_space_info *found;
+   int i;
+   int factor;
+
+   if (flags  (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+BTRFS_BLOCK_GROUP_RAID10))
+   factor = 2;
+   else
+   factor = 1;
 
found = __find_space_info(info, flags);
if (found) {
spin_lock(found-lock);
found-total_bytes += total_bytes;
found-bytes_used += bytes_used;
+   found-disk_used += bytes_used * factor;
found-full = 0;
spin_unlock(found-lock);
*space_info = found;
@@ -2674,14 +2686,18 @@ static int update_space_info(struct btrf
if (!found)
return -ENOMEM;
 
-   INIT_LIST_HEAD(found-block_groups);
+   for (i = 0; i  BTRFS_NR_RAID_TYPES; i++)
+   INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
init_waitqueue_head(found-flush_wait);
init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
-   found-flags = flags;
+   found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
+   BTRFS_BLOCK_GROUP_SYSTEM |
+   BTRFS_BLOCK_GROUP_METADATA);
found-total_bytes = total_bytes;
found-bytes_used = bytes_used;
+   found-disk_used = bytes_used * factor;
found-bytes_pinned = 0;
found-bytes_reserved = 0;
found-bytes_readonly = 0;
@@ -2751,26 +2767,32 @@ u64 btrfs_reduce_alloc_profile(struct bt
return flags;
 }
 
-static u64 btrfs_get_alloc_profile(struct btrfs_root *root, u64 data)
+static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
 {
-   struct btrfs_fs_info *info = root-fs_info;
-   u64 alloc_profile;
+   if (flags  BTRFS_BLOCK_GROUP_DATA)
+   flags |= root-fs_info-avail_data_alloc_bits 
+root-fs_info-data_alloc_profile;
+   else if (flags  BTRFS_BLOCK_GROUP_SYSTEM)
+   flags |= root-fs_info-avail_system_alloc_bits 
+root-fs_info-system_alloc_profile;
+   else if (flags  BTRFS_BLOCK_GROUP_METADATA)
+   flags |= root-fs_info-avail_metadata_alloc_bits 
+root

[PATCH V2 02/12] Btrfs: Kill allocate_wait in space_info

2010-04-26 Thread Yan, Zheng
We already have fs_info-chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:24:10.436081649 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:24:10.441079491 +0800
@@ -700,9 +700,7 @@ struct btrfs_space_info {
struct list_head list;
 
/* for controlling how we free up space for allocations */
-   wait_queue_head_t allocate_wait;
wait_queue_head_t flush_wait;
-   int allocating_chunk;
int flushing;
 
/* for block groups in our same type */
diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c
--- 2/fs/btrfs/extent-tree.c2010-04-26 17:24:10.437084933 +0800
+++ 3/fs/btrfs/extent-tree.c2010-04-26 17:24:10.444079704 +0800
@@ -70,6 +70,9 @@ static int find_next_key(struct btrfs_pa
 struct btrfs_key *key);
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
int dump_block_groups);
+static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_space_info *sinfo, u64 num_bytes);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -2690,7 +2693,6 @@ static int update_space_info(struct btrf
INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
init_waitqueue_head(found-flush_wait);
-   init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
BTRFS_BLOCK_GROUP_SYSTEM |
@@ -3003,71 +3005,6 @@ flush:
wake_up(info-flush_wait);
 }
 
-static int maybe_allocate_chunk(struct btrfs_root *root,
-struct btrfs_space_info *info)
-{
-   struct btrfs_super_block *disk_super = root-fs_info-super_copy;
-   struct btrfs_trans_handle *trans;
-   bool wait = false;
-   int ret = 0;
-   u64 min_metadata;
-   u64 free_space;
-
-   free_space = btrfs_super_total_bytes(disk_super);
-   /*
-* we allow the metadata to grow to a max of either 10gb or 5% of the
-* space in the volume.
-*/
-   min_metadata = min((u64)10 * 1024 * 1024 * 1024,
-div64_u64(free_space * 5, 100));
-   if (info-total_bytes = min_metadata) {
-   spin_unlock(info-lock);
-   return 0;
-   }
-
-   if (info-full) {
-   spin_unlock(info-lock);
-   return 0;
-   }
-
-   if (!info-allocating_chunk) {
-   info-force_alloc = 1;
-   info-allocating_chunk = 1;
-   } else {
-   wait = true;
-   }
-
-   spin_unlock(info-lock);
-
-   if (wait) {
-   wait_event(info-allocate_wait,
-  !info-allocating_chunk);
-   return 1;
-   }
-
-   trans = btrfs_start_transaction(root, 1);
-   if (!trans) {
-   ret = -ENOMEM;
-   goto out;
-   }
-
-   ret = do_chunk_alloc(trans, root-fs_info-extent_root,
-4096 + 2 * 1024 * 1024,
-info-flags, 0);
-   btrfs_end_transaction(trans, root);
-   if (ret)
-   goto out;
-out:
-   spin_lock(info-lock);
-   info-allocating_chunk = 0;
-   spin_unlock(info-lock);
-   wake_up(info-allocate_wait);
-
-   if (ret)
-   return 0;
-   return 1;
-}
-
 /*
  * Reserve metadata space for delalloc.
  */
@@ -3108,7 +3045,8 @@ again:
flushed++;
 
if (flushed == 1) {
-   if (maybe_allocate_chunk(root, meta_sinfo))
+   if (maybe_allocate_chunk(NULL, root, meta_sinfo,
+num_bytes))
goto again;
flushed++;
} else {
@@ -3223,7 +3161,8 @@ again:
if (used  meta_sinfo-total_bytes) {
retries++;
if (retries == 1) {
-   if (maybe_allocate_chunk(root, meta_sinfo))
+   if (maybe_allocate_chunk(NULL, root, meta_sinfo,
+num_bytes))
goto again;
retries++;
} else {
@@ -3420,13 +3359,28 @@ static void force_metadata_allocation(st
rcu_read_unlock();
 }
 
+static int should_alloc_chunk(struct btrfs_space_info *sinfo,
+ u64 alloc_bytes)
+{
+   u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly;
+
+   if (sinfo-bytes_used + sinfo-bytes_reserved +
+   alloc_bytes + 256 * 1024 * 1024  num_bytes)
+   return 0;
+
+   if (sinfo-bytes_used + sinfo

[PATCH V2 04/12] Btrfs: Kill init_btrfs_i()

2010-04-26 Thread Yan, Zheng
All code in init_btrfs_i can be moved into btrfs_alloc_inode()


Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c
--- 2/fs/btrfs/inode.c  2010-04-26 17:24:41.254078880 +0800
+++ 3/fs/btrfs/inode.c  2010-04-26 17:24:41.270103836 +0800
@@ -3595,40 +3595,10 @@ again:
return 0;
 }
 
-static noinline void init_btrfs_i(struct inode *inode)
-{
-   struct btrfs_inode *bi = BTRFS_I(inode);
-
-   bi-generation = 0;
-   bi-sequence = 0;
-   bi-last_trans = 0;
-   bi-last_sub_trans = 0;
-   bi-logged_trans = 0;
-   bi-delalloc_bytes = 0;
-   bi-reserved_bytes = 0;
-   bi-disk_i_size = 0;
-   bi-flags = 0;
-   bi-index_cnt = (u64)-1;
-   bi-last_unlink_trans = 0;
-   bi-ordered_data_close = 0;
-   bi-force_compress = 0;
-   extent_map_tree_init(BTRFS_I(inode)-extent_tree, GFP_NOFS);
-   extent_io_tree_init(BTRFS_I(inode)-io_tree,
-inode-i_mapping, GFP_NOFS);
-   extent_io_tree_init(BTRFS_I(inode)-io_failure_tree,
-inode-i_mapping, GFP_NOFS);
-   INIT_LIST_HEAD(BTRFS_I(inode)-delalloc_inodes);
-   INIT_LIST_HEAD(BTRFS_I(inode)-ordered_operations);
-   RB_CLEAR_NODE(BTRFS_I(inode)-rb_node);
-   btrfs_ordered_inode_tree_init(BTRFS_I(inode)-ordered_tree);
-   mutex_init(BTRFS_I(inode)-log_mutex);
-}
-
 static int btrfs_init_locked_inode(struct inode *inode, void *p)
 {
struct btrfs_iget_args *args = p;
inode-i_ino = args-ino;
-   init_btrfs_i(inode);
BTRFS_I(inode)-root = args-root;
btrfs_set_inode_space_info(args-root, inode);
return 0;
@@ -3691,8 +3661,6 @@ static struct inode *new_simple_dir(stru
if (!inode)
return ERR_PTR(-ENOMEM);
 
-   init_btrfs_i(inode);
-
BTRFS_I(inode)-root = root;
memcpy(BTRFS_I(inode)-location, key, sizeof(*key));
BTRFS_I(inode)-dummy_inode = 1;
@@ -4091,7 +4059,6 @@ static struct inode *btrfs_new_inode(str
 * btrfs_get_inode_index_count has an explanation for the magic
 * number
 */
-   init_btrfs_i(inode);
BTRFS_I(inode)-index_cnt = 2;
BTRFS_I(inode)-root = root;
BTRFS_I(inode)-generation = trans-transid;
@@ -5262,21 +5229,46 @@ unsigned long btrfs_force_ra(struct addr
 struct inode *btrfs_alloc_inode(struct super_block *sb)
 {
struct btrfs_inode *ei;
+   struct inode *inode;
 
ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
+
+   ei-root = NULL;
+   ei-space_info = NULL;
+   ei-generation = 0;
+   ei-sequence = 0;
ei-last_trans = 0;
ei-last_sub_trans = 0;
ei-logged_trans = 0;
+   ei-delalloc_bytes = 0;
+   ei-reserved_bytes = 0;
+   ei-disk_i_size = 0;
+   ei-flags = 0;
+   ei-index_cnt = (u64)-1;
+   ei-last_unlink_trans = 0;
+
+   spin_lock_init(ei-accounting_lock);
ei-outstanding_extents = 0;
ei-reserved_extents = 0;
-   ei-root = NULL;
-   spin_lock_init(ei-accounting_lock);
+
+   ei-ordered_data_close = 0;
+   ei-dummy_inode = 0;
+   ei-force_compress = 0;
+
+   inode = ei-vfs_inode;
+   extent_map_tree_init(ei-extent_tree, GFP_NOFS);
+   extent_io_tree_init(ei-io_tree, inode-i_data, GFP_NOFS);
+   extent_io_tree_init(ei-io_failure_tree, inode-i_data, GFP_NOFS);
+   mutex_init(ei-log_mutex);
btrfs_ordered_inode_tree_init(ei-ordered_tree);
INIT_LIST_HEAD(ei-i_orphan);
+   INIT_LIST_HEAD(ei-delalloc_inodes);
INIT_LIST_HEAD(ei-ordered_operations);
-   return ei-vfs_inode;
+   RB_CLEAR_NODE(ei-rb_node);
+
+   return inode;
 }
 
 void btrfs_destroy_inode(struct inode *inode)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 08/12] Btrfs: Introduce global metadata reservation

2010-04-26 Thread Yan, Zheng
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:27:31.644829469 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:27:31.648830941 +0800
@@ -682,21 +682,15 @@ struct btrfs_space_info {
u64 bytes_reserved; /* total bytes the allocator has reserved for
   current allocations */
u64 bytes_readonly; /* total bytes that are read only */
-   u64 bytes_super;/* total bytes reserved for the super blocks */
-   u64 bytes_root; /* the number of bytes needed to commit a
-  transaction */
+
u64 bytes_may_use;  /* number of bytes that may be used for
   delalloc/allocations */
-   u64 bytes_delalloc; /* number of bytes currently reserved for
-  delayed allocation */
u64 disk_used;  /* total bytes used on disk */
 
int full;   /* indicates that we cannot allocate any more
   chunks for this space */
int force_alloc;/* set if we need to force a chunk alloc for
   this space */
-   int force_delalloc; /* make people start doing filemap_flush until
-  we're under a threshold */
 
struct list_head list;
 
diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c
--- 2/fs/btrfs/disk-io.c2010-04-26 17:27:31.638850832 +0800
+++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:31.649830174 +0800
@@ -1472,10 +1472,6 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
 
do {
-   smp_mb();
-   if (root-fs_info-closing)
-   break;
-
vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
 
if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
@@ -1488,11 +1484,9 @@ static int cleaner_kthread(void *arg)
if (freezing(current)) {
refrigerator();
} else {
-   smp_mb();
-   if (root-fs_info-closing)
-   break;
set_current_state(TASK_INTERRUPTIBLE);
-   schedule();
+   if (!kthread_should_stop())
+   schedule();
__set_current_state(TASK_RUNNING);
}
} while (!kthread_should_stop());
@@ -1504,36 +1498,39 @@ static int transaction_kthread(void *arg
struct btrfs_root *root = arg;
struct btrfs_trans_handle *trans;
struct btrfs_transaction *cur;
+   u64 transid;
unsigned long now;
unsigned long delay;
int ret;
 
do {
-   smp_mb();
-   if (root-fs_info-closing)
-   break;
-
delay = HZ * 30;
vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE);
-   mutex_lock(root-fs_info-transaction_kthread_mutex);
 
-   mutex_lock(root-fs_info-trans_mutex);
+   spin_lock(root-fs_info-new_trans_lock);
cur = root-fs_info-running_transaction;
if (!cur) {
-   mutex_unlock(root-fs_info-trans_mutex);
+   spin_unlock(root-fs_info-new_trans_lock);
goto sleep;
}
 
now = get_seconds();
-   if (now  cur-start_time || now - cur-start_time  30) {
-   mutex_unlock(root-fs_info-trans_mutex);
+   if (!cur-blocked 
+   (now  cur-start_time || now - cur-start_time  30)) {
+   spin_unlock(root-fs_info-new_trans_lock);
delay = HZ * 5;
goto sleep;
}
-   mutex_unlock(root-fs_info-trans_mutex);
-   trans = btrfs_join_transaction(root, 1);
-   ret = btrfs_commit_transaction(trans, root);
+   transid = cur-transid;
+   spin_unlock(root-fs_info-new_trans_lock);
 
+   trans = btrfs_join_transaction(root, 1);
+   if (transid == trans-transid) {
+   ret = btrfs_commit_transaction(trans, root);
+   BUG_ON(ret);
+   } else {
+   btrfs_end_transaction(trans, root);
+   }
 sleep:
wake_up_process(root-fs_info-cleaner_kthread);
mutex_unlock(root-fs_info-transaction_kthread_mutex);
@@ -1541,10 +1538,10 @@ sleep:
if (freezing(current)) {
refrigerator();
} else {
-   if (root-fs_info-closing)
-   break

[PATCH V2 07/12] Btrfs: Update metadata reservation for delayed allocation

2010-04-26 Thread Yan, Zheng
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h
--- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.450105767 +0800
+++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.456080004 +0800
@@ -137,8 +137,8 @@ struct btrfs_inode {
 * of extent items we've reserved metadata for.
 */
spinlock_t accounting_lock;
+   atomic_t outstanding_extents;
int reserved_extents;
-   int outstanding_extents;
 
/*
 * ordered_data_close is set by truncate when a file that used
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:26:55.451104861 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:26:55.457079656 +0800
@@ -2078,19 +2078,8 @@ int btrfs_remove_block_group(struct btrf
 u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
 void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
-
-int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root,
- struct inode *inode, int num_items);
-int btrfs_reserve_metadata_for_delalloc(struct btrfs_root *root,
-   struct inode *inode, int num_items);
-int btrfs_check_data_free_space(struct btrfs_root *root, struct inode *inode,
-   u64 bytes);
-void btrfs_free_reserved_data_space(struct btrfs_root *root,
-   struct inode *inode, u64 bytes);
-void btrfs_delalloc_reserve_space(struct btrfs_root *root, struct inode *inode,
-u64 bytes);
-void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode,
- u64 bytes);
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
 int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
int num_items, int *retries);
@@ -2098,6 +2087,10 @@ void btrfs_trans_release_metadata(struct
struct btrfs_root *root);
 int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_pending_snapshot *pending);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root);
 void btrfs_free_block_rsv(struct btrfs_root *root,
diff -urp 2/fs/btrfs/extent_io.c 3/fs/btrfs/extent_io.c
--- 2/fs/btrfs/extent_io.c  2010-04-26 17:26:55.447090049 +0800
+++ 3/fs/btrfs/extent_io.c  2010-04-26 17:26:55.458079658 +0800
@@ -336,21 +336,18 @@ static int merge_state(struct extent_io_
 }
 
 static int set_state_cb(struct extent_io_tree *tree,
-struct extent_state *state,
-unsigned long bits)
+struct extent_state *state, int *bits)
 {
if (tree-ops  tree-ops-set_bit_hook) {
return tree-ops-set_bit_hook(tree-mapping-host,
-  state-start, state-end,
-  state-state, bits);
+  state, bits);
}
 
return 0;
 }
 
 static void clear_state_cb(struct extent_io_tree *tree,
-  struct extent_state *state,
-  unsigned long bits)
+  struct extent_state *state, int *bits)
 {
if (tree-ops  tree-ops-clear_bit_hook)
tree-ops-clear_bit_hook(tree-mapping-host, state, bits);
@@ -368,9 +365,10 @@ static void clear_state_cb(struct extent
  */
 static int insert_state(struct extent_io_tree *tree,
struct extent_state *state, u64 start, u64 end,
-   int bits)
+   int *bits)
 {
struct rb_node *node;
+   int bits_to_set = *bits  ~EXTENT_CTLBITS;
int ret;
 
if (end  start) {
@@ -385,9 +383,9 @@ static int insert_state(struct extent_io
if (ret)
return ret;
 
-   if (bits  EXTENT_DIRTY)
+   if (bits_to_set  EXTENT_DIRTY

[PATCH V2 09/12] Btrfs: Metadata reservation for orphan inodes

2010-04-26 Thread Yan, Zheng
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h
--- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.113080051 +0800
+++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.118079430 +0800
@@ -151,6 +151,7 @@ struct btrfs_inode {
 * of these.
 */
unsigned ordered_data_close:1;
+   unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
 
/*
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:27:52.114079844 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:27:52.119079920 +0800
@@ -1068,7 +1068,6 @@ struct btrfs_root {
int ref_cows;
int track_dirty;
int in_radix;
-   int clean_orphans;
 
u64 defrag_trans_start;
struct btrfs_key defrag_progress;
@@ -1082,8 +1081,11 @@ struct btrfs_root {
 
struct list_head root_list;
 
-   spinlock_t list_lock;
+   spinlock_t orphan_lock;
struct list_head orphan_list;
+   struct btrfs_block_rsv *orphan_block_rsv;
+   int orphan_item_inserted;
+   int orphan_cleanup_state;
 
spinlock_t inode_lock;
/* red-black tree that keeps track of in-memory inodes */
@@ -2079,6 +2081,9 @@ int btrfs_trans_reserve_metadata(struct 
int num_items, int *retries);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
+int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
+ struct inode *inode);
+void btrfs_orphan_release_metadata(struct inode *inode);
 int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_pending_snapshot *pending);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
@@ -2403,6 +2408,13 @@ int btrfs_update_inode(struct btrfs_tran
 int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
 int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode);
 void btrfs_orphan_cleanup(struct btrfs_root *root);
+void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans,
+   struct btrfs_pending_snapshot *pending,
+   u64 *bytes_to_reserve);
+void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans,
+   struct btrfs_pending_snapshot *pending);
+void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t size);
 int btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c
--- 2/fs/btrfs/disk-io.c2010-04-26 17:27:52.105081158 +0800
+++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:52.120080690 +0800
@@ -895,7 +895,8 @@ static int __setup_root(u32 nodesize, u3
root-ref_cows = 0;
root-track_dirty = 0;
root-in_radix = 0;
-   root-clean_orphans = 0;
+   root-orphan_item_inserted = 0;
+   root-orphan_cleanup_state = 0;
 
root-fs_info = fs_info;
root-objectid = objectid;
@@ -905,12 +906,13 @@ static int __setup_root(u32 nodesize, u3
root-in_sysfs = 0;
root-inode_tree = RB_ROOT;
root-block_rsv = NULL;
+   root-orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(root-dirty_list);
INIT_LIST_HEAD(root-orphan_list);
INIT_LIST_HEAD(root-root_list);
spin_lock_init(root-node_lock);
-   spin_lock_init(root-list_lock);
+   spin_lock_init(root-orphan_lock);
spin_lock_init(root-inode_lock);
spin_lock_init(root-accounting_lock);
mutex_init(root-objectid_mutex);
@@ -1194,19 +1196,23 @@ again:
if (root)
return root;
 
-   ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid);
-   if (ret == 0)
-   ret = -ENOENT;
-   if (ret  0)
-   return ERR_PTR(ret);
-
root = btrfs_read_fs_root_no_radix(fs_info-tree_root, location);
if (IS_ERR(root))
return root;
 
-   WARN_ON(btrfs_root_refs(root-root_item) == 0);
set_anon_super(root-anon_super, NULL);
 
+   if (btrfs_root_refs(root-root_item) == 0) {
+   ret = -ENOENT;
+   goto fail;
+   }
+
+   ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid);
+   if (ret  0)
+   goto fail;
+   if (ret == 0)
+   root-orphan_item_inserted = 1;
+
ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
if (ret)
goto fail;
@@ -1215,10 +1221,9 @@ again:
ret = radix_tree_insert(fs_info-fs_roots_radix,
(unsigned long)root-root_key.objectid

[PATCH V2 10/12] Btrfs: Metadata ENOSPC handling for tree log

2010-04-26 Thread Yan, Zheng
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c
--- 2/fs/btrfs/disk-io.c2010-04-26 17:28:05.496079922 +0800
+++ 3/fs/btrfs/disk-io.c2010-04-26 17:28:05.506079726 +0800
@@ -973,42 +973,6 @@ static int find_and_setup_root(struct bt
return 0;
 }
 
-int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info)
-{
-   struct extent_buffer *eb;
-   struct btrfs_root *log_root_tree = fs_info-log_root_tree;
-   u64 start = 0;
-   u64 end = 0;
-   int ret;
-
-   if (!log_root_tree)
-   return 0;
-
-   while (1) {
-   ret = find_first_extent_bit(log_root_tree-dirty_log_pages,
-   0, start, end, EXTENT_DIRTY | EXTENT_NEW);
-   if (ret)
-   break;
-
-   clear_extent_bits(log_root_tree-dirty_log_pages, start, end,
- EXTENT_DIRTY | EXTENT_NEW, GFP_NOFS);
-   }
-   eb = fs_info-log_root_tree-node;
-
-   WARN_ON(btrfs_header_level(eb) != 0);
-   WARN_ON(btrfs_header_nritems(eb) != 0);
-
-   ret = btrfs_free_reserved_extent(fs_info-tree_root,
-   eb-start, eb-len);
-   BUG_ON(ret);
-
-   free_extent_buffer(eb);
-   kfree(fs_info-log_root_tree);
-   fs_info-log_root_tree = NULL;
-   return 0;
-}
-
 static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info)
 {
diff -urp 2/fs/btrfs/disk-io.h 3/fs/btrfs/disk-io.h
--- 2/fs/btrfs/disk-io.h2010-04-26 17:28:05.495079921 +0800
+++ 3/fs/btrfs/disk-io.h2010-04-26 17:28:05.507080566 +0800
@@ -95,8 +95,6 @@ int btrfs_congested_async(struct btrfs_f
 unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info);
 int btrfs_write_tree_block(struct extent_buffer *buf);
 int btrfs_wait_tree_block_writeback(struct extent_buffer *buf);
-int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
diff -urp 2/fs/btrfs/file-item.c 3/fs/btrfs/file-item.c
--- 2/fs/btrfs/file-item.c  2010-04-26 17:28:05.503100326 +0800
+++ 3/fs/btrfs/file-item.c  2010-04-26 17:28:05.507080566 +0800
@@ -656,6 +656,9 @@ again:
goto found;
}
ret = PTR_ERR(item);
+   if (ret != -EFBIG  ret != -ENOENT)
+   goto fail_unlock;
+
if (ret == -EFBIG) {
u32 item_size;
/* we found one, but it isn't big enough yet */
diff -urp 2/fs/btrfs/tree-log.c 3/fs/btrfs/tree-log.c
--- 2/fs/btrfs/tree-log.c   2010-04-26 17:28:05.498105836 +0800
+++ 3/fs/btrfs/tree-log.c   2010-04-26 17:28:05.509079730 +0800
@@ -134,6 +134,7 @@ static int start_log_trans(struct btrfs_
   struct btrfs_root *root)
 {
int ret;
+   int err = 0;
 
mutex_lock(root-log_mutex);
if (root-log_root) {
@@ -154,17 +155,19 @@ static int start_log_trans(struct btrfs_
mutex_lock(root-fs_info-tree_log_mutex);
if (!root-fs_info-log_root_tree) {
ret = btrfs_init_log_root_tree(trans, root-fs_info);
-   BUG_ON(ret);
+   if (ret)
+   err = ret;
}
-   if (!root-log_root) {
+   if (err == 0  !root-log_root) {
ret = btrfs_add_log_tree(trans, root);
-   BUG_ON(ret);
+   if (ret)
+   err = ret;
}
mutex_unlock(root-fs_info-tree_log_mutex);
root-log_batch++;
atomic_inc(root-log_writers);
mutex_unlock(root-log_mutex);
-   return 0;
+   return err;
 }
 
 /*
@@ -375,7 +378,7 @@ insert:
BUG_ON(ret);
}
} else if (ret) {
-   BUG();
+   return ret;
}
dst_ptr = btrfs_item_ptr_offset(path-nodes[0],
path-slots[0]);
@@ -1698,9 +1701,9 @@ static noinline int walk_down_log_tree(s
 
next = btrfs_find_create_tree_block(root, bytenr, blocksize);
 
-   wc-process_func(root, next, wc, ptr_gen);
-
if (*level == 1) {
+   wc-process_func(root, next, wc, ptr_gen);
+
path-slots[*level]++;
if (wc-free) {
btrfs_read_buffer(next, ptr_gen);
@@ -1733,35 +1736,7 @@ static noinline int

[PATCH V2 11/12] Btrfs: Pre-allocate space for data relocation

2010-04-26 Thread Yan, Zheng
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-26 17:28:20.493839748 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-26 17:28:20.498830465 +0800
@@ -2419,6 +2419,9 @@ int btrfs_cont_expand(struct inode *inod
 int btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_root *root);
+int btrfs_prealloc_file_range(struct inode *inode, int mode,
+ u64 start, u64 num_bytes, u64 min_size,
+ loff_t actual_len, u64 *alloc_hint);
 extern const struct dentry_operations btrfs_dentry_operations;
 
 /* ioctl.c */
diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c
--- 2/fs/btrfs/inode.c  2010-04-26 17:28:20.489839672 +0800
+++ 3/fs/btrfs/inode.c  2010-04-26 17:28:20.500829420 +0800
@@ -1174,6 +1174,13 @@ out_check:
   num_bytes, num_bytes, type);
BUG_ON(ret);
 
+   if (root-root_key.objectid ==
+   BTRFS_DATA_RELOC_TREE_OBJECTID) {
+   ret = btrfs_reloc_clone_csums(inode, cur_offset,
+ num_bytes);
+   BUG_ON(ret);
+   }
+
extent_clear_unlock_delalloc(inode, BTRFS_I(inode)-io_tree,
cur_offset, cur_offset + num_bytes - 1,
locked_page, EXTENT_CLEAR_UNLOCK_PAGE |
@@ -6079,16 +6086,15 @@ out_unlock:
return err;
 }
 
-static int prealloc_file_range(struct inode *inode, u64 start, u64 end,
-   u64 alloc_hint, int mode, loff_t actual_len)
+int btrfs_prealloc_file_range(struct inode *inode, int mode,
+ u64 start, u64 num_bytes, u64 min_size,
+ loff_t actual_len, u64 *alloc_hint)
 {
struct btrfs_trans_handle *trans;
struct btrfs_root *root = BTRFS_I(inode)-root;
struct btrfs_key ins;
u64 cur_offset = start;
-   u64 num_bytes = end - start;
int ret = 0;
-   u64 i_size;
 
while (num_bytes  0) {
trans = btrfs_start_transaction(root, 3);
@@ -6097,9 +6103,8 @@ static int prealloc_file_range(struct in
break;
}
 
-   ret = btrfs_reserve_extent(trans, root, num_bytes,
-  root-sectorsize, 0, alloc_hint,
-  (u64)-1, ins, 1);
+   ret = btrfs_reserve_extent(trans, root, num_bytes, min_size,
+  0, *alloc_hint, (u64)-1, ins, 1);
if (ret) {
btrfs_end_transaction(trans, root);
break;
@@ -6116,20 +6121,19 @@ static int prealloc_file_range(struct in
 
num_bytes -= ins.offset;
cur_offset += ins.offset;
-   alloc_hint = ins.objectid + ins.offset;
+   *alloc_hint = ins.objectid + ins.offset;
 
inode-i_ctime = CURRENT_TIME;
BTRFS_I(inode)-flags |= BTRFS_INODE_PREALLOC;
if (!(mode  FALLOC_FL_KEEP_SIZE) 
-   (actual_len  inode-i_size) 
-   (cur_offset  inode-i_size)) {
-
+   (actual_len  inode-i_size) 
+   (cur_offset  inode-i_size)) {
if (cur_offset  actual_len)
-   i_size  = actual_len;
+   i_size_write(inode, actual_len);
else
-   i_size = cur_offset;
-   i_size_write(inode, i_size);
-   btrfs_ordered_update_i_size(inode, i_size, NULL);
+   i_size_write(inode, cur_offset);
+   i_size_write(inode, cur_offset);
+   btrfs_ordered_update_i_size(inode, cur_offset, NULL);
}
 
ret = btrfs_update_inode(trans, root, inode);
@@ -6215,16 +6219,16 @@ static long btrfs_fallocate(struct inode
if (em-block_start == EXTENT_MAP_HOLE ||
(cur_offset = inode-i_size 
 !test_bit(EXTENT_FLAG_PREALLOC, em-flags))) {
-   ret = prealloc_file_range(inode,
- cur_offset, last_byte,
-   alloc_hint, mode, offset+len);
+   ret = btrfs_prealloc_file_range(inode, 0, cur_offset,
+   last_byte - cur_offset,
+   1  inode-i_blkbits

[PATCH 10/12] Btrfs: Metadata ENOSPC handling for tree log

2010-04-19 Thread Yan, Zheng
Previous patches make the allocater return -ENOSPC if there is
no unreserved free meta space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 10/fs/btrfs/disk-io.c 11/fs/btrfs/disk-io.c
--- 10/fs/btrfs/disk-io.c   2010-04-18 10:51:06.294978596 +0800
+++ 11/fs/btrfs/disk-io.c   2010-04-18 10:51:28.685949058 +0800
@@ -973,42 +973,6 @@ static int find_and_setup_root(struct bt
return 0;
 }
 
-int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info)
-{
-   struct extent_buffer *eb;
-   struct btrfs_root *log_root_tree = fs_info-log_root_tree;
-   u64 start = 0;
-   u64 end = 0;
-   int ret;
-
-   if (!log_root_tree)
-   return 0;
-
-   while (1) {
-   ret = find_first_extent_bit(log_root_tree-dirty_log_pages,
-   0, start, end, EXTENT_DIRTY | EXTENT_NEW);
-   if (ret)
-   break;
-
-   clear_extent_bits(log_root_tree-dirty_log_pages, start, end,
- EXTENT_DIRTY | EXTENT_NEW, GFP_NOFS);
-   }
-   eb = fs_info-log_root_tree-node;
-
-   WARN_ON(btrfs_header_level(eb) != 0);
-   WARN_ON(btrfs_header_nritems(eb) != 0);
-
-   ret = btrfs_free_reserved_extent(fs_info-tree_root,
-   eb-start, eb-len);
-   BUG_ON(ret);
-
-   free_extent_buffer(eb);
-   kfree(fs_info-log_root_tree);
-   fs_info-log_root_tree = NULL;
-   return 0;
-}
-
 static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info)
 {
diff -urp 10/fs/btrfs/disk-io.h 11/fs/btrfs/disk-io.h
--- 10/fs/btrfs/disk-io.h   2010-04-18 10:47:31.057968000 +0800
+++ 11/fs/btrfs/disk-io.h   2010-04-18 10:51:28.686949137 +0800
@@ -95,8 +95,6 @@ int btrfs_congested_async(struct btrfs_f
 unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info);
 int btrfs_write_tree_block(struct extent_buffer *buf);
 int btrfs_wait_tree_block_writeback(struct extent_buffer *buf);
-int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
diff -urp 10/fs/btrfs/file-item.c 11/fs/btrfs/file-item.c
--- 10/fs/btrfs/file-item.c 2010-04-18 10:47:31.057968000 +0800
+++ 11/fs/btrfs/file-item.c 2010-04-18 10:51:28.687948517 +0800
@@ -656,6 +656,9 @@ again:
goto found;
}
ret = PTR_ERR(item);
+   if (ret != -EFBIG  ret != -ENOENT)
+   goto fail_unlock;
+
if (ret == -EFBIG) {
u32 item_size;
/* we found one, but it isn't big enough yet */
diff -urp 10/fs/btrfs/tree-log.c 11/fs/btrfs/tree-log.c
--- 10/fs/btrfs/tree-log.c  2010-04-18 10:47:31.058957000 +0800
+++ 11/fs/btrfs/tree-log.c  2010-04-18 10:51:28.689947836 +0800
@@ -134,6 +134,7 @@ static int start_log_trans(struct btrfs_
   struct btrfs_root *root)
 {
int ret;
+   int err = 0;
 
mutex_lock(root-log_mutex);
if (root-log_root) {
@@ -154,17 +155,19 @@ static int start_log_trans(struct btrfs_
mutex_lock(root-fs_info-tree_log_mutex);
if (!root-fs_info-log_root_tree) {
ret = btrfs_init_log_root_tree(trans, root-fs_info);
-   BUG_ON(ret);
+   if (ret)
+   err = ret;
}
-   if (!root-log_root) {
+   if (err == 0  !root-log_root) {
ret = btrfs_add_log_tree(trans, root);
-   BUG_ON(ret);
+   if (ret)
+   err = ret;
}
mutex_unlock(root-fs_info-tree_log_mutex);
root-log_batch++;
atomic_inc(root-log_writers);
mutex_unlock(root-log_mutex);
-   return 0;
+   return err;
 }
 
 /*
@@ -375,7 +378,7 @@ insert:
BUG_ON(ret);
}
} else if (ret) {
-   BUG();
+   return ret;
}
dst_ptr = btrfs_item_ptr_offset(path-nodes[0],
path-slots[0]);
@@ -1698,9 +1701,9 @@ static noinline int walk_down_log_tree(s
 
next = btrfs_find_create_tree_block(root, bytenr, blocksize);
 
-   wc-process_func(root, next, wc, ptr_gen);
-
if (*level == 1) {
+   wc-process_func(root, next, wc, ptr_gen);
+
path-slots[*level]++;
if (wc-free) {
btrfs_read_buffer(next, ptr_gen);
@@ -1733,35 +1736,7 @@ static noinline int

[PATCH 09/12] Btrfs: Metadata reseravtion for orphan inodes

2010-04-19 Thread Yan, Zheng
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 9/fs/btrfs/btrfs_inode.h 10/fs/btrfs/btrfs_inode.h
--- 9/fs/btrfs/btrfs_inode.h2010-04-18 10:26:38.326701000 +0800
+++ 10/fs/btrfs/btrfs_inode.h   2010-04-18 10:50:26.564697845 +0800
@@ -151,6 +151,7 @@ struct btrfs_inode {
 * of these.
 */
unsigned ordered_data_close:1;
+   unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
 
/*
diff -urp 9/fs/btrfs/ctree.h 10/fs/btrfs/ctree.h
--- 9/fs/btrfs/ctree.h  2010-04-18 10:30:01.883697869 +0800
+++ 10/fs/btrfs/ctree.h 2010-04-18 10:50:26.565702253 +0800
@@ -1066,7 +1066,6 @@ struct btrfs_root {
int ref_cows;
int track_dirty;
int in_radix;
-   int clean_orphans;
 
u64 defrag_trans_start;
struct btrfs_key defrag_progress;
@@ -1080,8 +1079,11 @@ struct btrfs_root {
 
struct list_head root_list;
 
-   spinlock_t list_lock;
+   spinlock_t orphan_lock;
struct list_head orphan_list;
+   struct btrfs_block_rsv *orphan_block_rsv;
+   int orphan_item_inserted;
+   int orphan_cleanup_state;
 
spinlock_t inode_lock;
/* red-black tree that keeps track of in-memory inodes */
@@ -2074,6 +2076,9 @@ int btrfs_trans_reserve_metadata(struct 
int num_items, int *retries);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
+int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
+ struct inode *inode);
+void btrfs_orphan_release_metadata(struct inode *inode);
 int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_pending_snapshot *pending);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
@@ -2390,6 +2395,13 @@ int btrfs_update_inode(struct btrfs_tran
 int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
 int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode);
 void btrfs_orphan_cleanup(struct btrfs_root *root);
+void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans,
+   struct btrfs_pending_snapshot *pending,
+   u64 *bytes_to_reserve);
+void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans,
+   struct btrfs_pending_snapshot *pending);
+void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t size);
 int btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
diff -urp 9/fs/btrfs/disk-io.c 10/fs/btrfs/disk-io.c
--- 9/fs/btrfs/disk-io.c2010-04-18 10:47:31.056726210 +0800
+++ 10/fs/btrfs/disk-io.c   2010-04-18 10:51:06.294978596 +0800
@@ -895,7 +895,8 @@ static int __setup_root(u32 nodesize, u3
root-ref_cows = 0;
root-track_dirty = 0;
root-in_radix = 0;
-   root-clean_orphans = 0;
+   root-orphan_item_inserted = 0;
+   root-orphan_cleanup_state = 0;
 
root-fs_info = fs_info;
root-objectid = objectid;
@@ -905,12 +906,13 @@ static int __setup_root(u32 nodesize, u3
root-in_sysfs = 0;
root-inode_tree = RB_ROOT;
root-block_rsv = NULL;
+   root-orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(root-dirty_list);
INIT_LIST_HEAD(root-orphan_list);
INIT_LIST_HEAD(root-root_list);
spin_lock_init(root-node_lock);
-   spin_lock_init(root-list_lock);
+   spin_lock_init(root-orphan_lock);
spin_lock_init(root-inode_lock);
spin_lock_init(root-accounting_lock);
mutex_init(root-objectid_mutex);
@@ -1194,19 +1196,23 @@ again:
if (root)
return root;
 
-   ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid);
-   if (ret == 0)
-   ret = -ENOENT;
-   if (ret  0)
-   return ERR_PTR(ret);
-
root = btrfs_read_fs_root_no_radix(fs_info-tree_root, location);
if (IS_ERR(root))
return root;
 
-   WARN_ON(btrfs_root_refs(root-root_item) == 0);
set_anon_super(root-anon_super, NULL);
 
+   if (btrfs_root_refs(root-root_item) == 0) {
+   ret = -ENOENT;
+   goto fail;
+   }
+
+   ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid);
+   if (ret  0)
+   goto fail;
+   if (ret == 0)
+   root-orphan_item_inserted = 1;
+
ret = radix_tree_preload(GFP_NOFS  ~__GFP_HIGHMEM);
if (ret)
goto fail;
@@ -1215,10 +1221,9 @@ again:
ret = radix_tree_insert(fs_info-fs_roots_radix,
(unsigned long)root-root_key.objectid

[PATCH 07/12] Btrfs: Update metadata reservation for delayed allocation

2010-04-19 Thread Yan, Zheng
Introduce metadata reservation context for delayed allocation and
update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 7/fs/btrfs/btrfs_inode.h 8/fs/btrfs/btrfs_inode.h
--- 7/fs/btrfs/btrfs_inode.h2010-04-13 15:44:56.104814000 +0800
+++ 8/fs/btrfs/btrfs_inode.h2010-04-18 10:26:38.326701375 +0800
@@ -137,8 +137,8 @@ struct btrfs_inode {
 * of extent items we've reserved metadata for.
 */
spinlock_t accounting_lock;
+   atomic_t outstanding_extents;
int reserved_extents;
-   int outstanding_extents;
 
/*
 * ordered_data_close is set by truncate when a file that used
diff -urp 7/fs/btrfs/ctree.h 8/fs/btrfs/ctree.h
--- 7/fs/btrfs/ctree.h  2010-04-18 10:24:51.285697715 +0800
+++ 8/fs/btrfs/ctree.h  2010-04-18 10:26:38.327697818 +0800
@@ -2073,19 +2073,8 @@ int btrfs_remove_block_group(struct btrf
 u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
 void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
-
-int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root,
- struct inode *inode, int num_items);
-int btrfs_reserve_metadata_for_delalloc(struct btrfs_root *root,
-   struct inode *inode, int num_items);
-int btrfs_check_data_free_space(struct btrfs_root *root, struct inode *inode,
-   u64 bytes);
-void btrfs_free_reserved_data_space(struct btrfs_root *root,
-   struct inode *inode, u64 bytes);
-void btrfs_delalloc_reserve_space(struct btrfs_root *root, struct inode *inode,
-u64 bytes);
-void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode,
- u64 bytes);
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
 int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
int num_items, int *retries);
@@ -2093,6 +2082,10 @@ void btrfs_trans_release_metadata(struct
struct btrfs_root *root);
 int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
struct btrfs_pending_snapshot *pending);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root);
 void btrfs_free_block_rsv(struct btrfs_block_rsv *rsv);
diff -urp 7/fs/btrfs/extent_io.c 8/fs/btrfs/extent_io.c
--- 7/fs/btrfs/extent_io.c  2010-04-14 14:49:57.000937000 +0800
+++ 8/fs/btrfs/extent_io.c  2010-04-18 10:26:38.329697898 +0800
@@ -336,21 +336,18 @@ static int merge_state(struct extent_io_
 }
 
 static int set_state_cb(struct extent_io_tree *tree,
-struct extent_state *state,
-unsigned long bits)
+struct extent_state *state, int *bits)
 {
if (tree-ops  tree-ops-set_bit_hook) {
return tree-ops-set_bit_hook(tree-mapping-host,
-  state-start, state-end,
-  state-state, bits);
+  state, bits);
}
 
return 0;
 }
 
 static void clear_state_cb(struct extent_io_tree *tree,
-  struct extent_state *state,
-  unsigned long bits)
+  struct extent_state *state, int *bits)
 {
if (tree-ops  tree-ops-clear_bit_hook)
tree-ops-clear_bit_hook(tree-mapping-host, state, bits);
@@ -368,9 +365,10 @@ static void clear_state_cb(struct extent
  */
 static int insert_state(struct extent_io_tree *tree,
struct extent_state *state, u64 start, u64 end,
-   int bits)
+   int *bits)
 {
struct rb_node *node;
+   int bits_to_set = *bits  ~EXTENT_CTLBITS;
int ret;
 
if (end  start) {
@@ -385,9 +383,9 @@ static int insert_state(struct extent_io
if (ret)
return ret;
 
-   if (bits  EXTENT_DIRTY)
+   if (bits_to_set  EXTENT_DIRTY

[PATCH 03/12] Btrfs: Shrink delay allocated space in a synchronized way

2010-04-19 Thread Yan, Zheng
Shrink delay allocated space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 3/fs/btrfs/ctree.h 4/fs/btrfs/ctree.h
--- 3/fs/btrfs/ctree.h  2010-04-18 08:13:15.457699211 +0800
+++ 4/fs/btrfs/ctree.h  2010-04-18 08:13:51.602699293 +0800
@@ -699,10 +699,6 @@ struct btrfs_space_info {
 
struct list_head list;
 
-   /* for controlling how we free up space for allocations */
-   wait_queue_head_t flush_wait;
-   int flushing;
-
/* for block groups in our same type */
struct list_head block_groups[BTRFS_NR_RAID_TYPES];
spinlock_t lock;
@@ -927,7 +923,6 @@ struct btrfs_fs_info {
struct btrfs_workers endio_meta_write_workers;
struct btrfs_workers endio_write_workers;
struct btrfs_workers submit_workers;
-   struct btrfs_workers enospc_workers;
/*
 * fixup workers take dirty pages that didn't properly go through
 * the cow mechanism and make them safe to write.  It happens
@@ -2311,6 +2306,7 @@ int btrfs_truncate_inode_items(struct bt
   u32 min_type);
 
 int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
+int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  struct extent_state **cached_state);
 int btrfs_writepages(struct address_space *mapping,
diff -urp 3/fs/btrfs/disk-io.c 4/fs/btrfs/disk-io.c
--- 3/fs/btrfs/disk-io.c2010-04-14 14:49:56.559944000 +0800
+++ 4/fs/btrfs/disk-io.c2010-04-18 08:13:51.60461 +0800
@@ -1768,9 +1768,6 @@ struct btrfs_root *open_ctree(struct sup
   min_t(u64, fs_devices-num_devices,
   fs_info-thread_pool_size),
   fs_info-generic_worker);
-   btrfs_init_workers(fs_info-enospc_workers, enospc,
-  fs_info-thread_pool_size,
-  fs_info-generic_worker);
 
/* a higher idle thresh on the submit workers makes it much more
 * likely that bios will be send down in a sane order to the
@@ -1818,7 +1815,6 @@ struct btrfs_root *open_ctree(struct sup
btrfs_start_workers(fs_info-endio_meta_workers, 1);
btrfs_start_workers(fs_info-endio_meta_write_workers, 1);
btrfs_start_workers(fs_info-endio_write_workers, 1);
-   btrfs_start_workers(fs_info-enospc_workers, 1);
 
fs_info-bdi.ra_pages *= btrfs_super_num_devices(disk_super);
fs_info-bdi.ra_pages = max(fs_info-bdi.ra_pages,
@@ -2049,7 +2045,6 @@ fail_sb_buffer:
btrfs_stop_workers(fs_info-endio_meta_write_workers);
btrfs_stop_workers(fs_info-endio_write_workers);
btrfs_stop_workers(fs_info-submit_workers);
-   btrfs_stop_workers(fs_info-enospc_workers);
 fail_iput:
invalidate_inode_pages2(fs_info-btree_inode-i_mapping);
iput(fs_info-btree_inode);
@@ -2482,7 +2477,6 @@ int close_ctree(struct btrfs_root *root)
btrfs_stop_workers(fs_info-endio_meta_write_workers);
btrfs_stop_workers(fs_info-endio_write_workers);
btrfs_stop_workers(fs_info-submit_workers);
-   btrfs_stop_workers(fs_info-enospc_workers);
 
btrfs_close_devices(fs_info-fs_devices);
btrfs_mapping_tree_free(fs_info-mapping_tree);
diff -urp 3/fs/btrfs/extent-tree.c 4/fs/btrfs/extent-tree.c
--- 3/fs/btrfs/extent-tree.c2010-04-18 08:13:15.463699138 +0800
+++ 4/fs/btrfs/extent-tree.c2010-04-18 08:13:51.608702224 +0800
@@ -73,6 +73,9 @@ static void dump_space_info(struct btrfs
 static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct btrfs_space_info *sinfo, u64 num_bytes);
+static int shrink_delalloc(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct btrfs_space_info *sinfo, u64 to_reclaim);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -2689,7 +2692,6 @@ static int update_space_info(struct btrf
for (i = 0; i  BTRFS_NR_RAID_TYPES; i++)
INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
-   init_waitqueue_head(found-flush_wait);
spin_lock_init(found-lock);
found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
BTRFS_BLOCK_GROUP_SYSTEM |
@@ -2903,105 +2905,6 @@ static void check_force_delalloc(struct 
meta_sinfo-force_delalloc = 0;
 }
 
-struct async_flush {
-   struct btrfs_root *root;
-   struct btrfs_space_info *info;
-   struct btrfs_work work;
-};
-
-static noinline void flush_delalloc_async(struct btrfs_work *work)
-{
-   struct async_flush *async;
-   struct btrfs_root *root

[PATCH 01/12] Btrfs: Link block groups of different raid types in the same space_info

2010-04-19 Thread Yan, Zheng
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/ctree.h 2/fs/btrfs/ctree.h
--- 1/fs/btrfs/ctree.h  2010-04-14 14:49:56.399956135 +0800
+++ 2/fs/btrfs/ctree.h  2010-04-18 08:12:22.086699485 +0800
@@ -662,6 +662,7 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID1(1  4)
 #define BTRFS_BLOCK_GROUP_DUP (1  5)
 #define BTRFS_BLOCK_GROUP_RAID10   (1  6)
+#define BTRFS_NR_RAID_TYPES   5
 
 struct btrfs_block_group_item {
__le64 used;
@@ -673,7 +674,8 @@ struct btrfs_space_info {
u64 flags;
 
u64 total_bytes;/* total bytes in the space */
-   u64 bytes_used; /* total bytes used on disk */
+   u64 bytes_used; /* total bytes used,
+  this does't take mirrors into account */
u64 bytes_pinned;   /* total bytes pinned, will be freed when the
   transaction finishes */
u64 bytes_reserved; /* total bytes the allocator has reserved for
@@ -686,6 +688,7 @@ struct btrfs_space_info {
   delalloc/allocations */
u64 bytes_delalloc; /* number of bytes currently reserved for
   delayed allocation */
+   u64 disk_used;  /* total bytes used on disk */
 
int full;   /* indicates that we cannot allocate any more
   chunks for this space */
@@ -703,7 +706,7 @@ struct btrfs_space_info {
int flushing;
 
/* for block groups in our same type */
-   struct list_head block_groups;
+   struct list_head block_groups[BTRFS_NR_RAID_TYPES];
spinlock_t lock;
struct rw_semaphore groups_sem;
atomic_t caching_threads;
diff -urp 1/fs/btrfs/extent-tree.c 2/fs/btrfs/extent-tree.c
--- 1/fs/btrfs/extent-tree.c2010-04-14 14:49:56.932956992 +0800
+++ 2/fs/btrfs/extent-tree.c2010-04-18 08:12:22.092698714 +0800
@@ -512,6 +509,9 @@ static struct btrfs_space_info *__find_s
struct list_head *head = info-space_info;
struct btrfs_space_info *found;
 
+   flags = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM |
+BTRFS_BLOCK_GROUP_METADATA;
+
rcu_read_lock();
list_for_each_entry_rcu(found, head, list) {
if (found-flags == flags) {
@@ -2659,12 +2659,21 @@ static int update_space_info(struct btrf
 struct btrfs_space_info **space_info)
 {
struct btrfs_space_info *found;
+   int i;
+   int factor;
+
+   if (flags  (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+BTRFS_BLOCK_GROUP_RAID10))
+   factor = 2;
+   else
+   factor = 1;
 
found = __find_space_info(info, flags);
if (found) {
spin_lock(found-lock);
found-total_bytes += total_bytes;
found-bytes_used += bytes_used;
+   found-disk_used += bytes_used * factor;
found-full = 0;
spin_unlock(found-lock);
*space_info = found;
@@ -2674,14 +2683,18 @@ static int update_space_info(struct btrf
if (!found)
return -ENOMEM;
 
-   INIT_LIST_HEAD(found-block_groups);
+   for (i = 0; i  BTRFS_NR_RAID_TYPES; i++)
+   INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
init_waitqueue_head(found-flush_wait);
init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
-   found-flags = flags;
+   found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
+   BTRFS_BLOCK_GROUP_SYSTEM |
+   BTRFS_BLOCK_GROUP_METADATA);
found-total_bytes = total_bytes;
found-bytes_used = bytes_used;
+   found-disk_used = bytes_used * factor;
found-bytes_pinned = 0;
found-bytes_reserved = 0;
found-bytes_readonly = 0;
@@ -2751,26 +2764,32 @@ u64 btrfs_reduce_alloc_profile(struct bt
return flags;
 }
 
-static u64 btrfs_get_alloc_profile(struct btrfs_root *root, u64 data)
+static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
 {
-   struct btrfs_fs_info *info = root-fs_info;
-   u64 alloc_profile;
+   if (flags  BTRFS_BLOCK_GROUP_DATA)
+   flags |= root-fs_info-avail_data_alloc_bits 
+root-fs_info-data_alloc_profile;
+   else if (flags  BTRFS_BLOCK_GROUP_SYSTEM)
+   flags |= root-fs_info-avail_system_alloc_bits 
+root-fs_info-system_alloc_profile;
+   else if (flags  BTRFS_BLOCK_GROUP_METADATA)
+   flags |= root-fs_info-avail_metadata_alloc_bits 
+root

[PATCH 02/12] Btrfs: Kill allocate_wait in space_info

2010-04-19 Thread Yan, Zheng
We already have fs_info-chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h
--- 2/fs/btrfs/ctree.h  2010-04-18 08:12:22.086699485 +0800
+++ 3/fs/btrfs/ctree.h  2010-04-18 08:13:15.457699211 +0800
@@ -700,9 +700,7 @@ struct btrfs_space_info {
struct list_head list;
 
/* for controlling how we free up space for allocations */
-   wait_queue_head_t allocate_wait;
wait_queue_head_t flush_wait;
-   int allocating_chunk;
int flushing;
 
/* for block groups in our same type */
diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c
--- 2/fs/btrfs/extent-tree.c2010-04-18 08:12:22.092698714 +0800
+++ 3/fs/btrfs/extent-tree.c2010-04-18 08:13:15.463699138 +0800
@@ -70,6 +70,9 @@ static int find_next_key(struct btrfs_pa
 struct btrfs_key *key);
 static void dump_space_info(struct btrfs_space_info *info, u64 bytes,
int dump_block_groups);
+static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_space_info *sinfo, u64 num_bytes);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -2687,7 +2690,6 @@ static int update_space_info(struct btrf
INIT_LIST_HEAD(found-block_groups[i]);
init_rwsem(found-groups_sem);
init_waitqueue_head(found-flush_wait);
-   init_waitqueue_head(found-allocate_wait);
spin_lock_init(found-lock);
found-flags = flags  (BTRFS_BLOCK_GROUP_DATA |
BTRFS_BLOCK_GROUP_SYSTEM |
@@ -3000,71 +3002,6 @@ flush:
wake_up(info-flush_wait);
 }
 
-static int maybe_allocate_chunk(struct btrfs_root *root,
-struct btrfs_space_info *info)
-{
-   struct btrfs_super_block *disk_super = root-fs_info-super_copy;
-   struct btrfs_trans_handle *trans;
-   bool wait = false;
-   int ret = 0;
-   u64 min_metadata;
-   u64 free_space;
-
-   free_space = btrfs_super_total_bytes(disk_super);
-   /*
-* we allow the metadata to grow to a max of either 10gb or 5% of the
-* space in the volume.
-*/
-   min_metadata = min((u64)10 * 1024 * 1024 * 1024,
-div64_u64(free_space * 5, 100));
-   if (info-total_bytes = min_metadata) {
-   spin_unlock(info-lock);
-   return 0;
-   }
-
-   if (info-full) {
-   spin_unlock(info-lock);
-   return 0;
-   }
-
-   if (!info-allocating_chunk) {
-   info-force_alloc = 1;
-   info-allocating_chunk = 1;
-   } else {
-   wait = true;
-   }
-
-   spin_unlock(info-lock);
-
-   if (wait) {
-   wait_event(info-allocate_wait,
-  !info-allocating_chunk);
-   return 1;
-   }
-
-   trans = btrfs_start_transaction(root, 1);
-   if (!trans) {
-   ret = -ENOMEM;
-   goto out;
-   }
-
-   ret = do_chunk_alloc(trans, root-fs_info-extent_root,
-4096 + 2 * 1024 * 1024,
-info-flags, 0);
-   btrfs_end_transaction(trans, root);
-   if (ret)
-   goto out;
-out:
-   spin_lock(info-lock);
-   info-allocating_chunk = 0;
-   spin_unlock(info-lock);
-   wake_up(info-allocate_wait);
-
-   if (ret)
-   return 0;
-   return 1;
-}
-
 /*
  * Reserve metadata space for delalloc.
  */
@@ -3105,7 +3042,8 @@ again:
flushed++;
 
if (flushed == 1) {
-   if (maybe_allocate_chunk(root, meta_sinfo))
+   if (maybe_allocate_chunk(NULL, root, meta_sinfo,
+num_bytes))
goto again;
flushed++;
} else {
@@ -3220,7 +3158,8 @@ again:
if (used  meta_sinfo-total_bytes) {
retries++;
if (retries == 1) {
-   if (maybe_allocate_chunk(root, meta_sinfo))
+   if (maybe_allocate_chunk(NULL, root, meta_sinfo,
+num_bytes))
goto again;
retries++;
} else {
@@ -3417,13 +3356,28 @@ static void force_metadata_allocation(st
rcu_read_unlock();
 }
 
+static int should_alloc_chunk(struct btrfs_space_info *sinfo,
+ u64 alloc_bytes)
+{
+   u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly;
+
+   if (sinfo-bytes_used + sinfo-bytes_reserved +
+   alloc_bytes + 256 * 1024 * 1024  num_bytes)
+   return 0;
+
+   if (sinfo-bytes_used + sinfo

[PATCH 05/12] Btrfs: Introduce contexts for metadata reseravtion

2010-04-19 Thread Yan, Zheng
Introducing contexts for metadata reseravtion has two major
advantages. First, it makes metadata reseravtion more traceable.
Second, it can reclaim freed space and re-add them to the itself
after transaction committed.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 5/fs/btrfs/ctree.c 6/fs/btrfs/ctree.c
--- 5/fs/btrfs/ctree.c  2010-04-14 14:49:56.34295 +0800
+++ 6/fs/btrfs/ctree.c  2010-04-18 10:22:08.215948795 +0800
@@ -279,7 +279,8 @@ int btrfs_block_can_be_shared(struct btr
 static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct extent_buffer *buf,
-  struct extent_buffer *cow)
+  struct extent_buffer *cow,
+  int *last_ref)
 {
u64 refs;
u64 owner;
@@ -365,6 +366,7 @@ static noinline int update_ref_for_cow(s
BUG_ON(ret);
}
clean_tree_block(trans, root, buf);
+   *last_ref = 1;
}
return 0;
 }
@@ -391,6 +393,7 @@ static noinline int __btrfs_cow_block(st
struct btrfs_disk_key disk_key;
struct extent_buffer *cow;
int level;
+   int last_ref = 0;
int unlock_orig = 0;
u64 parent_start;
 
@@ -441,7 +444,7 @@ static noinline int __btrfs_cow_block(st
(unsigned long)btrfs_header_fsid(cow),
BTRFS_FSID_SIZE);
 
-   update_ref_for_cow(trans, root, buf, cow);
+   update_ref_for_cow(trans, root, buf, cow, last_ref);
 
if (buf == root-node) {
WARN_ON(parent  parent != buf);
@@ -456,8 +459,8 @@ static noinline int __btrfs_cow_block(st
extent_buffer_get(cow);
spin_unlock(root-node_lock);
 
-   btrfs_free_tree_block(trans, root, buf-start, buf-len,
-   parent_start, root-root_key.objectid, level);
+   btrfs_free_tree_block(trans, root, buf, parent_start,
+ last_ref);
free_extent_buffer(buf);
add_root_to_dirty_list(root);
} else {
@@ -472,8 +475,8 @@ static noinline int __btrfs_cow_block(st
btrfs_set_node_ptr_generation(parent, parent_slot,
  trans-transid);
btrfs_mark_buffer_dirty(parent);
-   btrfs_free_tree_block(trans, root, buf-start, buf-len,
-   parent_start, root-root_key.objectid, level);
+   btrfs_free_tree_block(trans, root, buf, parent_start,
+ last_ref);
}
if (unlock_orig)
btrfs_tree_unlock(buf);
@@ -948,6 +951,22 @@ int btrfs_bin_search(struct extent_buffe
return bin_search(eb, key, level, slot);
 }
 
+static void root_add_used(struct btrfs_root *root, u32 size)
+{
+   spin_lock(root-accounting_lock);
+   btrfs_set_root_used(root-root_item,
+   btrfs_root_used(root-root_item) + size);
+   spin_unlock(root-accounting_lock);
+}
+
+static void root_sub_used(struct btrfs_root *root, u32 size)
+{
+   spin_lock(root-accounting_lock);
+   btrfs_set_root_used(root-root_item,
+   btrfs_root_used(root-root_item) - size);
+   spin_unlock(root-accounting_lock);
+}
+
 /* given a node and slot number, this reads the blocks it points to.  The
  * extent buffer is returned with a reference taken (but unlocked).
  * NULL is returned on error.
@@ -1018,7 +1037,11 @@ static noinline int balance_level(struct
btrfs_tree_lock(child);
btrfs_set_lock_blocking(child);
ret = btrfs_cow_block(trans, root, child, mid, 0, child);
-   BUG_ON(ret);
+   if (ret) {
+   btrfs_tree_unlock(child);
+   free_extent_buffer(child);
+   goto enospc;
+   }
 
spin_lock(root-node_lock);
root-node = child;
@@ -1033,11 +1056,12 @@ static noinline int balance_level(struct
btrfs_tree_unlock(mid);
/* once for the path */
free_extent_buffer(mid);
-   ret = btrfs_free_tree_block(trans, root, mid-start, mid-len,
-   0, root-root_key.objectid, level);
+
+   root_sub_used(root, mid-len);
+   btrfs_free_tree_block(trans, root, mid, 0, 1);
/* once for the root ptr */
free_extent_buffer(mid);
-   return ret;
+   return 0;
}
if (btrfs_header_nritems(mid) 
BTRFS_NODEPTRS_PER_BLOCK(root) / 4)
@@ -1087,23 +,16 @@ static noinline int balance_level(struct
if (wret  0

[PATCH 04/12] Btrfs: Kill init_btrfs_i()

2010-04-19 Thread Yan, Zheng
All code in init_btrfs_i can be moved into btrfs_alloc_inode()


Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 4/fs/btrfs/inode.c 5/fs/btrfs/inode.c
--- 4/fs/btrfs/inode.c  2010-04-18 08:13:48.183698829 +0800
+++ 5/fs/btrfs/inode.c  2010-04-18 10:59:07.534719917 +0800
@@ -3595,40 +3595,10 @@ again:
return 0;
 }
 
-static noinline void init_btrfs_i(struct inode *inode)
-{
-   struct btrfs_inode *bi = BTRFS_I(inode);
-
-   bi-generation = 0;
-   bi-sequence = 0;
-   bi-last_trans = 0;
-   bi-last_sub_trans = 0;
-   bi-logged_trans = 0;
-   bi-delalloc_bytes = 0;
-   bi-reserved_bytes = 0;
-   bi-disk_i_size = 0;
-   bi-flags = 0;
-   bi-index_cnt = (u64)-1;
-   bi-last_unlink_trans = 0;
-   bi-ordered_data_close = 0;
-   bi-force_compress = 0;
-   extent_map_tree_init(BTRFS_I(inode)-extent_tree, GFP_NOFS);
-   extent_io_tree_init(BTRFS_I(inode)-io_tree,
-inode-i_mapping, GFP_NOFS);
-   extent_io_tree_init(BTRFS_I(inode)-io_failure_tree,
-inode-i_mapping, GFP_NOFS);
-   INIT_LIST_HEAD(BTRFS_I(inode)-delalloc_inodes);
-   INIT_LIST_HEAD(BTRFS_I(inode)-ordered_operations);
-   RB_CLEAR_NODE(BTRFS_I(inode)-rb_node);
-   btrfs_ordered_inode_tree_init(BTRFS_I(inode)-ordered_tree);
-   mutex_init(BTRFS_I(inode)-log_mutex);
-}
-
 static int btrfs_init_locked_inode(struct inode *inode, void *p)
 {
struct btrfs_iget_args *args = p;
inode-i_ino = args-ino;
-   init_btrfs_i(inode);
BTRFS_I(inode)-root = args-root;
btrfs_set_inode_space_info(args-root, inode);
return 0;
@@ -3691,8 +3661,6 @@ static struct inode *new_simple_dir(stru
if (!inode)
return ERR_PTR(-ENOMEM);
 
-   init_btrfs_i(inode);
-
BTRFS_I(inode)-root = root;
memcpy(BTRFS_I(inode)-location, key, sizeof(*key));
BTRFS_I(inode)-dummy_inode = 1;
@@ -4091,7 +4059,6 @@ static struct inode *btrfs_new_inode(str
 * btrfs_get_inode_index_count has an explanation for the magic
 * number
 */
-   init_btrfs_i(inode);
BTRFS_I(inode)-index_cnt = 2;
BTRFS_I(inode)-root = root;
BTRFS_I(inode)-generation = trans-transid;
@@ -5262,21 +5229,46 @@ unsigned long btrfs_force_ra(struct addr
 struct inode *btrfs_alloc_inode(struct super_block *sb)
 {
struct btrfs_inode *ei;
+   struct inode *inode;
 
ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS);
if (!ei)
return NULL;
+
+   ei-root = NULL;
+   ei-space_info = NULL;
+   ei-generation = 0;
+   ei-sequence = 0;
ei-last_trans = 0;
ei-last_sub_trans = 0;
ei-logged_trans = 0;
+   ei-delalloc_bytes = 0;
+   ei-reserved_bytes = 0;
+   ei-disk_i_size = 0;
+   ei-flags = 0;
+   ei-index_cnt = (u64)-1;
+   ei-last_unlink_trans = 0;
+
+   spin_lock_init(ei-accounting_lock);
ei-outstanding_extents = 0;
ei-reserved_extents = 0;
-   ei-root = NULL;
-   spin_lock_init(ei-accounting_lock);
+
+   ei-ordered_data_close = 0;
+   ei-dummy_inode = 0;
+   ei-force_compress = 0;
+
+   inode = ei-vfs_inode;
+   extent_map_tree_init(ei-extent_tree, GFP_NOFS);
+   extent_io_tree_init(ei-io_tree, inode-i_data, GFP_NOFS);
+   extent_io_tree_init(ei-io_failure_tree, inode-i_data, GFP_NOFS);
+   mutex_init(ei-log_mutex);
btrfs_ordered_inode_tree_init(ei-ordered_tree);
INIT_LIST_HEAD(ei-i_orphan);
+   INIT_LIST_HEAD(ei-delalloc_inodes);
INIT_LIST_HEAD(ei-ordered_operations);
-   return ei-vfs_inode;
+   RB_CLEAR_NODE(ei-rb_node);
+
+   return inode;
 }
 
 void btrfs_destroy_inode(struct inode *inode)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] Btrfs: Kill allocate_wait in space_info

2010-04-19 Thread Yan, Zheng
On Mon, Apr 19, 2010 at 9:57 PM, Josef Bacik jo...@redhat.com wrote:
 The purpose of maybe_allocate_chunk was that there is no way to know if some
 other CPU is currently trying to allocate a chunk for the given space info.  
 We
 could have two cpu's come inot do_chunk_alloc at relatively the same time and
 end up allocating twice the amount of space, which is why I did the waitqueue
 thing.  It seems like this is still a possibility with your patch.  Thanks,

This is impossible because the very first thing do_chunk_alloc does is
lock the chunk_mutex.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] Btrfs: Kill allocate_wait in space_info

2010-04-19 Thread Yan, Zheng
On Mon, Apr 19, 2010 at 10:48 PM, Josef Bacik jo...@redhat.com wrote:
 On Mon, Apr 19, 2010 at 10:46:12PM +0800, Yan, Zheng wrote:
 On Mon, Apr 19, 2010 at 9:57 PM, Josef Bacik jo...@redhat.com wrote:
  The purpose of maybe_allocate_chunk was that there is no way to know if 
  some
  other CPU is currently trying to allocate a chunk for the given space 
  info.  We
  could have two cpu's come inot do_chunk_alloc at relatively the same time 
  and
  end up allocating twice the amount of space, which is why I did the 
  waitqueue
  thing.  It seems like this is still a possibility with your patch.  Thanks,
 
 This is impossible because the very first thing do_chunk_alloc does is
 lock the chunk_mutex.


 Sure, that just means we don't get two things creating chunks at the same 
 time,
 but not from creating them one right after another.  So CPU 0 and 1 come in to
 the check free space stuff, realize they need to allocate a chunk, and race to
 call do_chunk_alloc.  One of them wins, and the other blocks on the 
 chunk_mutex
 lock.  When the first finishes the other one is able to continue and do what 
 it
 was originally going to do, and then you get two chunks when you really only
 wanted one.  Thanks,


there is a check in do_chunk_alloc, so the later one will do nothing
if the first
call adds enough space.

Yan Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at /home/mafra/linux-2.6/fs/btrfs/volumes.c:2828!

2010-04-08 Thread Yan, Zheng
On Fri, Apr 9, 2010 at 6:23 AM, Carlos R. Mafra crmaf...@gmail.com wrote:
 I've just got this bug in the latest 2.6.34-rc3-00388 kernel.

 I wasn't doing anything fancy, just doing some 'git log' in a small wmaker
 repo I have here. After I got this bug the cpu went to 100% and I had to
 reboot with the button because the laptop was not responding to any commands
 (but the mouse was moving and X seemed ok)

 I have the full dmesg (64 KB) which I can upload somewhere if needed. But
 for now I just would like to know if this bug is known to people in the list.

 [ 8150.547096] [ cut here ]
 [ 8150.547101] kernel BUG at /home/mafra/linux-2.6/fs/btrfs/volumes.c:2828!
 [ 8150.547103] invalid opcode:  [#1] SMP
 [ 8150.547107] last sysfs file: 
 /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
 [ 8150.547109] CPU 1
 [ 8150.547111] Modules linked in: ppp_deflate ppp_async ppp_generic slhc 
 option usbserial snd_seq snd_seq_device uvcvideo snd_hda_codec_idt snd_hda_in 
                  tel iwlagn sky2 snd_hda_codec ehci_hcd snd_hwdep i2c_i801 
 evdev
 [ 8150.547125]
 [ 8150.547128] Pid: 1606, comm: btrfs-delalloc- Not tainted 
 2.6.34-rc3-00388-gf5284e7 #48 VAIO/VGN-FZ240E
 [ 8150.547131] RIP: 0010:[811b39a9]  [811b39a9] 
 btrfs_rmap_block+0x2c9/0x2d0
 [ 8150.547139] RSP: 0018:88007aeab7f0  EFLAGS: 00010246
 [ 8150.547142] RAX:  RBX: 0001 RCX: 
 
 [ 8150.547144] RDX: 001e821d RSI: 001e821d RDI: 
 88007ef1fa28
 [ 8150.547147] RBP: 88007aeab870 R08:  R09: 
 88007aeab8a8
 [ 8150.547149] R10:  R11: 88005ae73580 R12: 
 
 [ 8150.547151] R13: 88007ae9c0e8 R14: 88007aeab8ac R15: 
 88007aeab8a8
 [ 8150.547154] FS:  () GS:880001b0() 
 knlGS:
 [ 8150.547157] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 8150.547159] CR2: 7f4adf3d9000 CR3: 01812000 CR4: 
 06e0
 [ 8150.547161] DR0:  DR1:  DR2: 
 
 [ 8150.547164] DR3:  DR6: 0ff0 DR7: 
 0400
 [ 8150.547166] Process btrfs-delalloc- (pid: 1606, threadinfo 
 88007aeaa000, task 88007f330280)
 [ 8150.547168] Stack:
 [ 8150.547170]  09b8 8801 88007acf7000 
 00ff811a0eb5
 [ 8150.547173] 0 0001 220240cc 88007aeaa000 
 88007aeab8a8
 [ 8150.547178] 0 88007aeab8a0 001e821d 88007aeab860 
 88007ae6fd80
 [ 8150.547182] Call Trace:
 [ 8150.547188]  [8117bdb1] exclude_super_stripes+0x61/0x110
 [ 8150.547193]  [8119dc30] ? __tree_search+0x90/0x120
 [ 8150.547196]  [8117c906] btrfs_make_block_group+0x136/0x320
 [ 8150.547200]  [811b6fd6] __btrfs_alloc_chunk+0x616/0x740
 [ 8150.547203]  [811b716f] btrfs_alloc_chunk+0x6f/0xa0
 [ 8150.547207]  [8117d546] do_chunk_alloc+0x166/0x230
 [ 8150.547210]  [8117f7d9] find_free_extent+0x939/0x9f0
 [ 8150.547214]  [8117fab5] btrfs_reserve_extent+0xc5/0x1b0
 [ 8150.547218]  [814f3559] ? mutex_lock+0x19/0x50
 [ 8150.547227]  [8119572d] submit_compressed_extents+0x10d/0x3f0
 [ 8150.547231]  [81195a8f] async_cow_submit+0x7f/0x90
 [ 8150.547234]  [811b80c6] run_ordered_completions+0x76/0xe0
 [ 8150.547237]  [811b898a] worker_loop+0x15a/0x580
 [ 8150.547240]  [811b8830] ? worker_loop+0x0/0x580
 [ 8150.547243]  [8105b6ce] kthread+0x8e/0xa0
 [ 8150.547248]  [81003ad4] kernel_thread_helper+0x4/0x10
 [ 8150.547250]  [8105b640] ? kthread+0x0/0xa0
 [ 8150.547259]  [81003ad0] ? kernel_thread_helper+0x0/0x10
 [ 8150.547261] Code: 00 00 48 c7 c7 68 ce 75 81 e8 34 ab e8 ff 4c 8b 5d 90 4c 
 8b 55 98 4c 8b 4d b0 48 8b 4d a0 48 8b 45 a8 e9 af fe ff ff 0f 0b eb fe       
             0f 0b eb fe 0f 1f 00 55 48 89 e5 41 55 49 89 d5 41 54 49 89 f4
 [ 8150.547292] RIP  [811b39a9] btrfs_rmap_block+0x2c9/0x2d0
 [ 8150.547296]  RSP 88007aeab7f0
 [ 8150.547310] ---[ end trace 6c5e0e4f829c2aeb ]---

This has been fixed by commit
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=commit;h=9f680ce04ea19dabbbafe01b57b61930a9b70741

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix chunk allocate size calculation

2010-04-06 Thread Yan, Zheng
On Thu, Mar 18, 2010 at 4:45 AM, Josef Bacik jo...@redhat.com wrote:
 If the amount of free space left in a device is less than what we think should
 be the minimum size, just ignore the minimum size and use the amount we have. 
  I
 ran into this running tests on a 600mb volume, the chunk allocator wouldn't 
 let
 me allocate the last 52mb of the disk for data because we want to have at 
 least
 64mb chunks for data.  This patch fixes that problem.  Thanks,

 Signed-off-by: Josef Bacik jo...@redhat.com

 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 9df8e3f..1c5b5ba 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -2244,8 +2244,10 @@ again:
                do_div(calc_size, stripe_len);
                calc_size *= stripe_len;
        }
 +
        /* we don't want tiny stripes */
 -       calc_size = max_t(u64, min_stripe_size, calc_size);
 +       if (!looped)
 +               calc_size = max_t(u64, min_stripe_size, calc_size);

        do_div(calc_size, stripe_len);
        calc_size *= stripe_len;

I encountered an Oops caused by 'calc_size == 0'. It's likely introduced
by this change. (calc_size can be zero after calling do_div)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops on btrfs filesystem balance

2010-03-25 Thread Yan, Zheng
On Thu, Mar 25, 2010 at 9:06 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:
 On lastest Linus' git.

 [ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 
 0021
 [ 4005.426818] IP: [c109a130] page_cache_sync_readahead+0x18/0x3e
 [ 4005.426837] *pde = 
 [ 4005.426844] Oops:  [#1] PREEMPT SMP
 [ 4005.426854] last sysfs file:
 /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full
 [ 4005.426864] Modules linked in: btrfs zlib_deflate crc32c libcrc32c
 loop coretemp ext2 arc4 ecb iwlagn iwlcore snd_hda_codec_conexant
 snd_hda_intel mac80211 snd_hda_codec snd_hwdep snd_pcm snd_timer snd
 uvcvideo e1000e rtc_cmos rtc_core cdc_ether videodev uhci_hcd usbnet
 sg snd_page_alloc video thinkpad_acpi cdc_acm rtc_lib v4l1_compat mii
 output ext3 jbd usbhid sd_mod sha256_generic cbc ata_piix ehci_hcd
 aes_i586 aes_generic libata dm_crypt usbcore scsi_mod nls_base dm_mod
 [ 4005.426971]
 [ 4005.426979] Pid: 25838, comm: btrfs Not tainted 2.6.34-rc2 #67
 2767BC8/2767BC8
 [ 4005.426987] EIP: 0060:[c109a130] EFLAGS: 00010206 CPU: 0
 [ 4005.426996] EIP is at page_cache_sync_readahead+0x18/0x3e
 [ 4005.427002] EAX: f58dcb84 EBX:  ECX:  EDX: f45efe40
 [ 4005.427009] ESI: 00033b43 EDI: f58dcad4 EBP: f4b61ce0 ESP: f4b61cd8
 [ 4005.427010]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 [ 4005.427010] Process btrfs (pid: 25838, ti=f4b6 task=f6680a60
 task.ti=f4b6)
 [ 4005.427010] Stack:
 [ 4005.427010]  41c1 0001 f4b61d50 f9443902  00033b43
 f93fc3dc f6bf4d80
 [ 4005.427010] 0 f4cc74d0 41c1 0001 f58dcb4c 00033b42
 f58dc9e0 f72e7600 f4b61d2c
 [ 4005.427010] 0 f45efe40   00033b43 41c0
 0001  
 [ 4005.427010] Call Trace:
 [ 4005.427010]  [f9443902] ? relocate_file_extent_cluster+0x195/0x3bd 
 [btrfs]
 [ 4005.427010]  [f93fc3dc] ? btrfs_release_path+0x39/0x4a [btrfs]
 [ 4005.427010]  [f9444bd2] ? relocate_block_group+0x2be/0x32a [btrfs]
 [ 4005.427010]  [f9411dd3] ? btrfs_clean_old_snapshots+0x66/0xd9 [btrfs]
 [ 4005.427010]  [f9444d87] ? btrfs_relocate_block_group+0x149/0x2e3 [btrfs]
 [ 4005.427010]  [f942eecc] ? btrfs_relocate_chunk+0x5c/0x423 [btrfs]
 [ 4005.427010]  [c10217cc] ? kmap_atomic+0x13/0x15
 [ 4005.427010]  [f9428f32] ? map_private_extent_buffer+0x94/0xb6 [btrfs]
 [ 4005.427010]  [f9428fa3] ? map_extent_buffer+0x4f/0x7f [btrfs]
 [ 4005.427010]  [c10216d3] ? kunmap_atomic+0x6c/0x83
 [ 4005.427010]  [f9428aca] ? unmap_extent_buffer+0x11/0x13 [btrfs]
 [ 4005.427010]  [f94206dd] ? btrfs_item_offset+0x98/0xa2 [btrfs]
 [ 4005.427010]  [f942f856] ? btrfs_balance+0x20f/0x265 [btrfs]
 [ 4005.427010]  [f9436ab9] ? btrfs_ioctl+0x6ad/0x824 [btrfs]
 [ 4005.427010]  [c10bf8e1] ? __memcg_event_check+0x50/0x72
 [ 4005.427010]  [c11461e2] ? file_has_perm+0x8c/0xa6
 [ 4005.427010]  [c10cf310] ? vfs_ioctl+0x2c/0x96
 [ 4005.427010]  [f943640c] ? btrfs_ioctl+0x0/0x824 [btrfs]
 [ 4005.427010]  [c10cf8ac] ? do_vfs_ioctl+0x48e/0x4cc
 [ 4005.427010]  [c11463ca] ? selinux_file_ioctl+0x43/0x46
 [ 4005.427010]  [c10cf930] ? sys_ioctl+0x46/0x66
 [ 4005.427010]  [c132ae88] ? syscall_call+0x7/0xb
 [ 4005.427010] Code: 8b 48 24 85 c9 74 04 31 d2 ff d1 8d 65 f4 5b 5e
 5f c9 c3 55 89 e5 56 53 0f 1f 44 00 00 89 cb 8b 75 0c 8b 4d 08 83 7a
 0c 00 74 1f f6 43 21 10 74 0b 89 da 56 e8 f5 fc ff ff 5b eb 0e 56 51
 89 d9
 [ 4005.427010] EIP: [c109a130] page_cache_sync_readahead+0x18/0x3e
 SS:ESP 0068:f4b61cd8
 [ 4005.427010] CR2: 0021
 [ 4005.427898] ---[ end trace 0e53ab674cd5bfb9 ]---


The 'filp' parameter for page_cache_sync_readahead is NULL in this case.
Commit 0141450f66c3c12a3aaa869748caa64241885cdf  added code that
dereference 'filp'.

Fengguang, would you please fix this.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix ENOSPC accounting when max_extent is not maxed out V2

2010-03-19 Thread Yan, Zheng
On Fri, Mar 19, 2010 at 9:59 PM, Josef Bacik jo...@redhat.com wrote:
 On Fri, Mar 19, 2010 at 11:09:25AM +0800, Yan, Zheng  wrote:
 On Thu, Mar 18, 2010 at 11:47 PM, Josef Bacik jo...@redhat.com wrote:
  A user reported a bug a few weeks back where if he set max_extent=1m and 
  then
  did a dd and then stopped it, we would panic.  This is because I 
  miscalculated
  how many extents would be needed for splits/merges.  Turns out I didn't 
  actually
  take max_extent into account properly, since we only ever add 1 extent for 
  a
  write, which isn't quite right for the case that say max_extent is 4k and 
  we do
  8k writes.  That would result in more than 1 extent.  So this patch makes 
  us
  properly figure out how many extents are needed for the amount of data 
  that is
  being written, and deals with splitting and merging better.  I've tested 
  this ad
  nauseum and it works well.  This version doesn't depend on my per-cpu 
  stuff.
  Thanks,

 Why not remove the the max_extent check. The max length of file extent
 is also affected by fragmentation level of free space. It doesn't make sense
 to introduce complex code to address one factor while lose sight of another
 factor. I think reserving one unit of metadata for each delalloc extent in 
 the
 extent IO tree should be OK. because even a delalloc extent ends up with
 multiple file extents, these file extents are adjacency in the b-tree.


 Do you mean remove the option for max_extent altogether, or just remove all of
 my code for taking it into account?  Thanks,


all of the code for taking max_extent into account
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: btrfs_mark_extent_written writes correctly slot

2010-02-10 Thread Yan, Zheng
On 02/11/2010 03:43 PM, Shaohua Li wrote:
 My test do: fallocate a big file and do write. The file is 512M, but
 after file write is done btrfs-debug-tree shows:
 item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53
 extent data disk byte 1103101952 nr 536870912
 extent data offset 0 nr 399634432 ram 536870912
 extent compression 0
 Looks like a regression introducted by
 6c7d54ac87f338c479d9729e8392eca3f76e11e1, where we set wrong slot.
 
 Signed-off-by: Shaohua Li shaohua...@intel.com
 
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index 9d08096..6ed434a 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -720,13 +720,15 @@ again:
   inode-i_ino, orig_offset);
   BUG_ON(ret);
   }
 - fi = btrfs_item_ptr(leaf, path-slots[0],
 -struct btrfs_file_extent_item);
   if (del_nr == 0) {
 + fi = btrfs_item_ptr(leaf, path-slots[0],
 +struct btrfs_file_extent_item);
   btrfs_set_file_extent_type(leaf, fi,
  BTRFS_FILE_EXTENT_REG);
   btrfs_mark_buffer_dirty(leaf);
   } else {
 + fi = btrfs_item_ptr(leaf, del_slot - 1,
 +struct btrfs_file_extent_item);
   btrfs_set_file_extent_type(leaf, fi,
  BTRFS_FILE_EXTENT_REG);
   btrfs_set_file_extent_num_bytes(leaf, fi,

Acked-by: Yan Zheng zheng@oracle.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix race between allocate and release extent buffer.

2010-02-04 Thread Yan, Zheng
Increase extent buffer's reference count while holding the lock.
Otherwise it can race with try_release_extent_buffer.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/extent_io.c 2/fs/btrfs/extent_io.c
--- 1/fs/btrfs/extent_io.c  2010-01-17 15:48:16.770302026 +0800
+++ 2/fs/btrfs/extent_io.c  2010-02-04 16:37:45.704800682 +0800
@@ -3165,10 +3165,9 @@ struct extent_buffer *alloc_extent_buffe
spin_unlock(tree-buffer_lock);
goto free_eb;
}
-   spin_unlock(tree-buffer_lock);
-
/* add one reference for the tree */
atomic_inc(eb-refs);
+   spin_unlock(tree-buffer_lock);
return eb;
 
 free_eb:
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix race between allocate and release extent buffer.

2010-02-04 Thread Yan, Zheng
On 02/04/2010 04:46 PM, Yan, Zheng wrote:
 Increase extent buffer's reference count while holding the lock.
 Otherwise it can race with try_release_extent_buffer.
 
 Signed-off-by: Yan Zheng zheng@oracle.com
 
 ---
 diff -urp 1/fs/btrfs/extent_io.c 2/fs/btrfs/extent_io.c
 --- 1/fs/btrfs/extent_io.c2010-01-17 15:48:16.770302026 +0800
 +++ 2/fs/btrfs/extent_io.c2010-02-04 16:37:45.704800682 +0800
 @@ -3165,10 +3165,9 @@ struct extent_buffer *alloc_extent_buffe
   spin_unlock(tree-buffer_lock);
   goto free_eb;
   }
 - spin_unlock(tree-buffer_lock);
 -
   /* add one reference for the tree */
   atomic_inc(eb-refs);
 + spin_unlock(tree-buffer_lock);
   return eb;
  
  free_eb:

Oops caused by this bug are attached below.

Modules linked in: btrfs ipt_MASQUERADE iptable_nat nf_nat bridge stp 
zlib_deflate libcrc32c llc sunrpc xt_physdev ip6t_REJECT nf_conntrack_ipv6 
ip6table_filter ip6_tables ipv6 p4_clockmod freq_table speedstep_lib 
dm_multipath kvm uinput snd_hda_codec_analog snd_hda_intel snd_hda_codec 
snd_hwdep snd_seq snd_seq_device snd_pcm ppdev parport_pc parport dcdbas 
serio_raw i2c_i801 pcspkr snd_timer snd soundcore iTCO_wdt iTCO_vendor_support 
snd_page_alloc e1000e ata_generic pata_acpi i915 drm_kms_helper drm 
i2c_algo_bit i2c_core video output [last unloaded: freq_table]
Pid: 3302, comm: flush-btrfs-1 Tainted: GW  2.6.32 #1 OptiPlex 755  
   
RIP: 0010:[a0396718]  [a0396718] 
btrfs_set_buffer_uptodate+0x14/0x25 [btrfs]
RSP: 0018:880077e47480  EFLAGS: 00010202
RAX:  RBX: 88003d8a4000 RCX: 
RDX: 0001 RSI: 88003d8a4000 RDI: 88003d8a4000
RBP: 880077e47480 R08: 880001c555c0 R09: 
R10: 880001c55630 R11: 880001c555c0 R12: 88007910eb80
R13: 88007a39c800 R14: 0022 R15: 88007910eb80
FS:  () GS:880001c4() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2:  CR3: 0a991000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process flush-btrfs-1 (pid: 3302, threadinfo 880077e46000, task 
8800796a2e60)
Stack:
880077e474b0 a038c334 88007a39c800 88007a39c9e0
0 1000  880077e47550 a039237b
0 0003 8800288935c0  814627da
Call Trace:
[a038c334] btrfs_init_new_buffer+0x78/0xe9 [btrfs]
[a039237b] btrfs_alloc_free_block+0x1ef/0x1f4 [btrfs]
[814627da] ? sub_preempt_count+0x9/0x83
[a038708e] split_leaf+0x243/0x449 [btrfs]
[814600d2] ? _spin_unlock+0x2a/0x35
[a038826a] btrfs_search_slot+0x45c/0x518 [btrfs]
[a0388e0b] btrfs_insert_empty_items+0x6a/0xbc [btrfs]
[8146285d] ? add_preempt_count+0x9/0x83
[a039effe] insert_inline_extent+0xc0/0x251 [btrfs]
[a03b4eeb] ? extent_clear_unlock_delalloc+0x1c7/0x1e4 [btrfs]
[a039f2a5] cow_file_range_inline+0x116/0x159 [btrfs]
[a039bb6e] ? start_transaction+0x1b8/0x1ea [btrfs]
[a039f384] cow_file_range+0x9c/0x354 [btrfs]
[a03b3dae] ? set_extent_bit+0x390/0x3e8 [btrfs]
[a039fc67] run_delalloc_range+0xb4/0x364 [btrfs]
[a03b6198] ? find_lock_delalloc_range+0x186/0x1a6 [btrfs]
[a03b6343] __extent_writepage+0x18b/0x584 [btrfs]
[811156e5] ? mem_cgroup_add_lru_list+0x81/0x8a
[a03b6b73] extent_write_cache_pages.clone.0+0x155/0x2b1 [btrfs]
[8145e6ab] ? thread_return+0xa8/0xd0
[8104ad22] ? finish_task_switch+0x85/0xa8
[8103fe77] ? need_resched+0x23/0x2d
[a03b6dda] extent_writepages+0x44/0x5a [btrfs]
[a039e608] ? btrfs_get_extent+0x0/0x753 [btrfs]
[81076de8] ? bit_waitqueue+0x17/0xa9
[a039e4da] btrfs_writepages+0x27/0x29 [btrfs]
[810dd8d5] do_writepages+0x21/0x2a
[8113a5e2] writeback_single_inode+0xd1/0x1f6
[8113ade1] writeback_inodes_wb+0x388/0x423
[8113afa4] wb_writeback+0x128/0x1ac
[810b0ded] ? call_rcu_sched+0x15/0x17
[810b0dfd] ? call_rcu+0xe/0x10
[8113b147] wb_do_writeback+0x6e/0x166
[8113b27e] bdi_writeback_task+0x3f/0xaf
[810ecf94] ? bdi_start_fn+0x0/0xd4
[810ed00a] bdi_start_fn+0x76/0xd4
[810ecf94] ? bdi_start_fn+0x0/0xd4
[81076b9c] kthread+0x7f/0x87
[81012dda] child_rip+0xa/0x20
[81076b1d] ? kthread+0x0/0x87
[81012dd0] ? child_rip+0x0/0x20
Code: 00 00 48 81 c7 d0 20 00 00 e8 ad 99 0c e1 5b 41 5c 41 5d 41 5e c9 c3 55 
48 89 e5 0f 1f 44 00 00 48 8b 47 30 48 89 fe 48 8b 40 18 48 8b 38 48 81 ef 78 
01 00 00 e8 0a d7 01 00 c9 c3 55 48 89 e5 
RIP  [a0396718] btrfs_set_buffer_uptodate+0x14/0x25 [btrfs]
RSP 880077e47480
CR2: 

Modules linked

Re: btrfs-vol -b endless loop

2010-02-03 Thread Yan, Zheng
On Thu, Feb 4, 2010 at 6:25 AM, Pär Andersson pa...@lysator.liu.se wrote:
 Hi,

 I have a 11G btrfs file system where btrfs-vol -b get stuck in a
 loop. The found 7 extents messages and heavy disk io continues until I
 reboot:

 [26223.544037] btrfs: relocating block group 42079027200 flags 36
 [26228.082740] btrfs: found 2457 extents
 [26228.104515] btrfs: relocating block group 41407938560 flags 1
 [26228.423981] btrfs: found 124 extents
 [26229.514583] btrfs: found 124 extents
 [26229.544160] btrfs: found 7 extents
 [26229.812494] btrfs: found 7 extents
 [26230.453720] btrfs: found 7 extents
 [26230.733233] btrfs: found 7 extents
 [26231.383321] btrfs: found 7 extents
 [26231.652815] btrfs: found 7 extents
 ...

 The error is repeatable. btrfsck does not report any errors:

 r...@faran:~# btrfsck /dev/sda5
 found 9035055104 bytes used err is 0
 total csum bytes: 8293152
 total tree bytes: 542867456
 total fs tree bytes: 503087104
 btree space waste bytes: 108089491
 file data blocks allocated: 8492187648
  referenced 10314125312
 Btrfs Btrfs v0.19

 The kernel is Linus' master from two days ago, and seems to have all the
 latest btrfs code merged.


The loop is due to fragments of free space. I'm working on make 'btrfs-vol -b'
return -ENOSPC in this case.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix oopsen when dropping empty tree.

2010-01-31 Thread Yan, Zheng
When dropping a empty tree, walk_down_tree() skips checking
extent information for the tree root. This will triggers a
BUG_ON in walk_up_proc().

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/extent-tree.c 2/fs/btrfs/extent-tree.c
--- 1/fs/btrfs/extent-tree.c2010-01-22 12:16:34.203525744 +0800
+++ 2/fs/btrfs/extent-tree.c2010-02-01 10:26:19.865562007 +0800
@@ -5402,10 +5402,6 @@ static noinline int walk_down_tree(struc
int ret;
 
while (level = 0) {
-   if (path-slots[level] =
-   btrfs_header_nritems(path-nodes[level]))
-   break;
-
ret = walk_down_proc(trans, root, path, wc, lookup_info);
if (ret  0)
break;
@@ -5413,6 +5409,10 @@ static noinline int walk_down_tree(struc
if (level == 0)
break;
 
+   if (path-slots[level] =
+   btrfs_header_nritems(path-nodes[level]))
+   break;
+
ret = do_walk_down(trans, root, path, wc, lookup_info);
if (ret  0) {
path-slots[level]++;
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfsck: Remove superfluous WARN_ON

2010-01-31 Thread Yan, Zheng
Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp btrfs-progs-unstable/btrfsck.c btrfs-progs-2/btrfsck.c
--- btrfs-progs-unstable/btrfsck.c  2009-09-28 15:54:55.980479398 +0800
+++ btrfs-progs-2/btrfsck.c 2010-01-31 09:46:24.645485459 +0800
@@ -581,7 +581,6 @@ again:
}
ret = insert_existing_cache_extent(dst, ins-cache);
if (ret == -EEXIST) {
-   WARN_ON(src == src_node-root_cache);
conflict = get_inode_rec(dst, rec-ino, 1);
merge_inode_recs(rec, conflict, dst);
if (rec-checked) {
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: panic during rebalance, and now upon mount

2010-01-31 Thread Yan, Zheng
On Mon, Feb 1, 2010 at 3:33 AM, Troy Ablan tab...@gmail.com wrote:
 Yan, Zheng wrote:
 Please try the patch attached below. It should solve the bug during
 mounting
 that fs. But I don't know why there are so many link count errors in that fs.
 How old is that fs? what was that fs used for?

 Thank you very much.
 Yan, Zheng


 Good, so far.  Thanks!

 The filesystem is less than 2 weeks old, created and managed exclusively
 with the unstable tools Btrfs v0.19-4-gab8fb4c-dirty

 I created the filesystem -d raid1 -m raid1.

 There are 14 dm-crypt mappings corresponding to 14 partitions on 14
 drives.  There's one filesystem made up from these devices with about 14
 TB of space (a mixture of devices ranging from 500GB to 2TB)

 The filesystem is used for incremental backup from remote computers
 using rsync.

 The filesystem tree is as follows

 /
 /machine1 - normal directory
 /machine1/machine1 - a subvolume
 /machine1/machine1-20100120-1220 - a snapshot of the subvolume above
 
 /machine1/machine1-20100131-1220 - more snapshots of the subvolume above
 /machine2 - normal directory
 /machine2/machine1 - a subvolume
 /machine2/machine2-20100120-1020 - a snapshot of the subvolume above
 
 /machine2/machine2-20100131-1020 - more snapshots of the subvolume above
 

 The files are backed up with `rsync -aH --inplace` onto the subvolume
 for each machine.

 The only oddness I can think of is that during initial testing of this
 filesystem, I yanked a drive physically from the machine while it was
 writing.  btrfs seemed to continue to try to write to the inaccessible
 device, and indeed, btrfs-show showed the used space on the missing
 drive increasing over time.  Also, I was unable to remove the drive from
 the volume (ioctl returned -1), so it was in this state until I rebooted
 a couple hours later.   I then did a btrfs-vol -r missing on the drive,
 and then added it back in as a new device.  I did btrfs-vol -b which
 succeeded once.   After adding more drives, I did btrfs-vol -b again,
 and that left me in the state where this thread began.

 As I understand it, a btrfs-vol -b is currently one of the only ways to
 reduplicate unmirrored chunks after a drive failure. (aside from
 rewriting the data or removing and readding devices).  Is my
 understanding correct?

Yes,

Thanks again for helping debug.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: panic during rebalance, and now upon mount

2010-01-30 Thread Yan, Zheng
On Sat, Jan 30, 2010 at 2:05 PM, Troy Ablan tab...@gmail.com wrote:
 Hi folks,

 During a very lengthy btrfs-vol -b (3.5 days in), btrfs BUGged out.
 Upon rebooting and trying to mount that fs, the exact same bug (with the
 exact same call trace) happens.  I moved up to 2.6.33-rc6 from
 gentoo-maintained 2.6.32-r2 to see what would happen, and it appears to
 panic at the equivalent line of the same source file as before.

 Let me know if I can do anything to assist.  I won't do anything to the
 disks for the next few days in case some forensics will be useful.

 [  154.899692] device label bk0 devid 14 transid 34 /dev/mapper/btrn
 [  154.958264] btrfs: use compression
 [  202.394048] [ cut here ]
 [  202.394136] kernel BUG at fs/btrfs/extent-tree.c:5377!
 [  202.394220] invalid opcode:  [#1] SMP
 [  202.394372] last sysfs file:
 /sys/devices/virtual/block/md1/md/metadata_version
 [  202.394500] CPU 5
 [  202.394655] Pid: 5838, comm: btrfs-relocate- Tainted: G        W
 2.6.33-rc6 #1 P55M-GD45 (MS-7588) /MS-7588
 [  202.394787] RIP: 0010:[8129e5ad]  [8129e5ad]
 walk_up_proc+0x37d/0x3c0
 [  202.394955] RSP: 0018:880139729ca0  EFLAGS: 00010282
 [  202.395039] RAX: 0218 RBX: 88013c460300 RCX:
 880139728000
 [  202.395127] RDX: 8800 RSI: fff8 RDI:
 880138ac08e0
 [  202.395214] RBP: 880139729d00 R08: 0008 R09:
 0001
 [  202.395301] R10: 0001 R11: 0001 R12:
 880138ab8880
 [  202.395389] R13:  R14: 88013f72f880 R15:
 88013b646800
 [  202.395476] FS:  () GS:88002834()
 knlGS:
 [  202.395606] CS:  0010 DS:  ES:  CR0: 8005003b
 [  202.395691] CR2: 00425f40 CR3: 018d3000 CR4:
 06e0
 [  202.395778] DR0:  DR1:  DR2:
 
 [  202.395865] DR3:  DR6: 0ff0 DR7:
 0400
 [  202.395953] Process btrfs-relocate- (pid: 5838, threadinfo
 880139728000, task 88013f0e28f0)
 [  202.396083] Stack:
 [  202.396162]  880139729cf0 0002 88013f72f880
 0206
 [  202.397142] 0 880139729d30 880138ac08e0 
 
 [  202.397444] 0 88013c460300 88013f72f880 
 880139728000
 [  202.397856] Call Trace:
 [  202.397937]  [8129e72f] walk_up_tree+0x13f/0x1c0
 [  202.398023]  [8129f99c] btrfs_drop_snapshot+0x21c/0x600
 [  202.398110]  [812a9dd0] ? __btrfs_end_transaction+0x100/0x170
 [  202.398198]  [812e7d7d] merge_func+0x7d/0xc0
 [  202.398284]  [812d25aa] worker_loop+0x17a/0x540
 [  202.398379]  [812d2430] ? worker_loop+0x0/0x540
 [  202.398487]  [812d2430] ? worker_loop+0x0/0x540
 [  202.398611]  [81095936] kthread+0x96/0xa0
 [  202.398697]  [81034bd4] kernel_thread_helper+0x4/0x10
 [  202.398784]  [816ac869] ? restore_args+0x0/0x30
 [  202.398869]  [810958a0] ? kthread+0x0/0xa0
 [  202.398953]  [81034bd0] ? kernel_thread_helper+0x0/0x10
 [  202.399039] Code: 6d db b6 6d 48 c1 f8 03 48 0f af c2 48 ba 00 00 00
 00 00 88 ff ff 48 c1 e0 0c 48 8b 44 10 58 ff 49 1c 48 39 c6 0f 84 ab fd
 ff ff 0f 0b eb fe 0f 1f 80 00 00 00 00 47 8b 4c ae 60 45 85 c9 0f 85
 [  202.401551] RIP  [8129e5ad] walk_up_proc+0x37d/0x3c0
 [  202.401671]  RSP 880139729ca0
 [  202.401796] ---[ end trace 4c085bcc2bd215f6 ]---


Thank you for reporting this. Would you please run btrsck and mount
that fs again with the debug patch attached below.

Regards
Yan, Zheng

---
diff -urp 1/fs/btrfs/extent-tree.c 2/fs/btrfs/extent-tree.c
--- 1/fs/btrfs/extent-tree.c2010-01-22 12:16:34.203525744 +0800
+++ 2/fs/btrfs/extent-tree.c2010-01-30 20:03:23.609292953 +0800
@@ -5373,8 +5373,18 @@ static noinline int walk_up_proc(struct
if (wc-flags[level]  BTRFS_BLOCK_FLAG_FULL_BACKREF)
parent = eb-start;
else
-   BUG_ON(root-root_key.objectid !=
-  btrfs_header_owner(eb));
+   if (root-root_key.objectid !=
+   btrfs_header_owner(eb)) {
+   printk(root %llu %llu\n,
+  root-root_key.objectid,
+  root-root_key.offset);
+   printk(node %llu refs %llu flags %llu owner 
%llu reloc %d\n,
+  eb-start, wc-refs[level], 
wc-flags[level],
+  btrfs_header_owner(eb),
+  btrfs_header_flag(eb, 
BTRFS_HEADER_FLAG_RELOC));
+
+   BUG();
+   }
} else {
if (wc-flags[level + 1]  BTRFS_BLOCK_FLAG_FULL_BACKREF

Re: [PATCH]btrfs: avoid comparing with NULL pointer

2010-01-27 Thread Yan, Zheng
2010/1/27 Liuwenyi qingshen...@gmail.com:
 In this patch, I adjust the seqence of if-conditions.
 It will assess the page-private situation.
 First, we make sure the page-private is not null.
 And then, we can do some with this page-private.

 ---
 Signed-off-by: Liuwenyi qingshen...@gmail.com
 Cc: Chris Mason chris.ma...@oracle.com
 Cc: Yan Zheng zheng@oracle.com
 Cc: Josef Bacik jba...@redhat.com
 Cc: Jens Axboe jens.ax...@oracle.com
 Cc: linux-btrfs@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org

 ---
 fs/btrfs/disk-io.c | 4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 009e3bd..a300dca 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1407,11 +1407,11 @@ static int bio_ready_for_csum(struct bio *bio)

 bio_for_each_segment(bvec, bio, i) {
 page = bvec-bv_page;
 - if (page-private == EXTENT_PAGE_PRIVATE) {
 + if (!page-private) {
 length += bvec-bv_len;
 continue;
 }
 - if (!page-private) {
 + if (page-private == EXTENT_PAGE_PRIVATE) {
 length += bvec-bv_len;
 continue;
 }
 --

Why do you want to do this? The code is perfect safe even
page-private is NULL. Furthermore, your patch is malformed.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fsck doesn't modify a thing, and btrfs can not balance any data on a new device

2010-01-27 Thread Yan, Zheng
 in the btree ?


The short answer is repairing error isn't implemented yet. I'm afraid the
only way to save your data is try mounting the FS in readonly mode and
copying the data out.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix another orphan cleanup problem

2010-01-25 Thread Yan, Zheng
On Tue, Jan 26, 2010 at 5:01 AM, Josef Bacik jo...@redhat.com wrote:
 Because orphan cleanup now happens well after the fs is all initialized and
 such, we can run into this problem where we find orphan entries that were just
 added to the fs, not ones that were added previously during a crash.  This 
 does
 not bode well for the system, and results in a couple of odd things happening,
 like truncate being run on non-regular files.  In order to fix this we just
 check and see if the inode has been added to the in-memory orphan list, and if
 it has, set the key to it's inode number - 1 so we don't find this orphan 
 entry
 again, and continue searching.

 This problem kept popping up while running xfs tests, and was 100%
 reproduceable.  With this patch the problem no longer happens.  Thanks,


Hi Josef,

I think this problem was introduced by your previous orphan cleanup fix.
Before I introduced the orphan cleanup regression, orphan cleanup on
a subvol was triggered by the first access. Your previous orphan clean fix
broke this rule, orphan clean is triggered when the first time btrfs_lookup
finds a valid inode. I think it's better to keep the old rule. revert your
previous changes and add code to open_ctree() to do orphan cleanup for
the default subvol.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] make btrfs-image work

2010-01-19 Thread Yan, Zheng
On Wed, Jan 20, 2010 at 12:04 AM, Josef Bacik jo...@redhat.com wrote:
 Hello,

 btrfs-image would be very helpful for debugging some users problems that we
 can't reproduce ourselves, but every image that i try and re-create with
 btrfs-image makes btrfs panic.  This is because we zero out the superblocks
 chunk array and re-create our uuid.  This means that we end up not being able 
 to
 read the chunk tree on mount, and then even if we could the uuid's of the
 metadata we read back wouldn't match the uuid of the device.  The way I've 
 fixed
 this is to just spit the metadata back onto the disk exactly the way we got 
 it.
 The caveat to this I think is that if we try to image a multi-device setup 
 that
 it won't work right unless we have a multi-device setup to restore the image
 onto.  I'm not sure if thats the goal or not.  This patch makes the single 
 disk
 case work fine for me.  Let me know what you think.  Thanks,


The goal of btrfs-image is create image that can be  examined by btrfsck and
btrfs-debug-tree. btrfs-image creates metadata image for btrfs' logical address
space. So your patch only works for the uncommon case that btrfs' logical
address is mapped to offset of device.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix race in btrfs_mark_extent_written

2010-01-15 Thread Yan, Zheng
Fix bug reported by Johannes Hirte. The reason of that bug
is btrfs_del_items is called after btrfs_duplicate_item and
btrfs_del_items triggers tree balance. The fix is check that
case and call btrfs_search_slot when needed.

Signed-off-by: Yan Zheng zheng@oracle.com

---
diff -urp 1/fs/btrfs/file.c 2/fs/btrfs/file.c
--- 1/fs/btrfs/file.c   2009-12-28 12:23:42.081546961 +0800
+++ 2/fs/btrfs/file.c   2010-01-11 13:02:08.082735125 +0800
@@ -506,7 +506,8 @@ next_slot:
 }
 
 static int extent_mergeable(struct extent_buffer *leaf, int slot,
-   u64 objectid, u64 bytenr, u64 *start, u64 *end)
+   u64 objectid, u64 bytenr, u64 orig_offset,
+   u64 *start, u64 *end)
 {
struct btrfs_file_extent_item *fi;
struct btrfs_key key;
@@ -522,6 +523,7 @@ static int extent_mergeable(struct exten
fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG ||
btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr ||
+   btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset ||
btrfs_file_extent_compression(leaf, fi) ||
btrfs_file_extent_encryption(leaf, fi) ||
btrfs_file_extent_other_encoding(leaf, fi))
@@ -561,6 +563,7 @@ int btrfs_mark_extent_written(struct btr
u64 split;
int del_nr = 0;
int del_slot = 0;
+   int recow;
int ret;
 
btrfs_drop_extent_cache(inode, start, end - 1, 0);
@@ -568,6 +571,7 @@ int btrfs_mark_extent_written(struct btr
path = btrfs_alloc_path();
BUG_ON(!path);
 again:
+   recow = 0;
split = start;
key.objectid = inode-i_ino;
key.type = BTRFS_EXTENT_DATA_KEY;
@@ -591,12 +595,60 @@ again:
bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi);
+   memcpy(new_key, key, sizeof(new_key));
+
+   if (start == key.offset  end  extent_end) {
+   other_start = 0;
+   other_end = start;
+   if (extent_mergeable(leaf, path-slots[0] - 1,
+inode-i_ino, bytenr, orig_offset,
+other_start, other_end)) {
+   new_key.offset = end;
+   btrfs_set_item_key_safe(trans, root, path, new_key);
+   fi = btrfs_item_ptr(leaf, path-slots[0],
+   struct btrfs_file_extent_item);
+   btrfs_set_file_extent_num_bytes(leaf, fi,
+   extent_end - end);
+   btrfs_set_file_extent_offset(leaf, fi,
+end - orig_offset);
+   fi = btrfs_item_ptr(leaf, path-slots[0] - 1,
+   struct btrfs_file_extent_item);
+   btrfs_set_file_extent_num_bytes(leaf, fi,
+   end - other_start);
+   btrfs_mark_buffer_dirty(leaf);
+   goto out;
+   }
+   }
+
+   if (start  key.offset  end == extent_end) {
+   other_start = end;
+   other_end = 0;
+   if (extent_mergeable(leaf, path-slots[0] + 1,
+inode-i_ino, bytenr, orig_offset,
+other_start, other_end)) {
+   fi = btrfs_item_ptr(leaf, path-slots[0],
+   struct btrfs_file_extent_item);
+   btrfs_set_file_extent_num_bytes(leaf, fi,
+   start - key.offset);
+   path-slots[0]++;
+   new_key.offset = start;
+   btrfs_set_item_key_safe(trans, root, path, new_key);
+
+   fi = btrfs_item_ptr(leaf, path-slots[0],
+   struct btrfs_file_extent_item);
+   btrfs_set_file_extent_num_bytes(leaf, fi,
+   other_end - start);
+   btrfs_set_file_extent_offset(leaf, fi,
+start - orig_offset);
+   btrfs_mark_buffer_dirty(leaf);
+   goto out;
+   }
+   }
 
while (start  key.offset || end  extent_end) {
if (key.offset == start)
split = end;
 
-   memcpy(new_key, key, sizeof(new_key));
new_key.offset = split;
ret = btrfs_duplicate_item(trans, root, path, new_key);
if (ret == -EAGAIN

  1   2   3   >