Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault
On Fri, Dec 15, 2017 at 12:53 AM, Jan Kara <j...@suse.cz> wrote: > On Thu 14-12-17 22:30:26, Yan, Zheng wrote: >> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <j...@suse.cz> wrote: >> > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: >> >> We recently got an Oops report: >> >> >> >> BUG: unable to handle kernel NULL pointer dereference at (null) >> >> IP: jbd2__journal_start+0x38/0x1a2 >> >> [...] >> >> Call Trace: >> >> ext4_page_mkwrite+0x307/0x52b >> >> _ext4_get_block+0xd8/0xd8 >> >> do_page_mkwrite+0x6e/0xd8 >> >> handle_mm_fault+0x686/0xf9b >> >> mntput_no_expire+0x1f/0x21e >> >> __do_page_fault+0x21d/0x465 >> >> dput+0x4a/0x2f7 >> >> page_fault+0x22/0x30 >> >> copy_user_generic_string+0x2c/0x40 >> >> copy_page_to_iter+0x8c/0x2b8 >> >> generic_file_read_iter+0x26e/0x845 >> >> timerqueue_del+0x31/0x90 >> >> ceph_read_iter+0x697/0xa33 [ceph] >> >> hrtimer_cancel+0x23/0x41 >> >> futex_wait+0x1c8/0x24d >> >> get_futex_key+0x32c/0x39a >> >> __vfs_read+0xe0/0x130 >> >> vfs_read.part.1+0x6c/0x123 >> >> handle_mm_fault+0x831/0xf9b >> >> __fget+0x7e/0xbf >> >> SyS_read+0x4d/0xb5 >> >> >> >> ceph_read_iter() uses current->journal_info to pass context info to >> >> ceph_readpages(). Because ceph_readpages() needs to know if its caller >> >> has already gotten capability of using page cache (distinguish read >> >> from readahead/fadvise). ceph_read_iter() set current->journal_info, >> >> then calls generic_file_read_iter(). >> >> >> >> In above Oops, page fault happened when copying data to userspace. >> >> Page fault handler called ext4_page_mkwrite(). Ext4 code read >> >> current->journal_info and assumed it is journal handle. >> >> >> >> I checked other filesystems, btrfs probably suffers similar problem >> >> for its readpage. (page fault happens when write() copies data from >> >> userspace memory and the memory is mapped to a file in btrfs. >> >> verify_parent_transid() can be called during readpage) >> >> >> >> Cc: sta...@vger.kernel.org >> >> Signed-off-by: "Yan, Zheng" <z...@redhat.com> >> > >> > I agree with the analysis but the patch is too ugly too live. Ceph just >> > should not be abusing current->journal_info for passing information between >> > two random functions or when it does a hackery like this, it should just >> > make sure the pieces hold together. Poluting generic code to accommodate >> > this hack in Ceph is not acceptable. Also bear in mind there are likely >> > other code paths (e.g. memory reclaim) which could recurse into another >> > filesystem confusing it with non-NULL current->journal_info in the same >> > way. >> >> But ... >> >> some filesystem set journal_info in its write_begin(), then clear it >> in write_end(). If buffer for write is mapped to another filesystem, >> current->journal can leak to the later filesystem's page_readpage(). >> The later filesystem may read current->journal and treat it as its own >> journal handle. Besides, most filesystem's vm fault handle is >> filemap_fault(), filemap also may tigger memory reclaim. > > Did you really observe this? Because write path uses > iov_iter_copy_from_user_atomic() which does not allow page faults to > happen. All page faulting happens in iov_iter_fault_in_readable() before > ->write_begin() is called. And the recursion problems like you mention > above are exactly the reason why things are done in a more complicated way > like this. I think you are right. > >> > >> > In this particular case I'm not sure why does ceph pass 'filp' into >> > readpage() / readpages() handler when it already gets that pointer as part >> > of arguments... >> >> It actually a flag which tells ceph_readpages() if its caller is >> ceph_read_iter or readahead/fadvise/madvise. because when there are >> multiple clients read/write a file a the same time, page cache should >> be disabled. > > I'm not sure I understand the reasoning properly but from what you say > above it rather seems the 'hint' should be stored in the inode (or possibly > struct file)? > The capability of using page cache is hold by the process who got it. ceph_read_iter() first gets the capability, calls generic_file_read_iter(), then release the capability. The capability can not be easily stored in inode or file because it can be revoked by server any time if caller does not hold it Regards Yan, Zheng > Honza > -- > Jan Kara <j...@suse.com> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault
On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <j...@suse.cz> wrote: > On Thu 14-12-17 18:55:27, Yan, Zheng wrote: >> We recently got an Oops report: >> >> BUG: unable to handle kernel NULL pointer dereference at (null) >> IP: jbd2__journal_start+0x38/0x1a2 >> [...] >> Call Trace: >> ext4_page_mkwrite+0x307/0x52b >> _ext4_get_block+0xd8/0xd8 >> do_page_mkwrite+0x6e/0xd8 >> handle_mm_fault+0x686/0xf9b >> mntput_no_expire+0x1f/0x21e >> __do_page_fault+0x21d/0x465 >> dput+0x4a/0x2f7 >> page_fault+0x22/0x30 >> copy_user_generic_string+0x2c/0x40 >> copy_page_to_iter+0x8c/0x2b8 >> generic_file_read_iter+0x26e/0x845 >> timerqueue_del+0x31/0x90 >> ceph_read_iter+0x697/0xa33 [ceph] >> hrtimer_cancel+0x23/0x41 >> futex_wait+0x1c8/0x24d >> get_futex_key+0x32c/0x39a >> __vfs_read+0xe0/0x130 >> vfs_read.part.1+0x6c/0x123 >> handle_mm_fault+0x831/0xf9b >> __fget+0x7e/0xbf >> SyS_read+0x4d/0xb5 >> >> ceph_read_iter() uses current->journal_info to pass context info to >> ceph_readpages(). Because ceph_readpages() needs to know if its caller >> has already gotten capability of using page cache (distinguish read >> from readahead/fadvise). ceph_read_iter() set current->journal_info, >> then calls generic_file_read_iter(). >> >> In above Oops, page fault happened when copying data to userspace. >> Page fault handler called ext4_page_mkwrite(). Ext4 code read >> current->journal_info and assumed it is journal handle. >> >> I checked other filesystems, btrfs probably suffers similar problem >> for its readpage. (page fault happens when write() copies data from >> userspace memory and the memory is mapped to a file in btrfs. >> verify_parent_transid() can be called during readpage) >> >> Cc: sta...@vger.kernel.org >> Signed-off-by: "Yan, Zheng" <z...@redhat.com> > > I agree with the analysis but the patch is too ugly too live. Ceph just > should not be abusing current->journal_info for passing information between > two random functions or when it does a hackery like this, it should just > make sure the pieces hold together. Poluting generic code to accommodate > this hack in Ceph is not acceptable. Also bear in mind there are likely > other code paths (e.g. memory reclaim) which could recurse into another > filesystem confusing it with non-NULL current->journal_info in the same > way. But ... some filesystem set journal_info in its write_begin(), then clear it in write_end(). If buffer for write is mapped to another filesystem, current->journal can leak to the later filesystem's page_readpage(). The later filesystem may read current->journal and treat it as its own journal handle. Besides, most filesystem's vm fault handle is filemap_fault(), filemap also may tigger memory reclaim. > > In this particular case I'm not sure why does ceph pass 'filp' into > readpage() / readpages() handler when it already gets that pointer as part > of arguments... It actually a flag which tells ceph_readpages() if its caller is ceph_read_iter or readahead/fadvise/madvise. because when there are multiple clients read/write a file a the same time, page cache should be disabled. Regards Yan, Zheng > > Honza > >> diff --git a/mm/memory.c b/mm/memory.c >> index a728bed16c20..db2a50233c49 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, >> unsigned long address, >> unsigned int flags) >> { >> int ret; >> + void *old_journal_info; >> >> __set_current_state(TASK_RUNNING); >> >> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, >> unsigned long address, >> if (flags & FAULT_FLAG_USER) >> mem_cgroup_oom_enable(); >> >> + /* >> + * Fault can happen when filesystem A's read_iter()/write_iter() >> + * copies data to/from userspace. Filesystem A may have set >> + * current->journal_info. If the userspace memory is MAP_SHARED >> + * mapped to a file in filesystem B, we later may call filesystem >> + * B's vm operation. Filesystem B may also want to read/set >> + * current->journal_info. >> + */ >> + old_journal_info = current->journal_info; >> + current->journal_info = NULL; >> + >> if (unlikely(is_vm_hugetlb_page(vma))) >> ret = hugetlb_fault(vma->vm_mm, vma, address, flags); >&
[PATCH] mm: save/restore current->journal_info in handle_mm_fault
We recently got an Oops report: BUG: unable to handle kernel NULL pointer dereference at (null) IP: jbd2__journal_start+0x38/0x1a2 [...] Call Trace: ext4_page_mkwrite+0x307/0x52b _ext4_get_block+0xd8/0xd8 do_page_mkwrite+0x6e/0xd8 handle_mm_fault+0x686/0xf9b mntput_no_expire+0x1f/0x21e __do_page_fault+0x21d/0x465 dput+0x4a/0x2f7 page_fault+0x22/0x30 copy_user_generic_string+0x2c/0x40 copy_page_to_iter+0x8c/0x2b8 generic_file_read_iter+0x26e/0x845 timerqueue_del+0x31/0x90 ceph_read_iter+0x697/0xa33 [ceph] hrtimer_cancel+0x23/0x41 futex_wait+0x1c8/0x24d get_futex_key+0x32c/0x39a __vfs_read+0xe0/0x130 vfs_read.part.1+0x6c/0x123 handle_mm_fault+0x831/0xf9b __fget+0x7e/0xbf SyS_read+0x4d/0xb5 ceph_read_iter() uses current->journal_info to pass context info to ceph_readpages(). Because ceph_readpages() needs to know if its caller has already gotten capability of using page cache (distinguish read from readahead/fadvise). ceph_read_iter() set current->journal_info, then calls generic_file_read_iter(). In above Oops, page fault happened when copying data to userspace. Page fault handler called ext4_page_mkwrite(). Ext4 code read current->journal_info and assumed it is journal handle. I checked other filesystems, btrfs probably suffers similar problem for its readpage. (page fault happens when write() copies data from userspace memory and the memory is mapped to a file in btrfs. verify_parent_transid() can be called during readpage) Cc: sta...@vger.kernel.org Signed-off-by: "Yan, Zheng" <z...@redhat.com> --- mm/memory.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index a728bed16c20..db2a50233c49 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { int ret; + void *old_journal_info; __set_current_state(TASK_RUNNING); @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (flags & FAULT_FLAG_USER) mem_cgroup_oom_enable(); + /* +* Fault can happen when filesystem A's read_iter()/write_iter() +* copies data to/from userspace. Filesystem A may have set +* current->journal_info. If the userspace memory is MAP_SHARED +* mapped to a file in filesystem B, we later may call filesystem +* B's vm operation. Filesystem B may also want to read/set +* current->journal_info. +*/ + old_journal_info = current->journal_info; + current->journal_info = NULL; + if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); + current->journal_info = old_journal_info; + if (flags & FAULT_FLAG_USER) { mem_cgroup_oom_disable(); /* -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
> On 13 Dec 2017, at 13:38, Adam Borowski <kilob...@angband.pl> wrote: > > This link is replicated in most filesystems' config stanzas. Referring > to an archived version of that site is pointless as it mostly deals with > patches; user documentation is available elsewhere. > > Signed-off-by: Adam Borowski <kilob...@angband.pl> > --- > Sending this as one piece; if you guys would instead prefer this chopped > into tiny per-filesystem bits, please say so. > > > Documentation/filesystems/ext2.txt | 2 -- > Documentation/filesystems/ext4.txt | 7 +++ > fs/9p/Kconfig | 3 --- > fs/Kconfig | 6 +- > fs/btrfs/Kconfig | 3 --- > fs/ceph/Kconfig | 3 — Ceph bits looks good. Acked-by: Yan, Zheng" <z...@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Fri, Jun 2, 2017 at 10:18 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Fri, Jun 2, 2017 at 2:18 PM, Yan, Zheng <uker...@gmail.com> wrote: >> On Fri, Jun 2, 2017 at 7:33 PM, Arnd Bergmann <a...@arndb.de> wrote: >>> On Fri, Jun 2, 2017 at 1:18 PM, Yan, Zheng <uker...@gmail.com> wrote: >>> What I meant is another related problem in ceph_mkdir() where the >>> i_ctime field of the parent inode is different between the persistent >>> representation in the mds and the in-memory representation. >>> >> >> I don't see any problem in mkdir case. Parent inode's i_ctime in mds is set >> to >> r_stamp. When client receives request reply, it set its in-memory inode's >> ctime >> to the same time stamp. > > Ok, I see it now, thanks for the clarification. Most other file systems do > this > the other way round and update all fields in the in-memory inode structure > first and then write that to persistent storage, so I was getting confused > about > the order of events here. > > If I understand it all right, we have three different behaviors in ceph now, > though the differences are very minor and probably don't ever matter: > > - in setattr(), we update ctime in the in-memory inode first and then send > the same time to the mds, and expect to set it again when the reply comes. > > - in ceph_write_iter write() and mmap/page_mkwrite(), we call > file_update_time() to set i_mtime and i_ctime to the same > timestamp first once a write is observed by the fs and then take > two other timestamps that we send to the mds, and update the > in-memory inode a second time when the reply comes. ctime > is never older than mtime here, as far as I can tell, but it may > be newer when the timer interrupt happens between taking the > two stamps. We don't use request to send i_mtime/i_ctime to mds in this case. Instead, we use cap flush message. i_mtime/i_ctime are directly encoded in cap flush message. When mds receives the cap flush message, it writes i_mtime/i_ctime to persistent storage and sends a cap flush ack message to client. (when client receives the cap flush ack message, it does not update i_mtime/i_ctime). There is no issue as you described. > > - in all other calls, we only update the inode (and/or parent inode) > after the reply arrives. There are two cases. 1. Client updates in-memory inode's ctime, it sends the new ctime to mds through cap flush message. 2. client set mds request's r_stamp and send the request to mds. MDS updates relavent inodes' ctime and sends reply to client. Client updates in-memory inodes' ctime according to the reply. Regards Yan, Zheng > >Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Fri, Jun 2, 2017 at 7:33 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Fri, Jun 2, 2017 at 1:18 PM, Yan, Zheng <uker...@gmail.com> wrote: >> On Fri, Jun 2, 2017 at 6:51 PM, Arnd Bergmann <a...@arndb.de> wrote: >>> On Fri, Jun 2, 2017 at 12:10 PM, Yan, Zheng <uker...@gmail.com> wrote: >>>> On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann <a...@arndb.de> wrote: >>>>> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote: >>>>>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >>>>>> wrote: >>>>>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> >>>>>>> wrote: >>>>>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote: >>>>> >>>>> I believe the bug you see is the result of the two timestamps >>>>> currently being almost guaranteed to be different in the latest >>>>> kernels. >>>>> Changing r_stamp to use current_kernel_time() will make it the >>>>> same value most of the time (as it was before Deepa's patch), >>>>> but when the timer interrupt happens between the timestamps, >>>>> the two are still different, it's just much harder to hit. >>>>> >>>>> I think the proper solution should be to change __ceph_setattr() >>>>> in a way that has req->r_stamp always synchronized with i_ctime. >>>>> If we copy i_ctime to r_stamp, that will also take care of the >>>>> future issues with the planned changes to current_time(). >>>>> >>>> I already have a patch >>>> https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af >>> >>> Looks good to me. In case anyone cares: >>> Acked-by: Arnd Bergmann <a...@arndb.de> >>> >>>>> The part I don't understand is what else r_stamp (i.e. the time >>>>> stamp in ceph_msg_data with type== >>>>> CEPH_MSG_CLIENT_REQUEST) is used for, other than setting >>>>> ctime in CEPH_MDS_OP_SETATTR. >>>>> >>>>> Will this be used to update the stored i_ctime for other operations >>>>> too? If so, we would need to synchronize it with the in-memory >>>>> i_ctime for all operations that do this. >>>>> >>>> >>>> yes, mds uses it to update ctime of modified inodes. For example, >>>> when handling mkdir, mds set ctime of both parent inode and new inode >>>> to r_stamp. >>> >>> I see, so we may have a variation of that problem there as well: From >>> my reading of the code, the child inode is not in memory yet, so >>> that seems fine, but I could not find where the parent in-memory inode >>> i_ctime is updated in ceph, but it is most likely not the same as >>> req->r_stamp (assuming it gets updated at all). >> >> i_ctime is updated when handling request reply, by ceph_fill_file_time(). >> __ceph_setattr() can update the in-memory inode's ctime after request >> reply is received. The difference between ktime_get_real_ts() and >> current_time() can be larger than round-trip time of request. So it's >> still possible that __ceph_setattr() make ctime go back. > > But the __ceph_setattr() problem should be fixed by your patch, right? > > What I meant is another related problem in ceph_mkdir() where the > i_ctime field of the parent inode is different between the persistent > representation in the mds and the in-memory representation. > I don't see any problem in mkdir case. Parent inode's i_ctime in mds is set to r_stamp. When client receives request reply, it set its in-memory inode's ctime to the same time stamp. Regards Yan, Zheng > Arnd > >>> Would it make sense require all callers of ceph_mdsc_do_request() >>> to update r_stamp at the same time as i_ctime to keep them in sync? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Fri, Jun 2, 2017 at 6:51 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Fri, Jun 2, 2017 at 12:10 PM, Yan, Zheng <uker...@gmail.com> wrote: >> On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann <a...@arndb.de> wrote: >>> On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote: >>>> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >>>> wrote: >>>>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> >>>>> wrote: >>>>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote: >>> >>> I believe the bug you see is the result of the two timestamps >>> currently being almost guaranteed to be different in the latest >>> kernels. >>> Changing r_stamp to use current_kernel_time() will make it the >>> same value most of the time (as it was before Deepa's patch), >>> but when the timer interrupt happens between the timestamps, >>> the two are still different, it's just much harder to hit. >>> >>> I think the proper solution should be to change __ceph_setattr() >>> in a way that has req->r_stamp always synchronized with i_ctime. >>> If we copy i_ctime to r_stamp, that will also take care of the >>> future issues with the planned changes to current_time(). >>> >> I already have a patch >> https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af > > Looks good to me. In case anyone cares: > Acked-by: Arnd Bergmann <a...@arndb.de> > >>> The part I don't understand is what else r_stamp (i.e. the time >>> stamp in ceph_msg_data with type== >>> CEPH_MSG_CLIENT_REQUEST) is used for, other than setting >>> ctime in CEPH_MDS_OP_SETATTR. >>> >>> Will this be used to update the stored i_ctime for other operations >>> too? If so, we would need to synchronize it with the in-memory >>> i_ctime for all operations that do this. >>> >> >> yes, mds uses it to update ctime of modified inodes. For example, >> when handling mkdir, mds set ctime of both parent inode and new inode >> to r_stamp. > > I see, so we may have a variation of that problem there as well: From > my reading of the code, the child inode is not in memory yet, so > that seems fine, but I could not find where the parent in-memory inode > i_ctime is updated in ceph, but it is most likely not the same as > req->r_stamp (assuming it gets updated at all). i_ctime is updated when handling request reply, by ceph_fill_file_time(). __ceph_setattr() can update the in-memory inode's ctime after request reply is received. The difference between ktime_get_real_ts() and current_time() can be larger than round-trip time of request. So it's still possible that __ceph_setattr() make ctime go back. Regards Yan, Zheng > > Would it make sense require all callers of ceph_mdsc_do_request() > to update r_stamp at the same time as i_ctime to keep them in sync? > > Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote: >> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >> wrote: >>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> wrote: >>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote: >>>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote: >>>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote: >>>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >>>>>>> wrote: >>>>>> >>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>>>>>> index 517838b..77204da 100644 >>>>>>>> --- a/drivers/block/rbd.c >>>>>>>> +++ b/drivers/block/rbd.c >>>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct >>>>>>>> rbd_obj_request *obj_request) >>>>>>>> { >>>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req; >>>>>>>> >>>>>>>> - osd_req->r_mtime = CURRENT_TIME; >>>>>>>> + ktime_get_real_ts(_req->r_mtime); >>>>>>>> osd_req->r_data_offset = obj_request->offset; >>>>>>>> } >>>>>>>> >>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >>>>>>>> index c681762..1d3fa90 100644 >>>>>>>> --- a/fs/ceph/mds_client.c >>>>>>>> +++ b/fs/ceph/mds_client.c >>>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request * >>>>>>>> ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int >>>>>>>> mode) >>>>>>>> { >>>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS); >>>>>>>> + struct timespec ts; >>>>>>>> >>>>>>>> if (!req) >>>>>>>> return ERR_PTR(-ENOMEM); >>>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client >>>>>>>> *mdsc, int op, int mode) >>>>>>>> init_completion(>r_safe_completion); >>>>>>>> INIT_LIST_HEAD(>r_unsafe_item); >>>>>>>> >>>>>>>> - req->r_stamp = current_fs_time(mdsc->fsc->sb); >>>>>>>> + ktime_get_real_ts(); >>>>>>>> + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran); >>>>>>> >>>>>>> This change causes our kernel_untar_tar test case to fail (inode's >>>>>>> ctime goes back). The reason is that there is time drift between the >>>>>>> time stamps got by ktime_get_real_ts() and current_time(). We need to >>>>>>> revert this change until current_time() uses ktime_get_real_ts() >>>>>>> internally. >>>>>> >>>>>> Hmm, the change was not supposed to have a user-visible effect, so >>>>>> something has gone wrong, but I don't immediately see how it >>>>>> relates to what you observe. >>>>>> >>>>>> ktime_get_real_ts() and current_time() use the same time base, there >>>>>> is no drift, but there is a difference in resolution, as the latter uses >>>>>> the time stamp of the last jiffies update, which may be up to one jiffy >>>>>> (10ms) behind the exact time we put in the request stamps here. >>>>>> >>>>>> Do you still see problems if you use current_kernel_time() instead of >>>>>> ktime_get_real_ts()? >>>>> >>>>> The problem disappears after using current_kernel_time(). >>>>> >>>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417 >>>> >>>> From the commit above: >>>> "It seems there is time drift between ktime_get_real_ts() and >>>> current_kernel_time()" >>>> >>>> Its more of a granularity difference. current_ker
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> wrote: > On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stu...@linaro.org> wrote: >> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <uker...@gmail.com> wrote: >>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote: >>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote: >>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >>>>> wrote: >>>> >>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>>>> index 517838b..77204da 100644 >>>>>> --- a/drivers/block/rbd.c >>>>>> +++ b/drivers/block/rbd.c >>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct >>>>>> rbd_obj_request *obj_request) >>>>>> { >>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req; >>>>>> >>>>>> - osd_req->r_mtime = CURRENT_TIME; >>>>>> + ktime_get_real_ts(_req->r_mtime); >>>>>> osd_req->r_data_offset = obj_request->offset; >>>>>> } >>>>>> >>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >>>>>> index c681762..1d3fa90 100644 >>>>>> --- a/fs/ceph/mds_client.c >>>>>> +++ b/fs/ceph/mds_client.c >>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request * >>>>>> ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) >>>>>> { >>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS); >>>>>> + struct timespec ts; >>>>>> >>>>>> if (!req) >>>>>> return ERR_PTR(-ENOMEM); >>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client >>>>>> *mdsc, int op, int mode) >>>>>> init_completion(>r_safe_completion); >>>>>> INIT_LIST_HEAD(>r_unsafe_item); >>>>>> >>>>>> - req->r_stamp = current_fs_time(mdsc->fsc->sb); >>>>>> + ktime_get_real_ts(); >>>>>> + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran); >>>>> >>>>> This change causes our kernel_untar_tar test case to fail (inode's >>>>> ctime goes back). The reason is that there is time drift between the >>>>> time stamps got by ktime_get_real_ts() and current_time(). We need to >>>>> revert this change until current_time() uses ktime_get_real_ts() >>>>> internally. >>>> >>>> Hmm, the change was not supposed to have a user-visible effect, so >>>> something has gone wrong, but I don't immediately see how it >>>> relates to what you observe. >>>> >>>> ktime_get_real_ts() and current_time() use the same time base, there >>>> is no drift, but there is a difference in resolution, as the latter uses >>>> the time stamp of the last jiffies update, which may be up to one jiffy >>>> (10ms) behind the exact time we put in the request stamps here. >>>> >>>> Do you still see problems if you use current_kernel_time() instead of >>>> ktime_get_real_ts()? >>> >>> The problem disappears after using current_kernel_time(). >>> >>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417 >> >> From the commit above: >> "It seems there is time drift between ktime_get_real_ts() and >> current_kernel_time()" >> >> Its more of a granularity difference. current_kernel_time() returns >> the cached time at the last tick, where as ktime_get_real_ts() reads >> the clocksource hardware and returns the immediate time. >> >> Filesystems usually use the cached time (similar to >> CLOCK_REALTIME_COARSE), for performance reasons, as touching the >> clocksource takes time. > > Alternatively, it would be best for this code also to use current_time(). > I had suggested this in one of the previous versions of the patch. > The implementation of current_time() will change when we switch vfs to > use 64 bit time. This will prevent such errors from happening again. > But, this also means there is more code reordering for these modules > to get a reference to inode. > I took a look. it's quite inconvenience to use current_time(). I prefer to temporarily use current_kernel_time(). Regards Yan, Zheng > -Deepa -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote: >> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >> wrote: > >>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>> index 517838b..77204da 100644 >>> --- a/drivers/block/rbd.c >>> +++ b/drivers/block/rbd.c >>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct >>> rbd_obj_request *obj_request) >>> { >>> struct ceph_osd_request *osd_req = obj_request->osd_req; >>> >>> - osd_req->r_mtime = CURRENT_TIME; >>> + ktime_get_real_ts(_req->r_mtime); >>> osd_req->r_data_offset = obj_request->offset; >>> } >>> >>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >>> index c681762..1d3fa90 100644 >>> --- a/fs/ceph/mds_client.c >>> +++ b/fs/ceph/mds_client.c >>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request * >>> ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) >>> { >>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS); >>> + struct timespec ts; >>> >>> if (!req) >>> return ERR_PTR(-ENOMEM); >>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client >>> *mdsc, int op, int mode) >>> init_completion(>r_safe_completion); >>> INIT_LIST_HEAD(>r_unsafe_item); >>> >>> - req->r_stamp = current_fs_time(mdsc->fsc->sb); >>> + ktime_get_real_ts(); >>> + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran); >> >> This change causes our kernel_untar_tar test case to fail (inode's >> ctime goes back). The reason is that there is time drift between the >> time stamps got by ktime_get_real_ts() and current_time(). We need to >> revert this change until current_time() uses ktime_get_real_ts() >> internally. > > Hmm, the change was not supposed to have a user-visible effect, so > something has gone wrong, but I don't immediately see how it > relates to what you observe. > > ktime_get_real_ts() and current_time() use the same time base, there > is no drift, but there is a difference in resolution, as the latter uses > the time stamp of the last jiffies update, which may be up to one jiffy > (10ms) behind the exact time we put in the request stamps here. > It happens in following sequence of events 1. create a new file, the inode's ctime is set to ktime_get_real_ts() 2. chmod the new file, the inode's ctime is set to current_time(). Inode's ctime goes back when current_time() is behind ktime_get_real_ts(). Regards Yan, Zheng > Do you still see problems if you use current_kernel_time() instead of > ktime_get_real_ts()? > > Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <uker...@gmail.com> wrote: >> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> >> wrote: > >>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>> index 517838b..77204da 100644 >>> --- a/drivers/block/rbd.c >>> +++ b/drivers/block/rbd.c >>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct >>> rbd_obj_request *obj_request) >>> { >>> struct ceph_osd_request *osd_req = obj_request->osd_req; >>> >>> - osd_req->r_mtime = CURRENT_TIME; >>> + ktime_get_real_ts(_req->r_mtime); >>> osd_req->r_data_offset = obj_request->offset; >>> } >>> >>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >>> index c681762..1d3fa90 100644 >>> --- a/fs/ceph/mds_client.c >>> +++ b/fs/ceph/mds_client.c >>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request * >>> ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) >>> { >>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS); >>> + struct timespec ts; >>> >>> if (!req) >>> return ERR_PTR(-ENOMEM); >>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client >>> *mdsc, int op, int mode) >>> init_completion(>r_safe_completion); >>> INIT_LIST_HEAD(>r_unsafe_item); >>> >>> - req->r_stamp = current_fs_time(mdsc->fsc->sb); >>> + ktime_get_real_ts(); >>> + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran); >> >> This change causes our kernel_untar_tar test case to fail (inode's >> ctime goes back). The reason is that there is time drift between the >> time stamps got by ktime_get_real_ts() and current_time(). We need to >> revert this change until current_time() uses ktime_get_real_ts() >> internally. > > Hmm, the change was not supposed to have a user-visible effect, so > something has gone wrong, but I don't immediately see how it > relates to what you observe. > > ktime_get_real_ts() and current_time() use the same time base, there > is no drift, but there is a difference in resolution, as the latter uses > the time stamp of the last jiffies update, which may be up to one jiffy > (10ms) behind the exact time we put in the request stamps here. > > Do you still see problems if you use current_kernel_time() instead of > ktime_get_real_ts()? The problem disappears after using current_kernel_time(). https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417 Regards Yan, Zheng > > Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()
On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.ker...@gmail.com> wrote: > CURRENT_TIME is not y2038 safe. > The macro will be deleted and all the references to it > will be replaced by ktime_get_* apis. > > struct timespec is also not y2038 safe. > Retain timespec for timestamp representation here as ceph > uses it internally everywhere. > These references will be changed to use struct timespec64 > in a separate patch. > > The current_fs_time() api is being changed to use vfs > struct inode* as an argument instead of struct super_block*. > > Set the new mds client request r_stamp field using > ktime_get_real_ts() instead of using current_fs_time(). > > Also, since r_stamp is used as mtime on the server, use > timespec_trunc() to truncate the timestamp, using the right > granularity from the superblock. > > This api will be transitioned to be y2038 safe along > with vfs. > > Signed-off-by: Deepa Dinamani <deepa.ker...@gmail.com> > Reviewed-by: Arnd Bergmann <a...@arndb.de> > --- > drivers/block/rbd.c | 2 +- > fs/ceph/mds_client.c | 4 +++- > net/ceph/messenger.c | 6 -- > net/ceph/osd_client.c | 4 ++-- > 4 files changed, 10 insertions(+), 6 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 517838b..77204da 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct > rbd_obj_request *obj_request) > { > struct ceph_osd_request *osd_req = obj_request->osd_req; > > - osd_req->r_mtime = CURRENT_TIME; > + ktime_get_real_ts(_req->r_mtime); > osd_req->r_data_offset = obj_request->offset; > } > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index c681762..1d3fa90 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -1666,6 +1666,7 @@ struct ceph_mds_request * > ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) > { > struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS); > + struct timespec ts; > > if (!req) > return ERR_PTR(-ENOMEM); > @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, > int op, int mode) > init_completion(>r_safe_completion); > INIT_LIST_HEAD(>r_unsafe_item); > > - req->r_stamp = current_fs_time(mdsc->fsc->sb); > + ktime_get_real_ts(); > + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran); This change causes our kernel_untar_tar test case to fail (inode's ctime goes back). The reason is that there is time drift between the time stamps got by ktime_get_real_ts() and current_time(). We need to revert this change until current_time() uses ktime_get_real_ts() internally. Regards Yan, Zheng > > req->r_op = op; > req->r_direct_mode = mode; > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index f76bb33..5766a6c 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -1386,8 +1386,9 @@ static void prepare_write_keepalive(struct > ceph_connection *con) > dout("prepare_write_keepalive %p\n", con); > con_out_kvec_reset(con); > if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) { > - struct timespec now = CURRENT_TIME; > + struct timespec now; > > + ktime_get_real_ts(); > con_out_kvec_add(con, sizeof(tag_keepalive2), > _keepalive2); > ceph_encode_timespec(>out_temp_keepalive2, ); > con_out_kvec_add(con, sizeof(con->out_temp_keepalive2), > @@ -3176,8 +3177,9 @@ bool ceph_con_keepalive_expired(struct ceph_connection > *con, > { > if (interval > 0 && > (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2)) { > - struct timespec now = CURRENT_TIME; > + struct timespec now; > struct timespec ts; > + ktime_get_real_ts(); > jiffies_to_timespec(interval, ); > ts = timespec_add(con->last_keepalive_ack, ts); > return timespec_compare(, ) >= 0; > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index e15ea9e..242d7c0 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -3574,7 +3574,7 @@ ceph_osdc_watch(struct ceph_osd_client *osdc, > ceph_oid_copy(>t.base_oid, oid); > ceph_oloc_copy(>t.base_oloc, oloc); > lreq->t.flags = CEPH_OSD_FLAG_WRITE; > - lreq->mtime = CURRENT_TIME; > + ktime_get_real_ts(>mtim
Re: bad unlock balance in btrfs_commit_transaction_async
On Wed, Aug 28, 2013 at 8:56 AM, Dan Mick dan.m...@inktank.com wrote: Another developer just noticed this in testing; anyone have any ideas? btrfs_ioctl_start_sync() calls btrfs_attach_transaction_barrier() which further calls start_transaction() with type == TRANS_ATTACH. In start_transaction(), sb_start_intwrite() is called when (type __TRANS_FREEZABLE) is true. but (TRANS_ATTACH __TRANS_FREEZABLE) is false. So we see the bad bad unlock balance bug. Yan, Zheng On 08/22/2013 05:40 PM, Sage Weil wrote: I just noticed that there is a locking imbalance warning with sb_internal in the transaction commit code. I believe this has only started appearing recently (after I merged -rc5 into my testing tree), but I'm working on confirming that. The error is 4[27034.835134] = 4[27034.839854] [ BUG: bad unlock balance detected! ] 4[27034.844576] 3.11.0-rc5-ceph-00061-g546140d #1 Not tainted 4[27034.849992] - 4[27034.854713] ceph-osd/30797 is trying to release lock (sb_internal) at: 4[27034.861304] [a0148fd8] btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs] 4[27034.868994] but there are no more locks to release! 4[27034.873887] 4[27034.873887] other info that might help us debug this: 4[27034.880448] no locks held by ceph-osd/30797. 4[27034.884733] 4[27034.884733] stack backtrace: 4[27034.889123] CPU: 0 PID: 30797 Comm: ceph-osd Not tainted 3.11.0-rc5-ceph-00061-g546140d #1 4[27034.897421] Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 4[27034.904938] a0148fd8 88020baf9c68 81642d85 0007 4[27034.912411] 88021b32deb0 88020baf9c98 810ab89e 88020cff8000 4[27034.919883] 0246 88020aaeddd0 a0148fd8 88020baf9ce8 4[27034.927358] Call Trace: 4[27034.929836] [a0148fd8] ? btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs] 4[27034.937790] [81642d85] dump_stack+0x46/0x58 4[27034.942951] [810ab89e] print_unlock_imbalance_bug+0xfe/0x110 4[27034.949599] [a0148fd8] ? btrfs_commit_transaction_async+0x1c8/0x2c0 [btrfs] 4[27034.957552] [810aeafe] lock_release+0x15e/0x220 4[27034.963069] [a0148fff] btrfs_commit_transaction_async+0x1ef/0x2c0 [btrfs] 4[27034.970850] [810af6f5] ? trace_hardirqs_on_caller+0x105/0x1d0 4[27034.977587] [a0177907] btrfs_ioctl_start_sync+0x47/0xc0 [btrfs] 4[27034.984499] [a017c575] btrfs_ioctl+0xe55/0x1af0 [btrfs] 4[27034.990700] [8164ab2b] ? _raw_spin_unlock+0x2b/0x40 4[27034.996552] [810b40b5] ? do_futex+0xa45/0xbb0 4[27035.001885] [8119cf0c] ? fget_light+0x3c/0x130 4[27035.007302] [811922a6] do_vfs_ioctl+0x96/0x560 4[27035.012720] [8119cf6e] ? fget_light+0x9e/0x130 4[27035.018137] [8119cf0c] ? fget_light+0x3c/0x130 4[27035.023556] [81192801] SyS_ioctl+0x91/0xb0 4[27035.028628] [813338fe] ? trace_hardirqs_on_thunk+0x3a/0x3f 4[27035.035089] [81653782] system_call_fastpath+0x16/0x1b This is presumably some breakage in the freeze locking dance that goes on with async commits (btrfs_commit_transaction_async caller takes the freeze semaphore, do_async_commit releases it), but it's not obvious to me what broke yet. Unless this rings any bells for anyone, I'll go ahead and bisect. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] kernel BUG at fs/btrfs/relocation.c:2502!
On Wed, Sep 21, 2011 at 6:06 PM, Liu Bo liubo2...@cn.fujitsu.com wrote: Hi, While running my tool(attachment), I would encounter the BUG_ON, and I FAILED to find where went wrong :( The tool is with inode_cache option, and mainly do three things: a. run Chris's synctest in BACKGROUND b. create 100 snapshots c. after b, run btrfs fi balance You can follow these tips to reproduce the bug: 1) untar the attachment, 2) prepare 4 partitions, the mount point is default to /mnt, 3) $ ./2_while.sh /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb4 4) then just wait several minutes and you will get the bug. NOTE: You MAY hit a warning and I've fixed it with this patch: http://marc.info/?l=linux-btrfsm=131547325515336w=2 === kernel BUG at fs/btrfs/relocation.c:2502! [...] Call Trace: [a03d7a6b] ? block_rsv_add_bytes+0x2b/0x70 [btrfs] [a043292f] relocate_tree_blocks+0x60f/0x6d0 [btrfs] [a0433498] ? add_data_references+0x248/0x260 [btrfs] [a0433722] relocate_block_group+0x272/0x620 [btrfs] [a0433c83] btrfs_relocate_block_group+0x1b3/0x2d0 [btrfs] [a0413163] btrfs_relocate_chunk+0x93/0x6a0 [btrfs] [8103bfb3] ? __wake_up+0x53/0x70 [a041db42] ? btrfs_tree_read_unlock_blocking+0x42/0x70 [btrfs] [a04144d2] btrfs_balance+0x212/0x2a0 [btrfs] [81152459] ? path_openat+0x109/0x3e0 [a041d398] btrfs_ioctl+0x798/0xd20 [btrfs] [81112ec3] ? handle_mm_fault+0x143/0x260 [81474504] ? do_page_fault+0x1d4/0x440 [8115562a] do_vfs_ioctl+0x9a/0x540 [81155b71] sys_ioctl+0xa1/0xb0 [8147896b] system_call_fastpath+0x16/0x1b Code: 00 00 00 00 00 eb f6 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 48 83 7a 68 00 0f 84 eb fa RIP [a0430c26] do_relocation+0x546/0x570 [btrfs] RSP 88003c0859a8 ---[ end trace 6a4328153ff7ff17 ]--- call btrfs_save_ino_cache in commit_fs_roots is completely wrong. modification to fs trees is not allowed after create_pending_snapshots() -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] kernel BUG at fs/btrfs/relocation.c:2502!
On Thu, Sep 22, 2011 at 7:42 AM, David Sterba d...@jikos.cz wrote: On Wed, Sep 21, 2011 at 06:57:56PM +0800, Yan, Zheng wrote: modification to fs trees is not allowed after create_pending_snapshots() Do you have an idea whether there is a reasonable way to catch this in code? (even if only under a config option). no idea -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: check file extent backref offset underflow
Offset field in data extent backref can underflow if clone range ioctl is used. We can reliably detect the underflow because max file size is limited to 2^63 and max data extent size is limited by block group size. Signed-off-by: Zheng Yan zheng.z@intel.com --- diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 59bb176..107c9cf 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3323,8 +3323,11 @@ static int find_data_references(struct reloc_control *rc, } key.objectid = ref_objectid; - key.offset = ref_offset; key.type = BTRFS_EXTENT_DATA_KEY; + if (ref_offset ((u64)-1 32)) + key.offset = 0; + else + key.offset = ref_offset; path-search_commit_root = 1; path-skip_locking = 1; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Btrfs design defect in extent backref ?
On Thu, Aug 25, 2011 at 3:56 PM, Li Zefan l...@cn.fujitsu.com wrote: We have an offset in file extent to indicate its position in the corresponding extent item in extent tree. We also have an offset in extent item to indicate the start position of the file extent that uses this item. The math is: extent_item.extent_data_ref.offset = file_pos - file_extent.extent_offset. e1 disk extents: |--| ^ | e2 | |-| | | ^ | | | v v | file extents: |- f1 -|- f2 -| So it looks like e2.offset points to f1 not f2. Therefore given an extent item, we'll have to search through all the file extents in an inode to find the relative file extent in the worst case, which makes this field somewhat useless. The reason for this is reducing number of file extent backref itmes. we don't have to search all the file extents because the file extent size is limited and we have extent_data_ref.count. What makes things worse is the above fomula can make the offset a negative value (cast to u64): # touch /mnt/dst # clone_range -s 8192 -d 4096 /mnt/src /mnt/dst # umount /mnt # btrfs-debug-tree /dev/sda7 ... item 2 key (12582912 EXTENT_ITEM 49152) itemoff 3865 itemsize 82 extent refs 2 gen 8 flags 1 extent data backref root 5 objectid 258 offset 18446744073709543424 count 1 extent data backref root 5 objectid 257 offset 0 count 1 ... and relocation won't work in this case: # mount /dev/sda7 /mnt # rm /mnt/src # sync # btrfs fi bal /mnt (kernel warning !!) (hung up !!) I don't see the necessity or benefit of the substraction in the fomula, and I think the correct one is: extent_item.extent_data_ref.offset = file_pos (As a side effect thereafter we don't need extent_data_ref.count) That's what this patch does. Unfornately it is an incompatable change in disk format. So I think we have to live with this defect, just fix relocation for the negative offset case ? I prefer fixing relocation. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/extent-tree.c | 1 - fs/btrfs/file.c | 11 +-- fs/btrfs/inode.c | 7 +++ fs/btrfs/ioctl.c | 2 +- fs/btrfs/relocation.c | 1 - 5 files changed, 9 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f5be06a..3924e03 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2578,7 +2578,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, continue; num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi); - key.offset -= btrfs_file_extent_offset(buf, fi); ret = process_func(trans, root, bytenr, num_bytes, parent, ref_root, key.objectid, key.offset); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e7872e4..7f65a27 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -678,7 +678,7 @@ next_slot: disk_bytenr, num_bytes, 0, root-root_key.objectid, new_key.objectid, - start - extent_offset); + start); BUG_ON(ret); *hint_byte = disk_bytenr; } @@ -752,8 +752,7 @@ next_slot: ret = btrfs_free_extent(trans, root, disk_bytenr, num_bytes, 0, root-root_key.objectid, - key.objectid, key.offset - - extent_offset); + key.objectid, key.offset); BUG_ON(ret); inode_sub_bytes(inode, extent_end - key.offset); @@ -962,7 +961,7 @@ again: ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0, root-root_key.objectid, - ino, orig_offset); + ino, split); BUG_ON(ret); if (split == start) { @@ -989,7 +988,7 @@ again: del_nr++; ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
Re: [RFC] Btrfs design defect in extent backref ?
On Fri, Aug 26, 2011 at 10:00 AM, Li Zefan l...@cn.fujitsu.com wrote: Yan, Zheng wrote: On Thu, Aug 25, 2011 at 3:56 PM, Li Zefan l...@cn.fujitsu.com wrote: We have an offset in file extent to indicate its position in the corresponding extent item in extent tree. We also have an offset in extent item to indicate the start position of the file extent that uses this item. The math is: extent_item.extent_data_ref.offset = file_pos - file_extent.extent_offset. e1 disk extents: |--| ^ | e2 | |-| | | ^ | | | v v | file extents: |- f1 -|- f2 -| So it looks like e2.offset points to f1 not f2. Therefore given an extent item, we'll have to search through all the file extents in an inode to find the relative file extent in the worst case, which makes this field somewhat useless. The reason for this is reducing number of file extent backref itmes. It seems to me a rare case, which isn't worth the complexity and inconvenience it brings, and it requires an extra field (.count). Random write workload isn't a rare case. we don't have to search all the file extents because the file extent size is limited and we have extent_data_ref.count. Yes we have to, and for a big file with many small file extents, the extent number is not trivial. Max file extent size is 128M, so only need to scan a 128M range in the worst case. What makes things worse is the above fomula can make the offset a negative value (cast to u64): # touch /mnt/dst # clone_range -s 8192 -d 4096 /mnt/src /mnt/dst # umount /mnt # btrfs-debug-tree /dev/sda7 ... item 2 key (12582912 EXTENT_ITEM 49152) itemoff 3865 itemsize 82 extent refs 2 gen 8 flags 1 extent data backref root 5 objectid 258 offset 18446744073709543424 count 1 extent data backref root 5 objectid 257 offset 0 count 1 ... and relocation won't work in this case: # mount /dev/sda7 /mnt # rm /mnt/src # sync # btrfs fi bal /mnt (kernel warning !!) (hung up !!) I don't see the necessity or benefit of the substraction in the fomula, and I think the correct one is: extent_item.extent_data_ref.offset = file_pos (As a side effect thereafter we don't need extent_data_ref.count) That's what this patch does. Unfornately it is an incompatable change in disk format. So I think we have to live with this defect, just fix relocation for the negative offset case ? I prefer fixing relocation. Sure, though I would prefer the alternative if not for the stablity of disk format. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)
I found another bug. There are codes (btrfs_save_ino_cache) that modify fs trees after create_pending_snapshots is called. This can corrupt your fs. On Mon, Jun 13, 2011 at 3:13 PM, Li Zefan l...@cn.fujitsu.com wrote: Cc: Josef I encountered following panic using 'btrfs-unstable + for-linus' kernel. I ran btrfs fi bal /test5 command, and mount option of /test5 is as follows: /dev/sdc3 on /test5 type btrfs (rw,space_cache,compress=lzo,inode_cache) So, just a btrfs fi bal would lead to the bug? I think so. It should be specific to the inode caching code. The balancing code is finding the inode map cache extents, but it doesn't know how to relocate them. However, the panic has occurred even if inode_cahce is turned off. Is this another problem? I don't think free inode cache isthe cause of the bug here (even if inode_cache is turned on). What I have found out is: 1. git checkout a4abeea41adfa3c143c289045f4625dfaeba2212 So the top commit is the removal of trans_mutex and no delayed_inode patch or free inode cache patchset in the tree, and bug can be triggered. 2. git checkout 2a1eb4614d984d5cd4c928784e9afcf5c07f93be So the top commit is the one before trans_mutex removal, and no bug triggered. 3. test linus' tree bug triggered. 4. revert that suspicoius commit manually from linus' tree no bug. so either that commit is buggy or it reveals some bugs covered by the trans_mutex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)
Add a mutex to btrfs_init_reloc_root() to prevent the reloc tree creation from concurrent execution. On Mon, Jun 13, 2011 at 3:13 PM, Li Zefan l...@cn.fujitsu.com wrote: Cc: Josef I encountered following panic using 'btrfs-unstable + for-linus' kernel. I ran btrfs fi bal /test5 command, and mount option of /test5 is as follows: /dev/sdc3 on /test5 type btrfs (rw,space_cache,compress=lzo,inode_cache) So, just a btrfs fi bal would lead to the bug? I think so. It should be specific to the inode caching code. The balancing code is finding the inode map cache extents, but it doesn't know how to relocate them. However, the panic has occurred even if inode_cahce is turned off. Is this another problem? I don't think free inode cache isthe cause of the bug here (even if inode_cache is turned on). What I have found out is: 1. git checkout a4abeea41adfa3c143c289045f4625dfaeba2212 So the top commit is the removal of trans_mutex and no delayed_inode patch or free inode cache patchset in the tree, and bug can be triggered. 2. git checkout 2a1eb4614d984d5cd4c928784e9afcf5c07f93be So the top commit is the one before trans_mutex removal, and no bug triggered. 3. test linus' tree bug triggered. 4. revert that suspicoius commit manually from linus' tree no bug. so either that commit is buggy or it reveals some bugs covered by the trans_mutex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)
The usage of trans_mutex in relocation code is subtle. It controls interaction of relocation with transaction start, transaction commit and snapshot creation. Simple replacing trans_mutex with trans_lock is wrong. On Mon, Jun 13, 2011 at 3:13 PM, Li Zefan l...@cn.fujitsu.com wrote: Cc: Josef I encountered following panic using 'btrfs-unstable + for-linus' kernel. I ran btrfs fi bal /test5 command, and mount option of /test5 is as follows: /dev/sdc3 on /test5 type btrfs (rw,space_cache,compress=lzo,inode_cache) So, just a btrfs fi bal would lead to the bug? I think so. It should be specific to the inode caching code. The balancing code is finding the inode map cache extents, but it doesn't know how to relocate them. However, the panic has occurred even if inode_cahce is turned off. Is this another problem? I don't think free inode cache isthe cause of the bug here (even if inode_cache is turned on). What I have found out is: 1. git checkout a4abeea41adfa3c143c289045f4625dfaeba2212 So the top commit is the removal of trans_mutex and no delayed_inode patch or free inode cache patchset in the tree, and bug can be triggered. 2. git checkout 2a1eb4614d984d5cd4c928784e9afcf5c07f93be So the top commit is the one before trans_mutex removal, and no bug triggered. 3. test linus' tree bug triggered. 4. revert that suspicoius commit manually from linus' tree no bug. so either that commit is buggy or it reveals some bugs covered by the trans_mutex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bug caused by removal of trans_mutex? (Was: Re: kernel BUG at fs/btrfs/extent-tree.c:6164!)
On Tue, Jun 14, 2011 at 3:55 AM, Chris Mason chris.ma...@oracle.com wrote: Excerpts from Yan, Zheng's message of 2011-06-13 10:58:35 -0400: The usage of trans_mutex in relocation code is subtle. It controls interaction of relocation with transaction start, transaction commit and snapshot creation. Simple replacing trans_mutex with trans_lock is wrong. So, I've got a mutex around the reloc_root here and that was almost but not quite enough. It looks like the biggest problem is that we need to wait in btrfs_record_root_in_trans for anyone inside merge_reloc_roots. I'm surviving much longer with a patch in place that synchronizes btrfs_record_root_in_trans better. Zheng if you have other comments on the locking please let me know. following untested patch may help. --- diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 378b5b4..0b20dda 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -951,6 +951,7 @@ struct btrfs_fs_info { struct mutex cleaner_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; + struct mutex reloc_mutex; /* * this protects the ordered operations list only while we are * processing all of the entries on it. This way we make diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 9f68c68..28f8b11 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1714,6 +1714,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, mutex_init(fs_info-transaction_kthread_mutex); mutex_init(fs_info-cleaner_mutex); mutex_init(fs_info-volume_mutex); + mutex_init(fs_info-reloc_mutex); init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index b1ef27c..620e4af 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1330,18 +1330,20 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, struct btrfs_root *root) { struct btrfs_root *reloc_root; - struct reloc_control *rc = root-fs_info-reloc_ctl; + struct reloc_control *rc; int clear_rsv = 0; + mutex_lock(root-fs_info-reloc_mutex); if (root-reloc_root) { reloc_root = root-reloc_root; reloc_root-last_trans = trans-transid; - return 0; + goto unlock; } + rc = root-fs_info-reloc_ctl; if (!rc || !rc-create_reloc_tree || root-root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) - return 0; + goto unlock; if (!trans-block_rsv) { trans-block_rsv = rc-block_rsv; @@ -1353,6 +1355,8 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans, __add_reloc_root(reloc_root); root-reloc_root = reloc_root; +unlock: + mutex_unlock(root-fs_info-reloc_mutex); return 0; } @@ -1367,8 +1371,9 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans, int del = 0; int ret; + mutex_lock(root-fs_info-reloc_mutex); if (!root-reloc_root) - return 0; + goto unlock; reloc_root = root-reloc_root; root_item = reloc_root-root_item; @@ -1390,6 +1395,8 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans, ret = btrfs_update_root(trans, root-fs_info-tree_root, reloc_root-root_key, root_item); BUG_ON(ret); +unlock: + mutex_unlock(root-fs_info-reloc_mutex); return 0; } @@ -2142,10 +2149,10 @@ int prepare_to_merge(struct reloc_control *rc, int err) u64 num_bytes = 0; int ret; - spin_lock(root-fs_info-trans_lock); + mutex_lock(root-fs_info-reloc_mutex); rc-merging_rsv_size += root-nodesize * (BTRFS_MAX_LEVEL - 1) * 2; rc-merging_rsv_size += rc-nodes_relocated * 2; - spin_unlock(root-fs_info-trans_lock); + mutex_unlock(root-fs_info-reloc_mutex); again: if (!err) { num_bytes = rc-merging_rsv_size; @@ -2214,9 +2221,9 @@ int merge_reloc_roots(struct reloc_control *rc) int ret; again: root = rc-extent_root; - spin_lock(root-fs_info-trans_lock); + mutex_lock(root-fs_info-reloc_mutex); list_splice_init(rc-reloc_roots, reloc_roots); - spin_unlock(root-fs_info-trans_lock); + mutex_unlock(root-fs_info-reloc_mutex); while (!list_empty(reloc_roots)) { found = 1; @@ -3590,17 +3597,17 @@ next: static void set_reloc_control(struct reloc_control *rc) { struct btrfs_fs_info *fs_info = rc-extent_root-fs_info; - spin_lock(fs_info-trans_lock); + mutex_lock(fs_info-reloc_mutex); fs_info-reloc_ctl = rc; - spin_unlock(fs_info-trans_lock); + mutex_unlock(fs_info-reloc_mutex); } static void unset_reloc_control(struct reloc_control *rc) {
Re: [PATCH] Btrfs: check root_key's offset instead
On Wed, Jun 8, 2011 at 5:46 PM, liubo liubo2...@cn.fujitsu.com wrote: When we use reloc root to cow or copy a tree block, we do not set the block's owner, instead we set its header's flag with BTRFS_HEADER_FLAG_RELOC. So here we should check for root_key's offset. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent-tree.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5b9b6b6..0bda273 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6160,7 +6160,7 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans, if (wc-flags[level + 1] BTRFS_BLOCK_FLAG_FULL_BACKREF) parent = path-nodes[level + 1]-start; else - BUG_ON(root-root_key.objectid != + BUG_ON(root-root_key.offset != btrfs_header_owner(path-nodes[level + 1])); } This is wrong, all blocks with BTRFS_HEADER_FLAG_RELOC flag set should uss full back references. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-convert crashes
On Wed, Apr 27, 2011 at 1:20 PM, Brian Parma freec...@cox.net wrote: I have a 1.5 TB (1,475,720,773,632) partition that I wanted to convert from ext4 to btrfs. It is currently used as / for ubuntu 10.10. I booted into 11.04 beta2 and tried a 'btrfs-convert /dev/sdc1', but after about 20 minutes it segfaulted. I performed a: sck.ext4 -cDfty -C 0 /dev/sdc1 After everything was clean, I downloaded the debugging symbols for btrfs-convert and tried again. Below is the 'bt full' output. I don't have enough free space to copy all the data off, create a fresh btrfs partition, and copy everything back on (I have backups of important stuff). Is there something else I can try to get this to work? Brian The crash was caused by the hard links per directory limit in btrfs. In short, your ext4 is not convertible. at: http://pastebin.com/NEwJNzuP #0 0x77444d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x77448ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #2 0x0040502c in btrfs_extend_item (trans=value optimized out, root=0x633920, path=value optimized out, data_size=27) at ctree.c:2525 slot =value optimized out slot_orig =value optimized out leaf = 0x1955250 nritems = 1 data_end =value optimized out old_data =value optimized out i =value optimized out __PRETTY_FUNCTION__ = btrfs_extend_item #3 0x0040e32d in btrfs_insert_inode_ref (trans=0xc9ef10, root=0x633920, name=0xcfa314 gtfntf.f.svn-base, name_len=17, inode_objectid=value optimized out, ref_objectid=value optimized out, index=150) at inode-item.c:135 old_size = 3945 path = 0x1639aa0 key = {objectid = 37361107, type = 12 'f', offset = 37359706} ref =value optimized out ptr =value optimized out ret =value optimized out ins_len = 27 __PRETTY_FUNCTION__ = btrfs_insert_inode_ref #4 0x00413fff in dir_iterate_proc (dir=value optimized out, entry=value optimized out, old=0xcfa30c, offset=value optimized out, blocksize=value optimized out, buf=value optimized out, priv_data=0x7fffe370) at convert.c:289 ret =value optimized out file_type =value optimized out objectid = 37361107 dotdot = .. location = {objectid = 37361107, type = 1 '', offset = 0} dirent = 0xcfa30c idata = 0x7fffe370 __PRETTY_FUNCTION__ = dir_iterate_proc #5 0x77bbdc13 in ext2fs_process_dir_block () from /lib/x86_64-linux-gnu/libext2fs.so.2 No symbol table info available. #6 0x77bbac02 in ext2fs_block_iterate2 () from /lib/x86_64-linux-gnu/libext2fs.so.2 No symbol table info available. #7 0x77bbdfb8 in ext2fs_dir_iterate2 () from /lib/x86_64-linux-gnu/libext2fs.so.2 No symbol table info available. #8 0x0041689d in create_dir_entries (devname=0x7fffe897 /dev/sdc1, datacsum=1, packing=1, noxattr=0) at convert.c:322 err =value optimized out data = {trans = 0xc9ef10, root = 0x633920, inode = 0x7fffe1c0, objectid = 37359706, index_cnt = 150, parent = 37359705, errcode = 0} ret =value optimized out #9 copy_single_inode (devname=0x7fffe897 /dev/sdc1, datacsum=1, packing=1, noxattr=0) at convert.c:1072 ret =value optimized out btrfs_inode = {generation = 1, transid = 140737354044640, size = 4994, nbytes = 0, block_group = 0, nlink = 1, uid = 1000, gid = 1000, mode = 16877, rdev = 0, flags = 0, sequence = 140737351933932, reserved = {0, 140737354040256, 140733193388033, 0}, atime = {sec = 1303466526, nsec = 0}, ctime = { sec = 1296464377, nsec = 0}, mtime = {sec = 1296464377, nsec = 0}, otime = {sec = 0, nsec = 0}} #10 copy_inodes (devname=0x7fffe897 /dev/sdc1, datacsum=1, packing=1, noxattr=0) at convert.c:1154 ret =value optimized out err =value optimized out ext2_scan = 0xce2300 ext2_ino = 37359452 objectid = 37359706 ext2_inode = {i_mode = 16877, i_uid = 1000, i_size = 16384, i_atime = 1303466526, i_ctime = 1296464377, i_mtime = 1296464377, i_dtime = 0, i_gid = 1000, i_links_count = 2, i_blocks = 32, i_flags = 528384, osd1 = {linux1 = {l_i_version = 1981}, hurd1 = {h_i_translator = 1981}}, i_block = {193290, 4, 0, 0, 1, 149430439, 1, 3, 149430464, 0, 0, 0, 0, 0, 0}, i_generation = 2854948622, i_file_acl = 0, i_dir_acl = 0, i_faddr = 0, osd2 = {linux2 = { l_i_blocks_hi = 0, l_i_file_acl_high = 0, l_i_uid_high = 0, l_i_gid_high = 0, l_i_reserved2 = 0}, hurd2 = {h_i_frag = 0 '�', h_i_fsize = 0 '�', h_i_mode_high = 0, h_i_uid_high = 0, h_i_gid_high = 0, h_i_author = 0}}} trans = 0xc9ef10 #11 do_convert (devname=0x7fffe897 /dev/sdc1, datacsum=1, packing=1, noxattr=0) at convert.c:2411 i =value
Re: [PATCH] Btrfs: do not release delalloc space until after we end the transaction
On Thu, Apr 14, 2011 at 2:54 AM, Josef Bacik jo...@redhat.com wrote: There have been many sporadic reports of the following panic [ cut here ] kernel BUG at fs/btrfs/extent-tree.c:5498! invalid opcode: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 7 Modules linked in: btrfs zlib_deflate libcrc32c netconsole configfs ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath kvm uinput snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd hp_wmi i5400_edac sparse_keymap iTCO_wdt rfkill edac_core tg3 shpchp iTCO_vendor_support soundcore wmi floppy snd_page_alloc pcspkr i5k_amb [last unloaded: btrfs] Pid: 28504, comm: btrfs-endio-wri Tainted: G W 2.6.39-rc2+ #35 Hewlett-Packard HP xw6600 Workstation/0A9Ch RIP: 0010:[a044ec34] [a044ec34] alloc_reserved_file_extent+0x9a/0x1e5 [btrfs] RSP: 0018:88000b4319f0 EFLAGS: 00010286 RAX: ffe4 RBX: 880009fdc438 RCX: 880020c216d0 RDX: 88000b4318c0 RSI: 00d5 RDI: RBP: 88000b431a70 R08: ffe4 R09: 880020c216d0 R10: 0001 R11: 88000b431b10 R12: 88000b431b10 R13: 00b2 R14: R15: 88002225f2f8 FS: () GS:88003e40() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 003738ca6940 CR3: 2a39a000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process btrfs-endio-wri (pid: 28504, threadinfo 88000b43, task 880032278000) Stack: 0001 88002a92 881d 038d 0005 88003aa38000 81481012 88000c3bb480 8800241d01c8 88000b431a60 880031a040a8 Call Trace: [81481012] ? sub_preempt_count+0x97/0xaa [a044f92e] run_clustered_refs+0x61b/0x700 [btrfs] [81480f89] ? sub_preempt_count+0xe/0xaa [a0446ee9] ? spin_lock+0xe/0x10 [btrfs] [a044fae4] btrfs_run_delayed_refs+0xd1/0x1ab [btrfs] [8147dc1c] ? _raw_spin_unlock+0x4a/0x57 [a045af1b] __btrfs_end_transaction+0x89/0x1ed [btrfs] [a045b0c2] btrfs_end_transaction+0x15/0x17 [btrfs] [a0466932] btrfs_finish_ordered_io+0x29c/0x2bf [btrfs] [a04669d6] btrfs_writepage_end_io_hook+0x81/0x8d [btrfs] [a0477fd5] end_bio_extent_writepage+0xae/0x159 [btrfs] [811457e3] bio_endio+0x2d/0x2f [a0456c44] end_workqueue_fn+0x111/0x120 [btrfs] [a0480a0e] worker_loop+0x192/0x4d1 [btrfs] [a048087c] ? btrfs_queue_worker+0x22c/0x22c [btrfs] [81068a69] kthread+0xa0/0xa8 [8107a847] ? trace_hardirqs_on_caller+0x111/0x135 [81485364] kernel_thread_helper+0x4/0x10 [8147e398] ? retint_restore_args+0x13/0x13 [810689c9] ? __init_kthread_worker+0x5b/0x5b [81485360] ? gs_change+0x13/0x13 Code: 44 8b 45 90 0f 84 58 01 00 00 80 88 88 00 00 00 08 41 83 c0 18 4c 89 e1 48 8b 72 20 4c 89 ff 48 89 c2 e8 1f b4 ff ff 85 c0 74 04 0f 0b eb fe 48 8b 03 48 89 45 c8 8b 73 40 48 89 c7 e8 bc 98 ff RIP [a044ec34] alloc_reserved_file_extent+0x9a/0x1e5 [btrfs] RSP 88000b4319f0 ---[ end trace 81d1c68cb00af83e ]--- This is because we have been releasing the delalloc bytes before ending the transaction. However the way we make allocations, any updates to the extent_tree are delayed and then run when the transaction runs, so we still have plenty of space that we need to use. So instead release the delalloc bytes _after_ we end the transaction so that we don't get this false ENOSPC. Thanks, This is wrong, because btrfs_run_delayed_refs uses global block reservation. Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/inode.c | 8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ade00e7..b1e5b11 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1783,9 +1783,13 @@ out: if (trans) btrfs_end_transaction_nolock(trans, root); } else { - btrfs_delalloc_release_metadata(inode, ordered_extent-len); if (trans) btrfs_end_transaction(trans, root); + /* + * Release after the transaction ends so it covers the delayed + * ref updates + */ + btrfs_delalloc_release_metadata(inode, ordered_extent-len); } /* once for us */ @@ -5897,8 +5901,8 @@ out_unlock:
Re: [PATCH] Btrfs: fix infinite loop in btrfs_shrink_device()
On Sat, Feb 26, 2011 at 7:43 AM, Ilya Dryomov idryo...@gmail.com wrote: In case of an ENOSPC error from btrfs_relocate_chunk() (line 2202) while relocating a block group with offset 0 we end up endlessly looping. This happens because key.offset -= 1 statement then unconditionally brings us back to the beginnig of the loop (key.offset == (u64)-1). Signed-off-by: Ilya Dryomov idryo...@gmail.com --- fs/btrfs/volumes.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dd13eb8..0cb94ce 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2212,7 +2212,8 @@ again: goto done; if (ret == -ENOSPC) failed++; - key.offset -= 1; + if (--key.offset == -1) + break; } it should be if (--key.offset == (u64) -1) if (failed !retried) { -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: Fix balance panic
Mark the cloned backref_node as checked in clone_backref_node() Signed-off-by: Yan, Zheng zheng.z@intel.com --- diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 045c9c2..bef9c22 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1157,6 +1157,7 @@ static int clone_backref_node(struct btrfs_trans_handle *trans, new_node-bytenr = dest-node-start; new_node-level = node-level; new_node-lowest = node-lowest; + new_node-checked = 1; new_node-root = dest; if (!node-lowest) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel error during btrfs balance
please try patch attached below, Thanks. --- diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index b37d723..49d6b13 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1158,6 +1158,7 @@ static int clone_backref_node(struct btrfs_trans_handle *trans, new_node-bytenr = dest-node-start; new_node-level = node-level; new_node-lowest = node-lowest; + new_node-checked = 1; new_node-root = dest; if (!node-lowest) { --- On Fri, Jan 21, 2011 at 4:50 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, I hit the same bug again I think: [291835.724344] [ cut here ] [291835.724376] kernel BUG at fs/btrfs/relocation.c:836! [291835.724401] invalid opcode: [#1] SMP [291835.724424] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map [291835.724461] CPU 0 [291835.724472] Modules linked in: uvcvideo snd_usb_audio snd_usbmidi_lib videodev v4l1_compat snd_rawmidi v4l2_compat_ioctl32 btrfs zlib_deflate libcrc32c sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt tun ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nfsd lockd nfs_acl auth_rpcgss exportfs nls_utf8 cifs fscache sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm dummy uinput snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device e1000e snd_pcm snd_timer i2c_i801 snd shpchp iTCO_wdt iTCO_vendor_support soundcore dell_wmi sparse_keymap snd_page_alloc serio_raw joydev wmi dcdbas microcode usb_storage uas raid1 pata_acpi ata_generic radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] [291835.725002] [291835.725013] Pid: 27386, comm: btrfs Tainted: G I 2.6.37-2.fc15.x86_64 #1 [291835.725062] RIP: 0010:[a0565237] [a0565237] build_backref_tree+0x473/0xd6d [btrfs] [291835.725126] RSP: 0018:8800373bf9c8 EFLAGS: 00010246 [291835.725152] RAX: 8801367d5100 RBX: 88020b110880 RCX: 0040 [291835.725186] RDX: 0030 RSI: 006dd08d3000 RDI: 880100069820 [291835.725219] RBP: 8800373bfaf8 R08: 8050 R09: 8800373bf980 [291835.725253] R10: 8800373bf918 R11: 88020b110880 R12: 8801367d5100 [291835.725254] R13: 88012c0a24c0 R14: 88021e2013f0 R15: 88021e201cf0 [291835.725254] FS: 7fcb1a6cc760() GS:8800bfa0() knlGS: [291835.725254] CS: 0010 DS: ES: CR0: 8005003b [291835.725254] CR2: 02feeeb8 CR3: 0001c2943000 CR4: 000426e0 [291835.725254] DR0: DR1: DR2: [291835.725254] DR3: DR6: 0ff0 DR7: 0400 [291835.725254] Process btrfs (pid: 27386, threadinfo 8800373be000, task 88022452ae40) [291835.725254] Stack: [291835.725254] ea0004b5a470 ea00 8800373bf9f8 8800373bfaa8 [291835.725254] 88005faafbb0 880100069808 880100069d78 [291835.725254] 88012c0a2aa0 880100069820 88020b1108c0 880100069d80 [291835.725254] Call Trace: [291835.725254] [a0565c91] relocate_tree_blocks+0x160/0x478 [btrfs] [291835.725254] [a056463d] ? add_tree_block+0x11e/0x13e [btrfs] [291835.725254] [a0566b45] relocate_block_group+0x1e3/0x490 [btrfs] [291835.725254] [8103edb9] ? should_resched+0xe/0x2e [291835.725254] [a0566f39] btrfs_relocate_block_group+0x147/0x28a [btrfs] [291835.725254] [a054e52a] btrfs_relocate_chunk.clone.40+0x61/0x4ab [btrfs] [291835.725254] [a05152d4] ? btrfs_item_key+0x1e/0x20 [btrfs] [291835.725254] [a05152f0] ? btrfs_item_key_to_cpu+0x1a/0x36 [btrfs] [291835.725254] [a054c2a8] ? read_extent_buffer+0xc3/0xe3 [btrfs] [291835.725254] [a05154e6] ? btrfs_header_nritems.clone.12+0x17/0x1c [btrfs] [291835.725254] [a054cff6] ? btrfs_item_key_to_cpu+0x2a/0x46 [btrfs] [291835.725254] [a055045e] btrfs_balance+0x1a3/0x1f0 [btrfs] [291835.725254] [8112bce5] ? do_filp_open+0x226/0x5c8 [291835.725254] [a0556773] btrfs_ioctl+0x641/0x846 [btrfs] [291835.725254] [811f3ed1] ? file_has_perm+0xa5/0xc7 [291835.725254] [8112e091] do_vfs_ioctl+0x4b1/0x4f2 [291835.725254] [8112e128] sys_ioctl+0x56/0x7a [291835.725254] [8100acc2] system_call_fastpath+0x16/0x1b [291835.725254] Code: 48 8b 45 89 49 8d 7d 10 48 8d 75 b0 49 89 44 24 18 8a 43 70 ff c0 41 88 44 24 70 e8 f7 c3 ff ff eb 17 f6 40 71 10 49 89 c4 75 02 0f 0b 49 8d 45 10 49 89 45 10 49 89 45 18 48 8b b5 20 ff ff ff [291835.725254] RIP [a0565237] build_backref_tree+0x473/0xd6d [btrfs] [291835.725254] RSP 8800373bf9c8 [291835.738971] ---[ end trace a7919e7f17c0a727
Re: v0.19-35-g1b444cd btrfsck says snapshots have errors
On Fri, Jan 21, 2011 at 6:52 AM, Ian! D. Allen idal...@idallen.ca wrote: Still getting btrfsck errors with this: git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git # ./btrfstest.sh Using /mnt/sdb1 /dev/sdb1 on /dev/sdb + mkfs.btrfs -L BTRFStest /dev/sdb1 WARNING! - Btrfs v0.19-35-g1b444cd IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using fs created label BTRFStest on /dev/sdb1 nodesize 4096 leafsize 4096 sectorsize 4096 size 2.00GB Btrfs v0.19-35-g1b444cd + mount -o noatime /dev/sdb1 /mnt/sdb1 + btrfs subvolume snapshot /mnt/sdb1 /mnt/sdb1/snap1 Create a snapshot of '/mnt/sdb1' in '/mnt/sdb1/snap1' + btrfs subvolume snapshot /mnt/sdb1/snap1 /mnt/sdb1/snap2 Create a snapshot of '/mnt/sdb1/snap1' in '/mnt/sdb1/snap2' + btrfs subvolume snapshot /mnt/sdb1/snap2 /mnt/sdb1/snap3 Create a snapshot of '/mnt/sdb1/snap2' in '/mnt/sdb1/snap3' + btrfs subvolume snapshot /mnt/sdb1/snap3 /mnt/sdb1/snap4 Create a snapshot of '/mnt/sdb1/snap3' in '/mnt/sdb1/snap4' + btrfs subvolume snapshot /mnt/sdb1/snap4 /mnt/sdb1/snap5 Create a snapshot of '/mnt/sdb1/snap4' in '/mnt/sdb1/snap5' + umount /dev/sdb1 + btrfsck /dev/sdb1 fs tree 256 refs 6 unresolved ref root 256 dir 256 index 2 namelen 5 name snap1 error 600 unresolved ref root 257 dir 256 index 2 namelen 5 name snap1 error 600 unresolved ref root 258 dir 256 index 2 namelen 5 name snap1 error 600 unresolved ref root 259 dir 256 index 2 namelen 5 name snap1 error 600 unresolved ref root 260 dir 256 index 2 namelen 5 name snap1 error 600 found 49152 bytes used err is 1 total csum bytes: 0 total tree bytes: 49152 total fs tree bytes: 28672 btree space waste bytes: 39360 file data blocks allocated: 0 referenced 0 Btrfs v0.19-35-g1b444cd These is caused by a design flaw, you can safely ignore them. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel error during btrfs balance
On Tue, Jan 18, 2011 at 9:22 PM, Erik Logtenberg e...@logtenberg.eu wrote: On 01/18/2011 01:54 AM, Yan, Zheng wrote: On Mon, Jan 17, 2011 at 10:14 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, btrfs balance results in: http://pastebin.com/v5j0809M My system: fully up-to-date Fedora 14 with rawhide kernel to make btrfs balance do useful stuff to my free space: kernel-2.6.37-2.fc15.x86_64 btrfs-progs-0.19-12.fc14.x86_64 Filesystem had 0 bytes free, should be 45G, so on darklings advice I ran btrfs balance on the fs, while doing heavy I/O (re-running 5 backup jobs that had failed due to ENOSP). Up until the crash, btrfs balance did retrieve a couple of Gigs free space though, so that part of the plan worked just fine. Please try 2.6.36 kernel. Thanks for your (short) advice. Could you please elaborate. I was in fact using a 2.6.35.10-74.fc14.x86_64 kernel before, but darkling adviced me to switch to a newer kernel to reclaim free space by balancing -- the idea was that newer kernels have better balancing implementation, more effective at reclaiming free space. Now your advice is to take a small step back again, from 2.6.37 to 2.6.36 (which is still higher than the 2.6.35 I was using before). Is that because you think that 2.6.37 may have introduced the bug that I ran into? Do you think that 2.6.36 is still recent enough to have the effective balancing so that I will in fact be able to reclaim some free space? Or is is just a shot in the dark with no reasoning whatsoever ;) Please don't feel offended, but from your 4-word sentence I really can't tell. Just try narrowing down the bug, because I never saw bug like this before. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik jo...@redhat.com wrote: Here are patches to do offline deduplication for Btrfs. It works well for the cases it's expected to, I'm looking for feedback on the ioctl interface and such, I'm well aware there are missing features for the userspace app (like being able to set a different blocksize). If this interface is acceptable I will flesh out the userspace app a little more, but I believe the kernel side is ready to go. Basically I think online dedup is huge waste of time and completely useless. You are going to want to do different things with different data. For example, for a mailserver you are going to want to have very small blocksizes, but for say a virtualization image store you are going to want much larger blocksizes. And lets not get into heterogeneous environments, those just get much too complicated. So my solution is batched dedup, where a user just runs this command and it dedups everything at this point. This avoids the very costly overhead of having to hash and lookup for duplicate extents online and lets us be _much_ more flexible about what we want to deduplicate and how we want to do it. For the userspace app it only does 64k blocks, or whatever the largest area it can read out of a file. I'm going to extend this to do the following things in the near future 1) Take the blocksize as an argument so we can have bigger/smaller blocks 2) Have an option to _only_ honor the blocksize, don't try and dedup smaller blocks 3) Use fiemap to try and dedup extents as a whole and just ignore specific blocksizes 4) Use fiemap to determine what would be the most optimal blocksize for the data you want to dedup. I've tested this out on my setup and it seems to work well. I appreciate any feedback you may have. Thanks, FYI: Using clone ioctl can do the same thing (except reading data and computing hash in user space). Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] x86: hpet: Fix HPET timer + NMI watchdog panic
HPET doesn't use timer_interrupt() as interrupt handler now. So count of HPET interrupt isn't recorded in per_cpu(irq_stat, cpu).irq0_irqs. This confuses NMI watchdog when using HPET as tick device. Signed-off-by: Yan, Zheng zheng.z@intel.com --- diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 55e4de6..cca94f2 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -12,6 +12,9 @@ typedef struct { unsigned int apic_timer_irqs; /* arch dependent */ unsigned int irq_spurious_count; #endif +#ifdef CONFIG_HPET_TIMER + unsigned int hpet_timer_irqs; +#endif unsigned int x86_platform_ipis; /* arch dependent */ unsigned int apic_perf_irqs; unsigned int apic_irq_work_irqs; diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c index c90041c..cdb38d9 100644 --- a/arch/x86/kernel/apic/nmi.c +++ b/arch/x86/kernel/apic/nmi.c @@ -80,8 +80,15 @@ static inline int mce_in_progress(void) */ static inline unsigned int get_timer_irqs(int cpu) { - return per_cpu(irq_stat, cpu).apic_timer_irqs + + unsigned int irqs; +#ifdef CONFIG_HPET_TIMER + irqs = per_cpu(irq_stat, cpu).hpet_timer_irqs; +#else + irqs = 0; +#endif + irqs += per_cpu(irq_stat, cpu).apic_timer_irqs + per_cpu(irq_stat, cpu).irq0_irqs; + return irqs; } #ifdef CONFIG_SMP diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 4ff5968..b536474c 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -517,6 +517,7 @@ static irqreturn_t hpet_interrupt_handler(int irq, void *data) struct hpet_dev *dev = (struct hpet_dev *)data; struct clock_event_device *hevt = dev-evt; + inc_irq_stat(hpet_timer_irqs); if (!hevt-event_handler) { printk(KERN_INFO Spurious HPET timer interrupt on HPET timer %d\n, dev-num); diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c index fb5cc5e1..2098c56 100644 --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -56,7 +56,7 @@ unsigned long profile_pc(struct pt_regs *regs) EXPORT_SYMBOL(profile_pc); /* - * Default timer interrupt handler for PIT/HPET + * Default timer interrupt handler for PIT */ static irqreturn_t timer_interrupt(int irq, void *dev_id) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: hpet: Fix HPET timer + NMI watchdog panic
Sent this mail to wrong list, sorry for interrupting. Yan, Zheng On Wed, Dec 29, 2010 at 9:57 AM, Yan, Zheng zheng.z@linux.intel.com wrote: HPET doesn't use timer_interrupt() as interrupt handler now. So count of HPET interrupt isn't recorded in per_cpu(irq_stat, cpu).irq0_irqs. This confuses NMI watchdog when using HPET as tick device. Signed-off-by: Yan, Zheng zheng.z@intel.com --- diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 55e4de6..cca94f2 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -12,6 +12,9 @@ typedef struct { unsigned int apic_timer_irqs; /* arch dependent */ unsigned int irq_spurious_count; #endif +#ifdef CONFIG_HPET_TIMER + unsigned int hpet_timer_irqs; +#endif unsigned int x86_platform_ipis; /* arch dependent */ unsigned int apic_perf_irqs; unsigned int apic_irq_work_irqs; diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c index c90041c..cdb38d9 100644 --- a/arch/x86/kernel/apic/nmi.c +++ b/arch/x86/kernel/apic/nmi.c @@ -80,8 +80,15 @@ static inline int mce_in_progress(void) */ static inline unsigned int get_timer_irqs(int cpu) { - return per_cpu(irq_stat, cpu).apic_timer_irqs + + unsigned int irqs; +#ifdef CONFIG_HPET_TIMER + irqs = per_cpu(irq_stat, cpu).hpet_timer_irqs; +#else + irqs = 0; +#endif + irqs += per_cpu(irq_stat, cpu).apic_timer_irqs + per_cpu(irq_stat, cpu).irq0_irqs; + return irqs; } #ifdef CONFIG_SMP diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 4ff5968..b536474c 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -517,6 +517,7 @@ static irqreturn_t hpet_interrupt_handler(int irq, void *data) struct hpet_dev *dev = (struct hpet_dev *)data; struct clock_event_device *hevt = dev-evt; + inc_irq_stat(hpet_timer_irqs); if (!hevt-event_handler) { printk(KERN_INFO Spurious HPET timer interrupt on HPET timer %d\n, dev-num); diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c index fb5cc5e1..2098c56 100644 --- a/arch/x86/kernel/time.c +++ b/arch/x86/kernel/time.c @@ -56,7 +56,7 @@ unsigned long profile_pc(struct pt_regs *regs) EXPORT_SYMBOL(profile_pc); /* - * Default timer interrupt handler for PIT/HPET + * Default timer interrupt handler for PIT */ static irqreturn_t timer_interrupt(int irq, void *dev_id) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] can not allocate space for caching data
On Mon, Dec 20, 2010 at 11:41 PM, Chris Mason chris.ma...@oracle.com wrote: Excerpts from Miao Xie's message of 2010-12-20 08:13:14 -0500: On Mon, 20 Dec 2010 07:44:06 -0500, Chris Mason wrote: Excerpts from Miao Xie's message of 2010-12-20 07:25:10 -0500: Hi, Chris There is something wrong with this patch: commit 83a50de97fe96aca82389e061862ed760ece2283 Author: Chris Masonchris.ma...@oracle.com Date: Mon Dec 13 15:06:46 2010 -0500 Btrfs: prevent RAID level downgrades when space is low The extent allocator has code that allows us to fill allocations from any available block group, even if it doesn't match the raid level we've requested. Btrfs has added the space of single chunks and raid0 chunks into the space information, so when we use btrfs_check_data_free_space() to check if there is some space for storing file data, this function may return true. So we write the data into the cache successfully. But, the extent allocator can not allocate any space to store that cached data, and then the file system panic. I think we subtract that space from the space information, or split the space information into two types, one is used to manage the chunks with duplication, the other manages the other chunks. Ok, do you have a test case that triggers this? I'll work out a patch. Yan Zheng's original idea of 'the chunks should be readonly' should help us deduct them from the total. # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10 # mount /dev/sda9 /mnt # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=99 (fill the file system) # umount /mnt # mount /dev/sda9 /mnt # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=1000 # sync Looks like we've got an off by one bug in set_block_group_ro, which is why our block group isn't getting set to ro. With this patch, we're properly setting the block group ro, and the enospc accounting is done correctly. It should also be able to replace my commit above. Please take a look, Zheng does this look correct to you? diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 227e581..6f7d758 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7970,13 +7970,14 @@ static int set_block_group_ro(struct btrfs_block_group_cache *cache) if (sinfo-bytes_used + sinfo-bytes_reserved + sinfo-bytes_pinned + sinfo-bytes_may_use + sinfo-bytes_readonly + - cache-reserved_pinned + num_bytes sinfo-total_bytes) { + cache-reserved_pinned + num_bytes = sinfo-total_bytes) { sinfo-bytes_readonly += num_bytes; sinfo-bytes_reserved += cache-reserved_pinned; cache-reserved_pinned = 0; cache-ro = 1; ret = 0; } + spin_unlock(cache-lock); spin_unlock(sinfo-lock); return ret; Looks good for me, Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/5 v3] Btrfs: avoid transaction stuff when btrfs is readonly
On Fri, Dec 3, 2010 at 4:16 PM, liubo liubo2...@cn.fujitsu.com wrote: When the filesystem is readonly, avoid transaction stuff by checking MS_RDONLY at start transaction time. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/transaction.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 1fffbc0..14a597d 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -181,6 +181,9 @@ static struct btrfs_trans_handle *start_transaction(struct btrfs_root *root, struct btrfs_trans_handle *h; struct btrfs_transaction *cur_trans; int ret; + + if (root-fs_info-sb-s_flags MS_RDONLY) + return ERR_PTR(-EROFS); again: h = kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS); if (!h) There are cases that we need to start transaction when MS_RDONLY flag is set. For example, remount FS into read-only mode and log replay. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 4/5] implement metadata_ra in btrfs
On Mon, Dec 13, 2010 at 3:22 PM, Shaohua Li shaohua...@intel.com wrote: Implementation btrfs .metadata_readahead. In btrfs, all metadata pages are in a special btree_inode. We do readahead in it. Signed-off-by: Shaohua Li shaohua...@intel.com --- fs/btrfs/disk-io.c | 10 ++ fs/btrfs/super.c | 13 + mm/readahead.c | 1 + 3 files changed, 24 insertions(+) Index: linux/fs/btrfs/disk-io.c === --- linux.orig/fs/btrfs/disk-io.c 2010-12-07 13:32:24.0 +0800 +++ linux/fs/btrfs/disk-io.c 2010-12-07 13:33:08.0 +0800 @@ -776,6 +776,15 @@ static int btree_readpage(struct file *f return extent_read_full_page(tree, page, btree_get_extent); } +static int btree_readpages(struct file *file, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages) +{ + struct extent_io_tree *tree; + tree = BTRFS_I(mapping-host)-io_tree; + return extent_readpages(tree, mapping, pages, nr_pages, + btree_get_extent); +} + static int btree_releasepage(struct page *page, gfp_t gfp_flags) { struct extent_io_tree *tree; @@ -819,6 +828,7 @@ static void btree_invalidatepage(struct static const struct address_space_operations btree_aops = { .readpage = btree_readpage, + .readpages = btree_readpages, .writepage = btree_writepage, .writepages = btree_writepages, .releasepage = btree_releasepage, Index: linux/fs/btrfs/super.c === --- linux.orig/fs/btrfs/super.c 2010-12-07 13:32:24.0 +0800 +++ linux/fs/btrfs/super.c 2010-12-07 13:33:08.0 +0800 @@ -892,6 +892,18 @@ out: return -ENOENT; } +static int btrfs_metadata_readahead(struct super_block *sb, loff_t offset, + ssize_t size) +{ + struct btrfs_root *tree_root = btrfs_sb(sb); + struct inode *btree_inode = tree_root-fs_info-btree_inode; + struct address_space *mapping = btree_inode-i_mapping; + + force_page_cache_readahead(mapping, NULL, offset PAGE_CACHE_SHIFT, + size PAGE_CACHE_SHIFT); + return 0; +} + static const struct super_operations btrfs_super_ops = { .drop_inode = btrfs_drop_inode, .evict_inode = btrfs_evict_inode, @@ -907,6 +919,7 @@ static const struct super_operations btr .freeze_fs = btrfs_freeze, .unfreeze_fs = btrfs_unfreeze, .metadata_incore = btrfs_metadata_incore, + .metadata_readahead = btrfs_metadata_readahead, }; static const struct file_operations btrfs_ctl_fops = { Index: linux/mm/readahead.c === --- linux.orig/mm/readahead.c 2010-12-07 13:32:24.0 +0800 +++ linux/mm/readahead.c 2010-12-07 13:33:08.0 +0800 @@ -228,6 +228,7 @@ int force_page_cache_readahead(struct ad } return ret; } +EXPORT_SYMBOL_GPL(force_page_cache_readahead); /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a btrfs will crash If the read-ahead range falls into unallocated chunk. need code to check validity of the user input. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not loop through raid types when looking for free extent
On Tue, Dec 14, 2010 at 9:33 AM, Chris Mason chris.ma...@oracle.com wrote: Excerpts from Yan, Zheng's message of 2010-11-16 20:38:23 -0500: On Wed, Nov 17, 2010 at 5:22 AM, Josef Bacik jo...@redhat.com wrote: There is a bug in find_free_extent where if we don't find a free extent in the raid type we are looking for, we loop through to the next raid type. This is not ok since we need to make sure we honor the raid types we are given. So instead kill this check and get the proper index for the raid type we want from the allocator. Thanks, Loop through raid types is for handling failure in the middle of raid type conversion. The problem is that nothing prevents it from looping back to a raid0 chunk when we really want raid1 or raid10. And mkfs leaves behind a small raid0 chunk (4MB) that is uses as it assembles all the devices. check code at end of btrfs_read_block_groups, it prevents allocating from raid0 when there are mirrored block groups. I confirmed that we often use the small raid0 chunk even in raid1 or raid10. Please take a look at this commit: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=commit;h=83a50de97fe96aca82389e061862ed760ece2283 -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: Fix page leak in compressed writeback path
start + num_bytes = actual_end can happen when compressed page writeback races with file truncation. In that case we need unlock and release pages past the end of file. Signed-off-by: Yan, Zheng zheng.z@intel.com --- diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8039390..2ea98d8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -495,7 +495,7 @@ again: add_async_extent(async_cow, start, num_bytes, total_compressed, pages, nr_pages_ret); - if (start + num_bytes end start + num_bytes actual_end) { + if (start + num_bytes end) { start += num_bytes; pages = NULL; cond_resched(); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/4 v2] Btrfs: avoid transaction stuff when readonly
On Thu, Dec 2, 2010 at 11:42 AM, liubo liubo2...@cn.fujitsu.com wrote: On 12/01/2010 06:20 PM, liubo wrote: When the filesystem is readonly, avoid transaction stuff by checking MS_RDONLY at start transaction time. This patch may lead btrfs panic. Since btrfs allows transaction under readonly fs state, which is a bit weird, btrfs does not even check the returned transaction from start_transaction, although it may return -ENOMEM. btrfs may do log replay even mount as readonly. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not loop through raid types when looking for free extent
On Wed, Nov 17, 2010 at 5:22 AM, Josef Bacik jo...@redhat.com wrote: There is a bug in find_free_extent where if we don't find a free extent in the raid type we are looking for, we loop through to the next raid type. This is not ok since we need to make sure we honor the raid types we are given. So instead kill this check and get the proper index for the raid type we want from the allocator. Thanks, Loop through raid types is for handling failure in the middle of raid type conversion. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not loop through raid types when looking for free extent
On Wed, Nov 17, 2010 at 5:22 AM, Josef Bacik jo...@redhat.com wrote: There is a bug in find_free_extent where if we don't find a free extent in the raid type we are looking for, we loop through to the next raid type. This is not ok since we need to make sure we honor the raid types we are given. So instead kill this check and get the proper index for the raid type we want from the allocator. Thanks, Loop through raid types is for handling failure in the middle of raid type conversion. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-convert fails
On Sun, Oct 31, 2010 at 11:55 PM, Helmut Hullen hul...@t-online.de wrote: Hallo, linux-btrfs, I've tried to convert a 12 GByte ext2 partition (nearly full, 280 MByte free) with btrfs-convert. After about 15 minutes (700-MHz-CPU) the system tells ... creating ext2fs image file cleaning up system chunk btrfs-convert: extent-tree.c:2529: btrfs_reserve_extent: Assertion `!(ret)' failed Abgebrochen Try btrfs-convert -r /dev/xxx, hopefully it will recover your ext2. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.35.4 fumble-spolision
item 17 key (1923814207488 EXTENT_ITEM 4096) itemoff 2978 itemsize 51 extent refs 1 gen 127137 flags 2 tree block key (9962449 1 0) level 0 tree block backref root 877 item 18 key (1923814211584 EXTENT_ITEM 4096) itemoff 2927 itemsize 51 extent refs 1 gen 119662 flags 2 tree block key (9722750 54 1542337137) level 0 tree block backref root 848 item 19 key (1923814215680 EXTENT_ITEM 4096) itemoff 2876 itemsize 51 extent refs 1 gen 119662 flags 2 tree block key (9722750 54 1633150225) level 0 tree block backref root 848 item 20 key (1923814219776 EXTENT_ITEM 4096) itemoff 2825 itemsize 51 extent refs 1 gen 116478 flags 258 tree block key (2769869 54 1517502465) level 0 shared block backref parent 1923805945856 item 21 key (1923814223872 EXTENT_ITEM 4096) itemoff 2693 itemsize 132 extent refs 10 gen 118289 flags 2 tree block key (10471699 1 0) level 1 tree block backref root 879 tree block backref root 873 tree block backref root 867 tree block backref root 864 tree block backref root 861 tree block backref root 855 tree block backref root 852 tree block backref root 849 tree block backref root 846 tree block backref root 843 item 22 key (1923814240256 EXTENT_ITEM 4096) itemoff 2642 itemsize 51 extent refs 1 gen 127137 flags 2 tree block key (9962452 1 0) level 0 tree block backref root 877 item 23 key (1923814244352 EXTENT_ITEM 4096) itemoff 2591 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 2250929569) level 0 tree block backref root 862 item 24 key (1923814248448 EXTENT_ITEM 4096) itemoff 2540 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 2530040045) level 0 tree block backref root 862 item 25 key (1923814252544 EXTENT_ITEM 4096) itemoff 2489 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 693460895) level 0 tree block backref root 862 item 26 key (1923814256640 EXTENT_ITEM 4096) itemoff 2438 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 4250039336) level 0 tree block backref root 862 item 27 key (1923814264832 EXTENT_ITEM 4096) itemoff 2387 itemsize 51 extent refs 1 gen 125542 flags 2 tree block key (7531027 54 716511755) level 0 tree block backref root 870 item 28 key (1923814268928 EXTENT_ITEM 4096) itemoff 2336 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 583696754) level 0 tree block backref root 862 item 29 key (1923814273024 EXTENT_ITEM 4096) itemoff 2285 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 846280235) level 0 tree block backref root 862 item 30 key (1923814277120 EXTENT_ITEM 4096) itemoff 2234 itemsize 51 extent refs 1 gen 123524 flags 2 tree block key (4962881 54 108099388) level 0 tree block backref root 862 item 31 key (1923814281216 EXTENT_ITEM 4096) itemoff 2183 itemsize 51 extent refs 1 gen 116704 flags 258 tree block key (9759696 60 2) level 0 tree block backref root 839 item 32 key (1923814285312 EXTENT_ITEM 4096) itemoff 2105 itemsize 78 extent refs 4 gen 123524 flags 2 tree block key (4962881 60 1452615) level 0 tree block backref root 892 tree block backref root 868 tree block backref root 865 tree block backref root 862 item 33 key (1923814293504 EXTENT_ITEM 4096) itemoff 2054 itemsize 51 extent refs 1 gen 121426 flags 2 tree block key (5643880 60 602) level 0 tree block backref root 853 failed to find block number 1923814297600 Abort How large is the FS ? Is it possible to run btrfs-image and send the output file to us? Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: machine gets unresponsive during btrfs balance
On Thu, Aug 12, 2010 at 3:14 PM, Andreas Philipp philipp.andr...@gmail.com wrote: Hi, I am using a btrfs filesystem created with raid0 for data and metadata for (temporary) storage of tv recordings from my vdr. The filesystem was created under kernel version 2.6.34. An initial btrfs balance command succeeded. Since I upgraded to 2.6.35-rcX and 2.6.35 btrfs balance no longer finishes but puts the machine in some unresponsive state. Unfortunately, I do not see any kernel oops or other debug information because even the display freezes. The last thing that happens are that those two lines are written to /var/log/messages: Aug 11 21:42:23 thor kernel: btrfs: found 62911 extents Aug 11 21:42:24 thor kernel: btrfs: relocating block group 1723913469952 flags 9 After that the machine becomes immediately unresponsive. As I did not see anything that might be related to my problem in the changelog for 2.6.35.1 I did not try again with this version. Do you have more than one machines? would you please setup netconsole to see what happen. Thanks Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Code bug or data bug?
On Tue, Aug 10, 2010 at 6:20 AM, K. Richard Pixley r...@noir.com wrote: I've just gotten: r...@diamonds:~$ time sudo /sbin/btrfsck /dev/sda7 btrfsck: btrfsck.c:585: splice_shared_node: Assertion `!(src == src_node-root_cache)' failed. Aborted Does this indicate a coding error in btrfsck or a data error in my file system? --rich r...@diamonds:~$ dpkg -l | grep btrfs ii btrfs-tools 0.19+20100601-3 Checksumming Copy on Write Filesystem utilit r...@diamonds:~$ uname -a Linux diamonds 2.6.32-24-generic-pae #38-Ubuntu SMP Mon Jul 5 10:54:21 UTC 2010 i686 GNU/Linux r...@diamonds:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.04.1 LTS Release: 10.04 Codename: lucid This is a bug in btrfsck. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] btrfs: fix a wrong error check in add_ra_bio_pages()
2010/7/29 Miao Xie mi...@cn.fujitsu.com: From: Liu Bo liubo2...@cn.fujitsu.com Only when a page is not found by page_index, we'll go to the error check. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/compression.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index cb3877c..8458840 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -467,7 +467,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, rcu_read_lock(); page = radix_tree_lookup(mapping-page_tree, page_index); rcu_read_unlock(); - if (page) { + if (!page) { check_misses: misses++; if (misses 4) This patch is wrong. The word miss here means miss for read-ahead because the page is already in the page cache Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intermittent no space errors
On Tue, Jul 27, 2010 at 5:09 AM, Dave Cundiff syshack...@gmail.com wrote: Hello, On 2.6.35-rc5 I'm seeing some weird behavior under heavy IO loads. I have a backup process that fires up several rsync processes. These mirror several dozen servers to individual sub-volumes. Everyday I snapshot each sub-volume and rsync over it. The problem I'm seeing is my rsync processes are failing randomly with No space left on device. This is a 6 Terabyte volume with plenty of free space. Mount options: /dev/sdb on /backups type btrfs (rw,max_inline=0,compress) [r...@rsync1 ~]# btrfs filesystem df /backups/ Data: total=1.88TB, used=1.88TB Metadata: total=43.38GB, used=32.06GB System: total=12.00MB, used=260.00KB [r...@rsync1 ~]# df /dev/sdb Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb 5781249024 2087273084 3693975940 37% /backups They don't all fail at once. Normally I have 4-5 running at a time and 1 or 2 will drop out with a no space error. The rest continue on. I've noticed it will generally occur on ones that are in the middle of transferring a very large file. If I lighten the load to one rsync at a time it appears to happen less frequently. Any known issues I should be aware of? Thank you for reporting this. I will dig in. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: unlinked X orphans messages
On Mon, Jul 19, 2010 at 5:01 PM, Xavier Nicollet nicol...@jeru.org wrote: Hi, I am using btrfs for remote backups (via rsync), with daily and weekly snapshots. I see these messages in kern.log: Jul 18 07:09:43 backup1 kernel: [3437126.458374] btrfs: unlinked 9 orphans Jul 18 12:01:01 backup1 kernel: [3454604.905856] btrfs: unlinked 1 orphans Jul 18 13:01:51 backup1 kernel: [3458254.990199] btrfs: unlinked 1 orphans Jul 19 04:01:41 backup1 kernel: [3512244.236347] btrfs: unlinked 1 orphans Is this something I have to be afraid of ? Linux debian lenny, pure btrfs partition with no raid, vanilla kernel: 2.6.34. Nothing to worry about. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] btrfs hangup when we run the sync command
2010/7/15 Miao Xie mi...@cn.fujitsu.com: Hi, everyone I found btrfs will hangup when we run the sync command on my x86_64 box. The reproduce steps is following: # mkfs.btrfs -s 8192 -l 8192 -n 8192 /dev/sda1 # mount /dev/sda1 /mnt # echo 1234567 /mnt/aaa # sync (btrfs hangs up) It seems that the btrfs doesn't support the sectorsize which is greater than the page size just like ext2/3/4, though we can use mkfs.btrfs to make a filesystem with a big sectorsize. Am I right? If yes, we must do more check in the mkfs.btrfs. yes, btrfs doesn't support the sectorsize PAGE_size. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfsck segmentation fault + trace
On Thu, Jun 17, 2010 at 4:25 AM, John Wyzer john.wy...@gmx.de wrote: On 16/06/10 19:29, Chris Mason wrote: On Wed, Jun 16, 2010 at 06:35:25PM +0200, John Wyzer wrote: Is 2.6.34 working normally? Yes. I can boot with 2.6.34.y and everything works fine. (Actually, before trying 2.6.45-rc3, I had an uptime of two weeks on this laptop. Now, I'm writing this email on 2.6.34.y.) But every time you boot 2.6.35 you get errors? Would it be possible to save the console output (netconsole works well) [...] device fsid 3247922091b53feb-dcb02f0506fbdc8b devid 1 transid 155748 /dev/mapper/root EXT3-fs: barriers not enabled kjournald starting. Commit interval 5 seconds EXT3-fs (sda4): warning: maximal mount count reached, running e2fsck is recommended EXT3-fs (sda4): using internal journal EXT3-fs (sda4): recovery complete EXT3-fs (sda4): mounted filesystem with writeback data mode btrfs: fail to dirty inode 9959493 error -28 btrfs: fail to dirty inode 2987803 error -28 btrfs: fail to dirty inode 2987803 error -28 btrfs: fail to dirty inode 8873620 error -28 btrfs: fail to dirty inode 8873620 error -28 btrfs: fail to dirty inode 803894 error -28 btrfs: fail to dirty inode 2988335 error -28 btrfs: fail to dirty inode 2987971 error -28 btrfs: fail to dirty inode 2987972 error -28 btrfs: fail to dirty inode 2988336 error -28 btrfs: fail to dirty inode 6631 error -28 btrfs: fail to dirty inode 803896 error -28 btrfs: fail to dirty inode 6632 error -28 btrfs: fail to dirty inode 6633 error -28 btrfs: fail to dirty inode 6634 error -28 [...] (nothing new coming, only error -28) Apart from that, I had messages that there was no space left on /, but those were from userspace and not logged via netconsole. looks like the fs runs out of metadata space. btrfs in 2.6.35 reserves more metadata space for system use than btrfs in 2.6.34. That's why these error message only appear on 2.6.35. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still ENOSPC problems with 2.6.35-rc3
On Thu, Jun 17, 2010 at 1:48 AM, Johannes Hirte johannes.hi...@fem.tu-ilmenau.de wrote: With kernel-2.6.34 I run into the ENOSPC problems that where reported on this list recently. The filesystem was somewhat over 90% full and most operations on it caused a Oops. I was able to delete files by trial and error and freed up half of the filesystem space. Operation on the other files still caused an Oops. For 2.6.35 there went some patches in, that addressed this problem. Sadly they don't fix it but only avoid the Oops. A simple 'ls' on this filesystem results in To avoid ENOSPC oops, btrfs in 2.6.35 reserves more metadata space for system use than older btrfs. If the FS has already ran out of metadata space, using btrfs in 2.6.35 doesn't help. Yan, Zheng [ cut here ] WARNING: at fs/btrfs/extent-tree.c:3441 btrfs_block_rsv_check+0x10c/0x13e() Hardware name: To Be Filled By O.E.M. Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss radeon ttm drm_kms_helper drm i2c_algo_bit snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd amd64_edac_mod sata_sil sg sr_mod uhci_hcd ohci_hcd edac_core edac_mce_amd k8temp i2c_amd8111 i2c_amd756 hwmon Pid: 26973, comm: ls Not tainted 2.6.35-rc3 #1 Call Trace: [81031044] ? warn_slowpath_common+0x78/0x8c [81147fdf] ? btrfs_block_rsv_check+0x10c/0x13e [81155857] ? __btrfs_end_transaction+0x9f/0x1b1 [8115aaa2] ? btrfs_dirty_inode+0x58/0xf9 [810b07ba] ? __mark_inode_dirty+0x25/0x149 [810a809a] ? touch_atime+0xfc/0x125 [810a3a32] ? filldir+0x0/0xc3 [810a3c1c] ? vfs_readdir+0x76/0x9c [810a3d7e] ? sys_getdents+0x7d/0xcd [81364f1f] ? page_fault+0x1f/0x30 [81001e2b] ? system_call_fastpath+0x16/0x1b ---[ end trace 4aa882f64f792d16 ]--- block_rsv size 654311424 reserved 67809280 freed 0 0 [ cut here ] WARNING: at fs/btrfs/extent-tree.c:3441 btrfs_block_rsv_check+0x10c/0x13e() Hardware name: To Be Filled By O.E.M. Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss radeon ttm drm_kms_helper drm i2c_algo_bit snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd amd64_edac_mod sata_sil sg sr_mod uhci_hcd ohci_hcd edac_core edac_mce_amd k8temp i2c_amd8111 i2c_amd756 hwmon Pid: 26970, comm: btrfs-transacti Tainted: G W 2.6.35-rc3 #1 Call Trace: [81031044] ? warn_slowpath_common+0x78/0x8c [81147fdf] ? btrfs_block_rsv_check+0x10c/0x13e [81155857] ? __btrfs_end_transaction+0x9f/0x1b1 [81155a7a] ? btrfs_commit_transaction+0xf4/0x5fd [8102c39f] ? enqueue_task+0x39/0x47 [81363dbb] ? mutex_lock+0xd/0x31 [81043979] ? autoremove_wake_function+0x0/0x2a [81151b5b] ? transaction_kthread+0x16d/0x213 [811519ee] ? transaction_kthread+0x0/0x213 [810435ad] ? kthread+0x75/0x7d [81002b54] ? kernel_thread_helper+0x4/0x10 [81043538] ? kthread+0x0/0x7d [81002b50] ? kernel_thread_helper+0x0/0x10 ---[ end trace 4aa882f64f792d17 ]--- block_rsv size 654311424 reserved 67809280 freed 0 0 [ cut here ] WARNING: at fs/btrfs/extent-tree.c:3441 btrfs_block_rsv_check+0x10c/0x13e() Hardware name: To Be Filled By O.E.M. Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss radeon ttm drm_kms_helper drm i2c_algo_bit snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd amd64_edac_mod sata_sil sg sr_mod uhci_hcd ohci_hcd edac_core edac_mce_amd k8temp i2c_amd8111 i2c_amd756 hwmon Pid: 26973, comm: ls Tainted: G W 2.6.35-rc3 #1 Call Trace: [81031044] ? warn_slowpath_common+0x78/0x8c [81147fdf] ? btrfs_block_rsv_check+0x10c/0x13e [81155857] ? __btrfs_end_transaction+0x9f/0x1b1 [811562fb] ? start_transaction+0x15f/0x1c4 [8115aaaf] ? btrfs_dirty_inode+0x65/0xf9 [810b07ba] ? __mark_inode_dirty+0x25/0x149 [810a809a] ? touch_atime+0xfc/0x125 [810a3a32] ? filldir+0x0/0xc3 [810a3c1c] ? vfs_readdir+0x76/0x9c [810a3d7e] ? sys_getdents+0x7d/0xcd [81364f1f] ? page_fault+0x1f/0x30 [81001e2b] ? system_call_fastpath+0x16/0x1b ---[ end trace 4aa882f64f792d18 ]--- block_rsv size 654311424 reserved 67809280 freed 0 0 btrfs: fail to dirty inode 256 error -28 [ cut here ] WARNING: at fs/btrfs
Re: btrfsck segmentation fault + trace
On Thu, Jun 17, 2010 at 7:26 AM, John Wyzer john.wy...@gmx.de wrote: On 17/06/10 00:45, Yan, Zheng wrote: On Thu, Jun 17, 2010 at 4:25 AM, John Wyzer john.wy...@gmx.de wrote: btrfs: fail to dirty inode 6634 error -28 [...] (nothing new coming, only error -28) Apart from that, I had messages that there was no space left on /, but those were from userspace and not logged via netconsole. looks like the fs runs out of metadata space. btrfs in 2.6.35 reserves more metadata space for system use than btrfs in 2.6.34. That's why these error message only appear on 2.6.35. Hmm. I've formatted with -m single because I wanted to avoid running out of space for metadata. So if there's more then 10% of space left which is about 40GB that seems like quite a bit of waste... ;-) I'll stay with 2.6.34 then for the time being. Spaces for data and metadata are separated even the FS was formatted with -m single. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in 2.6.35 RC1 (Ubuntu 2.6.35-1-generic)
/0x210 [btrfs] Jun 10 21:23:17 bradf-x301 kernel: [ 209.858189] [a01694e0] btrfs_commit_transaction+0x80/0x710 [btrfs] Jun 10 21:23:17 bradf-x301 kernel: [ 209.858198] [81582a9e] ? mutex_lock+0x1e/0x50 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858227] [a0169f8b] ? start_transaction+0x1ab/0x230 [btrfs] Jun 10 21:23:17 bradf-x301 kernel: [ 209.858238] [8107d610] ? autoremove_wake_function+0x0/0x40 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858265] [a0163d53] transaction_kthread+0x283/0x290 [btrfs] Jun 10 21:23:17 bradf-x301 kernel: [ 209.858293] [a0163ad0] ? transaction_kthread+0x0/0x290 [btrfs] Jun 10 21:23:17 bradf-x301 kernel: [ 209.858302] [8107d0b6] kthread+0x96/0xa0 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858311] [8100aee4] kernel_thread_helper+0x4/0x10 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858320] [8107d020] ? kthread+0x0/0xa0 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858327] [8100aee0] ? kernel_thread_helper+0x0/0x10 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858441] RSP 88012821b820 Jun 10 21:23:17 bradf-x301 kernel: [ 209.858448] ---[ end trace 32d3e1002acaefc5 ]--- I have already sent a patch for this. http://www.spinics.net/lists/linux-btrfs/msg05150.html Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix null dereference in relocation.c
Fix a potential null dereference in relocation.c Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/relocation.c 2/fs/btrfs/relocation.c --- 1/fs/btrfs/relocation.c 2010-05-26 00:13:07.227605825 +0800 +++ 2/fs/btrfs/relocation.c 2010-05-31 16:35:23.489829633 +0800 @@ -784,16 +784,17 @@ again: struct btrfs_extent_ref_v0 *ref0; ref0 = btrfs_item_ptr(eb, path1-slots[0], struct btrfs_extent_ref_v0); - root = find_tree_root(rc, eb, ref0); - if (!root-ref_cows) - cur-cowonly = 1; if (key.objectid == key.offset) { + root = find_tree_root(rc, eb, ref0); if (root !should_ignore_root(root)) cur-root = root; else list_add(cur-list, useless); break; } + if (is_cowonly_root(btrfs_ref_root_v0(eb, + ref0))) + cur-cowonly = 1; } #else BUG_ON(key.type == BTRFS_EXTENT_REF_V0_KEY); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix BUG_ON for fs converted from extN
Tree blocks can live in data block groups in FS converted from extN. So it's easy to trigger the BUG_ON. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 1/fs/btrfs/extent-tree.c2010-05-26 23:55:46.610378078 +0800 +++ 3/fs/btrfs/extent-tree.c2010-05-31 16:36:51.907580723 +0800 @@ -4360,7 +4360,8 @@ void btrfs_free_tree_block(struct btrfs_ block_rsv = get_block_rsv(trans, root); cache = btrfs_lookup_block_group(root-fs_info, buf-start); - BUG_ON(block_rsv-space_info != cache-space_info); + if (block_rsv-space_info != cache-space_info) + goto out; if (btrfs_header_generation(buf) == trans-transid) { if (root-root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk space accounting and subvolume delete
On Tue, Jun 1, 2010 at 3:01 AM, Bruce Guenter br...@untroubled.org wrote: On Wed, May 12, 2010 at 01:02:07PM +0800, Yan, Zheng wrote: Dropping a tree can be lengthy. It's not good to let sync wait for hours. For most linux FS, 'sync' just force an transaction/journal commit. I don't think they wait for large operations that can span multiple transactions to complete. What happens to the consistency of the filesystem if a crash happens during this process? This does not break the consistency of the filesystem. Next mount will find the partial dropped tree and restart the dropping process. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] btrfs-convert: Add extent iteration functions.
On Sat, Mar 20, 2010 at 12:26 PM, Sean Bartell wingedtachik...@gmail.com wrote: A filesystem can have disk extents in arbitrary places on the disk, as well as extents that must be read into memory because they have compression or encryption btrfs doesn't support. These extents can be passed to the new extent iteration functions, which will handle all the details of alignment, allocation, etc. --- convert.c | 604 - 1 files changed, 401 insertions(+), 203 deletions(-) diff --git a/convert.c b/convert.c index c48f8ba..bd91990 100644 --- a/convert.c +++ b/convert.c @@ -357,7 +357,7 @@ error: } static int read_disk_extent(struct btrfs_root *root, u64 bytenr, - u32 num_bytes, char *buffer) + u64 num_bytes, char *buffer) { int ret; struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices; @@ -371,6 +371,23 @@ fail: ret = -1; return ret; } + +static int write_disk_extent(struct btrfs_root *root, u64 bytenr, + u64 num_bytes, const char *buffer) +{ + int ret; + struct btrfs_fs_devices *fs_devs = root-fs_info-fs_devices; + + ret = pwrite(fs_devs-latest_bdev, buffer, num_bytes, bytenr); + if (ret != num_bytes) + goto fail; + ret = 0; +fail: + if (ret 0) + ret = -1; + return ret; +} + /* * Record a file extent. Do all the required works, such as inserting * file extent item, inserting extent item and backref item into extent @@ -378,8 +395,7 @@ fail: */ static int record_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, - struct btrfs_inode_item *inode, - u64 file_pos, u64 disk_bytenr, + u64 *inode_nbytes, u64 file_pos, u64 disk_bytenr, u64 num_bytes, int checksum) { int ret; @@ -391,7 +407,6 @@ static int record_file_extent(struct btrfs_trans_handle *trans, struct btrfs_path path; struct btrfs_extent_item *ei; u32 blocksize = root-sectorsize; - u64 nbytes; if (disk_bytenr == 0) { ret = btrfs_insert_file_extent(trans, root, objectid, @@ -450,8 +465,7 @@ static int record_file_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_other_encoding(leaf, fi, 0); btrfs_mark_buffer_dirty(leaf); - nbytes = btrfs_stack_inode_nbytes(inode) + num_bytes; - btrfs_set_stack_inode_nbytes(inode, nbytes); + *inode_nbytes += num_bytes; btrfs_release_path(root, path); @@ -492,95 +506,355 @@ fail: return ret; } -static int record_file_blocks(struct btrfs_trans_handle *trans, - struct btrfs_root *root, u64 objectid, - struct btrfs_inode_item *inode, - u64 file_block, u64 disk_block, - u64 num_blocks, int checksum) -{ - u64 file_pos = file_block * root-sectorsize; - u64 disk_bytenr = disk_block * root-sectorsize; - u64 num_bytes = num_blocks * root-sectorsize; - return record_file_extent(trans, root, objectid, inode, file_pos, - disk_bytenr, num_bytes, checksum); -} - -struct blk_iterate_data { +struct extent_iterate_data { struct btrfs_trans_handle *trans; struct btrfs_root *root; - struct btrfs_inode_item *inode; + u64 *inode_nbytes; u64 objectid; - u64 first_block; - u64 disk_block; - u64 num_blocks; - u64 boundary; - int checksum; - int errcode; + int checksum, packing; + u64 last_file_off; + u64 total_size; + enum {EXTENT_ITERATE_TYPE_NONE, EXTENT_ITERATE_TYPE_MEM, + EXTENT_ITERATE_TYPE_DISK} type; + u64 size; + u64 file_off; /* always aligned to sectorsize */ + char *data; /* for mem */ + u64 disk_off; /* for disk */ }; -static int block_iterate_proc(ext2_filsys ext2_fs, - u64 disk_block, u64 file_block, - struct blk_iterate_data *idata) +static u64 extent_boundary(struct btrfs_root *root, u64 extent_start) { - int ret; - int sb_region; - int do_barrier; - struct btrfs_root *root = idata-root; - struct btrfs_trans_handle *trans = idata-trans; - struct btrfs_block_group_cache *cache; - u64 bytenr = disk_block * root-sectorsize; - - sb_region = intersect_with_sb(bytenr, root-sectorsize); - do_barrier = sb_region || disk_block = idata-boundary; - if ((idata-num_blocks 0 do_barrier) || - (file_block idata-first_block + idata-num_blocks) ||
Re: [PATCH 1/4] btrfs-convert: make more use of cache_free_extents
On Sat, Mar 20, 2010 at 12:24 PM, Sean Bartell wingedtachik...@gmail.com wrote: An extent_io_tree is used for all free space information. This allows removal of ext2_alloc_block and ext2_free_block, and makes create_ext2_image less ext2-specific. --- convert.c | 154 +++-- 1 files changed, 99 insertions(+), 55 deletions(-) diff --git a/convert.c b/convert.c index d037c98..c48f8ba 100644 --- a/convert.c +++ b/convert.c @@ -95,29 +95,10 @@ static int close_ext2fs(ext2_filsys fs) return 0; } -static int ext2_alloc_block(ext2_filsys fs, u64 goal, u64 *block_ret) +static int ext2_cache_free_extents(ext2_filsys ext2_fs, + struct extent_io_tree *free_tree) { - blk_t block; - - if (!ext2fs_new_block(fs, goal, NULL, block)) { - ext2fs_fast_mark_block_bitmap(fs-block_map, block); - *block_ret = block; - return 0; - } - return -ENOSPC; -} - -static int ext2_free_block(ext2_filsys fs, u64 block) -{ - BUG_ON(block != (blk_t)block); - ext2fs_fast_unmark_block_bitmap(fs-block_map, block); - return 0; -} - -static int cache_free_extents(struct btrfs_root *root, ext2_filsys ext2_fs) - -{ - int i, ret = 0; + int ret = 0; blk_t block; u64 bytenr; u64 blocksize = ext2_fs-blocksize; @@ -127,29 +108,68 @@ static int cache_free_extents(struct btrfs_root *root, ext2_filsys ext2_fs) if (ext2fs_fast_test_block_bitmap(ext2_fs-block_map, block)) continue; bytenr = block * blocksize; - ret = set_extent_dirty(root-fs_info-free_space_cache, - bytenr, bytenr + blocksize - 1, 0); + ret = set_extent_dirty(free_tree, bytenr, + bytenr + blocksize - 1, 0); BUG_ON(ret); } + return 0; +} + +/* mark btrfs-reserved blocks as used */ +static void adjust_free_extents(ext2_filsys ext2_fs, + struct extent_io_tree *free_tree) +{ + int i; + u64 bytenr; + u64 blocksize = ext2_fs-blocksize; + + clear_extent_dirty(free_tree, 0, BTRFS_SUPER_INFO_OFFSET - 1, 0); + for (i = 0; i BTRFS_SUPER_MIRROR_MAX; i++) { bytenr = btrfs_sb_offset(i); bytenr = ~((u64)STRIPE_LEN - 1); if (bytenr = blocksize * ext2_fs-super-s_blocks_count) break; - clear_extent_dirty(root-fs_info-free_space_cache, bytenr, - bytenr + STRIPE_LEN - 1, 0); + clear_extent_dirty(free_tree, bytenr, bytenr + STRIPE_LEN - 1, + 0); } +} - clear_extent_dirty(root-fs_info-free_space_cache, - 0, BTRFS_SUPER_INFO_OFFSET - 1, 0); - +static int alloc_blocks(struct extent_io_tree *free_tree, + u64 *blocks, int num, u64 blocksize) +{ + u64 start; + u64 end; + u64 last = 0; + u64 mask = blocksize - 1; + int ret; + while(num) { + ret = find_first_extent_bit(free_tree, last, start, end, + EXTENT_DIRTY); + if (ret) + goto fail; + last = end + 1; + if (start mask) + start = (start mask) + blocksize; + if (last - start blocksize) + continue; + *blocks++ = start; + num--; + last = start + blocksize; + clear_extent_dirty(free_tree, start, last - 1, 0); + } return 0; +fail: + fprintf(stderr, not enough free space\n); + return -ENOSPC; } static int custom_alloc_extent(struct btrfs_root *root, u64 num_bytes, u64 hint_byte, struct btrfs_key *ins) { + u64 blocksize = root-sectorsize; + u64 mask = blocksize - 1; u64 start; u64 end; u64 last = hint_byte; @@ -171,6 +191,8 @@ static int custom_alloc_extent(struct btrfs_root *root, u64 num_bytes, start = max(last, start); last = end + 1; + if (start mask) + start = (start mask) + blocksize; if (last - start num_bytes) continue; @@ -1186,9 +1208,9 @@ static int create_image_file_range(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, struct btrfs_inode_item *inode, u64 start_byte, u64 end_byte, - ext2_filsys ext2_fs) +
Re: Bug when resizing FS
On Sat, May 15, 2010 at 12:23 AM, Martin Bueger mbuer...@edu.uni-klu.ac.at wrote: Hello, when I try to resize the FS with btrfsctl -r it works using + or -, hence, extending or shrinking the FS but when I want to set it to a certain size I always hit the follwing bug: [ cut here ] invalid opcode: [#1] SMP last sysfs file: /sys/devices/pci:00/:00:10.0/host2/target2:0:1/2:0:1:0/block/sdb/size Modules linked in: sr_mod i2c_piix4 cdrom processor container thermal ac button i2c_core Pid: 4044, comm: btrfs-delalloc- Not tainted 2.6.33-zen2 #1 440BX Desktop Reference Platform/VMware Virtual Platform EIP: 0060:[c12545a8] EFLAGS: 00010286 CPU: 0 EIP is at cow_file_range+0x638/0x650 EAX: ffe4 EBX: 1b5cb000 ECX: 3d3d EDX: 0001 ESI: EDI: EBP: cd22a034 ESP: caa4de40 DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 Process btrfs-delalloc- (pid: 4044, ti=caa4c000 task=de0d0b10 task.ti=caa4c000) Stack: 1000 1000 1b5cb000 0 caa4debf 0001 c1a4d1a0 cd22a12c 1000 0 cd22a038 de032800 cb25a180 1000 caa4dea8 1000 Call Trace: [c12bbe1a] ? __prop_inc_single+0x3a/0x50 [c1255580] ? submit_compressed_extents+0x260/0x4f0 [c127b10e] ? run_ordered_completions+0x5e/0xb0 [c127b85b] ? worker_loop+0x12b/0x410 [c127b730] ? worker_loop+0x0/0x410 [c103e994] ? kthread+0x74/0x80 [c103e920] ? kthread+0x0/0x80 [c10030b6] ? kernel_thread_helper+0x6/0x10 Code: 8b 94 24 b8 00 00 00 83 d6 00 0f ac f3 0c 01 1a 8b 84 24 b4 00 00 00 c7 00 01 00 00 00 e9 23 fe ff ff 0f 0b eb fe 90 8d 74 26 00 0f 0b eb fe 8d 74 26 00 31 db 31 f6 e9 a1 fb ff ff 0f 0b eb fe EIP: [c12545a8] cow_file_range+0x638/0x650 SS:ESP 0068:caa4de40 ---[ end trace 31b4672bb84c5cec ]--- The command I ran: btrfsctl -r 1g /mnt/point Looks like an ENOSPC Oops, this will be improved in 2.6.35 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] btrfs: pass buffer extent to btrfs_free_tree_block
prepare for the log code Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/ctree.c 2/fs/btrfs/ctree.c --- 1/fs/btrfs/ctree.c 2010-04-14 14:49:56.342950744 +0800 +++ 2/fs/btrfs/ctree.c 2010-05-11 14:00:04.122357838 +0800 @@ -279,7 +279,8 @@ int btrfs_block_can_be_shared(struct btr static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, - struct extent_buffer *cow) + struct extent_buffer *cow, + int *last_ref) { u64 refs; u64 owner; @@ -365,6 +366,7 @@ static noinline int update_ref_for_cow(s BUG_ON(ret); } clean_tree_block(trans, root, buf); + *last_ref = 1; } return 0; } @@ -392,6 +394,7 @@ static noinline int __btrfs_cow_block(st struct extent_buffer *cow; int level; int unlock_orig = 0; + int last_ref = 0; u64 parent_start; if (*cow_ret == buf) @@ -441,7 +444,7 @@ static noinline int __btrfs_cow_block(st (unsigned long)btrfs_header_fsid(cow), BTRFS_FSID_SIZE); - update_ref_for_cow(trans, root, buf, cow); + update_ref_for_cow(trans, root, buf, cow, last_ref); if (buf == root-node) { WARN_ON(parent parent != buf); @@ -456,8 +459,8 @@ static noinline int __btrfs_cow_block(st extent_buffer_get(cow); spin_unlock(root-node_lock); - btrfs_free_tree_block(trans, root, buf-start, buf-len, - parent_start, root-root_key.objectid, level); + btrfs_free_tree_block(trans, root, buf, parent_start, + last_ref); free_extent_buffer(buf); add_root_to_dirty_list(root); } else { @@ -472,8 +475,8 @@ static noinline int __btrfs_cow_block(st btrfs_set_node_ptr_generation(parent, parent_slot, trans-transid); btrfs_mark_buffer_dirty(parent); - btrfs_free_tree_block(trans, root, buf-start, buf-len, - parent_start, root-root_key.objectid, level); + btrfs_free_tree_block(trans, root, buf, parent_start, + last_ref); } if (unlock_orig) btrfs_tree_unlock(buf); @@ -948,6 +951,22 @@ int btrfs_bin_search(struct extent_buffe return bin_search(eb, key, level, slot); } +static void root_add_used(struct btrfs_root *root, u32 size) +{ + spin_lock(root-node_lock); + btrfs_set_root_used(root-root_item, + btrfs_root_used(root-root_item) + size); + spin_unlock(root-node_lock); +} + +static void root_sub_used(struct btrfs_root *root, u32 size) +{ + spin_lock(root-node_lock); + btrfs_set_root_used(root-root_item, + btrfs_root_used(root-root_item) - size); + spin_unlock(root-node_lock); +} + /* given a node and slot number, this reads the blocks it points to. The * extent buffer is returned with a reference taken (but unlocked). * NULL is returned on error. @@ -1018,7 +1037,11 @@ static noinline int balance_level(struct btrfs_tree_lock(child); btrfs_set_lock_blocking(child); ret = btrfs_cow_block(trans, root, child, mid, 0, child); - BUG_ON(ret); + if (ret) { + btrfs_tree_unlock(child); + free_extent_buffer(child); + goto enospc; + } spin_lock(root-node_lock); root-node = child; @@ -1033,11 +1056,12 @@ static noinline int balance_level(struct btrfs_tree_unlock(mid); /* once for the path */ free_extent_buffer(mid); - ret = btrfs_free_tree_block(trans, root, mid-start, mid-len, - 0, root-root_key.objectid, level); + + root_sub_used(root, mid-len); + btrfs_free_tree_block(trans, root, mid, 0, 1); /* once for the root ptr */ free_extent_buffer(mid); - return ret; + return 0; } if (btrfs_header_nritems(mid) BTRFS_NODEPTRS_PER_BLOCK(root) / 4) @@ -1087,23 +,16 @@ static noinline int balance_level(struct if (wret 0 wret != -ENOSPC) ret = wret; if (btrfs_header_nritems(right) == 0) { - u64 bytenr = right-start; - u32 blocksize = right-len
[PATCH 3/5] btrfs: split btrfs_alloc_free_block()
split btrfs_alloc_free_block() into btrfs_reserved_tree_block() and btrfs_alloc_reserved_tree_block(). Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 3/fs/btrfs/ctree.h 4/fs/btrfs/ctree.h --- 3/fs/btrfs/ctree.h 2010-05-11 14:09:45.052107958 +0800 +++ 4/fs/btrfs/ctree.h 2010-05-11 13:15:47.060357000 +0800 @@ -1978,6 +1978,15 @@ struct btrfs_block_group_cache *btrfs_lo void btrfs_put_block_group(struct btrfs_block_group_cache *cache); u64 btrfs_find_block_group(struct btrfs_root *root, u64 search_start, u64 search_hint, int owner); +struct extent_buffer *btrfs_reserve_tree_block(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + u32 blocksize, int level, + u64 hint, u64 empty_size); +int btrfs_alloc_reserved_tree_block(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct extent_buffer *buf, + u64 parent, u64 root_objectid, + struct btrfs_disk_key *key, int level); struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, u32 blocksize, u64 parent, u64 root_objectid, diff -urp 3/fs/btrfs/extent-tree.c 4/fs/btrfs/extent-tree.c --- 3/fs/btrfs/extent-tree.c2010-05-11 14:12:00.044357180 +0800 +++ 4/fs/btrfs/extent-tree.c2010-05-11 13:26:38.036107000 +0800 @@ -4956,64 +4998,6 @@ int btrfs_alloc_logged_file_extent(struc return ret; } -/* - * finds a free extent and does all the dirty work required for allocation - * returns the key for the extent through ins, and a tree buffer for - * the first block of the extent through buf. - * - * returns 0 if everything worked, non-zero otherwise. - */ -static int alloc_tree_block(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 num_bytes, u64 parent, u64 root_objectid, - struct btrfs_disk_key *key, int level, - u64 empty_size, u64 hint_byte, u64 search_end, - struct btrfs_key *ins) -{ - int ret; - u64 flags = 0; - - ret = btrfs_reserve_extent(trans, root, num_bytes, num_bytes, - empty_size, hint_byte, search_end, - ins, 0); - if (ret) - return ret; - - if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) { - if (parent == 0) - parent = ins-objectid; - flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF; - } else - BUG_ON(parent 0); - - if (root_objectid != BTRFS_TREE_LOG_OBJECTID) { - struct btrfs_delayed_extent_op *extent_op; - extent_op = kzalloc(sizeof(*extent_op), GFP_NOFS); - BUG_ON(!extent_op); - if (key) - memcpy(extent_op-key, key, sizeof(extent_op-key)); - extent_op-flags_to_set = flags; - extent_op-update_key = 1; - extent_op-update_gen = 1; - extent_op-update_flags = 1; - - ret = btrfs_add_delayed_tree_ref(trans, ins-objectid, - ins-offset, parent, root_objectid, - level, BTRFS_ADD_DELAYED_EXTENT, - extent_op); - BUG_ON(ret); - } - - if (root_objectid == root-root_key.objectid) { - u64 used; - spin_lock(root-node_lock); - used = btrfs_root_used(root-root_item) + num_bytes; - btrfs_set_root_used(root-root_item, used); - spin_unlock(root-node_lock); - } - return ret; -} - struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u32 blocksize, @@ -5052,8 +5036,68 @@ struct extent_buffer *btrfs_init_new_buf return buf; } +struct extent_buffer *btrfs_reserve_tree_block(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + u32 blocksize, int level, + u64 hint, u64 empty_size) +{ + + struct btrfs_key ins; + struct extent_buffer *buf; + int ret; + + ret = btrfs_reserve_extent(trans, root, blocksize, blocksize, + empty_size, hint, (u64)-1, ins, 0); + if (ret) + return ERR_PTR(ret); + + buf = btrfs_init_new_buffer(trans, root
[PATCH 4/5] btrfs: don't cache empty block groups during mount
the tree log recover code expects no free space cached before it executes. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 4/fs/btrfs/extent-tree.c 8/fs/btrfs/extent-tree.c --- 4/fs/btrfs/extent-tree.c2010-05-11 14:15:29.174108554 +0800 +++ 8/fs/btrfs/extent-tree.c2010-05-11 13:26:38.036107000 +0800 @@ -316,11 +329,6 @@ static int caching_kthread(void *data) if (!path) return -ENOMEM; - exclude_super_stripes(extent_root, block_group); - spin_lock(block_group-space_info-lock); - block_group-space_info-bytes_super += block_group-bytes_super; - spin_unlock(block_group-space_info-lock); - last = max_t(u64, block_group-key.objectid, BTRFS_SUPER_INFO_OFFSET); /* @@ -7499,6 +7541,7 @@ int btrfs_free_block_groups(struct btrfs if (block_group-cached == BTRFS_CACHE_STARTED) wait_block_group_cache_done(block_group); + free_excluded_extents(info-extent_root, block_group); btrfs_remove_free_space_cache(block_group); btrfs_put_block_group(block_group); @@ -7586,26 +7629,12 @@ int btrfs_read_block_groups(struct btrfs cache-flags = btrfs_block_group_flags(cache-item); cache-sectorsize = root-sectorsize; - /* -* check for two cases, either we are full, and therefore -* don't need to bother with the caching work since we won't -* find any space, or we are empty, and we can just add all -* the space in and be done with it. This saves us _alot_ of -* time, particularly in the full case. -*/ - if (found_key.offset == btrfs_block_group_used(cache-item)) { - exclude_super_stripes(root, cache); - cache-last_byte_to_unpin = (u64)-1; - cache-cached = BTRFS_CACHE_FINISHED; - free_excluded_extents(root, cache); - } else if (btrfs_block_group_used(cache-item) == 0) { - exclude_super_stripes(root, cache); + exclude_super_stripes(root, cache); + /* check for the case that block group is full */ + if (found_key.offset == cache-bytes_super + + btrfs_block_group_used(cache-item)) { cache-last_byte_to_unpin = (u64)-1; cache-cached = BTRFS_CACHE_FINISHED; - add_new_free_space(cache, root-fs_info, - found_key.objectid, - found_key.objectid + - found_key.offset); free_excluded_extents(root, cache); } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk space accounting and subvolume delete
On Tue, May 11, 2010 at 11:45 PM, Bruce Guenter br...@untroubled.org wrote: On Tue, May 11, 2010 at 08:10:38AM +0800, Yan, Zheng wrote: This is because the snapshot deleting ioctl only removes the a link. Right, I understand that. That part is not unexpected, as it works just like unlink would. However... The corresponding tree is dropped in the background by a kernel thread. The surprise is that 'sync', in any form I was able to try, does not wait until all or even most of the I/O is completed. Apparently the standards spec for sync(2) says it is not required to wait for I/O to complete, but AFAIK all other Linux FS do wait (the man page for sync(2) implies as much, as does the info page for sync in glibc). The only way I've found so far to force this behavior is to unmount, and that's rather intrusive to other users of the FS. We could probably add another ioctl that waits until the tree has been completely dropped. Since the expected behavior for sync is to wait until all pending I/O has been completed, I would argue this should be the default action for sync. Am I misunderstanding something? Dropping a tree can be lengthy. It's not good to let sync wait for hours. For most linux FS, 'sync' just force an transaction/journal commit. I don't think they wait for large operations that can span multiple transactions to complete. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix block generation verification race
After the path is released, the generation number got from block pointer is no long valid. The race may cause disk corruption, because verify_parent_transid() calls clear_extent_buffer_uptodate() when generation numbers mismatch. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/ctree.c 2/fs/btrfs/ctree.c --- 1/fs/btrfs/ctree.c 2010-04-14 14:49:56.342950744 +0800 +++ 2/fs/btrfs/ctree.c 2010-05-03 09:44:24.426642447 +0800 @@ -1589,7 +1589,7 @@ read_block_for_search(struct btrfs_trans btrfs_release_path(NULL, p); ret = -EAGAIN; - tmp = read_tree_block(root, blocknr, blocksize, gen); + tmp = read_tree_block(root, blocknr, blocksize, 0); if (tmp) { /* * If the read above didn't mark this buffer up to date, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bug when removing device
[ 298.659939] [c02024c3] ? __mem_cgroup_try_charge+0x53/0x330 [ 298.659956] [f85e1b9d] ? btrfs_ioctl+0x79d/0x9c0 [btrfs] [ 298.659964] [c01dd0d8] ? memdup_user+0x38/0x70 [ 298.659980] [f85e1bbc] ? btrfs_ioctl+0x7bc/0x9c0 [btrfs] [ 298.659986] [c01e28b8] ? __do_fault+0x3e8/0x560 [ 298.659993] [c01e47e5] ? handle_mm_fault+0x145/0xaa0 [ 298.66] [c0215532] ? vfs_ioctl+0x32/0xb0 [ 298.660016] [f85e1400] ? btrfs_ioctl+0x0/0x9c0 [btrfs] [ 298.660022] [c0215c92] ? do_vfs_ioctl+0x72/0x5c0 [ 298.660029] [c05a1abd] ? do_page_fault+0x1cd/0x440 [ 298.660035] [c0210e3b] ? putname+0x2b/0x40 [ 298.660041] [c0205e9a] ? do_sys_open+0xfa/0x120 [ 298.660047] [c0216247] ? sys_ioctl+0x67/0x80 [ 298.660053] [c0102fe3] ? sysenter_do_call+0x12/0x28 [ 298.660057] Code: 85 ff 75 db e9 49 fb ff ff 0f 0b 66 90 eb fc ba 29 09 00 00 b8 c8 60 5f f8 e8 af 5c b5 c7 0f b6 7e 25 e9 c8 fc ff ff 0f 0b eb fe 0f 0b eb fe 8b 45 d4 e8 d6 1b fa ff c7 45 c4 f4 ff ff ff e9 02 [ 298.660124] EIP: [f85f3b7e] relocate_tree_blocks+0x52e/0x590 [btrfs] SS:ESP 0068:cd333c14 [ 298.660150] ---[ end trace fb3e62da0e52a0bd ]--- I have sent a set of patches that address bugs like this. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 01/12] Btrfs: Link block groups of different raid types in the same space_info
The size of reserved space is stored in space_info. If block groups of different raid types are linked to separate space_info, changing allocation profile will corrupt reserved space accounting. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:23:52.921839641 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:23:52.926830638 +0800 @@ -662,6 +662,7 @@ struct btrfs_csum_item { #define BTRFS_BLOCK_GROUP_RAID1(1 4) #define BTRFS_BLOCK_GROUP_DUP (1 5) #define BTRFS_BLOCK_GROUP_RAID10 (1 6) +#define BTRFS_NR_RAID_TYPES 5 struct btrfs_block_group_item { __le64 used; @@ -673,7 +674,8 @@ struct btrfs_space_info { u64 flags; u64 total_bytes;/* total bytes in the space */ - u64 bytes_used; /* total bytes used on disk */ + u64 bytes_used; /* total bytes used, + this does't take mirrors into account */ u64 bytes_pinned; /* total bytes pinned, will be freed when the transaction finishes */ u64 bytes_reserved; /* total bytes the allocator has reserved for @@ -686,6 +688,7 @@ struct btrfs_space_info { delalloc/allocations */ u64 bytes_delalloc; /* number of bytes currently reserved for delayed allocation */ + u64 disk_used; /* total bytes used on disk */ int full; /* indicates that we cannot allocate any more chunks for this space */ @@ -703,7 +706,7 @@ struct btrfs_space_info { int flushing; /* for block groups in our same type */ - struct list_head block_groups; + struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; struct rw_semaphore groups_sem; atomic_t caching_threads; diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 2/fs/btrfs/extent-tree.c2010-04-26 17:23:52.922840061 +0800 +++ 3/fs/btrfs/extent-tree.c2010-04-26 17:23:52.929829246 +0800 @@ -506,6 +506,9 @@ static struct btrfs_space_info *__find_s struct list_head *head = info-space_info; struct btrfs_space_info *found; + flags = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | +BTRFS_BLOCK_GROUP_METADATA; + rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { if (found-flags == flags) { @@ -2659,12 +2662,21 @@ static int update_space_info(struct btrf struct btrfs_space_info **space_info) { struct btrfs_space_info *found; + int i; + int factor; + + if (flags (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 | +BTRFS_BLOCK_GROUP_RAID10)) + factor = 2; + else + factor = 1; found = __find_space_info(info, flags); if (found) { spin_lock(found-lock); found-total_bytes += total_bytes; found-bytes_used += bytes_used; + found-disk_used += bytes_used * factor; found-full = 0; spin_unlock(found-lock); *space_info = found; @@ -2674,14 +2686,18 @@ static int update_space_info(struct btrf if (!found) return -ENOMEM; - INIT_LIST_HEAD(found-block_groups); + for (i = 0; i BTRFS_NR_RAID_TYPES; i++) + INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); init_waitqueue_head(found-flush_wait); init_waitqueue_head(found-allocate_wait); spin_lock_init(found-lock); - found-flags = flags; + found-flags = flags (BTRFS_BLOCK_GROUP_DATA | + BTRFS_BLOCK_GROUP_SYSTEM | + BTRFS_BLOCK_GROUP_METADATA); found-total_bytes = total_bytes; found-bytes_used = bytes_used; + found-disk_used = bytes_used * factor; found-bytes_pinned = 0; found-bytes_reserved = 0; found-bytes_readonly = 0; @@ -2751,26 +2767,32 @@ u64 btrfs_reduce_alloc_profile(struct bt return flags; } -static u64 btrfs_get_alloc_profile(struct btrfs_root *root, u64 data) +static u64 get_alloc_profile(struct btrfs_root *root, u64 flags) { - struct btrfs_fs_info *info = root-fs_info; - u64 alloc_profile; + if (flags BTRFS_BLOCK_GROUP_DATA) + flags |= root-fs_info-avail_data_alloc_bits +root-fs_info-data_alloc_profile; + else if (flags BTRFS_BLOCK_GROUP_SYSTEM) + flags |= root-fs_info-avail_system_alloc_bits +root-fs_info-system_alloc_profile; + else if (flags BTRFS_BLOCK_GROUP_METADATA) + flags |= root-fs_info-avail_metadata_alloc_bits +root
[PATCH V2 02/12] Btrfs: Kill allocate_wait in space_info
We already have fs_info-chunk_mutex to avoid concurrent chunk creation. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:24:10.436081649 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:24:10.441079491 +0800 @@ -700,9 +700,7 @@ struct btrfs_space_info { struct list_head list; /* for controlling how we free up space for allocations */ - wait_queue_head_t allocate_wait; wait_queue_head_t flush_wait; - int allocating_chunk; int flushing; /* for block groups in our same type */ diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 2/fs/btrfs/extent-tree.c2010-04-26 17:24:10.437084933 +0800 +++ 3/fs/btrfs/extent-tree.c2010-04-26 17:24:10.444079704 +0800 @@ -70,6 +70,9 @@ static int find_next_key(struct btrfs_pa struct btrfs_key *key); static void dump_space_info(struct btrfs_space_info *info, u64 bytes, int dump_block_groups); +static int maybe_allocate_chunk(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_space_info *sinfo, u64 num_bytes); static noinline int block_group_cache_done(struct btrfs_block_group_cache *cache) @@ -2690,7 +2693,6 @@ static int update_space_info(struct btrf INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); init_waitqueue_head(found-flush_wait); - init_waitqueue_head(found-allocate_wait); spin_lock_init(found-lock); found-flags = flags (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -3003,71 +3005,6 @@ flush: wake_up(info-flush_wait); } -static int maybe_allocate_chunk(struct btrfs_root *root, -struct btrfs_space_info *info) -{ - struct btrfs_super_block *disk_super = root-fs_info-super_copy; - struct btrfs_trans_handle *trans; - bool wait = false; - int ret = 0; - u64 min_metadata; - u64 free_space; - - free_space = btrfs_super_total_bytes(disk_super); - /* -* we allow the metadata to grow to a max of either 10gb or 5% of the -* space in the volume. -*/ - min_metadata = min((u64)10 * 1024 * 1024 * 1024, -div64_u64(free_space * 5, 100)); - if (info-total_bytes = min_metadata) { - spin_unlock(info-lock); - return 0; - } - - if (info-full) { - spin_unlock(info-lock); - return 0; - } - - if (!info-allocating_chunk) { - info-force_alloc = 1; - info-allocating_chunk = 1; - } else { - wait = true; - } - - spin_unlock(info-lock); - - if (wait) { - wait_event(info-allocate_wait, - !info-allocating_chunk); - return 1; - } - - trans = btrfs_start_transaction(root, 1); - if (!trans) { - ret = -ENOMEM; - goto out; - } - - ret = do_chunk_alloc(trans, root-fs_info-extent_root, -4096 + 2 * 1024 * 1024, -info-flags, 0); - btrfs_end_transaction(trans, root); - if (ret) - goto out; -out: - spin_lock(info-lock); - info-allocating_chunk = 0; - spin_unlock(info-lock); - wake_up(info-allocate_wait); - - if (ret) - return 0; - return 1; -} - /* * Reserve metadata space for delalloc. */ @@ -3108,7 +3045,8 @@ again: flushed++; if (flushed == 1) { - if (maybe_allocate_chunk(root, meta_sinfo)) + if (maybe_allocate_chunk(NULL, root, meta_sinfo, +num_bytes)) goto again; flushed++; } else { @@ -3223,7 +3161,8 @@ again: if (used meta_sinfo-total_bytes) { retries++; if (retries == 1) { - if (maybe_allocate_chunk(root, meta_sinfo)) + if (maybe_allocate_chunk(NULL, root, meta_sinfo, +num_bytes)) goto again; retries++; } else { @@ -3420,13 +3359,28 @@ static void force_metadata_allocation(st rcu_read_unlock(); } +static int should_alloc_chunk(struct btrfs_space_info *sinfo, + u64 alloc_bytes) +{ + u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly; + + if (sinfo-bytes_used + sinfo-bytes_reserved + + alloc_bytes + 256 * 1024 * 1024 num_bytes) + return 0; + + if (sinfo-bytes_used + sinfo
[PATCH V2 04/12] Btrfs: Kill init_btrfs_i()
All code in init_btrfs_i can be moved into btrfs_alloc_inode() Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c --- 2/fs/btrfs/inode.c 2010-04-26 17:24:41.254078880 +0800 +++ 3/fs/btrfs/inode.c 2010-04-26 17:24:41.270103836 +0800 @@ -3595,40 +3595,10 @@ again: return 0; } -static noinline void init_btrfs_i(struct inode *inode) -{ - struct btrfs_inode *bi = BTRFS_I(inode); - - bi-generation = 0; - bi-sequence = 0; - bi-last_trans = 0; - bi-last_sub_trans = 0; - bi-logged_trans = 0; - bi-delalloc_bytes = 0; - bi-reserved_bytes = 0; - bi-disk_i_size = 0; - bi-flags = 0; - bi-index_cnt = (u64)-1; - bi-last_unlink_trans = 0; - bi-ordered_data_close = 0; - bi-force_compress = 0; - extent_map_tree_init(BTRFS_I(inode)-extent_tree, GFP_NOFS); - extent_io_tree_init(BTRFS_I(inode)-io_tree, -inode-i_mapping, GFP_NOFS); - extent_io_tree_init(BTRFS_I(inode)-io_failure_tree, -inode-i_mapping, GFP_NOFS); - INIT_LIST_HEAD(BTRFS_I(inode)-delalloc_inodes); - INIT_LIST_HEAD(BTRFS_I(inode)-ordered_operations); - RB_CLEAR_NODE(BTRFS_I(inode)-rb_node); - btrfs_ordered_inode_tree_init(BTRFS_I(inode)-ordered_tree); - mutex_init(BTRFS_I(inode)-log_mutex); -} - static int btrfs_init_locked_inode(struct inode *inode, void *p) { struct btrfs_iget_args *args = p; inode-i_ino = args-ino; - init_btrfs_i(inode); BTRFS_I(inode)-root = args-root; btrfs_set_inode_space_info(args-root, inode); return 0; @@ -3691,8 +3661,6 @@ static struct inode *new_simple_dir(stru if (!inode) return ERR_PTR(-ENOMEM); - init_btrfs_i(inode); - BTRFS_I(inode)-root = root; memcpy(BTRFS_I(inode)-location, key, sizeof(*key)); BTRFS_I(inode)-dummy_inode = 1; @@ -4091,7 +4059,6 @@ static struct inode *btrfs_new_inode(str * btrfs_get_inode_index_count has an explanation for the magic * number */ - init_btrfs_i(inode); BTRFS_I(inode)-index_cnt = 2; BTRFS_I(inode)-root = root; BTRFS_I(inode)-generation = trans-transid; @@ -5262,21 +5229,46 @@ unsigned long btrfs_force_ra(struct addr struct inode *btrfs_alloc_inode(struct super_block *sb) { struct btrfs_inode *ei; + struct inode *inode; ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS); if (!ei) return NULL; + + ei-root = NULL; + ei-space_info = NULL; + ei-generation = 0; + ei-sequence = 0; ei-last_trans = 0; ei-last_sub_trans = 0; ei-logged_trans = 0; + ei-delalloc_bytes = 0; + ei-reserved_bytes = 0; + ei-disk_i_size = 0; + ei-flags = 0; + ei-index_cnt = (u64)-1; + ei-last_unlink_trans = 0; + + spin_lock_init(ei-accounting_lock); ei-outstanding_extents = 0; ei-reserved_extents = 0; - ei-root = NULL; - spin_lock_init(ei-accounting_lock); + + ei-ordered_data_close = 0; + ei-dummy_inode = 0; + ei-force_compress = 0; + + inode = ei-vfs_inode; + extent_map_tree_init(ei-extent_tree, GFP_NOFS); + extent_io_tree_init(ei-io_tree, inode-i_data, GFP_NOFS); + extent_io_tree_init(ei-io_failure_tree, inode-i_data, GFP_NOFS); + mutex_init(ei-log_mutex); btrfs_ordered_inode_tree_init(ei-ordered_tree); INIT_LIST_HEAD(ei-i_orphan); + INIT_LIST_HEAD(ei-delalloc_inodes); INIT_LIST_HEAD(ei-ordered_operations); - return ei-vfs_inode; + RB_CLEAR_NODE(ei-rb_node); + + return inode; } void btrfs_destroy_inode(struct inode *inode) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 08/12] Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:27:31.644829469 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:27:31.648830941 +0800 @@ -682,21 +682,15 @@ struct btrfs_space_info { u64 bytes_reserved; /* total bytes the allocator has reserved for current allocations */ u64 bytes_readonly; /* total bytes that are read only */ - u64 bytes_super;/* total bytes reserved for the super blocks */ - u64 bytes_root; /* the number of bytes needed to commit a - transaction */ + u64 bytes_may_use; /* number of bytes that may be used for delalloc/allocations */ - u64 bytes_delalloc; /* number of bytes currently reserved for - delayed allocation */ u64 disk_used; /* total bytes used on disk */ int full; /* indicates that we cannot allocate any more chunks for this space */ int force_alloc;/* set if we need to force a chunk alloc for this space */ - int force_delalloc; /* make people start doing filemap_flush until - we're under a threshold */ struct list_head list; diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:27:31.638850832 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:31.649830174 +0800 @@ -1472,10 +1472,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - smp_mb(); - if (root-fs_info-closing) - break; - vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); if (!(root-fs_info-sb-s_flags MS_RDONLY) @@ -1488,11 +1484,9 @@ static int cleaner_kthread(void *arg) if (freezing(current)) { refrigerator(); } else { - smp_mb(); - if (root-fs_info-closing) - break; set_current_state(TASK_INTERRUPTIBLE); - schedule(); + if (!kthread_should_stop()) + schedule(); __set_current_state(TASK_RUNNING); } } while (!kthread_should_stop()); @@ -1504,36 +1498,39 @@ static int transaction_kthread(void *arg struct btrfs_root *root = arg; struct btrfs_trans_handle *trans; struct btrfs_transaction *cur; + u64 transid; unsigned long now; unsigned long delay; int ret; do { - smp_mb(); - if (root-fs_info-closing) - break; - delay = HZ * 30; vfs_check_frozen(root-fs_info-sb, SB_FREEZE_WRITE); - mutex_lock(root-fs_info-transaction_kthread_mutex); - mutex_lock(root-fs_info-trans_mutex); + spin_lock(root-fs_info-new_trans_lock); cur = root-fs_info-running_transaction; if (!cur) { - mutex_unlock(root-fs_info-trans_mutex); + spin_unlock(root-fs_info-new_trans_lock); goto sleep; } now = get_seconds(); - if (now cur-start_time || now - cur-start_time 30) { - mutex_unlock(root-fs_info-trans_mutex); + if (!cur-blocked + (now cur-start_time || now - cur-start_time 30)) { + spin_unlock(root-fs_info-new_trans_lock); delay = HZ * 5; goto sleep; } - mutex_unlock(root-fs_info-trans_mutex); - trans = btrfs_join_transaction(root, 1); - ret = btrfs_commit_transaction(trans, root); + transid = cur-transid; + spin_unlock(root-fs_info-new_trans_lock); + trans = btrfs_join_transaction(root, 1); + if (transid == trans-transid) { + ret = btrfs_commit_transaction(trans, root); + BUG_ON(ret); + } else { + btrfs_end_transaction(trans, root); + } sleep: wake_up_process(root-fs_info-cleaner_kthread); mutex_unlock(root-fs_info-transaction_kthread_mutex); @@ -1541,10 +1538,10 @@ sleep: if (freezing(current)) { refrigerator(); } else { - if (root-fs_info-closing) - break
[PATCH V2 07/12] Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h --- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.450105767 +0800 +++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.456080004 +0800 @@ -137,8 +137,8 @@ struct btrfs_inode { * of extent items we've reserved metadata for. */ spinlock_t accounting_lock; + atomic_t outstanding_extents; int reserved_extents; - int outstanding_extents; /* * ordered_data_close is set by truncate when a file that used diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:26:55.451104861 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:26:55.457079656 +0800 @@ -2078,19 +2078,8 @@ int btrfs_remove_block_group(struct btrf u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags); void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde); void btrfs_clear_space_info_full(struct btrfs_fs_info *info); - -int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root, - struct inode *inode, int num_items); -int btrfs_reserve_metadata_for_delalloc(struct btrfs_root *root, - struct inode *inode, int num_items); -int btrfs_check_data_free_space(struct btrfs_root *root, struct inode *inode, - u64 bytes); -void btrfs_free_reserved_data_space(struct btrfs_root *root, - struct inode *inode, u64 bytes); -void btrfs_delalloc_reserve_space(struct btrfs_root *root, struct inode *inode, -u64 bytes); -void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode, - u64 bytes); +int btrfs_check_data_free_space(struct inode *inode, u64 bytes); +void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes); int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root, int num_items, int *retries); @@ -2098,6 +2087,10 @@ void btrfs_trans_release_metadata(struct struct btrfs_root *root); int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); +int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); +void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes); +void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root); void btrfs_free_block_rsv(struct btrfs_root *root, diff -urp 2/fs/btrfs/extent_io.c 3/fs/btrfs/extent_io.c --- 2/fs/btrfs/extent_io.c 2010-04-26 17:26:55.447090049 +0800 +++ 3/fs/btrfs/extent_io.c 2010-04-26 17:26:55.458079658 +0800 @@ -336,21 +336,18 @@ static int merge_state(struct extent_io_ } static int set_state_cb(struct extent_io_tree *tree, -struct extent_state *state, -unsigned long bits) +struct extent_state *state, int *bits) { if (tree-ops tree-ops-set_bit_hook) { return tree-ops-set_bit_hook(tree-mapping-host, - state-start, state-end, - state-state, bits); + state, bits); } return 0; } static void clear_state_cb(struct extent_io_tree *tree, - struct extent_state *state, - unsigned long bits) + struct extent_state *state, int *bits) { if (tree-ops tree-ops-clear_bit_hook) tree-ops-clear_bit_hook(tree-mapping-host, state, bits); @@ -368,9 +365,10 @@ static void clear_state_cb(struct extent */ static int insert_state(struct extent_io_tree *tree, struct extent_state *state, u64 start, u64 end, - int bits) + int *bits) { struct rb_node *node; + int bits_to_set = *bits ~EXTENT_CTLBITS; int ret; if (end start) { @@ -385,9 +383,9 @@ static int insert_state(struct extent_io if (ret) return ret; - if (bits EXTENT_DIRTY) + if (bits_to_set EXTENT_DIRTY
[PATCH V2 09/12] Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h --- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.113080051 +0800 +++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.118079430 +0800 @@ -151,6 +151,7 @@ struct btrfs_inode { * of these. */ unsigned ordered_data_close:1; + unsigned orphan_meta_reserved:1; unsigned dummy_inode:1; /* diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:27:52.114079844 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:27:52.119079920 +0800 @@ -1068,7 +1068,6 @@ struct btrfs_root { int ref_cows; int track_dirty; int in_radix; - int clean_orphans; u64 defrag_trans_start; struct btrfs_key defrag_progress; @@ -1082,8 +1081,11 @@ struct btrfs_root { struct list_head root_list; - spinlock_t list_lock; + spinlock_t orphan_lock; struct list_head orphan_list; + struct btrfs_block_rsv *orphan_block_rsv; + int orphan_item_inserted; + int orphan_cleanup_state; spinlock_t inode_lock; /* red-black tree that keeps track of in-memory inodes */ @@ -2079,6 +2081,9 @@ int btrfs_trans_reserve_metadata(struct int num_items, int *retries); void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root); +int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans, + struct inode *inode); +void btrfs_orphan_release_metadata(struct inode *inode); int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); @@ -2403,6 +2408,13 @@ int btrfs_update_inode(struct btrfs_tran int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode); int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode); void btrfs_orphan_cleanup(struct btrfs_root *root); +void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_pending_snapshot *pending, + u64 *bytes_to_reserve); +void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_pending_snapshot *pending); +void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t size); int btrfs_invalidate_inodes(struct btrfs_root *root); void btrfs_add_delayed_iput(struct inode *inode); diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:27:52.105081158 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:52.120080690 +0800 @@ -895,7 +895,8 @@ static int __setup_root(u32 nodesize, u3 root-ref_cows = 0; root-track_dirty = 0; root-in_radix = 0; - root-clean_orphans = 0; + root-orphan_item_inserted = 0; + root-orphan_cleanup_state = 0; root-fs_info = fs_info; root-objectid = objectid; @@ -905,12 +906,13 @@ static int __setup_root(u32 nodesize, u3 root-in_sysfs = 0; root-inode_tree = RB_ROOT; root-block_rsv = NULL; + root-orphan_block_rsv = NULL; INIT_LIST_HEAD(root-dirty_list); INIT_LIST_HEAD(root-orphan_list); INIT_LIST_HEAD(root-root_list); spin_lock_init(root-node_lock); - spin_lock_init(root-list_lock); + spin_lock_init(root-orphan_lock); spin_lock_init(root-inode_lock); spin_lock_init(root-accounting_lock); mutex_init(root-objectid_mutex); @@ -1194,19 +1196,23 @@ again: if (root) return root; - ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid); - if (ret == 0) - ret = -ENOENT; - if (ret 0) - return ERR_PTR(ret); - root = btrfs_read_fs_root_no_radix(fs_info-tree_root, location); if (IS_ERR(root)) return root; - WARN_ON(btrfs_root_refs(root-root_item) == 0); set_anon_super(root-anon_super, NULL); + if (btrfs_root_refs(root-root_item) == 0) { + ret = -ENOENT; + goto fail; + } + + ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid); + if (ret 0) + goto fail; + if (ret == 0) + root-orphan_item_inserted = 1; + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); if (ret) goto fail; @@ -1215,10 +1221,9 @@ again: ret = radix_tree_insert(fs_info-fs_roots_radix, (unsigned long)root-root_key.objectid
[PATCH V2 10/12] Btrfs: Metadata ENOSPC handling for tree log
Previous patches make the allocater return -ENOSPC if there is no unreserved free metadata space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:28:05.496079922 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:28:05.506079726 +0800 @@ -973,42 +973,6 @@ static int find_and_setup_root(struct bt return 0; } -int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info) -{ - struct extent_buffer *eb; - struct btrfs_root *log_root_tree = fs_info-log_root_tree; - u64 start = 0; - u64 end = 0; - int ret; - - if (!log_root_tree) - return 0; - - while (1) { - ret = find_first_extent_bit(log_root_tree-dirty_log_pages, - 0, start, end, EXTENT_DIRTY | EXTENT_NEW); - if (ret) - break; - - clear_extent_bits(log_root_tree-dirty_log_pages, start, end, - EXTENT_DIRTY | EXTENT_NEW, GFP_NOFS); - } - eb = fs_info-log_root_tree-node; - - WARN_ON(btrfs_header_level(eb) != 0); - WARN_ON(btrfs_header_nritems(eb) != 0); - - ret = btrfs_free_reserved_extent(fs_info-tree_root, - eb-start, eb-len); - BUG_ON(ret); - - free_extent_buffer(eb); - kfree(fs_info-log_root_tree); - fs_info-log_root_tree = NULL; - return 0; -} - static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { diff -urp 2/fs/btrfs/disk-io.h 3/fs/btrfs/disk-io.h --- 2/fs/btrfs/disk-io.h2010-04-26 17:28:05.495079921 +0800 +++ 3/fs/btrfs/disk-io.h2010-04-26 17:28:05.507080566 +0800 @@ -95,8 +95,6 @@ int btrfs_congested_async(struct btrfs_f unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info); int btrfs_write_tree_block(struct extent_buffer *buf); int btrfs_wait_tree_block_writeback(struct extent_buffer *buf); -int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info); int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_add_log_tree(struct btrfs_trans_handle *trans, diff -urp 2/fs/btrfs/file-item.c 3/fs/btrfs/file-item.c --- 2/fs/btrfs/file-item.c 2010-04-26 17:28:05.503100326 +0800 +++ 3/fs/btrfs/file-item.c 2010-04-26 17:28:05.507080566 +0800 @@ -656,6 +656,9 @@ again: goto found; } ret = PTR_ERR(item); + if (ret != -EFBIG ret != -ENOENT) + goto fail_unlock; + if (ret == -EFBIG) { u32 item_size; /* we found one, but it isn't big enough yet */ diff -urp 2/fs/btrfs/tree-log.c 3/fs/btrfs/tree-log.c --- 2/fs/btrfs/tree-log.c 2010-04-26 17:28:05.498105836 +0800 +++ 3/fs/btrfs/tree-log.c 2010-04-26 17:28:05.509079730 +0800 @@ -134,6 +134,7 @@ static int start_log_trans(struct btrfs_ struct btrfs_root *root) { int ret; + int err = 0; mutex_lock(root-log_mutex); if (root-log_root) { @@ -154,17 +155,19 @@ static int start_log_trans(struct btrfs_ mutex_lock(root-fs_info-tree_log_mutex); if (!root-fs_info-log_root_tree) { ret = btrfs_init_log_root_tree(trans, root-fs_info); - BUG_ON(ret); + if (ret) + err = ret; } - if (!root-log_root) { + if (err == 0 !root-log_root) { ret = btrfs_add_log_tree(trans, root); - BUG_ON(ret); + if (ret) + err = ret; } mutex_unlock(root-fs_info-tree_log_mutex); root-log_batch++; atomic_inc(root-log_writers); mutex_unlock(root-log_mutex); - return 0; + return err; } /* @@ -375,7 +378,7 @@ insert: BUG_ON(ret); } } else if (ret) { - BUG(); + return ret; } dst_ptr = btrfs_item_ptr_offset(path-nodes[0], path-slots[0]); @@ -1698,9 +1701,9 @@ static noinline int walk_down_log_tree(s next = btrfs_find_create_tree_block(root, bytenr, blocksize); - wc-process_func(root, next, wc, ptr_gen); - if (*level == 1) { + wc-process_func(root, next, wc, ptr_gen); + path-slots[*level]++; if (wc-free) { btrfs_read_buffer(next, ptr_gen); @@ -1733,35 +1736,7 @@ static noinline int
[PATCH V2 11/12] Btrfs: Pre-allocate space for data relocation
Pre-allocate space for data relocation. This can detect ENOPSC condition caused by fragmentation of free space. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:28:20.493839748 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:28:20.498830465 +0800 @@ -2419,6 +2419,9 @@ int btrfs_cont_expand(struct inode *inod int btrfs_invalidate_inodes(struct btrfs_root *root); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_root *root); +int btrfs_prealloc_file_range(struct inode *inode, int mode, + u64 start, u64 num_bytes, u64 min_size, + loff_t actual_len, u64 *alloc_hint); extern const struct dentry_operations btrfs_dentry_operations; /* ioctl.c */ diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c --- 2/fs/btrfs/inode.c 2010-04-26 17:28:20.489839672 +0800 +++ 3/fs/btrfs/inode.c 2010-04-26 17:28:20.500829420 +0800 @@ -1174,6 +1174,13 @@ out_check: num_bytes, num_bytes, type); BUG_ON(ret); + if (root-root_key.objectid == + BTRFS_DATA_RELOC_TREE_OBJECTID) { + ret = btrfs_reloc_clone_csums(inode, cur_offset, + num_bytes); + BUG_ON(ret); + } + extent_clear_unlock_delalloc(inode, BTRFS_I(inode)-io_tree, cur_offset, cur_offset + num_bytes - 1, locked_page, EXTENT_CLEAR_UNLOCK_PAGE | @@ -6079,16 +6086,15 @@ out_unlock: return err; } -static int prealloc_file_range(struct inode *inode, u64 start, u64 end, - u64 alloc_hint, int mode, loff_t actual_len) +int btrfs_prealloc_file_range(struct inode *inode, int mode, + u64 start, u64 num_bytes, u64 min_size, + loff_t actual_len, u64 *alloc_hint) { struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(inode)-root; struct btrfs_key ins; u64 cur_offset = start; - u64 num_bytes = end - start; int ret = 0; - u64 i_size; while (num_bytes 0) { trans = btrfs_start_transaction(root, 3); @@ -6097,9 +6103,8 @@ static int prealloc_file_range(struct in break; } - ret = btrfs_reserve_extent(trans, root, num_bytes, - root-sectorsize, 0, alloc_hint, - (u64)-1, ins, 1); + ret = btrfs_reserve_extent(trans, root, num_bytes, min_size, + 0, *alloc_hint, (u64)-1, ins, 1); if (ret) { btrfs_end_transaction(trans, root); break; @@ -6116,20 +6121,19 @@ static int prealloc_file_range(struct in num_bytes -= ins.offset; cur_offset += ins.offset; - alloc_hint = ins.objectid + ins.offset; + *alloc_hint = ins.objectid + ins.offset; inode-i_ctime = CURRENT_TIME; BTRFS_I(inode)-flags |= BTRFS_INODE_PREALLOC; if (!(mode FALLOC_FL_KEEP_SIZE) - (actual_len inode-i_size) - (cur_offset inode-i_size)) { - + (actual_len inode-i_size) + (cur_offset inode-i_size)) { if (cur_offset actual_len) - i_size = actual_len; + i_size_write(inode, actual_len); else - i_size = cur_offset; - i_size_write(inode, i_size); - btrfs_ordered_update_i_size(inode, i_size, NULL); + i_size_write(inode, cur_offset); + i_size_write(inode, cur_offset); + btrfs_ordered_update_i_size(inode, cur_offset, NULL); } ret = btrfs_update_inode(trans, root, inode); @@ -6215,16 +6219,16 @@ static long btrfs_fallocate(struct inode if (em-block_start == EXTENT_MAP_HOLE || (cur_offset = inode-i_size !test_bit(EXTENT_FLAG_PREALLOC, em-flags))) { - ret = prealloc_file_range(inode, - cur_offset, last_byte, - alloc_hint, mode, offset+len); + ret = btrfs_prealloc_file_range(inode, 0, cur_offset, + last_byte - cur_offset, + 1 inode-i_blkbits
[PATCH 10/12] Btrfs: Metadata ENOSPC handling for tree log
Previous patches make the allocater return -ENOSPC if there is no unreserved free meta space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 10/fs/btrfs/disk-io.c 11/fs/btrfs/disk-io.c --- 10/fs/btrfs/disk-io.c 2010-04-18 10:51:06.294978596 +0800 +++ 11/fs/btrfs/disk-io.c 2010-04-18 10:51:28.685949058 +0800 @@ -973,42 +973,6 @@ static int find_and_setup_root(struct bt return 0; } -int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info) -{ - struct extent_buffer *eb; - struct btrfs_root *log_root_tree = fs_info-log_root_tree; - u64 start = 0; - u64 end = 0; - int ret; - - if (!log_root_tree) - return 0; - - while (1) { - ret = find_first_extent_bit(log_root_tree-dirty_log_pages, - 0, start, end, EXTENT_DIRTY | EXTENT_NEW); - if (ret) - break; - - clear_extent_bits(log_root_tree-dirty_log_pages, start, end, - EXTENT_DIRTY | EXTENT_NEW, GFP_NOFS); - } - eb = fs_info-log_root_tree-node; - - WARN_ON(btrfs_header_level(eb) != 0); - WARN_ON(btrfs_header_nritems(eb) != 0); - - ret = btrfs_free_reserved_extent(fs_info-tree_root, - eb-start, eb-len); - BUG_ON(ret); - - free_extent_buffer(eb); - kfree(fs_info-log_root_tree); - fs_info-log_root_tree = NULL; - return 0; -} - static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { diff -urp 10/fs/btrfs/disk-io.h 11/fs/btrfs/disk-io.h --- 10/fs/btrfs/disk-io.h 2010-04-18 10:47:31.057968000 +0800 +++ 11/fs/btrfs/disk-io.h 2010-04-18 10:51:28.686949137 +0800 @@ -95,8 +95,6 @@ int btrfs_congested_async(struct btrfs_f unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info); int btrfs_write_tree_block(struct extent_buffer *buf); int btrfs_wait_tree_block_writeback(struct extent_buffer *buf); -int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info); int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_add_log_tree(struct btrfs_trans_handle *trans, diff -urp 10/fs/btrfs/file-item.c 11/fs/btrfs/file-item.c --- 10/fs/btrfs/file-item.c 2010-04-18 10:47:31.057968000 +0800 +++ 11/fs/btrfs/file-item.c 2010-04-18 10:51:28.687948517 +0800 @@ -656,6 +656,9 @@ again: goto found; } ret = PTR_ERR(item); + if (ret != -EFBIG ret != -ENOENT) + goto fail_unlock; + if (ret == -EFBIG) { u32 item_size; /* we found one, but it isn't big enough yet */ diff -urp 10/fs/btrfs/tree-log.c 11/fs/btrfs/tree-log.c --- 10/fs/btrfs/tree-log.c 2010-04-18 10:47:31.058957000 +0800 +++ 11/fs/btrfs/tree-log.c 2010-04-18 10:51:28.689947836 +0800 @@ -134,6 +134,7 @@ static int start_log_trans(struct btrfs_ struct btrfs_root *root) { int ret; + int err = 0; mutex_lock(root-log_mutex); if (root-log_root) { @@ -154,17 +155,19 @@ static int start_log_trans(struct btrfs_ mutex_lock(root-fs_info-tree_log_mutex); if (!root-fs_info-log_root_tree) { ret = btrfs_init_log_root_tree(trans, root-fs_info); - BUG_ON(ret); + if (ret) + err = ret; } - if (!root-log_root) { + if (err == 0 !root-log_root) { ret = btrfs_add_log_tree(trans, root); - BUG_ON(ret); + if (ret) + err = ret; } mutex_unlock(root-fs_info-tree_log_mutex); root-log_batch++; atomic_inc(root-log_writers); mutex_unlock(root-log_mutex); - return 0; + return err; } /* @@ -375,7 +378,7 @@ insert: BUG_ON(ret); } } else if (ret) { - BUG(); + return ret; } dst_ptr = btrfs_item_ptr_offset(path-nodes[0], path-slots[0]); @@ -1698,9 +1701,9 @@ static noinline int walk_down_log_tree(s next = btrfs_find_create_tree_block(root, bytenr, blocksize); - wc-process_func(root, next, wc, ptr_gen); - if (*level == 1) { + wc-process_func(root, next, wc, ptr_gen); + path-slots[*level]++; if (wc-free) { btrfs_read_buffer(next, ptr_gen); @@ -1733,35 +1736,7 @@ static noinline int
[PATCH 09/12] Btrfs: Metadata reseravtion for orphan inodes
reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 9/fs/btrfs/btrfs_inode.h 10/fs/btrfs/btrfs_inode.h --- 9/fs/btrfs/btrfs_inode.h2010-04-18 10:26:38.326701000 +0800 +++ 10/fs/btrfs/btrfs_inode.h 2010-04-18 10:50:26.564697845 +0800 @@ -151,6 +151,7 @@ struct btrfs_inode { * of these. */ unsigned ordered_data_close:1; + unsigned orphan_meta_reserved:1; unsigned dummy_inode:1; /* diff -urp 9/fs/btrfs/ctree.h 10/fs/btrfs/ctree.h --- 9/fs/btrfs/ctree.h 2010-04-18 10:30:01.883697869 +0800 +++ 10/fs/btrfs/ctree.h 2010-04-18 10:50:26.565702253 +0800 @@ -1066,7 +1066,6 @@ struct btrfs_root { int ref_cows; int track_dirty; int in_radix; - int clean_orphans; u64 defrag_trans_start; struct btrfs_key defrag_progress; @@ -1080,8 +1079,11 @@ struct btrfs_root { struct list_head root_list; - spinlock_t list_lock; + spinlock_t orphan_lock; struct list_head orphan_list; + struct btrfs_block_rsv *orphan_block_rsv; + int orphan_item_inserted; + int orphan_cleanup_state; spinlock_t inode_lock; /* red-black tree that keeps track of in-memory inodes */ @@ -2074,6 +2076,9 @@ int btrfs_trans_reserve_metadata(struct int num_items, int *retries); void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root); +int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans, + struct inode *inode); +void btrfs_orphan_release_metadata(struct inode *inode); int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); @@ -2390,6 +2395,13 @@ int btrfs_update_inode(struct btrfs_tran int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode); int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode); void btrfs_orphan_cleanup(struct btrfs_root *root); +void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_pending_snapshot *pending, + u64 *bytes_to_reserve); +void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_pending_snapshot *pending); +void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t size); int btrfs_invalidate_inodes(struct btrfs_root *root); void btrfs_add_delayed_iput(struct inode *inode); diff -urp 9/fs/btrfs/disk-io.c 10/fs/btrfs/disk-io.c --- 9/fs/btrfs/disk-io.c2010-04-18 10:47:31.056726210 +0800 +++ 10/fs/btrfs/disk-io.c 2010-04-18 10:51:06.294978596 +0800 @@ -895,7 +895,8 @@ static int __setup_root(u32 nodesize, u3 root-ref_cows = 0; root-track_dirty = 0; root-in_radix = 0; - root-clean_orphans = 0; + root-orphan_item_inserted = 0; + root-orphan_cleanup_state = 0; root-fs_info = fs_info; root-objectid = objectid; @@ -905,12 +906,13 @@ static int __setup_root(u32 nodesize, u3 root-in_sysfs = 0; root-inode_tree = RB_ROOT; root-block_rsv = NULL; + root-orphan_block_rsv = NULL; INIT_LIST_HEAD(root-dirty_list); INIT_LIST_HEAD(root-orphan_list); INIT_LIST_HEAD(root-root_list); spin_lock_init(root-node_lock); - spin_lock_init(root-list_lock); + spin_lock_init(root-orphan_lock); spin_lock_init(root-inode_lock); spin_lock_init(root-accounting_lock); mutex_init(root-objectid_mutex); @@ -1194,19 +1196,23 @@ again: if (root) return root; - ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid); - if (ret == 0) - ret = -ENOENT; - if (ret 0) - return ERR_PTR(ret); - root = btrfs_read_fs_root_no_radix(fs_info-tree_root, location); if (IS_ERR(root)) return root; - WARN_ON(btrfs_root_refs(root-root_item) == 0); set_anon_super(root-anon_super, NULL); + if (btrfs_root_refs(root-root_item) == 0) { + ret = -ENOENT; + goto fail; + } + + ret = btrfs_find_orphan_item(fs_info-tree_root, location-objectid); + if (ret 0) + goto fail; + if (ret == 0) + root-orphan_item_inserted = 1; + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); if (ret) goto fail; @@ -1215,10 +1221,9 @@ again: ret = radix_tree_insert(fs_info-fs_roots_radix, (unsigned long)root-root_key.objectid
[PATCH 07/12] Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 7/fs/btrfs/btrfs_inode.h 8/fs/btrfs/btrfs_inode.h --- 7/fs/btrfs/btrfs_inode.h2010-04-13 15:44:56.104814000 +0800 +++ 8/fs/btrfs/btrfs_inode.h2010-04-18 10:26:38.326701375 +0800 @@ -137,8 +137,8 @@ struct btrfs_inode { * of extent items we've reserved metadata for. */ spinlock_t accounting_lock; + atomic_t outstanding_extents; int reserved_extents; - int outstanding_extents; /* * ordered_data_close is set by truncate when a file that used diff -urp 7/fs/btrfs/ctree.h 8/fs/btrfs/ctree.h --- 7/fs/btrfs/ctree.h 2010-04-18 10:24:51.285697715 +0800 +++ 8/fs/btrfs/ctree.h 2010-04-18 10:26:38.327697818 +0800 @@ -2073,19 +2073,8 @@ int btrfs_remove_block_group(struct btrf u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags); void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde); void btrfs_clear_space_info_full(struct btrfs_fs_info *info); - -int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root, - struct inode *inode, int num_items); -int btrfs_reserve_metadata_for_delalloc(struct btrfs_root *root, - struct inode *inode, int num_items); -int btrfs_check_data_free_space(struct btrfs_root *root, struct inode *inode, - u64 bytes); -void btrfs_free_reserved_data_space(struct btrfs_root *root, - struct inode *inode, u64 bytes); -void btrfs_delalloc_reserve_space(struct btrfs_root *root, struct inode *inode, -u64 bytes); -void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode, - u64 bytes); +int btrfs_check_data_free_space(struct inode *inode, u64 bytes); +void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes); int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root, int num_items, int *retries); @@ -2093,6 +2082,10 @@ void btrfs_trans_release_metadata(struct struct btrfs_root *root); int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); +int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); +void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes); +void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root); void btrfs_free_block_rsv(struct btrfs_block_rsv *rsv); diff -urp 7/fs/btrfs/extent_io.c 8/fs/btrfs/extent_io.c --- 7/fs/btrfs/extent_io.c 2010-04-14 14:49:57.000937000 +0800 +++ 8/fs/btrfs/extent_io.c 2010-04-18 10:26:38.329697898 +0800 @@ -336,21 +336,18 @@ static int merge_state(struct extent_io_ } static int set_state_cb(struct extent_io_tree *tree, -struct extent_state *state, -unsigned long bits) +struct extent_state *state, int *bits) { if (tree-ops tree-ops-set_bit_hook) { return tree-ops-set_bit_hook(tree-mapping-host, - state-start, state-end, - state-state, bits); + state, bits); } return 0; } static void clear_state_cb(struct extent_io_tree *tree, - struct extent_state *state, - unsigned long bits) + struct extent_state *state, int *bits) { if (tree-ops tree-ops-clear_bit_hook) tree-ops-clear_bit_hook(tree-mapping-host, state, bits); @@ -368,9 +365,10 @@ static void clear_state_cb(struct extent */ static int insert_state(struct extent_io_tree *tree, struct extent_state *state, u64 start, u64 end, - int bits) + int *bits) { struct rb_node *node; + int bits_to_set = *bits ~EXTENT_CTLBITS; int ret; if (end start) { @@ -385,9 +383,9 @@ static int insert_state(struct extent_io if (ret) return ret; - if (bits EXTENT_DIRTY) + if (bits_to_set EXTENT_DIRTY
[PATCH 03/12] Btrfs: Shrink delay allocated space in a synchronized way
Shrink delay allocated space in a synchronized manner is more controllable than flushing all delay allocated space in an async thread. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 3/fs/btrfs/ctree.h 4/fs/btrfs/ctree.h --- 3/fs/btrfs/ctree.h 2010-04-18 08:13:15.457699211 +0800 +++ 4/fs/btrfs/ctree.h 2010-04-18 08:13:51.602699293 +0800 @@ -699,10 +699,6 @@ struct btrfs_space_info { struct list_head list; - /* for controlling how we free up space for allocations */ - wait_queue_head_t flush_wait; - int flushing; - /* for block groups in our same type */ struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; @@ -927,7 +923,6 @@ struct btrfs_fs_info { struct btrfs_workers endio_meta_write_workers; struct btrfs_workers endio_write_workers; struct btrfs_workers submit_workers; - struct btrfs_workers enospc_workers; /* * fixup workers take dirty pages that didn't properly go through * the cow mechanism and make them safe to write. It happens @@ -2311,6 +2306,7 @@ int btrfs_truncate_inode_items(struct bt u32 min_type); int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput); +int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput); int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, struct extent_state **cached_state); int btrfs_writepages(struct address_space *mapping, diff -urp 3/fs/btrfs/disk-io.c 4/fs/btrfs/disk-io.c --- 3/fs/btrfs/disk-io.c2010-04-14 14:49:56.559944000 +0800 +++ 4/fs/btrfs/disk-io.c2010-04-18 08:13:51.60461 +0800 @@ -1768,9 +1768,6 @@ struct btrfs_root *open_ctree(struct sup min_t(u64, fs_devices-num_devices, fs_info-thread_pool_size), fs_info-generic_worker); - btrfs_init_workers(fs_info-enospc_workers, enospc, - fs_info-thread_pool_size, - fs_info-generic_worker); /* a higher idle thresh on the submit workers makes it much more * likely that bios will be send down in a sane order to the @@ -1818,7 +1815,6 @@ struct btrfs_root *open_ctree(struct sup btrfs_start_workers(fs_info-endio_meta_workers, 1); btrfs_start_workers(fs_info-endio_meta_write_workers, 1); btrfs_start_workers(fs_info-endio_write_workers, 1); - btrfs_start_workers(fs_info-enospc_workers, 1); fs_info-bdi.ra_pages *= btrfs_super_num_devices(disk_super); fs_info-bdi.ra_pages = max(fs_info-bdi.ra_pages, @@ -2049,7 +2045,6 @@ fail_sb_buffer: btrfs_stop_workers(fs_info-endio_meta_write_workers); btrfs_stop_workers(fs_info-endio_write_workers); btrfs_stop_workers(fs_info-submit_workers); - btrfs_stop_workers(fs_info-enospc_workers); fail_iput: invalidate_inode_pages2(fs_info-btree_inode-i_mapping); iput(fs_info-btree_inode); @@ -2482,7 +2477,6 @@ int close_ctree(struct btrfs_root *root) btrfs_stop_workers(fs_info-endio_meta_write_workers); btrfs_stop_workers(fs_info-endio_write_workers); btrfs_stop_workers(fs_info-submit_workers); - btrfs_stop_workers(fs_info-enospc_workers); btrfs_close_devices(fs_info-fs_devices); btrfs_mapping_tree_free(fs_info-mapping_tree); diff -urp 3/fs/btrfs/extent-tree.c 4/fs/btrfs/extent-tree.c --- 3/fs/btrfs/extent-tree.c2010-04-18 08:13:15.463699138 +0800 +++ 4/fs/btrfs/extent-tree.c2010-04-18 08:13:51.608702224 +0800 @@ -73,6 +73,9 @@ static void dump_space_info(struct btrfs static int maybe_allocate_chunk(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_space_info *sinfo, u64 num_bytes); +static int shrink_delalloc(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_space_info *sinfo, u64 to_reclaim); static noinline int block_group_cache_done(struct btrfs_block_group_cache *cache) @@ -2689,7 +2692,6 @@ static int update_space_info(struct btrf for (i = 0; i BTRFS_NR_RAID_TYPES; i++) INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); - init_waitqueue_head(found-flush_wait); spin_lock_init(found-lock); found-flags = flags (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -2903,105 +2905,6 @@ static void check_force_delalloc(struct meta_sinfo-force_delalloc = 0; } -struct async_flush { - struct btrfs_root *root; - struct btrfs_space_info *info; - struct btrfs_work work; -}; - -static noinline void flush_delalloc_async(struct btrfs_work *work) -{ - struct async_flush *async; - struct btrfs_root *root
[PATCH 01/12] Btrfs: Link block groups of different raid types in the same space_info
The size of reserved space is stored in space_info. If block groups of different raid types are linked to separate space_info, changing allocation profile will corrupt reserved space accounting. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/ctree.h 2/fs/btrfs/ctree.h --- 1/fs/btrfs/ctree.h 2010-04-14 14:49:56.399956135 +0800 +++ 2/fs/btrfs/ctree.h 2010-04-18 08:12:22.086699485 +0800 @@ -662,6 +662,7 @@ struct btrfs_csum_item { #define BTRFS_BLOCK_GROUP_RAID1(1 4) #define BTRFS_BLOCK_GROUP_DUP (1 5) #define BTRFS_BLOCK_GROUP_RAID10 (1 6) +#define BTRFS_NR_RAID_TYPES 5 struct btrfs_block_group_item { __le64 used; @@ -673,7 +674,8 @@ struct btrfs_space_info { u64 flags; u64 total_bytes;/* total bytes in the space */ - u64 bytes_used; /* total bytes used on disk */ + u64 bytes_used; /* total bytes used, + this does't take mirrors into account */ u64 bytes_pinned; /* total bytes pinned, will be freed when the transaction finishes */ u64 bytes_reserved; /* total bytes the allocator has reserved for @@ -686,6 +688,7 @@ struct btrfs_space_info { delalloc/allocations */ u64 bytes_delalloc; /* number of bytes currently reserved for delayed allocation */ + u64 disk_used; /* total bytes used on disk */ int full; /* indicates that we cannot allocate any more chunks for this space */ @@ -703,7 +706,7 @@ struct btrfs_space_info { int flushing; /* for block groups in our same type */ - struct list_head block_groups; + struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; struct rw_semaphore groups_sem; atomic_t caching_threads; diff -urp 1/fs/btrfs/extent-tree.c 2/fs/btrfs/extent-tree.c --- 1/fs/btrfs/extent-tree.c2010-04-14 14:49:56.932956992 +0800 +++ 2/fs/btrfs/extent-tree.c2010-04-18 08:12:22.092698714 +0800 @@ -512,6 +509,9 @@ static struct btrfs_space_info *__find_s struct list_head *head = info-space_info; struct btrfs_space_info *found; + flags = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | +BTRFS_BLOCK_GROUP_METADATA; + rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { if (found-flags == flags) { @@ -2659,12 +2659,21 @@ static int update_space_info(struct btrf struct btrfs_space_info **space_info) { struct btrfs_space_info *found; + int i; + int factor; + + if (flags (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 | +BTRFS_BLOCK_GROUP_RAID10)) + factor = 2; + else + factor = 1; found = __find_space_info(info, flags); if (found) { spin_lock(found-lock); found-total_bytes += total_bytes; found-bytes_used += bytes_used; + found-disk_used += bytes_used * factor; found-full = 0; spin_unlock(found-lock); *space_info = found; @@ -2674,14 +2683,18 @@ static int update_space_info(struct btrf if (!found) return -ENOMEM; - INIT_LIST_HEAD(found-block_groups); + for (i = 0; i BTRFS_NR_RAID_TYPES; i++) + INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); init_waitqueue_head(found-flush_wait); init_waitqueue_head(found-allocate_wait); spin_lock_init(found-lock); - found-flags = flags; + found-flags = flags (BTRFS_BLOCK_GROUP_DATA | + BTRFS_BLOCK_GROUP_SYSTEM | + BTRFS_BLOCK_GROUP_METADATA); found-total_bytes = total_bytes; found-bytes_used = bytes_used; + found-disk_used = bytes_used * factor; found-bytes_pinned = 0; found-bytes_reserved = 0; found-bytes_readonly = 0; @@ -2751,26 +2764,32 @@ u64 btrfs_reduce_alloc_profile(struct bt return flags; } -static u64 btrfs_get_alloc_profile(struct btrfs_root *root, u64 data) +static u64 get_alloc_profile(struct btrfs_root *root, u64 flags) { - struct btrfs_fs_info *info = root-fs_info; - u64 alloc_profile; + if (flags BTRFS_BLOCK_GROUP_DATA) + flags |= root-fs_info-avail_data_alloc_bits +root-fs_info-data_alloc_profile; + else if (flags BTRFS_BLOCK_GROUP_SYSTEM) + flags |= root-fs_info-avail_system_alloc_bits +root-fs_info-system_alloc_profile; + else if (flags BTRFS_BLOCK_GROUP_METADATA) + flags |= root-fs_info-avail_metadata_alloc_bits +root
[PATCH 02/12] Btrfs: Kill allocate_wait in space_info
We already have fs_info-chunk_mutex to avoid concurrent chunk creation. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-18 08:12:22.086699485 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-18 08:13:15.457699211 +0800 @@ -700,9 +700,7 @@ struct btrfs_space_info { struct list_head list; /* for controlling how we free up space for allocations */ - wait_queue_head_t allocate_wait; wait_queue_head_t flush_wait; - int allocating_chunk; int flushing; /* for block groups in our same type */ diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 2/fs/btrfs/extent-tree.c2010-04-18 08:12:22.092698714 +0800 +++ 3/fs/btrfs/extent-tree.c2010-04-18 08:13:15.463699138 +0800 @@ -70,6 +70,9 @@ static int find_next_key(struct btrfs_pa struct btrfs_key *key); static void dump_space_info(struct btrfs_space_info *info, u64 bytes, int dump_block_groups); +static int maybe_allocate_chunk(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_space_info *sinfo, u64 num_bytes); static noinline int block_group_cache_done(struct btrfs_block_group_cache *cache) @@ -2687,7 +2690,6 @@ static int update_space_info(struct btrf INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); init_waitqueue_head(found-flush_wait); - init_waitqueue_head(found-allocate_wait); spin_lock_init(found-lock); found-flags = flags (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -3000,71 +3002,6 @@ flush: wake_up(info-flush_wait); } -static int maybe_allocate_chunk(struct btrfs_root *root, -struct btrfs_space_info *info) -{ - struct btrfs_super_block *disk_super = root-fs_info-super_copy; - struct btrfs_trans_handle *trans; - bool wait = false; - int ret = 0; - u64 min_metadata; - u64 free_space; - - free_space = btrfs_super_total_bytes(disk_super); - /* -* we allow the metadata to grow to a max of either 10gb or 5% of the -* space in the volume. -*/ - min_metadata = min((u64)10 * 1024 * 1024 * 1024, -div64_u64(free_space * 5, 100)); - if (info-total_bytes = min_metadata) { - spin_unlock(info-lock); - return 0; - } - - if (info-full) { - spin_unlock(info-lock); - return 0; - } - - if (!info-allocating_chunk) { - info-force_alloc = 1; - info-allocating_chunk = 1; - } else { - wait = true; - } - - spin_unlock(info-lock); - - if (wait) { - wait_event(info-allocate_wait, - !info-allocating_chunk); - return 1; - } - - trans = btrfs_start_transaction(root, 1); - if (!trans) { - ret = -ENOMEM; - goto out; - } - - ret = do_chunk_alloc(trans, root-fs_info-extent_root, -4096 + 2 * 1024 * 1024, -info-flags, 0); - btrfs_end_transaction(trans, root); - if (ret) - goto out; -out: - spin_lock(info-lock); - info-allocating_chunk = 0; - spin_unlock(info-lock); - wake_up(info-allocate_wait); - - if (ret) - return 0; - return 1; -} - /* * Reserve metadata space for delalloc. */ @@ -3105,7 +3042,8 @@ again: flushed++; if (flushed == 1) { - if (maybe_allocate_chunk(root, meta_sinfo)) + if (maybe_allocate_chunk(NULL, root, meta_sinfo, +num_bytes)) goto again; flushed++; } else { @@ -3220,7 +3158,8 @@ again: if (used meta_sinfo-total_bytes) { retries++; if (retries == 1) { - if (maybe_allocate_chunk(root, meta_sinfo)) + if (maybe_allocate_chunk(NULL, root, meta_sinfo, +num_bytes)) goto again; retries++; } else { @@ -3417,13 +3356,28 @@ static void force_metadata_allocation(st rcu_read_unlock(); } +static int should_alloc_chunk(struct btrfs_space_info *sinfo, + u64 alloc_bytes) +{ + u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly; + + if (sinfo-bytes_used + sinfo-bytes_reserved + + alloc_bytes + 256 * 1024 * 1024 num_bytes) + return 0; + + if (sinfo-bytes_used + sinfo
[PATCH 05/12] Btrfs: Introduce contexts for metadata reseravtion
Introducing contexts for metadata reseravtion has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 5/fs/btrfs/ctree.c 6/fs/btrfs/ctree.c --- 5/fs/btrfs/ctree.c 2010-04-14 14:49:56.34295 +0800 +++ 6/fs/btrfs/ctree.c 2010-04-18 10:22:08.215948795 +0800 @@ -279,7 +279,8 @@ int btrfs_block_can_be_shared(struct btr static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, - struct extent_buffer *cow) + struct extent_buffer *cow, + int *last_ref) { u64 refs; u64 owner; @@ -365,6 +366,7 @@ static noinline int update_ref_for_cow(s BUG_ON(ret); } clean_tree_block(trans, root, buf); + *last_ref = 1; } return 0; } @@ -391,6 +393,7 @@ static noinline int __btrfs_cow_block(st struct btrfs_disk_key disk_key; struct extent_buffer *cow; int level; + int last_ref = 0; int unlock_orig = 0; u64 parent_start; @@ -441,7 +444,7 @@ static noinline int __btrfs_cow_block(st (unsigned long)btrfs_header_fsid(cow), BTRFS_FSID_SIZE); - update_ref_for_cow(trans, root, buf, cow); + update_ref_for_cow(trans, root, buf, cow, last_ref); if (buf == root-node) { WARN_ON(parent parent != buf); @@ -456,8 +459,8 @@ static noinline int __btrfs_cow_block(st extent_buffer_get(cow); spin_unlock(root-node_lock); - btrfs_free_tree_block(trans, root, buf-start, buf-len, - parent_start, root-root_key.objectid, level); + btrfs_free_tree_block(trans, root, buf, parent_start, + last_ref); free_extent_buffer(buf); add_root_to_dirty_list(root); } else { @@ -472,8 +475,8 @@ static noinline int __btrfs_cow_block(st btrfs_set_node_ptr_generation(parent, parent_slot, trans-transid); btrfs_mark_buffer_dirty(parent); - btrfs_free_tree_block(trans, root, buf-start, buf-len, - parent_start, root-root_key.objectid, level); + btrfs_free_tree_block(trans, root, buf, parent_start, + last_ref); } if (unlock_orig) btrfs_tree_unlock(buf); @@ -948,6 +951,22 @@ int btrfs_bin_search(struct extent_buffe return bin_search(eb, key, level, slot); } +static void root_add_used(struct btrfs_root *root, u32 size) +{ + spin_lock(root-accounting_lock); + btrfs_set_root_used(root-root_item, + btrfs_root_used(root-root_item) + size); + spin_unlock(root-accounting_lock); +} + +static void root_sub_used(struct btrfs_root *root, u32 size) +{ + spin_lock(root-accounting_lock); + btrfs_set_root_used(root-root_item, + btrfs_root_used(root-root_item) - size); + spin_unlock(root-accounting_lock); +} + /* given a node and slot number, this reads the blocks it points to. The * extent buffer is returned with a reference taken (but unlocked). * NULL is returned on error. @@ -1018,7 +1037,11 @@ static noinline int balance_level(struct btrfs_tree_lock(child); btrfs_set_lock_blocking(child); ret = btrfs_cow_block(trans, root, child, mid, 0, child); - BUG_ON(ret); + if (ret) { + btrfs_tree_unlock(child); + free_extent_buffer(child); + goto enospc; + } spin_lock(root-node_lock); root-node = child; @@ -1033,11 +1056,12 @@ static noinline int balance_level(struct btrfs_tree_unlock(mid); /* once for the path */ free_extent_buffer(mid); - ret = btrfs_free_tree_block(trans, root, mid-start, mid-len, - 0, root-root_key.objectid, level); + + root_sub_used(root, mid-len); + btrfs_free_tree_block(trans, root, mid, 0, 1); /* once for the root ptr */ free_extent_buffer(mid); - return ret; + return 0; } if (btrfs_header_nritems(mid) BTRFS_NODEPTRS_PER_BLOCK(root) / 4) @@ -1087,23 +,16 @@ static noinline int balance_level(struct if (wret 0
[PATCH 04/12] Btrfs: Kill init_btrfs_i()
All code in init_btrfs_i can be moved into btrfs_alloc_inode() Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 4/fs/btrfs/inode.c 5/fs/btrfs/inode.c --- 4/fs/btrfs/inode.c 2010-04-18 08:13:48.183698829 +0800 +++ 5/fs/btrfs/inode.c 2010-04-18 10:59:07.534719917 +0800 @@ -3595,40 +3595,10 @@ again: return 0; } -static noinline void init_btrfs_i(struct inode *inode) -{ - struct btrfs_inode *bi = BTRFS_I(inode); - - bi-generation = 0; - bi-sequence = 0; - bi-last_trans = 0; - bi-last_sub_trans = 0; - bi-logged_trans = 0; - bi-delalloc_bytes = 0; - bi-reserved_bytes = 0; - bi-disk_i_size = 0; - bi-flags = 0; - bi-index_cnt = (u64)-1; - bi-last_unlink_trans = 0; - bi-ordered_data_close = 0; - bi-force_compress = 0; - extent_map_tree_init(BTRFS_I(inode)-extent_tree, GFP_NOFS); - extent_io_tree_init(BTRFS_I(inode)-io_tree, -inode-i_mapping, GFP_NOFS); - extent_io_tree_init(BTRFS_I(inode)-io_failure_tree, -inode-i_mapping, GFP_NOFS); - INIT_LIST_HEAD(BTRFS_I(inode)-delalloc_inodes); - INIT_LIST_HEAD(BTRFS_I(inode)-ordered_operations); - RB_CLEAR_NODE(BTRFS_I(inode)-rb_node); - btrfs_ordered_inode_tree_init(BTRFS_I(inode)-ordered_tree); - mutex_init(BTRFS_I(inode)-log_mutex); -} - static int btrfs_init_locked_inode(struct inode *inode, void *p) { struct btrfs_iget_args *args = p; inode-i_ino = args-ino; - init_btrfs_i(inode); BTRFS_I(inode)-root = args-root; btrfs_set_inode_space_info(args-root, inode); return 0; @@ -3691,8 +3661,6 @@ static struct inode *new_simple_dir(stru if (!inode) return ERR_PTR(-ENOMEM); - init_btrfs_i(inode); - BTRFS_I(inode)-root = root; memcpy(BTRFS_I(inode)-location, key, sizeof(*key)); BTRFS_I(inode)-dummy_inode = 1; @@ -4091,7 +4059,6 @@ static struct inode *btrfs_new_inode(str * btrfs_get_inode_index_count has an explanation for the magic * number */ - init_btrfs_i(inode); BTRFS_I(inode)-index_cnt = 2; BTRFS_I(inode)-root = root; BTRFS_I(inode)-generation = trans-transid; @@ -5262,21 +5229,46 @@ unsigned long btrfs_force_ra(struct addr struct inode *btrfs_alloc_inode(struct super_block *sb) { struct btrfs_inode *ei; + struct inode *inode; ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS); if (!ei) return NULL; + + ei-root = NULL; + ei-space_info = NULL; + ei-generation = 0; + ei-sequence = 0; ei-last_trans = 0; ei-last_sub_trans = 0; ei-logged_trans = 0; + ei-delalloc_bytes = 0; + ei-reserved_bytes = 0; + ei-disk_i_size = 0; + ei-flags = 0; + ei-index_cnt = (u64)-1; + ei-last_unlink_trans = 0; + + spin_lock_init(ei-accounting_lock); ei-outstanding_extents = 0; ei-reserved_extents = 0; - ei-root = NULL; - spin_lock_init(ei-accounting_lock); + + ei-ordered_data_close = 0; + ei-dummy_inode = 0; + ei-force_compress = 0; + + inode = ei-vfs_inode; + extent_map_tree_init(ei-extent_tree, GFP_NOFS); + extent_io_tree_init(ei-io_tree, inode-i_data, GFP_NOFS); + extent_io_tree_init(ei-io_failure_tree, inode-i_data, GFP_NOFS); + mutex_init(ei-log_mutex); btrfs_ordered_inode_tree_init(ei-ordered_tree); INIT_LIST_HEAD(ei-i_orphan); + INIT_LIST_HEAD(ei-delalloc_inodes); INIT_LIST_HEAD(ei-ordered_operations); - return ei-vfs_inode; + RB_CLEAR_NODE(ei-rb_node); + + return inode; } void btrfs_destroy_inode(struct inode *inode) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12] Btrfs: Kill allocate_wait in space_info
On Mon, Apr 19, 2010 at 9:57 PM, Josef Bacik jo...@redhat.com wrote: The purpose of maybe_allocate_chunk was that there is no way to know if some other CPU is currently trying to allocate a chunk for the given space info. We could have two cpu's come inot do_chunk_alloc at relatively the same time and end up allocating twice the amount of space, which is why I did the waitqueue thing. It seems like this is still a possibility with your patch. Thanks, This is impossible because the very first thing do_chunk_alloc does is lock the chunk_mutex. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12] Btrfs: Kill allocate_wait in space_info
On Mon, Apr 19, 2010 at 10:48 PM, Josef Bacik jo...@redhat.com wrote: On Mon, Apr 19, 2010 at 10:46:12PM +0800, Yan, Zheng wrote: On Mon, Apr 19, 2010 at 9:57 PM, Josef Bacik jo...@redhat.com wrote: The purpose of maybe_allocate_chunk was that there is no way to know if some other CPU is currently trying to allocate a chunk for the given space info. We could have two cpu's come inot do_chunk_alloc at relatively the same time and end up allocating twice the amount of space, which is why I did the waitqueue thing. It seems like this is still a possibility with your patch. Thanks, This is impossible because the very first thing do_chunk_alloc does is lock the chunk_mutex. Sure, that just means we don't get two things creating chunks at the same time, but not from creating them one right after another. So CPU 0 and 1 come in to the check free space stuff, realize they need to allocate a chunk, and race to call do_chunk_alloc. One of them wins, and the other blocks on the chunk_mutex lock. When the first finishes the other one is able to continue and do what it was originally going to do, and then you get two chunks when you really only wanted one. Thanks, there is a check in do_chunk_alloc, so the later one will do nothing if the first call adds enough space. Yan Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at /home/mafra/linux-2.6/fs/btrfs/volumes.c:2828!
On Fri, Apr 9, 2010 at 6:23 AM, Carlos R. Mafra crmaf...@gmail.com wrote: I've just got this bug in the latest 2.6.34-rc3-00388 kernel. I wasn't doing anything fancy, just doing some 'git log' in a small wmaker repo I have here. After I got this bug the cpu went to 100% and I had to reboot with the button because the laptop was not responding to any commands (but the mouse was moving and X seemed ok) I have the full dmesg (64 KB) which I can upload somewhere if needed. But for now I just would like to know if this bug is known to people in the list. [ 8150.547096] [ cut here ] [ 8150.547101] kernel BUG at /home/mafra/linux-2.6/fs/btrfs/volumes.c:2828! [ 8150.547103] invalid opcode: [#1] SMP [ 8150.547107] last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq [ 8150.547109] CPU 1 [ 8150.547111] Modules linked in: ppp_deflate ppp_async ppp_generic slhc option usbserial snd_seq snd_seq_device uvcvideo snd_hda_codec_idt snd_hda_in tel iwlagn sky2 snd_hda_codec ehci_hcd snd_hwdep i2c_i801 evdev [ 8150.547125] [ 8150.547128] Pid: 1606, comm: btrfs-delalloc- Not tainted 2.6.34-rc3-00388-gf5284e7 #48 VAIO/VGN-FZ240E [ 8150.547131] RIP: 0010:[811b39a9] [811b39a9] btrfs_rmap_block+0x2c9/0x2d0 [ 8150.547139] RSP: 0018:88007aeab7f0 EFLAGS: 00010246 [ 8150.547142] RAX: RBX: 0001 RCX: [ 8150.547144] RDX: 001e821d RSI: 001e821d RDI: 88007ef1fa28 [ 8150.547147] RBP: 88007aeab870 R08: R09: 88007aeab8a8 [ 8150.547149] R10: R11: 88005ae73580 R12: [ 8150.547151] R13: 88007ae9c0e8 R14: 88007aeab8ac R15: 88007aeab8a8 [ 8150.547154] FS: () GS:880001b0() knlGS: [ 8150.547157] CS: 0010 DS: ES: CR0: 8005003b [ 8150.547159] CR2: 7f4adf3d9000 CR3: 01812000 CR4: 06e0 [ 8150.547161] DR0: DR1: DR2: [ 8150.547164] DR3: DR6: 0ff0 DR7: 0400 [ 8150.547166] Process btrfs-delalloc- (pid: 1606, threadinfo 88007aeaa000, task 88007f330280) [ 8150.547168] Stack: [ 8150.547170] 09b8 8801 88007acf7000 00ff811a0eb5 [ 8150.547173] 0 0001 220240cc 88007aeaa000 88007aeab8a8 [ 8150.547178] 0 88007aeab8a0 001e821d 88007aeab860 88007ae6fd80 [ 8150.547182] Call Trace: [ 8150.547188] [8117bdb1] exclude_super_stripes+0x61/0x110 [ 8150.547193] [8119dc30] ? __tree_search+0x90/0x120 [ 8150.547196] [8117c906] btrfs_make_block_group+0x136/0x320 [ 8150.547200] [811b6fd6] __btrfs_alloc_chunk+0x616/0x740 [ 8150.547203] [811b716f] btrfs_alloc_chunk+0x6f/0xa0 [ 8150.547207] [8117d546] do_chunk_alloc+0x166/0x230 [ 8150.547210] [8117f7d9] find_free_extent+0x939/0x9f0 [ 8150.547214] [8117fab5] btrfs_reserve_extent+0xc5/0x1b0 [ 8150.547218] [814f3559] ? mutex_lock+0x19/0x50 [ 8150.547227] [8119572d] submit_compressed_extents+0x10d/0x3f0 [ 8150.547231] [81195a8f] async_cow_submit+0x7f/0x90 [ 8150.547234] [811b80c6] run_ordered_completions+0x76/0xe0 [ 8150.547237] [811b898a] worker_loop+0x15a/0x580 [ 8150.547240] [811b8830] ? worker_loop+0x0/0x580 [ 8150.547243] [8105b6ce] kthread+0x8e/0xa0 [ 8150.547248] [81003ad4] kernel_thread_helper+0x4/0x10 [ 8150.547250] [8105b640] ? kthread+0x0/0xa0 [ 8150.547259] [81003ad0] ? kernel_thread_helper+0x0/0x10 [ 8150.547261] Code: 00 00 48 c7 c7 68 ce 75 81 e8 34 ab e8 ff 4c 8b 5d 90 4c 8b 55 98 4c 8b 4d b0 48 8b 4d a0 48 8b 45 a8 e9 af fe ff ff 0f 0b eb fe 0f 0b eb fe 0f 1f 00 55 48 89 e5 41 55 49 89 d5 41 54 49 89 f4 [ 8150.547292] RIP [811b39a9] btrfs_rmap_block+0x2c9/0x2d0 [ 8150.547296] RSP 88007aeab7f0 [ 8150.547310] ---[ end trace 6c5e0e4f829c2aeb ]--- This has been fixed by commit http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=commit;h=9f680ce04ea19dabbbafe01b57b61930a9b70741 Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix chunk allocate size calculation
On Thu, Mar 18, 2010 at 4:45 AM, Josef Bacik jo...@redhat.com wrote: If the amount of free space left in a device is less than what we think should be the minimum size, just ignore the minimum size and use the amount we have. I ran into this running tests on a 600mb volume, the chunk allocator wouldn't let me allocate the last 52mb of the disk for data because we want to have at least 64mb chunks for data. This patch fixes that problem. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9df8e3f..1c5b5ba 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2244,8 +2244,10 @@ again: do_div(calc_size, stripe_len); calc_size *= stripe_len; } + /* we don't want tiny stripes */ - calc_size = max_t(u64, min_stripe_size, calc_size); + if (!looped) + calc_size = max_t(u64, min_stripe_size, calc_size); do_div(calc_size, stripe_len); calc_size *= stripe_len; I encountered an Oops caused by 'calc_size == 0'. It's likely introduced by this change. (calc_size can be zero after calling do_div) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops on btrfs filesystem balance
On Thu, Mar 25, 2010 at 9:06 PM, Kirill A. Shutemov kir...@shutemov.name wrote: On lastest Linus' git. [ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 0021 [ 4005.426818] IP: [c109a130] page_cache_sync_readahead+0x18/0x3e [ 4005.426837] *pde = [ 4005.426844] Oops: [#1] PREEMPT SMP [ 4005.426854] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:00/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full [ 4005.426864] Modules linked in: btrfs zlib_deflate crc32c libcrc32c loop coretemp ext2 arc4 ecb iwlagn iwlcore snd_hda_codec_conexant snd_hda_intel mac80211 snd_hda_codec snd_hwdep snd_pcm snd_timer snd uvcvideo e1000e rtc_cmos rtc_core cdc_ether videodev uhci_hcd usbnet sg snd_page_alloc video thinkpad_acpi cdc_acm rtc_lib v4l1_compat mii output ext3 jbd usbhid sd_mod sha256_generic cbc ata_piix ehci_hcd aes_i586 aes_generic libata dm_crypt usbcore scsi_mod nls_base dm_mod [ 4005.426971] [ 4005.426979] Pid: 25838, comm: btrfs Not tainted 2.6.34-rc2 #67 2767BC8/2767BC8 [ 4005.426987] EIP: 0060:[c109a130] EFLAGS: 00010206 CPU: 0 [ 4005.426996] EIP is at page_cache_sync_readahead+0x18/0x3e [ 4005.427002] EAX: f58dcb84 EBX: ECX: EDX: f45efe40 [ 4005.427009] ESI: 00033b43 EDI: f58dcad4 EBP: f4b61ce0 ESP: f4b61cd8 [ 4005.427010] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 4005.427010] Process btrfs (pid: 25838, ti=f4b6 task=f6680a60 task.ti=f4b6) [ 4005.427010] Stack: [ 4005.427010] 41c1 0001 f4b61d50 f9443902 00033b43 f93fc3dc f6bf4d80 [ 4005.427010] 0 f4cc74d0 41c1 0001 f58dcb4c 00033b42 f58dc9e0 f72e7600 f4b61d2c [ 4005.427010] 0 f45efe40 00033b43 41c0 0001 [ 4005.427010] Call Trace: [ 4005.427010] [f9443902] ? relocate_file_extent_cluster+0x195/0x3bd [btrfs] [ 4005.427010] [f93fc3dc] ? btrfs_release_path+0x39/0x4a [btrfs] [ 4005.427010] [f9444bd2] ? relocate_block_group+0x2be/0x32a [btrfs] [ 4005.427010] [f9411dd3] ? btrfs_clean_old_snapshots+0x66/0xd9 [btrfs] [ 4005.427010] [f9444d87] ? btrfs_relocate_block_group+0x149/0x2e3 [btrfs] [ 4005.427010] [f942eecc] ? btrfs_relocate_chunk+0x5c/0x423 [btrfs] [ 4005.427010] [c10217cc] ? kmap_atomic+0x13/0x15 [ 4005.427010] [f9428f32] ? map_private_extent_buffer+0x94/0xb6 [btrfs] [ 4005.427010] [f9428fa3] ? map_extent_buffer+0x4f/0x7f [btrfs] [ 4005.427010] [c10216d3] ? kunmap_atomic+0x6c/0x83 [ 4005.427010] [f9428aca] ? unmap_extent_buffer+0x11/0x13 [btrfs] [ 4005.427010] [f94206dd] ? btrfs_item_offset+0x98/0xa2 [btrfs] [ 4005.427010] [f942f856] ? btrfs_balance+0x20f/0x265 [btrfs] [ 4005.427010] [f9436ab9] ? btrfs_ioctl+0x6ad/0x824 [btrfs] [ 4005.427010] [c10bf8e1] ? __memcg_event_check+0x50/0x72 [ 4005.427010] [c11461e2] ? file_has_perm+0x8c/0xa6 [ 4005.427010] [c10cf310] ? vfs_ioctl+0x2c/0x96 [ 4005.427010] [f943640c] ? btrfs_ioctl+0x0/0x824 [btrfs] [ 4005.427010] [c10cf8ac] ? do_vfs_ioctl+0x48e/0x4cc [ 4005.427010] [c11463ca] ? selinux_file_ioctl+0x43/0x46 [ 4005.427010] [c10cf930] ? sys_ioctl+0x46/0x66 [ 4005.427010] [c132ae88] ? syscall_call+0x7/0xb [ 4005.427010] Code: 8b 48 24 85 c9 74 04 31 d2 ff d1 8d 65 f4 5b 5e 5f c9 c3 55 89 e5 56 53 0f 1f 44 00 00 89 cb 8b 75 0c 8b 4d 08 83 7a 0c 00 74 1f f6 43 21 10 74 0b 89 da 56 e8 f5 fc ff ff 5b eb 0e 56 51 89 d9 [ 4005.427010] EIP: [c109a130] page_cache_sync_readahead+0x18/0x3e SS:ESP 0068:f4b61cd8 [ 4005.427010] CR2: 0021 [ 4005.427898] ---[ end trace 0e53ab674cd5bfb9 ]--- The 'filp' parameter for page_cache_sync_readahead is NULL in this case. Commit 0141450f66c3c12a3aaa869748caa64241885cdf added code that dereference 'filp'. Fengguang, would you please fix this. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix ENOSPC accounting when max_extent is not maxed out V2
On Fri, Mar 19, 2010 at 9:59 PM, Josef Bacik jo...@redhat.com wrote: On Fri, Mar 19, 2010 at 11:09:25AM +0800, Yan, Zheng wrote: On Thu, Mar 18, 2010 at 11:47 PM, Josef Bacik jo...@redhat.com wrote: A user reported a bug a few weeks back where if he set max_extent=1m and then did a dd and then stopped it, we would panic. This is because I miscalculated how many extents would be needed for splits/merges. Turns out I didn't actually take max_extent into account properly, since we only ever add 1 extent for a write, which isn't quite right for the case that say max_extent is 4k and we do 8k writes. That would result in more than 1 extent. So this patch makes us properly figure out how many extents are needed for the amount of data that is being written, and deals with splitting and merging better. I've tested this ad nauseum and it works well. This version doesn't depend on my per-cpu stuff. Thanks, Why not remove the the max_extent check. The max length of file extent is also affected by fragmentation level of free space. It doesn't make sense to introduce complex code to address one factor while lose sight of another factor. I think reserving one unit of metadata for each delalloc extent in the extent IO tree should be OK. because even a delalloc extent ends up with multiple file extents, these file extents are adjacency in the b-tree. Do you mean remove the option for max_extent altogether, or just remove all of my code for taking it into account? Thanks, all of the code for taking max_extent into account -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: btrfs_mark_extent_written writes correctly slot
On 02/11/2010 03:43 PM, Shaohua Li wrote: My test do: fallocate a big file and do write. The file is 512M, but after file write is done btrfs-debug-tree shows: item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53 extent data disk byte 1103101952 nr 536870912 extent data offset 0 nr 399634432 ram 536870912 extent compression 0 Looks like a regression introducted by 6c7d54ac87f338c479d9729e8392eca3f76e11e1, where we set wrong slot. Signed-off-by: Shaohua Li shaohua...@intel.com diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9d08096..6ed434a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -720,13 +720,15 @@ again: inode-i_ino, orig_offset); BUG_ON(ret); } - fi = btrfs_item_ptr(leaf, path-slots[0], -struct btrfs_file_extent_item); if (del_nr == 0) { + fi = btrfs_item_ptr(leaf, path-slots[0], +struct btrfs_file_extent_item); btrfs_set_file_extent_type(leaf, fi, BTRFS_FILE_EXTENT_REG); btrfs_mark_buffer_dirty(leaf); } else { + fi = btrfs_item_ptr(leaf, del_slot - 1, +struct btrfs_file_extent_item); btrfs_set_file_extent_type(leaf, fi, BTRFS_FILE_EXTENT_REG); btrfs_set_file_extent_num_bytes(leaf, fi, Acked-by: Yan Zheng zheng@oracle.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix race between allocate and release extent buffer.
Increase extent buffer's reference count while holding the lock. Otherwise it can race with try_release_extent_buffer. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/extent_io.c 2/fs/btrfs/extent_io.c --- 1/fs/btrfs/extent_io.c 2010-01-17 15:48:16.770302026 +0800 +++ 2/fs/btrfs/extent_io.c 2010-02-04 16:37:45.704800682 +0800 @@ -3165,10 +3165,9 @@ struct extent_buffer *alloc_extent_buffe spin_unlock(tree-buffer_lock); goto free_eb; } - spin_unlock(tree-buffer_lock); - /* add one reference for the tree */ atomic_inc(eb-refs); + spin_unlock(tree-buffer_lock); return eb; free_eb: -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix race between allocate and release extent buffer.
On 02/04/2010 04:46 PM, Yan, Zheng wrote: Increase extent buffer's reference count while holding the lock. Otherwise it can race with try_release_extent_buffer. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/extent_io.c 2/fs/btrfs/extent_io.c --- 1/fs/btrfs/extent_io.c2010-01-17 15:48:16.770302026 +0800 +++ 2/fs/btrfs/extent_io.c2010-02-04 16:37:45.704800682 +0800 @@ -3165,10 +3165,9 @@ struct extent_buffer *alloc_extent_buffe spin_unlock(tree-buffer_lock); goto free_eb; } - spin_unlock(tree-buffer_lock); - /* add one reference for the tree */ atomic_inc(eb-refs); + spin_unlock(tree-buffer_lock); return eb; free_eb: Oops caused by this bug are attached below. Modules linked in: btrfs ipt_MASQUERADE iptable_nat nf_nat bridge stp zlib_deflate libcrc32c llc sunrpc xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 p4_clockmod freq_table speedstep_lib dm_multipath kvm uinput snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm ppdev parport_pc parport dcdbas serio_raw i2c_i801 pcspkr snd_timer snd soundcore iTCO_wdt iTCO_vendor_support snd_page_alloc e1000e ata_generic pata_acpi i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: freq_table] Pid: 3302, comm: flush-btrfs-1 Tainted: GW 2.6.32 #1 OptiPlex 755 RIP: 0010:[a0396718] [a0396718] btrfs_set_buffer_uptodate+0x14/0x25 [btrfs] RSP: 0018:880077e47480 EFLAGS: 00010202 RAX: RBX: 88003d8a4000 RCX: RDX: 0001 RSI: 88003d8a4000 RDI: 88003d8a4000 RBP: 880077e47480 R08: 880001c555c0 R09: R10: 880001c55630 R11: 880001c555c0 R12: 88007910eb80 R13: 88007a39c800 R14: 0022 R15: 88007910eb80 FS: () GS:880001c4() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 0a991000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process flush-btrfs-1 (pid: 3302, threadinfo 880077e46000, task 8800796a2e60) Stack: 880077e474b0 a038c334 88007a39c800 88007a39c9e0 0 1000 880077e47550 a039237b 0 0003 8800288935c0 814627da Call Trace: [a038c334] btrfs_init_new_buffer+0x78/0xe9 [btrfs] [a039237b] btrfs_alloc_free_block+0x1ef/0x1f4 [btrfs] [814627da] ? sub_preempt_count+0x9/0x83 [a038708e] split_leaf+0x243/0x449 [btrfs] [814600d2] ? _spin_unlock+0x2a/0x35 [a038826a] btrfs_search_slot+0x45c/0x518 [btrfs] [a0388e0b] btrfs_insert_empty_items+0x6a/0xbc [btrfs] [8146285d] ? add_preempt_count+0x9/0x83 [a039effe] insert_inline_extent+0xc0/0x251 [btrfs] [a03b4eeb] ? extent_clear_unlock_delalloc+0x1c7/0x1e4 [btrfs] [a039f2a5] cow_file_range_inline+0x116/0x159 [btrfs] [a039bb6e] ? start_transaction+0x1b8/0x1ea [btrfs] [a039f384] cow_file_range+0x9c/0x354 [btrfs] [a03b3dae] ? set_extent_bit+0x390/0x3e8 [btrfs] [a039fc67] run_delalloc_range+0xb4/0x364 [btrfs] [a03b6198] ? find_lock_delalloc_range+0x186/0x1a6 [btrfs] [a03b6343] __extent_writepage+0x18b/0x584 [btrfs] [811156e5] ? mem_cgroup_add_lru_list+0x81/0x8a [a03b6b73] extent_write_cache_pages.clone.0+0x155/0x2b1 [btrfs] [8145e6ab] ? thread_return+0xa8/0xd0 [8104ad22] ? finish_task_switch+0x85/0xa8 [8103fe77] ? need_resched+0x23/0x2d [a03b6dda] extent_writepages+0x44/0x5a [btrfs] [a039e608] ? btrfs_get_extent+0x0/0x753 [btrfs] [81076de8] ? bit_waitqueue+0x17/0xa9 [a039e4da] btrfs_writepages+0x27/0x29 [btrfs] [810dd8d5] do_writepages+0x21/0x2a [8113a5e2] writeback_single_inode+0xd1/0x1f6 [8113ade1] writeback_inodes_wb+0x388/0x423 [8113afa4] wb_writeback+0x128/0x1ac [810b0ded] ? call_rcu_sched+0x15/0x17 [810b0dfd] ? call_rcu+0xe/0x10 [8113b147] wb_do_writeback+0x6e/0x166 [8113b27e] bdi_writeback_task+0x3f/0xaf [810ecf94] ? bdi_start_fn+0x0/0xd4 [810ed00a] bdi_start_fn+0x76/0xd4 [810ecf94] ? bdi_start_fn+0x0/0xd4 [81076b9c] kthread+0x7f/0x87 [81012dda] child_rip+0xa/0x20 [81076b1d] ? kthread+0x0/0x87 [81012dd0] ? child_rip+0x0/0x20 Code: 00 00 48 81 c7 d0 20 00 00 e8 ad 99 0c e1 5b 41 5c 41 5d 41 5e c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8b 47 30 48 89 fe 48 8b 40 18 48 8b 38 48 81 ef 78 01 00 00 e8 0a d7 01 00 c9 c3 55 48 89 e5 RIP [a0396718] btrfs_set_buffer_uptodate+0x14/0x25 [btrfs] RSP 880077e47480 CR2: Modules linked
Re: btrfs-vol -b endless loop
On Thu, Feb 4, 2010 at 6:25 AM, Pär Andersson pa...@lysator.liu.se wrote: Hi, I have a 11G btrfs file system where btrfs-vol -b get stuck in a loop. The found 7 extents messages and heavy disk io continues until I reboot: [26223.544037] btrfs: relocating block group 42079027200 flags 36 [26228.082740] btrfs: found 2457 extents [26228.104515] btrfs: relocating block group 41407938560 flags 1 [26228.423981] btrfs: found 124 extents [26229.514583] btrfs: found 124 extents [26229.544160] btrfs: found 7 extents [26229.812494] btrfs: found 7 extents [26230.453720] btrfs: found 7 extents [26230.733233] btrfs: found 7 extents [26231.383321] btrfs: found 7 extents [26231.652815] btrfs: found 7 extents ... The error is repeatable. btrfsck does not report any errors: r...@faran:~# btrfsck /dev/sda5 found 9035055104 bytes used err is 0 total csum bytes: 8293152 total tree bytes: 542867456 total fs tree bytes: 503087104 btree space waste bytes: 108089491 file data blocks allocated: 8492187648 referenced 10314125312 Btrfs Btrfs v0.19 The kernel is Linus' master from two days ago, and seems to have all the latest btrfs code merged. The loop is due to fragments of free space. I'm working on make 'btrfs-vol -b' return -ENOSPC in this case. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix oopsen when dropping empty tree.
When dropping a empty tree, walk_down_tree() skips checking extent information for the tree root. This will triggers a BUG_ON in walk_up_proc(). Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/extent-tree.c 2/fs/btrfs/extent-tree.c --- 1/fs/btrfs/extent-tree.c2010-01-22 12:16:34.203525744 +0800 +++ 2/fs/btrfs/extent-tree.c2010-02-01 10:26:19.865562007 +0800 @@ -5402,10 +5402,6 @@ static noinline int walk_down_tree(struc int ret; while (level = 0) { - if (path-slots[level] = - btrfs_header_nritems(path-nodes[level])) - break; - ret = walk_down_proc(trans, root, path, wc, lookup_info); if (ret 0) break; @@ -5413,6 +5409,10 @@ static noinline int walk_down_tree(struc if (level == 0) break; + if (path-slots[level] = + btrfs_header_nritems(path-nodes[level])) + break; + ret = do_walk_down(trans, root, path, wc, lookup_info); if (ret 0) { path-slots[level]++; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfsck: Remove superfluous WARN_ON
Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp btrfs-progs-unstable/btrfsck.c btrfs-progs-2/btrfsck.c --- btrfs-progs-unstable/btrfsck.c 2009-09-28 15:54:55.980479398 +0800 +++ btrfs-progs-2/btrfsck.c 2010-01-31 09:46:24.645485459 +0800 @@ -581,7 +581,6 @@ again: } ret = insert_existing_cache_extent(dst, ins-cache); if (ret == -EEXIST) { - WARN_ON(src == src_node-root_cache); conflict = get_inode_rec(dst, rec-ino, 1); merge_inode_recs(rec, conflict, dst); if (rec-checked) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: panic during rebalance, and now upon mount
On Mon, Feb 1, 2010 at 3:33 AM, Troy Ablan tab...@gmail.com wrote: Yan, Zheng wrote: Please try the patch attached below. It should solve the bug during mounting that fs. But I don't know why there are so many link count errors in that fs. How old is that fs? what was that fs used for? Thank you very much. Yan, Zheng Good, so far. Thanks! The filesystem is less than 2 weeks old, created and managed exclusively with the unstable tools Btrfs v0.19-4-gab8fb4c-dirty I created the filesystem -d raid1 -m raid1. There are 14 dm-crypt mappings corresponding to 14 partitions on 14 drives. There's one filesystem made up from these devices with about 14 TB of space (a mixture of devices ranging from 500GB to 2TB) The filesystem is used for incremental backup from remote computers using rsync. The filesystem tree is as follows / /machine1 - normal directory /machine1/machine1 - a subvolume /machine1/machine1-20100120-1220 - a snapshot of the subvolume above /machine1/machine1-20100131-1220 - more snapshots of the subvolume above /machine2 - normal directory /machine2/machine1 - a subvolume /machine2/machine2-20100120-1020 - a snapshot of the subvolume above /machine2/machine2-20100131-1020 - more snapshots of the subvolume above The files are backed up with `rsync -aH --inplace` onto the subvolume for each machine. The only oddness I can think of is that during initial testing of this filesystem, I yanked a drive physically from the machine while it was writing. btrfs seemed to continue to try to write to the inaccessible device, and indeed, btrfs-show showed the used space on the missing drive increasing over time. Also, I was unable to remove the drive from the volume (ioctl returned -1), so it was in this state until I rebooted a couple hours later. I then did a btrfs-vol -r missing on the drive, and then added it back in as a new device. I did btrfs-vol -b which succeeded once. After adding more drives, I did btrfs-vol -b again, and that left me in the state where this thread began. As I understand it, a btrfs-vol -b is currently one of the only ways to reduplicate unmirrored chunks after a drive failure. (aside from rewriting the data or removing and readding devices). Is my understanding correct? Yes, Thanks again for helping debug. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: panic during rebalance, and now upon mount
On Sat, Jan 30, 2010 at 2:05 PM, Troy Ablan tab...@gmail.com wrote: Hi folks, During a very lengthy btrfs-vol -b (3.5 days in), btrfs BUGged out. Upon rebooting and trying to mount that fs, the exact same bug (with the exact same call trace) happens. I moved up to 2.6.33-rc6 from gentoo-maintained 2.6.32-r2 to see what would happen, and it appears to panic at the equivalent line of the same source file as before. Let me know if I can do anything to assist. I won't do anything to the disks for the next few days in case some forensics will be useful. [ 154.899692] device label bk0 devid 14 transid 34 /dev/mapper/btrn [ 154.958264] btrfs: use compression [ 202.394048] [ cut here ] [ 202.394136] kernel BUG at fs/btrfs/extent-tree.c:5377! [ 202.394220] invalid opcode: [#1] SMP [ 202.394372] last sysfs file: /sys/devices/virtual/block/md1/md/metadata_version [ 202.394500] CPU 5 [ 202.394655] Pid: 5838, comm: btrfs-relocate- Tainted: G W 2.6.33-rc6 #1 P55M-GD45 (MS-7588) /MS-7588 [ 202.394787] RIP: 0010:[8129e5ad] [8129e5ad] walk_up_proc+0x37d/0x3c0 [ 202.394955] RSP: 0018:880139729ca0 EFLAGS: 00010282 [ 202.395039] RAX: 0218 RBX: 88013c460300 RCX: 880139728000 [ 202.395127] RDX: 8800 RSI: fff8 RDI: 880138ac08e0 [ 202.395214] RBP: 880139729d00 R08: 0008 R09: 0001 [ 202.395301] R10: 0001 R11: 0001 R12: 880138ab8880 [ 202.395389] R13: R14: 88013f72f880 R15: 88013b646800 [ 202.395476] FS: () GS:88002834() knlGS: [ 202.395606] CS: 0010 DS: ES: CR0: 8005003b [ 202.395691] CR2: 00425f40 CR3: 018d3000 CR4: 06e0 [ 202.395778] DR0: DR1: DR2: [ 202.395865] DR3: DR6: 0ff0 DR7: 0400 [ 202.395953] Process btrfs-relocate- (pid: 5838, threadinfo 880139728000, task 88013f0e28f0) [ 202.396083] Stack: [ 202.396162] 880139729cf0 0002 88013f72f880 0206 [ 202.397142] 0 880139729d30 880138ac08e0 [ 202.397444] 0 88013c460300 88013f72f880 880139728000 [ 202.397856] Call Trace: [ 202.397937] [8129e72f] walk_up_tree+0x13f/0x1c0 [ 202.398023] [8129f99c] btrfs_drop_snapshot+0x21c/0x600 [ 202.398110] [812a9dd0] ? __btrfs_end_transaction+0x100/0x170 [ 202.398198] [812e7d7d] merge_func+0x7d/0xc0 [ 202.398284] [812d25aa] worker_loop+0x17a/0x540 [ 202.398379] [812d2430] ? worker_loop+0x0/0x540 [ 202.398487] [812d2430] ? worker_loop+0x0/0x540 [ 202.398611] [81095936] kthread+0x96/0xa0 [ 202.398697] [81034bd4] kernel_thread_helper+0x4/0x10 [ 202.398784] [816ac869] ? restore_args+0x0/0x30 [ 202.398869] [810958a0] ? kthread+0x0/0xa0 [ 202.398953] [81034bd0] ? kernel_thread_helper+0x0/0x10 [ 202.399039] Code: 6d db b6 6d 48 c1 f8 03 48 0f af c2 48 ba 00 00 00 00 00 88 ff ff 48 c1 e0 0c 48 8b 44 10 58 ff 49 1c 48 39 c6 0f 84 ab fd ff ff 0f 0b eb fe 0f 1f 80 00 00 00 00 47 8b 4c ae 60 45 85 c9 0f 85 [ 202.401551] RIP [8129e5ad] walk_up_proc+0x37d/0x3c0 [ 202.401671] RSP 880139729ca0 [ 202.401796] ---[ end trace 4c085bcc2bd215f6 ]--- Thank you for reporting this. Would you please run btrsck and mount that fs again with the debug patch attached below. Regards Yan, Zheng --- diff -urp 1/fs/btrfs/extent-tree.c 2/fs/btrfs/extent-tree.c --- 1/fs/btrfs/extent-tree.c2010-01-22 12:16:34.203525744 +0800 +++ 2/fs/btrfs/extent-tree.c2010-01-30 20:03:23.609292953 +0800 @@ -5373,8 +5373,18 @@ static noinline int walk_up_proc(struct if (wc-flags[level] BTRFS_BLOCK_FLAG_FULL_BACKREF) parent = eb-start; else - BUG_ON(root-root_key.objectid != - btrfs_header_owner(eb)); + if (root-root_key.objectid != + btrfs_header_owner(eb)) { + printk(root %llu %llu\n, + root-root_key.objectid, + root-root_key.offset); + printk(node %llu refs %llu flags %llu owner %llu reloc %d\n, + eb-start, wc-refs[level], wc-flags[level], + btrfs_header_owner(eb), + btrfs_header_flag(eb, BTRFS_HEADER_FLAG_RELOC)); + + BUG(); + } } else { if (wc-flags[level + 1] BTRFS_BLOCK_FLAG_FULL_BACKREF
Re: [PATCH]btrfs: avoid comparing with NULL pointer
2010/1/27 Liuwenyi qingshen...@gmail.com: In this patch, I adjust the seqence of if-conditions. It will assess the page-private situation. First, we make sure the page-private is not null. And then, we can do some with this page-private. --- Signed-off-by: Liuwenyi qingshen...@gmail.com Cc: Chris Mason chris.ma...@oracle.com Cc: Yan Zheng zheng@oracle.com Cc: Josef Bacik jba...@redhat.com Cc: Jens Axboe jens.ax...@oracle.com Cc: linux-btrfs@vger.kernel.org Cc: linux-ker...@vger.kernel.org --- fs/btrfs/disk-io.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 009e3bd..a300dca 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1407,11 +1407,11 @@ static int bio_ready_for_csum(struct bio *bio) bio_for_each_segment(bvec, bio, i) { page = bvec-bv_page; - if (page-private == EXTENT_PAGE_PRIVATE) { + if (!page-private) { length += bvec-bv_len; continue; } - if (!page-private) { + if (page-private == EXTENT_PAGE_PRIVATE) { length += bvec-bv_len; continue; } -- Why do you want to do this? The code is perfect safe even page-private is NULL. Furthermore, your patch is malformed. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs fsck doesn't modify a thing, and btrfs can not balance any data on a new device
in the btree ? The short answer is repairing error isn't implemented yet. I'm afraid the only way to save your data is try mounting the FS in readonly mode and copying the data out. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix another orphan cleanup problem
On Tue, Jan 26, 2010 at 5:01 AM, Josef Bacik jo...@redhat.com wrote: Because orphan cleanup now happens well after the fs is all initialized and such, we can run into this problem where we find orphan entries that were just added to the fs, not ones that were added previously during a crash. This does not bode well for the system, and results in a couple of odd things happening, like truncate being run on non-regular files. In order to fix this we just check and see if the inode has been added to the in-memory orphan list, and if it has, set the key to it's inode number - 1 so we don't find this orphan entry again, and continue searching. This problem kept popping up while running xfs tests, and was 100% reproduceable. With this patch the problem no longer happens. Thanks, Hi Josef, I think this problem was introduced by your previous orphan cleanup fix. Before I introduced the orphan cleanup regression, orphan cleanup on a subvol was triggered by the first access. Your previous orphan clean fix broke this rule, orphan clean is triggered when the first time btrfs_lookup finds a valid inode. I think it's better to keep the old rule. revert your previous changes and add code to open_ctree() to do orphan cleanup for the default subvol. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] make btrfs-image work
On Wed, Jan 20, 2010 at 12:04 AM, Josef Bacik jo...@redhat.com wrote: Hello, btrfs-image would be very helpful for debugging some users problems that we can't reproduce ourselves, but every image that i try and re-create with btrfs-image makes btrfs panic. This is because we zero out the superblocks chunk array and re-create our uuid. This means that we end up not being able to read the chunk tree on mount, and then even if we could the uuid's of the metadata we read back wouldn't match the uuid of the device. The way I've fixed this is to just spit the metadata back onto the disk exactly the way we got it. The caveat to this I think is that if we try to image a multi-device setup that it won't work right unless we have a multi-device setup to restore the image onto. I'm not sure if thats the goal or not. This patch makes the single disk case work fine for me. Let me know what you think. Thanks, The goal of btrfs-image is create image that can be examined by btrfsck and btrfs-debug-tree. btrfs-image creates metadata image for btrfs' logical address space. So your patch only works for the uncommon case that btrfs' logical address is mapped to offset of device. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Fix race in btrfs_mark_extent_written
Fix bug reported by Johannes Hirte. The reason of that bug is btrfs_del_items is called after btrfs_duplicate_item and btrfs_del_items triggers tree balance. The fix is check that case and call btrfs_search_slot when needed. Signed-off-by: Yan Zheng zheng@oracle.com --- diff -urp 1/fs/btrfs/file.c 2/fs/btrfs/file.c --- 1/fs/btrfs/file.c 2009-12-28 12:23:42.081546961 +0800 +++ 2/fs/btrfs/file.c 2010-01-11 13:02:08.082735125 +0800 @@ -506,7 +506,8 @@ next_slot: } static int extent_mergeable(struct extent_buffer *leaf, int slot, - u64 objectid, u64 bytenr, u64 *start, u64 *end) + u64 objectid, u64 bytenr, u64 orig_offset, + u64 *start, u64 *end) { struct btrfs_file_extent_item *fi; struct btrfs_key key; @@ -522,6 +523,7 @@ static int extent_mergeable(struct exten fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG || btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr || + btrfs_file_extent_offset(leaf, fi) != key.offset - orig_offset || btrfs_file_extent_compression(leaf, fi) || btrfs_file_extent_encryption(leaf, fi) || btrfs_file_extent_other_encoding(leaf, fi)) @@ -561,6 +563,7 @@ int btrfs_mark_extent_written(struct btr u64 split; int del_nr = 0; int del_slot = 0; + int recow; int ret; btrfs_drop_extent_cache(inode, start, end - 1, 0); @@ -568,6 +571,7 @@ int btrfs_mark_extent_written(struct btr path = btrfs_alloc_path(); BUG_ON(!path); again: + recow = 0; split = start; key.objectid = inode-i_ino; key.type = BTRFS_EXTENT_DATA_KEY; @@ -591,12 +595,60 @@ again: bytenr = btrfs_file_extent_disk_bytenr(leaf, fi); num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi); orig_offset = key.offset - btrfs_file_extent_offset(leaf, fi); + memcpy(new_key, key, sizeof(new_key)); + + if (start == key.offset end extent_end) { + other_start = 0; + other_end = start; + if (extent_mergeable(leaf, path-slots[0] - 1, +inode-i_ino, bytenr, orig_offset, +other_start, other_end)) { + new_key.offset = end; + btrfs_set_item_key_safe(trans, root, path, new_key); + fi = btrfs_item_ptr(leaf, path-slots[0], + struct btrfs_file_extent_item); + btrfs_set_file_extent_num_bytes(leaf, fi, + extent_end - end); + btrfs_set_file_extent_offset(leaf, fi, +end - orig_offset); + fi = btrfs_item_ptr(leaf, path-slots[0] - 1, + struct btrfs_file_extent_item); + btrfs_set_file_extent_num_bytes(leaf, fi, + end - other_start); + btrfs_mark_buffer_dirty(leaf); + goto out; + } + } + + if (start key.offset end == extent_end) { + other_start = end; + other_end = 0; + if (extent_mergeable(leaf, path-slots[0] + 1, +inode-i_ino, bytenr, orig_offset, +other_start, other_end)) { + fi = btrfs_item_ptr(leaf, path-slots[0], + struct btrfs_file_extent_item); + btrfs_set_file_extent_num_bytes(leaf, fi, + start - key.offset); + path-slots[0]++; + new_key.offset = start; + btrfs_set_item_key_safe(trans, root, path, new_key); + + fi = btrfs_item_ptr(leaf, path-slots[0], + struct btrfs_file_extent_item); + btrfs_set_file_extent_num_bytes(leaf, fi, + other_end - start); + btrfs_set_file_extent_offset(leaf, fi, +start - orig_offset); + btrfs_mark_buffer_dirty(leaf); + goto out; + } + } while (start key.offset || end extent_end) { if (key.offset == start) split = end; - memcpy(new_key, key, sizeof(new_key)); new_key.offset = split; ret = btrfs_duplicate_item(trans, root, path, new_key); if (ret == -EAGAIN