Re: [PATCH v3 0/8] Support for transparent PUD pages for DAX files
On 01/08/2016 11:49 AM, Matthew Wilcox wrote: > From: Matthew Wilcox > > Andrew, I think this is ready for a spin in -mm. > > v3: Rebased against current mmtom > v2: Reduced churn in filesystems by switching to ->huge_fault interface > Addressed concerns from Kirill > > We have customer demand to use 1GB pages to map DAX files. Unlike the 2MB > page support, the Linux MM does not currently support PUD pages, so I have > attempted to add support for the necessary pieces for DAX huge PUD pages. > > Filesystems still need work to allocate 1GB pages. With ext4, I can > only get 16MB of contiguous space, although it is aligned. With XFS, > I can get 80MB less than 1GB, and it's not aligned. The XFS problem > may be due to the small amount of RAM in my test machine. > I dont think ext4 can do 1G at this time due to extent length bits (15 for unwritten) and block group size bundary (well, with flex bg we may able to relax this ). I have seen about 125M of contiguous space allocated on my fresh new ext4 filesystem. I do remember mballoc in ext4 used to normalize the allocation request up to 8 or 16M, but it appears not that small any more. Thanks, Mingming > This patch set is against something approximately current -mm. I'd like > to thank Dave Chinner & Kirill Shutemov for their reviews of v1. > The conversion of pmd_fault & pud_fault to huge_fault is thanks to > Dave's poking, and Kirill spotted a couple of problems in the MM code. > Version 2 of the patch set is about 200 lines smaller (1016 insertions, > 23 deletions in v1). > > I've done some light testing using a program to mmap a block device > with DAX enabled, calling mincore() and examining /proc/smaps and > /proc/pagemap. > > Matthew Wilcox (8): > mm: Convert an open-coded VM_BUG_ON_VMA > mm,fs,dax: Change ->pmd_fault to ->huge_fault > mm: Add support for PUD-sized transparent hugepages > mincore: Add support for PUDs > procfs: Add support for PUDs to smaps, clear_refs and pagemap > x86: Add support for PUD-sized transparent hugepages > dax: Support for transparent PUD pages > ext4: Support for PUD-sized transparent huge pages > > Documentation/filesystems/dax.txt | 12 +- > arch/Kconfig | 3 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/paravirt.h | 11 ++ > arch/x86/include/asm/paravirt_types.h | 2 + > arch/x86/include/asm/pgtable.h| 94 > arch/x86/include/asm/pgtable_64.h | 13 ++ > arch/x86/kernel/paravirt.c| 1 + > arch/x86/mm/pgtable.c | 31 > fs/block_dev.c| 10 +- > fs/dax.c | 272 > +- > fs/ext2/file.c| 27 +--- > fs/ext4/file.c| 60 +++- > fs/proc/task_mmu.c| 109 ++ > fs/xfs/xfs_file.c | 25 ++-- > fs/xfs/xfs_trace.h| 2 +- > include/asm-generic/pgtable.h | 62 +++- > include/asm-generic/tlb.h | 14 ++ > include/linux/dax.h | 17 --- > include/linux/huge_mm.h | 50 +++ > include/linux/mm.h| 43 +- > include/linux/mmu_notifier.h | 13 ++ > include/linux/pfn_t.h | 8 + > mm/huge_memory.c | 151 +++ > mm/memory.c | 101 +++-- > mm/mincore.c | 13 ++ > mm/pagewalk.c | 19 ++- > mm/pgtable-generic.c | 14 ++ > 28 files changed, 980 insertions(+), 198 deletions(-) >
Re: [PATCH v3 0/8] Support for transparent PUD pages for DAX files
On 01/08/2016 11:49 AM, Matthew Wilcox wrote: > From: Matthew Wilcox> > Andrew, I think this is ready for a spin in -mm. > > v3: Rebased against current mmtom > v2: Reduced churn in filesystems by switching to ->huge_fault interface > Addressed concerns from Kirill > > We have customer demand to use 1GB pages to map DAX files. Unlike the 2MB > page support, the Linux MM does not currently support PUD pages, so I have > attempted to add support for the necessary pieces for DAX huge PUD pages. > > Filesystems still need work to allocate 1GB pages. With ext4, I can > only get 16MB of contiguous space, although it is aligned. With XFS, > I can get 80MB less than 1GB, and it's not aligned. The XFS problem > may be due to the small amount of RAM in my test machine. > I dont think ext4 can do 1G at this time due to extent length bits (15 for unwritten) and block group size bundary (well, with flex bg we may able to relax this ). I have seen about 125M of contiguous space allocated on my fresh new ext4 filesystem. I do remember mballoc in ext4 used to normalize the allocation request up to 8 or 16M, but it appears not that small any more. Thanks, Mingming > This patch set is against something approximately current -mm. I'd like > to thank Dave Chinner & Kirill Shutemov for their reviews of v1. > The conversion of pmd_fault & pud_fault to huge_fault is thanks to > Dave's poking, and Kirill spotted a couple of problems in the MM code. > Version 2 of the patch set is about 200 lines smaller (1016 insertions, > 23 deletions in v1). > > I've done some light testing using a program to mmap a block device > with DAX enabled, calling mincore() and examining /proc/smaps and > /proc/pagemap. > > Matthew Wilcox (8): > mm: Convert an open-coded VM_BUG_ON_VMA > mm,fs,dax: Change ->pmd_fault to ->huge_fault > mm: Add support for PUD-sized transparent hugepages > mincore: Add support for PUDs > procfs: Add support for PUDs to smaps, clear_refs and pagemap > x86: Add support for PUD-sized transparent hugepages > dax: Support for transparent PUD pages > ext4: Support for PUD-sized transparent huge pages > > Documentation/filesystems/dax.txt | 12 +- > arch/Kconfig | 3 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/paravirt.h | 11 ++ > arch/x86/include/asm/paravirt_types.h | 2 + > arch/x86/include/asm/pgtable.h| 94 > arch/x86/include/asm/pgtable_64.h | 13 ++ > arch/x86/kernel/paravirt.c| 1 + > arch/x86/mm/pgtable.c | 31 > fs/block_dev.c| 10 +- > fs/dax.c | 272 > +- > fs/ext2/file.c| 27 +--- > fs/ext4/file.c| 60 +++- > fs/proc/task_mmu.c| 109 ++ > fs/xfs/xfs_file.c | 25 ++-- > fs/xfs/xfs_trace.h| 2 +- > include/asm-generic/pgtable.h | 62 +++- > include/asm-generic/tlb.h | 14 ++ > include/linux/dax.h | 17 --- > include/linux/huge_mm.h | 50 +++ > include/linux/mm.h| 43 +- > include/linux/mmu_notifier.h | 13 ++ > include/linux/pfn_t.h | 8 + > mm/huge_memory.c | 151 +++ > mm/memory.c | 101 +++-- > mm/mincore.c | 13 ++ > mm/pagewalk.c | 19 ++- > mm/pgtable-generic.c | 14 ++ > 28 files changed, 980 insertions(+), 198 deletions(-) >
Re: [PATCH resend] ext2/3/4: convert byte order of constant instead of variable
On Thu, 2008-02-14 at 14:20 -0800, Andrew Morton wrote: > On Sun, 10 Feb 2008 11:10:15 +0100 > Marcin Slusarz <[EMAIL PROTECTED]> wrote: > > > fs/ext2/super.c |8 +++- > > fs/ext3/super.c |2 +- > > fs/ext4/super.c |2 +- > > Please don't bundle the filesystem patches in this manner. I split > it into three patches. > Andrew, Ted, I added the ext4 patch in the ext4 patch queue. Regards, Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH resend] ext2/3/4: convert byte order of constant instead of variable
On Thu, 2008-02-14 at 14:20 -0800, Andrew Morton wrote: On Sun, 10 Feb 2008 11:10:15 +0100 Marcin Slusarz [EMAIL PROTECTED] wrote: fs/ext2/super.c |8 +++- fs/ext3/super.c |2 +- fs/ext4/super.c |2 +- Please don't bundle the filesystem patches in this manner. I split it into three patches. Andrew, Ted, I added the ext4 patch in the ext4 patch queue. Regards, Mingming -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext4 can fail badly when device stops accepting BIO_RW_BARRIER requests.
On Wed, 2008-02-06 at 22:25 -0600, Dave Kleikamp wrote: > Duplicating Neil Brown's jbd patch for jbd2. I guess this can go > through the ext4 queue rather than straight into -mm. > Checked-in. Thanks Shaggy and Neil. Mingming > Neil's text: > > Some devices - notably dm and md - can change their behaviour in > response to BIO_RW_BARRIER requests. They might start out accepting > such requests but on reconfiguration, they find out that they cannot > any more. > > ext3 (and other filesystems) deal with this by always testing if > BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write > requests without the barrier (probably after waiting for any pending > writes to complete). > > However there is a bug in the handling for this for ext3. > > When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request, > it sets the buffer_ordered flag on the buffer head. > If the request completes successfully, the flag STAYS SET. > > Other code might then write the same buffer_head after the device has > been reconfigured to not accept barriers. This write will then fail, > but the "other code" is not ready to handle EOPNOTSUPP errors and the > error will be treated as fatal. > > This can be seen without having to reconfigure a device at exactly the > wrong time by putting: > > if (buffer_ordered(bh)) > printk("OH DEAR, and ordered buffer\n"); > > > in the while loop in "commit phase 5" of journal_commit_transaction. > > If it ever prints the "OH DEAR ..." message (as it does sometimes for > me), then that request could (in different circumstances) have failed > with EOPNOTSUPP, but that isn't tested for. > > My proposed fix is to clear the buffer_ordered flag after it has been > used, as in the following patch. > > Thanks, > NeilBrown > > Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]> > > diff -Nurp linux-2.6.24-mm1/fs/jbd2/commit.c linux/fs/jbd2/commit.c > --- linux-2.6.24-mm1/fs/jbd2/commit.c 2008-02-04 09:08:44.0 -0600 > +++ linux/fs/jbd2/commit.c2008-02-06 22:11:14.0 -0600 > @@ -148,6 +148,8 @@ static int journal_submit_commit_record( > barrier_done = 1; > } > ret = submit_bh(WRITE, bh); > + if (barrier_done) > + clear_buffer_ordered(bh); > > /* is it possible for another commit to fail at roughly >* the same time as this one? If so, we don't want to > @@ -166,7 +168,6 @@ static int journal_submit_commit_record( > spin_unlock(>j_state_lock); > > /* And try again, without the barrier */ > - clear_buffer_ordered(bh); > set_buffer_uptodate(bh); > set_buffer_dirty(bh); > ret = submit_bh(WRITE, bh); > > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext4 can fail badly when device stops accepting BIO_RW_BARRIER requests.
On Wed, 2008-02-06 at 22:25 -0600, Dave Kleikamp wrote: Duplicating Neil Brown's jbd patch for jbd2. I guess this can go through the ext4 queue rather than straight into -mm. Checked-in. Thanks Shaggy and Neil. Mingming Neil's text: Some devices - notably dm and md - can change their behaviour in response to BIO_RW_BARRIER requests. They might start out accepting such requests but on reconfiguration, they find out that they cannot any more. ext3 (and other filesystems) deal with this by always testing if BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write requests without the barrier (probably after waiting for any pending writes to complete). However there is a bug in the handling for this for ext3. When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request, it sets the buffer_ordered flag on the buffer head. If the request completes successfully, the flag STAYS SET. Other code might then write the same buffer_head after the device has been reconfigured to not accept barriers. This write will then fail, but the other code is not ready to handle EOPNOTSUPP errors and the error will be treated as fatal. This can be seen without having to reconfigure a device at exactly the wrong time by putting: if (buffer_ordered(bh)) printk(OH DEAR, and ordered buffer\n); in the while loop in commit phase 5 of journal_commit_transaction. If it ever prints the OH DEAR ... message (as it does sometimes for me), then that request could (in different circumstances) have failed with EOPNOTSUPP, but that isn't tested for. My proposed fix is to clear the buffer_ordered flag after it has been used, as in the following patch. Thanks, NeilBrown Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] diff -Nurp linux-2.6.24-mm1/fs/jbd2/commit.c linux/fs/jbd2/commit.c --- linux-2.6.24-mm1/fs/jbd2/commit.c 2008-02-04 09:08:44.0 -0600 +++ linux/fs/jbd2/commit.c2008-02-06 22:11:14.0 -0600 @@ -148,6 +148,8 @@ static int journal_submit_commit_record( barrier_done = 1; } ret = submit_bh(WRITE, bh); + if (barrier_done) + clear_buffer_ordered(bh); /* is it possible for another commit to fail at roughly * the same time as this one? If so, we don't want to @@ -166,7 +168,6 @@ static int journal_submit_commit_record( spin_unlock(journal-j_state_lock); /* And try again, without the barrier */ - clear_buffer_ordered(bh); set_buffer_uptodate(bh); set_buffer_dirty(bh); ret = submit_bh(WRITE, bh); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
On Tue, 2008-02-05 at 13:59 -0500, Josef Bacik wrote: > On Tuesday 05 February 2008 12:27:31 pm Jan Kara wrote: > > Hello, > > > > Sorry for replying a bit late but I'm currently falling behind in > > maling-list reading... > > > > > The way jbd tries to determine if there is enough space left on the > > > journal in order to start a new transaction is looking at the space left > > > in the journal and the space needed for the committing transaction, which > > > is 1/4 of the journal + the number of t_outstanding_credits for that > > > transaction. In this case its assumed that t_outstanding_credits > > > accurately represents the number of credits > > > > Yes. > > > > > waiting to be written to the journal, but this sometimes isn't the case. > > > The transaction has two counters for buffers, t_outstanding_credits which > > > is used in conjunction with handles that are added to the transaction, > > > and t_nr_buffers which is only incremented/decremented when buffers are > > > added/removed from the transaction and are actually destined to be > > > journaled. Normally these two > > > > t_nr_buffers actually represents number of buffers on BJ_Metadata list > > and nobody uses it (except for the assertion in > > __journal_temp_unlink_buffer()). t_outstanding_credits is supposed to be > > *the* counter making sure we don't write more than we have credits for. > > > > > counters are the same, however there are cases where the committing > > > transaction can have buffers moved to the next running transaction, for > > > example any buffers on the committing transactions t_reserved list would > > > be moved to the next (running) transaction, and if it had been dirtied in > > > the process it would immediately make it onto the t_updates list, which > > > would increment t_nr_buffers > > > > You probably mean t_buffers list here... > > > > > but not t_outstanding_credits. So you get into this situation where > > > > But which moving and dirtying do you mean? The caller which dirties > > the buffer must make sure that he has acquired enough credits for the > > transaction where the buffer ends up... So if there were not enough > > buffers in the running transaction where we refiled the buffer it is a > > bug in the caller which dirties the buffer. > > > > You know now that you say that I feel like an idiot, you are right the only > way > for something to actually end up on that list was if somebody dirtied it and > if > they did it would have had to been accounted for at some point on the running > transaction. > > > > t_nr_buffers (the actual number of buffers that are on the transaction) > > > is greater than the number of buffers accounted for via > > > t_outstanding_credits. This presents a problem since as we loop through > > > writting buffers to the journal, we decrement t_outstanding_credits, and > > > if t_nr_buffers is more than t_outstanding_credits then we end up with a > > > negative number for > > > t_outstanding_credits, which means we start saying we need less than 1/4 > > > of the journal for our committing transaction and allow more transactions > > > than we can handle to start, and then bam we fail because > > > journal_next_log_block doesn't have enough free blocks in order to handle > > > the request. This has been tested and fixes the issue (which could not > > > be reproduced by me but several other people could get it to reproduce > > > using postmark), and although I couldn't reproduce the assertion, I could > > > very easily reproduce the situation where t_outstanding_credits was < > > > than t_nr_buffers. > > > > I suppose you see the assertion J_ASSERT(journal->j_free > 1); to > > fail, right? I don't see how your patch could help avoid that assertion. > > You've just removed accounting of t_outstanding_credits which has no > > impact on the real number of free blocks in the journal stored in > > j_free. Anyway, if you can reproduce t_outstanding_credits < > > t_nr_buffers, then there's something fishy. Are you able to reproduce it > > also with a current kernel? > > Thanks for looking into the problem :) > > > > Well my patch helped avoid the assertion because t_outstanding_credits was > going > negative therefore we were letting transactions start when we shouldn't be, > and > eventually we would end up with too much of the journal in use and we'd > assert. > Course I can't reproduce where t_outstanding_credits < t_nr_buffers upstream > (again I feel like an idiot, should have tested that first). Thanks for > looking at this Jan. > > Mingming, would you mind pulling this patch out of the patch queue please > since > its wrong? Thanks much, > Sure, done! Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
On Thu, 2008-01-31 at 16:52 -0500, Josef Bacik wrote: > On Thu, Jan 31, 2008 at 12:35:43PM -0700, Andreas Dilger wrote: > > On Jan 31, 2008 11:14 -0500, Josef Bacik wrote: > > [snip excellent analysis] > > > So you get into this situation where > > > t_nr_buffers (the actual number of buffers that are on the transaction) is > > > greater than the number of buffers accounted for via > > > t_outstanding_credits. > > > This presents a problem since as we loop through writting buffers to the > > > journal, we decrement t_outstanding_credits, and if t_nr_buffers is more > > > than > > > t_outstanding_credits then we end up with a negative number for > > > t_outstanding_credits > > > > > > Signed-off-by: Josef Bacik <[EMAIL PROTECTED]> > > > > Do you know what kernel this problem was introduced in, or is this a > > long standing problem? Presumably the same is needed for jbd2? > > > > Once we have some decent amount of testing going on with ext4, I think > > it makes sense to merge the jbd2 changes back into jbd and return to > > a single code base, since there is nothing in the jbd2 code that ext3 > > can't also work with (i.e. all of the changes are properly isolated > > with compatibility flags and such). > > > > > @@ -1056,7 +1056,7 @@ static inline int jbd_space_needed(journal_t > > > *journal) > > > int nblocks = journal->j_max_transaction_buffers; > > > if (journal->j_committing_transaction) > > > nblocks += journal->j_committing_transaction-> > > > - t_outstanding_credits; > > > + t_nr_buffers; > > > > (trivial) this can be moved back onto the previous line. > > > > > @@ -1168,7 +1168,7 @@ static inline int jbd_space_needed(journal_t > > > *journal) > > > int nblocks = journal->j_max_transaction_buffers; > > > if (journal->j_committing_transaction) > > > nblocks += journal->j_committing_transaction-> > > > - t_outstanding_credits; > > > + t_nr_buffers; > > > > Same... > > > > The original issue was reported on RHEL4, so thats 2.6.9, and looking through > the old-bkcvs git tree I can't see where this was introduced, so it's probably > existed before that. The same problem looks to exist in jbd2 though I haven't > tested it myself, I just went ahead and included the fixes. Here is the > updated > patch, thanks much for the comments. > Added to ext4 patch queue. Thanks, Mingming > Signed-off-by: Josef Bacik <[EMAIL PROTECTED]> > > > diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c > index 31853eb..e385a5c 100644 > --- a/fs/jbd/commit.c > +++ b/fs/jbd/commit.c > @@ -561,13 +561,6 @@ void journal_commit_transaction(journal_t *journal) > continue; > } > > - /* > - * start_this_handle() uses t_outstanding_credits to determine > - * the free space in the log, but this counter is changed > - * by journal_next_log_block() also. > - */ > - commit_transaction->t_outstanding_credits--; > - > /* Bump b_count to prevent truncate from stumbling over > the shadowed buffer! @@@ This can go if we ever get > rid of the BJ_IO/BJ_Shadow pairing of buffers. */ > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c > index 4f302d2..c0f93f5 100644 > --- a/fs/jbd2/commit.c > +++ b/fs/jbd2/commit.c > @@ -580,7 +580,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) > stats.u.run.rs_logging = jiffies; > stats.u.run.rs_flushing = jbd2_time_diff(stats.u.run.rs_flushing, >stats.u.run.rs_logging); > - stats.u.run.rs_blocks = commit_transaction->t_outstanding_credits; > + stats.u.run.rs_blocks = commit_transaction->t_nr_buffers; > stats.u.run.rs_blocks_logged = 0; > > descriptor = NULL; > @@ -655,13 +655,6 @@ void jbd2_journal_commit_transaction(journal_t *journal) > continue; > } > > - /* > - * start_this_handle() uses t_outstanding_credits to determine > - * the free space in the log, but this counter is changed > - * by jbd2_journal_next_log_block() also. > - */ > - commit_transaction->t_outstanding_credits--; > - > /* Bump b_count to prevent truncate from stumbling over > the shadowed buffer! @@@ This can go if we ever get > rid of the BJ_IO/BJ_Shadow pairing of buffers. */ > diff --git a/include/linux/jbd.h b/include/linux/jbd.h > index d9ecd13..eaeb3db 100644 > --- a/include/linux/jbd.h > +++ b/include/linux/jbd.h > @@ -1055,8 +1055,7 @@ static inline int jbd_space_needed(journal_t *journal) > { > int nblocks = journal->j_max_transaction_buffers; > if (journal->j_committing_transaction) > - nblocks +=
Re: [PATCH] jbd: fix assertion failure in journal_next_log_block
On Thu, 2008-01-31 at 16:52 -0500, Josef Bacik wrote: On Thu, Jan 31, 2008 at 12:35:43PM -0700, Andreas Dilger wrote: On Jan 31, 2008 11:14 -0500, Josef Bacik wrote: [snip excellent analysis] So you get into this situation where t_nr_buffers (the actual number of buffers that are on the transaction) is greater than the number of buffers accounted for via t_outstanding_credits. This presents a problem since as we loop through writting buffers to the journal, we decrement t_outstanding_credits, and if t_nr_buffers is more than t_outstanding_credits then we end up with a negative number for t_outstanding_credits Signed-off-by: Josef Bacik [EMAIL PROTECTED] Do you know what kernel this problem was introduced in, or is this a long standing problem? Presumably the same is needed for jbd2? Once we have some decent amount of testing going on with ext4, I think it makes sense to merge the jbd2 changes back into jbd and return to a single code base, since there is nothing in the jbd2 code that ext3 can't also work with (i.e. all of the changes are properly isolated with compatibility flags and such). @@ -1056,7 +1056,7 @@ static inline int jbd_space_needed(journal_t *journal) int nblocks = journal-j_max_transaction_buffers; if (journal-j_committing_transaction) nblocks += journal-j_committing_transaction- - t_outstanding_credits; + t_nr_buffers; (trivial) this can be moved back onto the previous line. @@ -1168,7 +1168,7 @@ static inline int jbd_space_needed(journal_t *journal) int nblocks = journal-j_max_transaction_buffers; if (journal-j_committing_transaction) nblocks += journal-j_committing_transaction- - t_outstanding_credits; + t_nr_buffers; Same... The original issue was reported on RHEL4, so thats 2.6.9, and looking through the old-bkcvs git tree I can't see where this was introduced, so it's probably existed before that. The same problem looks to exist in jbd2 though I haven't tested it myself, I just went ahead and included the fixes. Here is the updated patch, thanks much for the comments. Added to ext4 patch queue. Thanks, Mingming Signed-off-by: Josef Bacik [EMAIL PROTECTED] diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c index 31853eb..e385a5c 100644 --- a/fs/jbd/commit.c +++ b/fs/jbd/commit.c @@ -561,13 +561,6 @@ void journal_commit_transaction(journal_t *journal) continue; } - /* - * start_this_handle() uses t_outstanding_credits to determine - * the free space in the log, but this counter is changed - * by journal_next_log_block() also. - */ - commit_transaction-t_outstanding_credits--; - /* Bump b_count to prevent truncate from stumbling over the shadowed buffer! @@@ This can go if we ever get rid of the BJ_IO/BJ_Shadow pairing of buffers. */ diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index 4f302d2..c0f93f5 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -580,7 +580,7 @@ void jbd2_journal_commit_transaction(journal_t *journal) stats.u.run.rs_logging = jiffies; stats.u.run.rs_flushing = jbd2_time_diff(stats.u.run.rs_flushing, stats.u.run.rs_logging); - stats.u.run.rs_blocks = commit_transaction-t_outstanding_credits; + stats.u.run.rs_blocks = commit_transaction-t_nr_buffers; stats.u.run.rs_blocks_logged = 0; descriptor = NULL; @@ -655,13 +655,6 @@ void jbd2_journal_commit_transaction(journal_t *journal) continue; } - /* - * start_this_handle() uses t_outstanding_credits to determine - * the free space in the log, but this counter is changed - * by jbd2_journal_next_log_block() also. - */ - commit_transaction-t_outstanding_credits--; - /* Bump b_count to prevent truncate from stumbling over the shadowed buffer! @@@ This can go if we ever get rid of the BJ_IO/BJ_Shadow pairing of buffers. */ diff --git a/include/linux/jbd.h b/include/linux/jbd.h index d9ecd13..eaeb3db 100644 --- a/include/linux/jbd.h +++ b/include/linux/jbd.h @@ -1055,8 +1055,7 @@ static inline int jbd_space_needed(journal_t *journal) { int nblocks = journal-j_max_transaction_buffers; if (journal-j_committing_transaction) - nblocks += journal-j_committing_transaction- - t_outstanding_credits; + nblocks += journal-j_committing_transaction-t_nr_buffers; return nblocks; } diff --git
Re: [patch 12/26] mount options: fix ext4
On Thu, 2008-01-24 at 20:33 +0100, Miklos Szeredi wrote: > plain text document attachment (ext4_opts.patch) > From: Miklos Szeredi <[EMAIL PROTECTED]> > > Add stripe= option to /proc/mounts for ext4 filesystems. > > Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> > --- > > Index: linux/fs/ext4/super.c > === > --- linux.orig/fs/ext4/super.c2008-01-23 12:57:07.0 +0100 > +++ linux/fs/ext4/super.c 2008-01-23 21:43:51.0 +0100 > @@ -742,7 +742,8 @@ static int ext4_show_options(struct seq_ > seq_puts(seq, ",nomballoc"); > if (!test_opt(sb, DELALLOC)) > seq_puts(seq, ",nodelalloc"); > - > + if (sbi->s_stripe) > + seq_printf(seq, ",stripe=%lu", sbi->s_stripe); > > /* >* journal mode get enabled in different ways > Added to ext4 patch queue. Thanks! http://repo.or.cz/w/ext4-patch-queue.git Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 12/26] mount options: fix ext4
On Thu, 2008-01-24 at 20:33 +0100, Miklos Szeredi wrote: plain text document attachment (ext4_opts.patch) From: Miklos Szeredi [EMAIL PROTECTED] Add stripe= option to /proc/mounts for ext4 filesystems. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/ext4/super.c === --- linux.orig/fs/ext4/super.c2008-01-23 12:57:07.0 +0100 +++ linux/fs/ext4/super.c 2008-01-23 21:43:51.0 +0100 @@ -742,7 +742,8 @@ static int ext4_show_options(struct seq_ seq_puts(seq, ,nomballoc); if (!test_opt(sb, DELALLOC)) seq_puts(seq, ,nodelalloc); - + if (sbi-s_stripe) + seq_printf(seq, ,stripe=%lu, sbi-s_stripe); /* * journal mode get enabled in different ways Added to ext4 patch queue. Thanks! http://repo.or.cz/w/ext4-patch-queue.git Mingming -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 33/49] ext4: Add the journal checksum feature
al_t *journal, > > struct recovery_info *info, enum passtype pass) > > { > > @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journal, > > unsigned intsequence; > > int blocktype; > > int tag_bytes = journal_tag_bytes(journal); > > + __u32 crc32_sum = ~0; /* Transactional Checksums */ > > > > /* Precompute the maximum metadata descriptors in a descriptor block */ > > int MAX_BLOCKS_PER_DESC; > > @@ -419,9 +452,23 @@ static int do_one_pass(journal_t *journal, > > switch(blocktype) { > > case JBD2_DESCRIPTOR_BLOCK: > > /* If it is a valid descriptor block, replay it > > -* in pass REPLAY; otherwise, just skip over the > > -* blocks it describes. */ > > +* in pass REPLAY; if journal_checksums enabled, then > > +* calculate checksums in PASS_SCAN, otherwise, > > +* just skip over the blocks it describes. */ > > if (pass != PASS_REPLAY) { > > + if (pass == PASS_SCAN && > > + JBD2_HAS_COMPAT_FEATURE(journal, > > + JBD2_FEATURE_COMPAT_CHECKSUM) && > > + !info->end_transaction) { > > + if (calc_chksums(journal, bh, > > + _log_block, > > + _sum)) { > > put_bh() > > > + brelse(bh); > > + break; > > + } > > + brelse(bh); > > + continue; > > put_bh() > > > + } > > next_log_block += count_tags(journal, bh); > > wrap(journal, next_log_block); > > brelse(bh); > > @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal, > > continue; > > > > + brelse(bh); > > etc > Thanks, Updated patch below: ext4: Add the journal checksum feature From: Girish Shilamkar <[EMAIL PROTECTED]> The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]> Signed-off-by: Girish Shilamkar <[EMAIL PROTECTED]> Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- Documentation/filesystems/ext4.txt | 10 + fs/Kconfig |1 fs/ext4/super.c| 25 fs/jbd2/commit.c | 198 +++-- fs/jbd2/journal.c | 26 fs/jbd2/recovery.c | 151 ++-- include/linux/ext4_fs.h|3 include/linux/jbd2.h | 36 +- 8 files changed, 388 insertions(+), 62 deletions(-) Index: linux-2.6.24-rc8/Documentation/filesystems/ext4.txt === --- linux-2.6.24-rc8.orig/Documentation/filesystems/ext4.txt2008-01-24 11:18:08.0 -0800 +++ linux-2.6.24-rc8/Documentation/filesystems/ext4.txt 2008-01-24 13:00:44.0 -0800 @@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the fo extentsext4 will use extents to address file data. The file system will no longer be mountable by ext3. +journal_checksum Enable checksumming of the journal transactions. + This will allow the recovery code in e2fsck and the +
Re: [PATCH 33/49] ext4: Add the journal checksum feature
, +* just skip over the blocks it describes. */ if (pass != PASS_REPLAY) { + if (pass == PASS_SCAN + JBD2_HAS_COMPAT_FEATURE(journal, + JBD2_FEATURE_COMPAT_CHECKSUM) + !info-end_transaction) { + if (calc_chksums(journal, bh, + next_log_block, + crc32_sum)) { put_bh() + brelse(bh); + break; + } + brelse(bh); + continue; put_bh() + } next_log_block += count_tags(journal, bh); wrap(journal, next_log_block); brelse(bh); @@ -516,9 +563,96 @@ static int do_one_pass(journal_t *journal, continue; + brelse(bh); etc Thanks, Updated patch below: ext4: Add the journal checksum feature From: Girish Shilamkar [EMAIL PROTECTED] The journal checksum feature adds two new flags i.e JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM. JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the checksum for the blocks described by the descriptor blocks. Due to checksums, writing of the commit record no longer needs to be synchronous. Now commit record can be sent to disk without waiting for descriptor blocks to be written to disk. This behavior is controlled using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be able to recover the journal with _ASYNC_COMMIT hence it is made incompat. The commit header has been extended to hold the checksum along with the type of the checksum. For recovery in pass scan checksums are verified to ensure the sanity and completeness(in case of _ASYNC_COMMIT) of every transaction. Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Girish Shilamkar [EMAIL PROTECTED] Signed-off-by: Dave Kleikamp [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- Documentation/filesystems/ext4.txt | 10 + fs/Kconfig |1 fs/ext4/super.c| 25 fs/jbd2/commit.c | 198 +++-- fs/jbd2/journal.c | 26 fs/jbd2/recovery.c | 151 ++-- include/linux/ext4_fs.h|3 include/linux/jbd2.h | 36 +- 8 files changed, 388 insertions(+), 62 deletions(-) Index: linux-2.6.24-rc8/Documentation/filesystems/ext4.txt === --- linux-2.6.24-rc8.orig/Documentation/filesystems/ext4.txt2008-01-24 11:18:08.0 -0800 +++ linux-2.6.24-rc8/Documentation/filesystems/ext4.txt 2008-01-24 13:00:44.0 -0800 @@ -89,6 +89,16 @@ When mounting an ext4 filesystem, the fo extentsext4 will use extents to address file data. The file system will no longer be mountable by ext3. +journal_checksum Enable checksumming of the journal transactions. + This will allow the recovery code in e2fsck and the + kernel to detect corruption in the kernel. It is a + compatible change and will be ignored by older kernels. + +journal_async_commit Commit block can be written to disk without waiting + for descriptor blocks. If enabled older kernels cannot + mount the device. This will enable 'journal_checksum' + internally. + journal=update Update the ext4 file system's journal to the current format. Index: linux-2.6.24-rc8/fs/Kconfig === --- linux-2.6.24-rc8.orig/fs/Kconfig2008-01-24 11:18:08.0 -0800 +++ linux-2.6.24-rc8/fs/Kconfig 2008-01-24 11:18:55.0 -0800 @@ -236,6 +236,7 @@ config JBD_DEBUG config JBD2 tristate + select CRC32 help This is a generic journaling layer for block devices that support both 32-bit and 64-bit block numbers. It is currently used by Index: linux-2.6.24-rc8/fs/ext4/super.c === --- linux-2.6.24-rc8.orig/fs/ext4/super.c 2008-01-24 11:18:52.0 -0800 +++ linux-2.6.24-rc8/fs/ext4/super.c2008-01-24 13:00:45.0 -0800 @@ -869,6 +869,7 @@ enum { Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh
Re: [PATCH] Shrink ext3_inode_info by 8 bytes for !POSIX_ACL.
On Sat, 2008-01-12 at 21:35 +0100, Indan Zupancic wrote: > i_file_acl and i_dir_acl aren't always needed. > > With certain configs this makes 10 ext3_inode_cache objects fit in > one slab instead of the current 9, as the size shrinks from 416 to > 408 bytes for 32 bit, !POSIX_ACL and !EXT3_FS_XATTR configs. > > Signed-off-by: Indan Zupancic <[EMAIL PROTECTED]> > --- > fs/ext3/ialloc.c |2 ++ > fs/ext3/inode.c | 29 +++-- > include/linux/ext3_fs_i.h |2 ++ > 3 files changed, 23 insertions(+), 10 deletions(-) > > diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c > index 1bc8cd8..01745bc 100644 > --- a/fs/ext3/ialloc.c > +++ b/fs/ext3/ialloc.c > @@ -574,8 +574,10 @@ got: > ei->i_frag_no = 0; > ei->i_frag_size = 0; > #endif > +#ifdef CONFIG_EXT3_FS_POSIX_ACL > ei->i_file_acl = 0; > ei->i_dir_acl = 0; > +#endif For regular file, i_dir_acl is being reused as i_size_high to support large file. Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Shrink ext3_inode_info by 8 bytes for !POSIX_ACL.
On Sat, 2008-01-12 at 21:35 +0100, Indan Zupancic wrote: i_file_acl and i_dir_acl aren't always needed. With certain configs this makes 10 ext3_inode_cache objects fit in one slab instead of the current 9, as the size shrinks from 416 to 408 bytes for 32 bit, !POSIX_ACL and !EXT3_FS_XATTR configs. Signed-off-by: Indan Zupancic [EMAIL PROTECTED] --- fs/ext3/ialloc.c |2 ++ fs/ext3/inode.c | 29 +++-- include/linux/ext3_fs_i.h |2 ++ 3 files changed, 23 insertions(+), 10 deletions(-) diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c index 1bc8cd8..01745bc 100644 --- a/fs/ext3/ialloc.c +++ b/fs/ext3/ialloc.c @@ -574,8 +574,10 @@ got: ei-i_frag_no = 0; ei-i_frag_size = 0; #endif +#ifdef CONFIG_EXT3_FS_POSIX_ACL ei-i_file_acl = 0; ei-i_dir_acl = 0; +#endif For regular file, i_dir_acl is being reused as i_size_high to support large file. Mingming -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4]JBD2: user of the jiffies rounding code
Ported from JBD changes from Arjan van de Ven <[EMAIL PROTECTED]> --- Date: Sun, 10 Dec 2006 10:21:26 + (-0800) Subject: [PATCH] user of the jiffies rounding code: JBD X-Git-Tag: v2.6.20-rc1~15^2~43 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=44d306e1508fef6fa7a6eb15a1aba86ef68389a6 [PATCH] user of the jiffies rounding code: JBD This patch introduces a user: of the round_jiffies() function; the "5 second" ext3/jbd wakeup. While "every 5 seconds" doesn't sound as a problem, there can be many of these (and these timers do add up over all the kernel). The "5 second" wakeup isn't really timing sensitive; in addition even with rounding it'll still happen every 5 seconds (with the exception of the very first time, which is likely to be rounded up to somewhere closer to 6 seconds) Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- fs/jbd2/transaction.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 -0800 @@ -54,7 +54,7 @@ jbd2_get_transaction(journal_t *journal, spin_lock_init(>t_handle_lock); /* Set up the commit timer for the new transaction. */ - journal->j_commit_timer.expires = transaction->t_expires; + journal->j_commit_timer.expires = round_jiffies(transaction->t_expires); add_timer(>j_commit_timer); J_ASSERT(journal->j_running_transaction == NULL); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4]JBD2: Group short-lived and reclaimable kernel allocations
JBD2: Group short-lived and reclaimable kernel allocations From: Mingming Cao <[EMAIL PROTECTED]> Ported from JBD to JBD2 From: Mel Gorman <[EMAIL PROTECTED]> Date: Tue, 16 Oct 2007 08:25:52 + (-0700) Subject: Group short-lived and reclaimable kernel allocations X-Git-Tag: v2.6.24-rc1~1137 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=e12ba74d8ff3e2f73a583500d7095e406df4d093 Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Cc: Mel Gorman <[EMAIL PROTECTED]> Cc: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- --- fs/jbd2/journal.c |4 ++-- fs/jbd2/revoke.c |6 -- 2 files changed, 6 insertions(+), 4 deletions(-) Index: linux-2.6.24-rc7/fs/jbd2/journal.c === --- linux-2.6.24-rc7.orig/fs/jbd2/journal.c 2008-01-16 15:02:40.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/journal.c 2008-01-16 17:40:24.0 -0800 @@ -1975,7 +1975,7 @@ static int journal_init_jbd2_journal_hea jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head", sizeof(struct journal_head), 0, /* offset */ - 0, /* flags */ + SLAB_TEMPORARY, /* flags */ NULL); /* ctor */ retval = 0; if (jbd2_journal_head_cache == 0) { @@ -2271,7 +2271,7 @@ static int __init journal_init_handle_ca jbd2_handle_cache = kmem_cache_create("jbd2_journal_handle", sizeof(handle_t), 0, /* offset */ - 0, /* flags */ + SLAB_TEMPORARY, /* flags */ NULL); /* ctor */ if (jbd2_handle_cache == NULL) { printk(KERN_EMERG "JBD: failed to create handle cache\n"); Index: linux-2.6.24-rc7/fs/jbd2/revoke.c === --- linux-2.6.24-rc7.orig/fs/jbd2/revoke.c 2008-01-06 13:45:38.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/revoke.c 2008-01-16 17:40:24.0 -0800 @@ -171,13 +171,15 @@ int __init jbd2_journal_init_revoke_cach { jbd2_revoke_record_cache = kmem_cache_create("jbd2_revoke_record", sizeof(struct jbd2_revoke_record_s), - 0, SLAB_HWCACHE_ALIGN, NULL); + 0, + SLAB_HWCACHE_ALIGN|SLAB_TEMPORARY, + NULL); if (jbd2_revoke_record_cache == 0) return -ENOMEM; jbd2_revoke_table_cache = kmem_cache_create("jbd2_revoke_table", sizeof(struct jbd2_revoke_table_s), - 0, 0, NULL); + 0, SLAB_TEMPORARY, NULL); if (jbd2_revoke_table_cache == 0) { kmem_cache_destroy(jbd2_revoke_record_cache); jbd2_revoke_record_cache = NULL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4]JBD2: sparse pointer use of zero as null
Ported from upstream jbd changes to jbd2 sparse pointer use of zero as null Get rid of sparse related warnings from places that use integer as NULL pointer. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd2/transaction.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 18:06:33.0 -0800 @@ -1182,7 +1182,7 @@ int jbd2_journal_dirty_metadata(handle_t } /* That test should have eliminated the following case: */ - J_ASSERT_JH(jh, jh->b_frozen_data == 0); + J_ASSERT_JH(jh, jh->b_frozen_data == NULL); JBUFFER_TRACE(jh, "file as BJ_Metadata"); spin_lock(>j_list_lock); @@ -1532,7 +1532,7 @@ void __jbd2_journal_temp_unlink_buffer(s J_ASSERT_JH(jh, jh->b_jlist < BJ_Types); if (jh->b_jlist != BJ_None) - J_ASSERT_JH(jh, transaction != 0); + J_ASSERT_JH(jh, transaction != NULL); switch (jh->b_jlist) { case BJ_None: @@ -1601,11 +1601,11 @@ __journal_try_to_free_buffer(journal_t * if (buffer_locked(bh) || buffer_dirty(bh)) goto out; - if (jh->b_next_transaction != 0) + if (jh->b_next_transaction != NULL) goto out; spin_lock(>j_list_lock); - if (jh->b_transaction != 0 && jh->b_cp_transaction == 0) { + if (jh->b_transaction != NULL && jh->b_cp_transaction == NULL) { if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) { /* A written-back ordered data buffer */ JBUFFER_TRACE(jh, "release data"); @@ -1613,7 +1613,7 @@ __journal_try_to_free_buffer(journal_t * jbd2_journal_remove_journal_head(bh); __brelse(bh); } - } else if (jh->b_cp_transaction != 0 && jh->b_transaction == 0) { + } else if (jh->b_cp_transaction != NULL && jh->b_transaction == NULL) { /* written-back checkpointed metadata buffer */ if (jh->b_jlist == BJ_None) { JBUFFER_TRACE(jh, "remove from checkpoint list"); @@ -1973,7 +1973,7 @@ void __jbd2_journal_file_buffer(struct j J_ASSERT_JH(jh, jh->b_jlist < BJ_Types); J_ASSERT_JH(jh, jh->b_transaction == transaction || - jh->b_transaction == 0); + jh->b_transaction == NULL); if (jh->b_transaction && jh->b_jlist == jlist) return; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4]jbd2: port jbd lockdep support to jbd2
Hi Andrew, Ted, I walked through the linus's git tree history and found 4 patches should port from ext3/jbd to ext4/jbd2, since the day ext4 was forked (2006.10.11) to today. I have already queued the ported patches in ext4 patch queue and verified they seems fine. Here is the first one. jbd2: port jbd lockdep support to jbd2 > Except lockdep doesn't know about journal_start(), which has ranking > requirements similar to a semaphore. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd2/transaction.c | 11 +++ include/linux/jbd2.h |4 2 files changed, 15 insertions(+) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:30:24.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 @@ -241,6 +241,8 @@ out: return ret; } +static struct lock_class_key jbd2_handle_key; + /* Allocate a new handle. This should probably be in a slab... */ static handle_t *new_handle(int nblocks) { @@ -251,6 +253,9 @@ static handle_t *new_handle(int nblocks) handle->h_buffer_credits = nblocks; handle->h_ref = 1; + lockdep_init_map(>h_lockdep_map, "jbd2_handle", + _handle_key, 0); + return handle; } @@ -293,7 +298,11 @@ handle_t *jbd2_journal_start(journal_t * jbd2_free_handle(handle); current->journal_info = NULL; handle = ERR_PTR(err); + goto out; } + + lock_acquire(>h_lockdep_map, 0, 0, 0, 2, _THIS_IP_); +out: return handle; } @@ -1419,6 +1428,8 @@ int jbd2_journal_stop(handle_t *handle) spin_unlock(>j_state_lock); } + lock_release(>h_lockdep_map, 1, _THIS_IP_); + jbd2_free_handle(handle); return err; } Index: linux-2.6.24-rc7/include/linux/jbd2.h === --- linux-2.6.24-rc7.orig/include/linux/jbd2.h 2008-01-16 15:29:03.0 -0800 +++ linux-2.6.24-rc7/include/linux/jbd2.h 2008-01-16 15:29:54.0 -0800 @@ -418,6 +418,10 @@ struct handle_s unsigned inth_sync: 1; /* sync-on-close */ unsigned inth_jdata:1; /* force data journaling */ unsigned inth_aborted: 1; /* fatal error on handle */ + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map h_lockdep_map; +#endif }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4]JBD2: Group short-lived and reclaimable kernel allocations
JBD2: Group short-lived and reclaimable kernel allocations From: Mingming Cao [EMAIL PROTECTED] Ported from JBD to JBD2 From: Mel Gorman [EMAIL PROTECTED] Date: Tue, 16 Oct 2007 08:25:52 + (-0700) Subject: Group short-lived and reclaimable kernel allocations X-Git-Tag: v2.6.24-rc1~1137 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=e12ba74d8ff3e2f73a583500d7095e406df4d093 Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Cc: Mel Gorman [EMAIL PROTECTED] Cc: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- --- fs/jbd2/journal.c |4 ++-- fs/jbd2/revoke.c |6 -- 2 files changed, 6 insertions(+), 4 deletions(-) Index: linux-2.6.24-rc7/fs/jbd2/journal.c === --- linux-2.6.24-rc7.orig/fs/jbd2/journal.c 2008-01-16 15:02:40.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/journal.c 2008-01-16 17:40:24.0 -0800 @@ -1975,7 +1975,7 @@ static int journal_init_jbd2_journal_hea jbd2_journal_head_cache = kmem_cache_create(jbd2_journal_head, sizeof(struct journal_head), 0, /* offset */ - 0, /* flags */ + SLAB_TEMPORARY, /* flags */ NULL); /* ctor */ retval = 0; if (jbd2_journal_head_cache == 0) { @@ -2271,7 +2271,7 @@ static int __init journal_init_handle_ca jbd2_handle_cache = kmem_cache_create(jbd2_journal_handle, sizeof(handle_t), 0, /* offset */ - 0, /* flags */ + SLAB_TEMPORARY, /* flags */ NULL); /* ctor */ if (jbd2_handle_cache == NULL) { printk(KERN_EMERG JBD: failed to create handle cache\n); Index: linux-2.6.24-rc7/fs/jbd2/revoke.c === --- linux-2.6.24-rc7.orig/fs/jbd2/revoke.c 2008-01-06 13:45:38.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/revoke.c 2008-01-16 17:40:24.0 -0800 @@ -171,13 +171,15 @@ int __init jbd2_journal_init_revoke_cach { jbd2_revoke_record_cache = kmem_cache_create(jbd2_revoke_record, sizeof(struct jbd2_revoke_record_s), - 0, SLAB_HWCACHE_ALIGN, NULL); + 0, + SLAB_HWCACHE_ALIGN|SLAB_TEMPORARY, + NULL); if (jbd2_revoke_record_cache == 0) return -ENOMEM; jbd2_revoke_table_cache = kmem_cache_create(jbd2_revoke_table, sizeof(struct jbd2_revoke_table_s), - 0, 0, NULL); + 0, SLAB_TEMPORARY, NULL); if (jbd2_revoke_table_cache == 0) { kmem_cache_destroy(jbd2_revoke_record_cache); jbd2_revoke_record_cache = NULL; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4]jbd2: port jbd lockdep support to jbd2
Hi Andrew, Ted, I walked through the linus's git tree history and found 4 patches should port from ext3/jbd to ext4/jbd2, since the day ext4 was forked (2006.10.11) to today. I have already queued the ported patches in ext4 patch queue and verified they seems fine. Here is the first one. jbd2: port jbd lockdep support to jbd2 Except lockdep doesn't know about journal_start(), which has ranking requirements similar to a semaphore. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/transaction.c | 11 +++ include/linux/jbd2.h |4 2 files changed, 15 insertions(+) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:30:24.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 @@ -241,6 +241,8 @@ out: return ret; } +static struct lock_class_key jbd2_handle_key; + /* Allocate a new handle. This should probably be in a slab... */ static handle_t *new_handle(int nblocks) { @@ -251,6 +253,9 @@ static handle_t *new_handle(int nblocks) handle-h_buffer_credits = nblocks; handle-h_ref = 1; + lockdep_init_map(handle-h_lockdep_map, jbd2_handle, + jbd2_handle_key, 0); + return handle; } @@ -293,7 +298,11 @@ handle_t *jbd2_journal_start(journal_t * jbd2_free_handle(handle); current-journal_info = NULL; handle = ERR_PTR(err); + goto out; } + + lock_acquire(handle-h_lockdep_map, 0, 0, 0, 2, _THIS_IP_); +out: return handle; } @@ -1419,6 +1428,8 @@ int jbd2_journal_stop(handle_t *handle) spin_unlock(journal-j_state_lock); } + lock_release(handle-h_lockdep_map, 1, _THIS_IP_); + jbd2_free_handle(handle); return err; } Index: linux-2.6.24-rc7/include/linux/jbd2.h === --- linux-2.6.24-rc7.orig/include/linux/jbd2.h 2008-01-16 15:29:03.0 -0800 +++ linux-2.6.24-rc7/include/linux/jbd2.h 2008-01-16 15:29:54.0 -0800 @@ -418,6 +418,10 @@ struct handle_s unsigned inth_sync: 1; /* sync-on-close */ unsigned inth_jdata:1; /* force data journaling */ unsigned inth_aborted: 1; /* fatal error on handle */ + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map h_lockdep_map; +#endif }; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4]JBD2: sparse pointer use of zero as null
Ported from upstream jbd changes to jbd2 sparse pointer use of zero as null Get rid of sparse related warnings from places that use integer as NULL pointer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/transaction.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 18:06:33.0 -0800 @@ -1182,7 +1182,7 @@ int jbd2_journal_dirty_metadata(handle_t } /* That test should have eliminated the following case: */ - J_ASSERT_JH(jh, jh-b_frozen_data == 0); + J_ASSERT_JH(jh, jh-b_frozen_data == NULL); JBUFFER_TRACE(jh, file as BJ_Metadata); spin_lock(journal-j_list_lock); @@ -1532,7 +1532,7 @@ void __jbd2_journal_temp_unlink_buffer(s J_ASSERT_JH(jh, jh-b_jlist BJ_Types); if (jh-b_jlist != BJ_None) - J_ASSERT_JH(jh, transaction != 0); + J_ASSERT_JH(jh, transaction != NULL); switch (jh-b_jlist) { case BJ_None: @@ -1601,11 +1601,11 @@ __journal_try_to_free_buffer(journal_t * if (buffer_locked(bh) || buffer_dirty(bh)) goto out; - if (jh-b_next_transaction != 0) + if (jh-b_next_transaction != NULL) goto out; spin_lock(journal-j_list_lock); - if (jh-b_transaction != 0 jh-b_cp_transaction == 0) { + if (jh-b_transaction != NULL jh-b_cp_transaction == NULL) { if (jh-b_jlist == BJ_SyncData || jh-b_jlist == BJ_Locked) { /* A written-back ordered data buffer */ JBUFFER_TRACE(jh, release data); @@ -1613,7 +1613,7 @@ __journal_try_to_free_buffer(journal_t * jbd2_journal_remove_journal_head(bh); __brelse(bh); } - } else if (jh-b_cp_transaction != 0 jh-b_transaction == 0) { + } else if (jh-b_cp_transaction != NULL jh-b_transaction == NULL) { /* written-back checkpointed metadata buffer */ if (jh-b_jlist == BJ_None) { JBUFFER_TRACE(jh, remove from checkpoint list); @@ -1973,7 +1973,7 @@ void __jbd2_journal_file_buffer(struct j J_ASSERT_JH(jh, jh-b_jlist BJ_Types); J_ASSERT_JH(jh, jh-b_transaction == transaction || - jh-b_transaction == 0); + jh-b_transaction == NULL); if (jh-b_transaction jh-b_jlist == jlist) return; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4]JBD2: user of the jiffies rounding code
Ported from JBD changes from Arjan van de Ven [EMAIL PROTECTED] --- Date: Sun, 10 Dec 2006 10:21:26 + (-0800) Subject: [PATCH] user of the jiffies rounding code: JBD X-Git-Tag: v2.6.20-rc1~15^2~43 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=44d306e1508fef6fa7a6eb15a1aba86ef68389a6 [PATCH] user of the jiffies rounding code: JBD This patch introduces a user: of the round_jiffies() function; the 5 second ext3/jbd wakeup. While every 5 seconds doesn't sound as a problem, there can be many of these (and these timers do add up over all the kernel). The 5 second wakeup isn't really timing sensitive; in addition even with rounding it'll still happen every 5 seconds (with the exception of the very first time, which is likely to be rounded up to somewhere closer to 6 seconds) Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- --- fs/jbd2/transaction.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24-rc7/fs/jbd2/transaction.c === --- linux-2.6.24-rc7.orig/fs/jbd2/transaction.c 2008-01-16 15:41:14.0 -0800 +++ linux-2.6.24-rc7/fs/jbd2/transaction.c 2008-01-16 17:49:48.0 -0800 @@ -54,7 +54,7 @@ jbd2_get_transaction(journal_t *journal, spin_lock_init(transaction-t_handle_lock); /* Set up the commit timer for the new transaction. */ - journal-j_commit_timer.expires = transaction-t_expires; + journal-j_commit_timer.expires = round_jiffies(transaction-t_expires); add_timer(journal-j_commit_timer); J_ASSERT(journal-j_running_transaction == NULL); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Fri, 2007-11-30 at 17:08 -0600, Eric Sandeen wrote: > Mingming Cao wrote: > > [PATCH] jbd2 stats through procfs > > > > The patch below updates the jbd stats patch to 2.6.20/jbd2. > > The initial patch was posted by Alex Tomas in December 2005 > > (http://marc.info/?l=linux-ext4=113538565128617=2). > > It provides statistics via procfs such as transaction lifetime and size. > > > > [ This probably should be rewritten to use debugfs? -- Ted] > > > > Signed-off-by: Johann Lombardi <[EMAIL PROTECTED]> > > I've started going through this one to clean it up to the point where it > can go forward. It's been sitting at the top of the unstable portion of > the patch queue for long enough, I think :) > Thanks! > I've converted the msecs to jiffies until the user boundary, changed the > union #defines as suggested by Andrew, and various other little issues etc. > > Remaining to do is a generic time-difference calculator (instead of > jbd2_time_diff), and looking into whether it should be made a config > option; I tend to think it should, but it's fairly well sprinkled > through the code, so I'll see how well that works. > > Also did we ever decided if this should go to debugfs? > I thought it was decided to keep it on procfs as debugfs is not always on... > Thanks, > > -Eric > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs
On Fri, 2007-11-30 at 17:08 -0600, Eric Sandeen wrote: Mingming Cao wrote: [PATCH] jbd2 stats through procfs The patch below updates the jbd stats patch to 2.6.20/jbd2. The initial patch was posted by Alex Tomas in December 2005 (http://marc.info/?l=linux-ext4m=113538565128617w=2). It provides statistics via procfs such as transaction lifetime and size. [ This probably should be rewritten to use debugfs? -- Ted] Signed-off-by: Johann Lombardi [EMAIL PROTECTED] I've started going through this one to clean it up to the point where it can go forward. It's been sitting at the top of the unstable portion of the patch queue for long enough, I think :) Thanks! I've converted the msecs to jiffies until the user boundary, changed the union #defines as suggested by Andrew, and various other little issues etc. Remaining to do is a generic time-difference calculator (instead of jbd2_time_diff), and looking into whether it should be made a config option; I tend to think it should, but it's fairly well sprinkled through the code, so I'll see how well that works. Also did we ever decided if this should go to debugfs? I thought it was decided to keep it on procfs as debugfs is not always on... Thanks, -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext4: dir inode reservation V3
On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote: > Thanks for the feedback :-) > > Mingming Cao wrote: > > On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: > >> Basic idea of my dir inode reservation patch can be found here, > >> http://lists.openwall.net/linux-ext4/2007/11/05/3 > >> > >> 1, What does dir inode reservation do > >> Dir inode reservation tries to reserve several inodes in inodes table for > >> a directory when this > >> directory is created. When create new file under this directory, try to > >> allocate inode from the > >> reserved inodes area. This is called as dir_ireserve inode allocator. > >> > > Thanks for the update. > > > > Let me try to understand your method: > > > > So the basic idea is not do linear inode allocation for directory? Inode > > structure block for directory file is only coming from block 0, N, N > > +N,... where the number of skipped blocks N is stored in the in-core > > superblock structure. > > N is not stored in in-core superblock. N = s_dir_ireserve_nr / > inodes_per_block. What is stored in > in-core superblock is number of inodes to be reserved for each directory. > > > > > When ever need to allocate an inode for directory, skip N reserved bits > > (space for N*16 inodes) if the previous block is already allocated. That > > way place two directories with the hole of N*16 inodes structures, then > > allow files under the first directory stay closer with their parent > > directory. Is this correct? > > The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because > directory inode will also > use a inode slot from reserved area, reset slots number for files is > (s_dir_ireserve_nr - 1). > Except for the reserved inodes number, your understanding exactly matches my > idea. > Ok, thanks for clarification. The performance gain on large number of directories looks interesting, but I am afraid this makes the uninitialized block group feature less useful (to speed up fsck) than before:( The reserved inode will cause the unused inode watermark higher than before, and spread the directories over to other block groups earlier than before. Maybe it worth it...not sure. > > > > > >> 4, Dir inode reservation is optional > >> Dir inode reservation is optional, you can use -o followed by one of these > >> options to enable dir > >> inode reservation during mount ext4 file system: > >> dir_ireserve=low > >> dir_ireserve=normal > >> dir_ireserve=high > > > > Would be nice to pass the tuning info low/normal/high(16/64/128 blocks > > correspondingly) via something else rather than mount options. > > Sure, I agree with you. Also I am thinking should this patch permit user to > input reserved inodes > number directly other than a low/normal/high. Also I am looking for methods > to display the tuning > info more convenient to users. > export/tune through /procfs? > > > >> Currently, 'low' reserves 15 file inodes for each directory, 'normal' > >> reserves 31 inodes and 'high' > >> reserves 127 inodes. Reserving more than 127 inodes does not help to > >> performance obviously. > >> > >> > >> 5, Performance number > >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 > >> partition, and tried to create > >> 5 directories, and create 15 (1KB) files in each directory > >> alternatively. After a remount, I > >> tried to remove all the directories and files recursively by a 'rm -rf'. > >> Bellow is the benchmark result, > >>normal ext4 ext4 with dir inode > >> reservation > >>mount options: -o data=writeback -o > >> data=writeback,dir_ireserve=low > >>Create dirs:real0m49.101s real2m59.703s > >>Create files: real24m17.962s real21m8.161s > >>Unlink all: real24m43.788s real17m29.862s > >> Creating dirs with dir inode reservation is slower than normal ext4 as > >> predicted, because allocating > >> directory inodes in non-linear order will cause extra hard disk seeking > >> and block I/O. > > > > Hmm...I suspect there is bug in your patch, the extra seek should not > > contribute to 4 times slower > > I agree with you :-) > > > > >> #include > >> @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, > >> struct
Re: [PATCH] ext4: dir inode reservation V3
On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote: Thanks for the feedback :-) Mingming Cao wrote: On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: Basic idea of my dir inode reservation patch can be found here, http://lists.openwall.net/linux-ext4/2007/11/05/3 1, What does dir inode reservation do Dir inode reservation tries to reserve several inodes in inodes table for a directory when this directory is created. When create new file under this directory, try to allocate inode from the reserved inodes area. This is called as dir_ireserve inode allocator. Thanks for the update. Let me try to understand your method: So the basic idea is not do linear inode allocation for directory? Inode structure block for directory file is only coming from block 0, N, N +N,... where the number of skipped blocks N is stored in the in-core superblock structure. N is not stored in in-core superblock. N = s_dir_ireserve_nr / inodes_per_block. What is stored in in-core superblock is number of inodes to be reserved for each directory. When ever need to allocate an inode for directory, skip N reserved bits (space for N*16 inodes) if the previous block is already allocated. That way place two directories with the hole of N*16 inodes structures, then allow files under the first directory stay closer with their parent directory. Is this correct? The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because directory inode will also use a inode slot from reserved area, reset slots number for files is (s_dir_ireserve_nr - 1). Except for the reserved inodes number, your understanding exactly matches my idea. Ok, thanks for clarification. The performance gain on large number of directories looks interesting, but I am afraid this makes the uninitialized block group feature less useful (to speed up fsck) than before:( The reserved inode will cause the unused inode watermark higher than before, and spread the directories over to other block groups earlier than before. Maybe it worth it...not sure. 4, Dir inode reservation is optional Dir inode reservation is optional, you can use -o followed by one of these options to enable dir inode reservation during mount ext4 file system: dir_ireserve=low dir_ireserve=normal dir_ireserve=high Would be nice to pass the tuning info low/normal/high(16/64/128 blocks correspondingly) via something else rather than mount options. Sure, I agree with you. Also I am thinking should this patch permit user to input reserved inodes number directly other than a low/normal/high. Also I am looking for methods to display the tuning info more convenient to users. export/tune through /procfs? Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Hmm...I suspect there is bug in your patch, the extra seek should not contribute to 4 times slower I agree with you :-) #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, +int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i
Re: [PATCH] ext4: dir inode reservation V3
On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: > Basic idea of my dir inode reservation patch can be found here, > http://lists.openwall.net/linux-ext4/2007/11/05/3 > > 1, What does dir inode reservation do > Dir inode reservation tries to reserve several inodes in inodes table for a > directory when this > directory is created. When create new file under this directory, try to > allocate inode from the > reserved inodes area. This is called as dir_ireserve inode allocator. > Thanks for the update. Let me try to understand your method: So the basic idea is not do linear inode allocation for directory? Inode structure block for directory file is only coming from block 0, N, N +N,... where the number of skipped blocks N is stored in the in-core superblock structure. When ever need to allocate an inode for directory, skip N reserved bits (space for N*16 inodes) if the previous block is already allocated. That way place two directories with the hole of N*16 inodes structures, then allow files under the first directory stay closer with their parent directory. Is this correct? > 4, Dir inode reservation is optional > Dir inode reservation is optional, you can use -o followed by one of these > options to enable dir > inode reservation during mount ext4 file system: > dir_ireserve=low > dir_ireserve=normal > dir_ireserve=high Would be nice to pass the tuning info low/normal/high(16/64/128 blocks correspondingly) via something else rather than mount options. > Currently, 'low' reserves 15 file inodes for each directory, 'normal' > reserves 31 inodes and 'high' > reserves 127 inodes. Reserving more than 127 inodes does not help to > performance obviously. > > > 5, Performance number > On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 > partition, and tried to create > 5 directories, and create 15 (1KB) files in each directory alternatively. > After a remount, I > tried to remove all the directories and files recursively by a 'rm -rf'. > Bellow is the benchmark result, > normal ext4 ext4 with dir inode > reservation > mount options: -o data=writeback -o > data=writeback,dir_ireserve=low > Create dirs:real0m49.101s real2m59.703s > Create files: real24m17.962s real21m8.161s > Unlink all: real24m43.788s real17m29.862s > Creating dirs with dir inode reservation is slower than normal ext4 as > predicted, because allocating > directory inodes in non-linear order will cause extra hard disk seeking and > block I/O. Hmm...I suspect there is bug in your patch, the extra seek should not contribute to 4 times slower > > #include > @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, > struct inode *parent, > return -1; > } > > +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, > + int mode, ext4_group_t *group, unsigned long *ino) > +{ > + struct super_block *sb; > + struct ext4_sb_info *sbi; > + struct ext4_group_desc *gdp = NULL; > + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; > + ext4_group_t ires_group = *group; > + unsigned long ires_ino; > + int i, bit; > + > + sb = dir->i_sb; > + sbi = EXT4_SB(sb); > + > + /* if the inode number is not for directory, > + * only try to allocate after directory's inode > + */ > + if (!S_ISDIR(mode)) { > + *ino = dir->i_ino % EXT4_INODES_PER_GROUP(sb); > + return 0; > + } > + > + /* reserve inodes for new directory */ > + for (i = 0; i < sbi->s_groups_count; i++) { > + gdp = ext4_get_group_desc(sb, ires_group, _bh); > + if (!gdp) > + goto fail; > + bit = 0; > +try_same_group: > + if (bit < EXT4_INODES_PER_GROUP(sb)) { > + brelse(bitmap_bh); > + bitmap_bh = read_inode_bitmap(sb, ires_group); > + if (!bitmap_bh) > + goto fail; > + > + BUFFER_TRACE(bitmap_bh, "get_write_access"); > + if (ext4_journal_get_write_access( > + handle, bitmap_bh) != 0) > + goto fail; > + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), > + bit, bitmap_bh->b_data)) { > + /* we won it */ > + BUFFER_TRACE(bitmap_bh, > + "call ext4_journal_dirty_metadata"); > + if (ext4_journal_dirty_metadata(handle, > + bitmap_bh) != 0) > + goto fail; > + ires_ino = bit; > +
Re: [PATCH] ext4: dir inode reservation V3
On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote: Basic idea of my dir inode reservation patch can be found here, http://lists.openwall.net/linux-ext4/2007/11/05/3 1, What does dir inode reservation do Dir inode reservation tries to reserve several inodes in inodes table for a directory when this directory is created. When create new file under this directory, try to allocate inode from the reserved inodes area. This is called as dir_ireserve inode allocator. Thanks for the update. Let me try to understand your method: So the basic idea is not do linear inode allocation for directory? Inode structure block for directory file is only coming from block 0, N, N +N,... where the number of skipped blocks N is stored in the in-core superblock structure. When ever need to allocate an inode for directory, skip N reserved bits (space for N*16 inodes) if the previous block is already allocated. That way place two directories with the hole of N*16 inodes structures, then allow files under the first directory stay closer with their parent directory. Is this correct? 4, Dir inode reservation is optional Dir inode reservation is optional, you can use -o followed by one of these options to enable dir inode reservation during mount ext4 file system: dir_ireserve=low dir_ireserve=normal dir_ireserve=high Would be nice to pass the tuning info low/normal/high(16/64/128 blocks correspondingly) via something else rather than mount options. Currently, 'low' reserves 15 file inodes for each directory, 'normal' reserves 31 inodes and 'high' reserves 127 inodes. Reserving more than 127 inodes does not help to performance obviously. 5, Performance number On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4 partition, and tried to create 5 directories, and create 15 (1KB) files in each directory alternatively. After a remount, I tried to remove all the directories and files recursively by a 'rm -rf'. Bellow is the benchmark result, normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Hmm...I suspect there is bug in your patch, the extra seek should not contribute to 4 times slower #include linux/time.h @@ -478,6 +480,75 @@ static int find_group_other(struct super_block *sb, struct inode *parent, return -1; } +static int ext4_ino_from_ireserve(handle_t *handle, struct inode *dir, + int mode, ext4_group_t *group, unsigned long *ino) +{ + struct super_block *sb; + struct ext4_sb_info *sbi; + struct ext4_group_desc *gdp = NULL; + struct buffer_head *gdp_bh = NULL, *bitmap_bh = NULL; + ext4_group_t ires_group = *group; + unsigned long ires_ino; + int i, bit; + + sb = dir-i_sb; + sbi = EXT4_SB(sb); + + /* if the inode number is not for directory, + * only try to allocate after directory's inode + */ + if (!S_ISDIR(mode)) { + *ino = dir-i_ino % EXT4_INODES_PER_GROUP(sb); + return 0; + } + + /* reserve inodes for new directory */ + for (i = 0; i sbi-s_groups_count; i++) { + gdp = ext4_get_group_desc(sb, ires_group, gdp_bh); + if (!gdp) + goto fail; + bit = 0; +try_same_group: + if (bit EXT4_INODES_PER_GROUP(sb)) { + brelse(bitmap_bh); + bitmap_bh = read_inode_bitmap(sb, ires_group); + if (!bitmap_bh) + goto fail; + + BUFFER_TRACE(bitmap_bh, get_write_access); + if (ext4_journal_get_write_access( + handle, bitmap_bh) != 0) + goto fail; + if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, ires_group), + bit, bitmap_bh-b_data)) { + /* we won it */ + BUFFER_TRACE(bitmap_bh, + call ext4_journal_dirty_metadata); + if (ext4_journal_dirty_metadata(handle, + bitmap_bh) != 0) + goto fail; + ires_ino = bit; + goto find; + } + /* we lost it */ +
Re: [2.6 patch] ext4/super.c: fix #ifdef's
Acked-by: Mingmming Cao <[EMAIL PROTECTED]> Ted, I added this patch in ext4 patch queue. On Mon, 2007-11-05 at 18:07 +0100, Adrian Bunk wrote: > This patch fixes the names of two variables in #ifdef's. > > Based on a report by Robert P. J. Day. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> > > --- > > fs/ext4/super.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > 44e9889e6a3952ea225704b2e494d31e00f34a6b > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 8031dc0..6673672 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -646,7 +646,7 @@ static int ext4_show_options(struct seq_file *seq, struct > vfsmount *vfs) > seq_puts(seq, ",debug"); > if (test_opt(sb, OLDALLOC)) > seq_puts(seq, ",oldalloc"); > -#ifdef CONFIG_EXT4_FS_XATTR > +#ifdef CONFIG_EXT4DEV_FS_XATTR > if (test_opt(sb, XATTR_USER)) > seq_puts(seq, ",user_xattr"); > if (!test_opt(sb, XATTR_USER) && > @@ -654,7 +654,7 @@ static int ext4_show_options(struct seq_file *seq, struct > vfsmount *vfs) > seq_puts(seq, ",nouser_xattr"); > } > #endif > -#ifdef CONFIG_EXT4_FS_POSIX_ACL > +#ifdef CONFIG_EXT4DEV_FS_POSIX_ACL > if (test_opt(sb, POSIX_ACL)) > seq_puts(seq, ",acl"); > if (!test_opt(sb, POSIX_ACL) && (def_mount_opts & EXT4_DEFM_ACL)) > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] ext4/super.c: fix #ifdef's
Acked-by: Mingmming Cao [EMAIL PROTECTED] Ted, I added this patch in ext4 patch queue. On Mon, 2007-11-05 at 18:07 +0100, Adrian Bunk wrote: This patch fixes the names of two variables in #ifdef's. Based on a report by Robert P. J. Day. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- fs/ext4/super.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) 44e9889e6a3952ea225704b2e494d31e00f34a6b diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8031dc0..6673672 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -646,7 +646,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,debug); if (test_opt(sb, OLDALLOC)) seq_puts(seq, ,oldalloc); -#ifdef CONFIG_EXT4_FS_XATTR +#ifdef CONFIG_EXT4DEV_FS_XATTR if (test_opt(sb, XATTR_USER)) seq_puts(seq, ,user_xattr); if (!test_opt(sb, XATTR_USER) @@ -654,7 +654,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,nouser_xattr); } #endif -#ifdef CONFIG_EXT4_FS_POSIX_ACL +#ifdef CONFIG_EXT4DEV_FS_POSIX_ACL if (test_opt(sb, POSIX_ACL)) seq_puts(seq, ,acl); if (!test_opt(sb, POSIX_ACL) (def_mount_opts EXT4_DEFM_ACL)) - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] i_version update - vfs part
On Thu, 2007-10-25 at 19:04 +0200, Cordenner jean noel wrote: > Hi, > > This is an update of the previous patches on the ext4 git tree, the 2 > coming patches applies at the end of the current ext4-patch-queue, and > replaces the inode-version related patches: > 64-bit-i_version.patch > i_version_hi.patch > ext4_i_version_hi_2.patch > i_version_update_ext4.patch > > The first part deals with the vfs part. > The i_version field of the inode is changed to be a 64-bit counter that > is set on every inode creation and that is incremented every time the > inode data is modified (similarly to the "ctime" time-stamp). > The aim is to fulfill a NFSv4 requirement for rfc3530. > This first part concerns the vfs, it converts the 32-bit i_version in > the generic inode to a 64-bit, a flag is added in the super block in > order to check if the feature is enabled and the i_version is > incremented in the vfs. > Thanks for reposting it. > Index: linux-2.6.23-ext4-1/fs/inode.c > === > --- linux-2.6.23-ext4-1.orig/fs/inode.c 2007-10-25 16:15:52.0 > +0200 > +++ linux-2.6.23-ext4-1/fs/inode.c2007-10-25 16:25:53.0 +0200 > @@ -1216,6 +1216,24 @@ > EXPORT_SYMBOL(touch_atime); > > /** > + * inode_inc_iversion - increments i_version > + * @inode: inode that need to be updated > + * > + * Every time the inode is modified, the i_version field > + * will be incremented. > + * The filesystem has to be mounted with i_version flag > + * > + */ > + > +void inode_inc_iversion(struct inode *inode) > +{ > + spin_lock(>i_lock); > + inode->i_version++; > + spin_unlock(>i_lock); > +} I wonder do we really need i_lock here for inode versioning update? Understand this is a 64 bit counter, but file_update_time() and ext4_mark_inode_dirty() (where the inode version is updated) is called on the file write path so i_mutex should be hold all the time. As long as the read patch holding i_mutex everything should be fine, isn't it? Have you get a chance to check the performance impact to ext4? > + > +/** > * file_update_time- update mtime and ctime time > * @file: file accessed > * > @@ -1249,6 +1267,11 @@ > sync_it = 1; > } > > + if (IS_I_VERSION(inode)) { > + inode_inc_iversion(inode); > + sync_it = 1; > + } > + > if (sync_it) > mark_inode_dirty_sync(inode); > } > > > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] i_version update - vfs part
On Thu, 2007-10-25 at 19:04 +0200, Cordenner jean noel wrote: Hi, This is an update of the previous patches on the ext4 git tree, the 2 coming patches applies at the end of the current ext4-patch-queue, and replaces the inode-version related patches: 64-bit-i_version.patch i_version_hi.patch ext4_i_version_hi_2.patch i_version_update_ext4.patch The first part deals with the vfs part. The i_version field of the inode is changed to be a 64-bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530. This first part concerns the vfs, it converts the 32-bit i_version in the generic inode to a 64-bit, a flag is added in the super block in order to check if the feature is enabled and the i_version is incremented in the vfs. Thanks for reposting it. Index: linux-2.6.23-ext4-1/fs/inode.c === --- linux-2.6.23-ext4-1.orig/fs/inode.c 2007-10-25 16:15:52.0 +0200 +++ linux-2.6.23-ext4-1/fs/inode.c2007-10-25 16:25:53.0 +0200 @@ -1216,6 +1216,24 @@ EXPORT_SYMBOL(touch_atime); /** + * inode_inc_iversion - increments i_version + * @inode: inode that need to be updated + * + * Every time the inode is modified, the i_version field + * will be incremented. + * The filesystem has to be mounted with i_version flag + * + */ + +void inode_inc_iversion(struct inode *inode) +{ + spin_lock(inode-i_lock); + inode-i_version++; + spin_unlock(inode-i_lock); +} I wonder do we really need i_lock here for inode versioning update? Understand this is a 64 bit counter, but file_update_time() and ext4_mark_inode_dirty() (where the inode version is updated) is called on the file write path so i_mutex should be hold all the time. As long as the read patch holding i_mutex everything should be fine, isn't it? Have you get a chance to check the performance impact to ext4? + +/** * file_update_time- update mtime and ctime time * @file: file accessed * @@ -1249,6 +1267,11 @@ sync_it = 1; } + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Wed, 2007-10-17 at 21:09 -0700, Andrew Morton wrote: > On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <[EMAIL PROTECTED]> wrote: > > > With 64KB blocksize, a directory entry can have size 64KB which does not fit > > into 16 bits we have for entry lenght. So we store 0x instead and > > convert > > value when read from / written to disk. > > btw, this changes ext2's on-disk format. > Just to clarify this is only changes the directory entries format on ext2/3/4 fs with 64k block size. But currently without kernel changes ext2/3/4 does not support 64 block size. > a) is the ext2 format documented anywhere? If so, that document will >need updating. > The e2fsprogs needs to be changed to sync up with this change. Ted has a paper a while back to show ext2 disk format http://web.mit.edu/tytso/www/linux/ext2intro.html The Documentation/filesystems/ext2.txt doesn't have the ext2 format documented. That document is out-dated need to be reviewed and cleaned up. > b) what happens when an old ext2 driver tries to read and/or write this >directory entry? Do we need a compat flag for it? > > c) what happens when old and new ext3 or ext4 try to read/write this >directory entry? > Without the first patch in this series: ext2 large blocksize support patches, it fails to mount a ext2 filesystem with 64k block size. [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE http://lkml.org/lkml/2007/10/1/361 So the old ext2/3/4 driver will not get access the directory entry with 64k block size format changes. Regards, Mingming > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
On Wed, 2007-10-17 at 21:09 -0700, Andrew Morton wrote: On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara [EMAIL PROTECTED] wrote: With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. btw, this changes ext2's on-disk format. Just to clarify this is only changes the directory entries format on ext2/3/4 fs with 64k block size. But currently without kernel changes ext2/3/4 does not support 64 block size. a) is the ext2 format documented anywhere? If so, that document will need updating. The e2fsprogs needs to be changed to sync up with this change. Ted has a paper a while back to show ext2 disk format http://web.mit.edu/tytso/www/linux/ext2intro.html The Documentation/filesystems/ext2.txt doesn't have the ext2 format documented. That document is out-dated need to be reviewed and cleaned up. b) what happens when an old ext2 driver tries to read and/or write this directory entry? Do we need a compat flag for it? c) what happens when old and new ext3 or ext4 try to read/write this directory entry? Without the first patch in this series: ext2 large blocksize support patches, it fails to mount a ext2 filesystem with 64k block size. [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE http://lkml.org/lkml/2007/10/1/361 So the old ext2/3/4 driver will not get access the directory entry with 64k block size format changes. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jbd: JBD replace jbd_kmalloc with kmalloc
On Mon, 2007-10-08 at 10:46 -0700, Christoph Lameter wrote: > On Fri, 5 Oct 2007, Mingming Cao wrote: > > > Index: linux-2.6.23-rc9/fs/jbd/transaction.c > > === > > --- linux-2.6.23-rc9.orig/fs/jbd/transaction.c 2007-10-05 > > 12:08:08.0 -0700 > > +++ linux-2.6.23-rc9/fs/jbd/transaction.c 2007-10-05 12:08:29.0 > > -0700 > > @@ -96,8 +96,8 @@ static int start_this_handle(journal_t * > > > > alloc_transaction: > > if (!journal->j_running_transaction) { > > - new_transaction = jbd_kmalloc(sizeof(*new_transaction), > > - GFP_NOFS); > > + new_transaction = kmalloc(sizeof(*new_transaction), > > + GFP_NOFS|__GFP_NOFAIL); > > > Why was a __GFP_NOFAIL added here? I do not see a use of jbd_rep_kmalloc? > > > -#define jbd_kmalloc(size, flags) \ > > - __jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry) > > journal_oom_retry is no longer used? > - journal_oom_retry (which is defined as 1 currently) is still being used in revoke.c, the cleanup patch doesn't remove the define of journal_oom_retry. Since journal_oom_retry is always 1 to __jbd_kmalloc, void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) { return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); } So we replace jbd_kmalloc() to kmalloc() with __GFP_NOFAIL flag on in start_this_handle(). Other two places replacing to kmalloc() is part of the init process, so no need for __GFP_NOFAIL flag there. Mingming > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jbd: JBD replace jbd_kmalloc with kmalloc
On Mon, 2007-10-08 at 10:46 -0700, Christoph Lameter wrote: On Fri, 5 Oct 2007, Mingming Cao wrote: Index: linux-2.6.23-rc9/fs/jbd/transaction.c === --- linux-2.6.23-rc9.orig/fs/jbd/transaction.c 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/transaction.c 2007-10-05 12:08:29.0 -0700 @@ -96,8 +96,8 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = jbd_kmalloc(sizeof(*new_transaction), - GFP_NOFS); + new_transaction = kmalloc(sizeof(*new_transaction), + GFP_NOFS|__GFP_NOFAIL); Why was a __GFP_NOFAIL added here? I do not see a use of jbd_rep_kmalloc? -#define jbd_kmalloc(size, flags) \ - __jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry) journal_oom_retry is no longer used? - journal_oom_retry (which is defined as 1 currently) is still being used in revoke.c, the cleanup patch doesn't remove the define of journal_oom_retry. Since journal_oom_retry is always 1 to __jbd_kmalloc, void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) { return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); } So we replace jbd_kmalloc() to kmalloc() with __GFP_NOFAIL flag on in start_this_handle(). Other two places replacing to kmalloc() is part of the init process, so no need for __GFP_NOFAIL flag there. Mingming To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] i_version update - vfs part
On Fri, 2007-10-05 at 17:28 +0200, Cordenner jean noel wrote: > Hi, > Hi Jean Noel, > This is an update of the i_version patch. Just to make sure, is this vfs patch and next ext4 patch together going to replace the 4 inode-version related patches currently in ext4-patch-queue (and git tree)? > The i_version field is a 64bit counter that is set on every inode > creation and that is incremented every time the inode data is modified > (similarly to the "ctime" time-stamp). > The aim is to fulfill a NFSv4 requirement for rfc3530: > "5.5. Mandatory Attributes - Definitions > Name#DataType Access Description > ___ > change3uint64 READ A value created by the > server that the client can use to determine if file > data, directory contents or attributes of the object > have been modified. The servermay return the object's > time_metadata attribute for this attribute's value but > only if the filesystem object can not be updated more > frequently than the resolution of time_metadata. > " > > This first part deals with adding a flag in the super block and incrementing > the i_version in the vfs. > > Signed-off-by: Jean Noel Cordenner <[EMAIL PROTECTED]> > --- > fs/inode.c | 23 +++ > fs/libfs.c | 12 > include/linux/fs.h |3 +++ > 3 files changed, 38 insertions(+) > > Index: linux-2.6.23-rc8-ext4-i_version/fs/inode.c > === > --- linux-2.6.23-rc8-ext4-i_version.orig/fs/inode.c 2007-09-26 > 14:41:41.0 +0200 > +++ linux-2.6.23-rc8-ext4-i_version/fs/inode.c2007-10-05 > 16:14:41.0 +0200 > @@ -1216,6 +1216,24 @@ > EXPORT_SYMBOL(touch_atime); > > /** > + * inode_inc_iversion - increments i_version > + * @inode: inode that need to be updated > + * > + * Every time the inode is modified, the i_version field > + * will be incremented. > + * The filesystem has to be mounted with i_version flag > + * > + */ > + > +void inode_inc_iversion(struct inode *inode) > +{ > + spin_lock(>i_lock); > + inode->i_version++; > + spin_unlock(>i_lock); > +} I suspect we need a lock here, the places where need to update the inode->i_version are already doing update for inode, mostly protected by i_mutex. You could remove the above function and update the counter directly at the places it need to. > +EXPORT_SYMBOL(inode_inc_iversion); > + Seems unnecessary. > +/** > * file_update_time- update mtime and ctime time > * @file: file accessed > * > @@ -1249,6 +1267,11 @@ > sync_it = 1; > } > > + if (IS_I_VERSION(inode)) { > + inode_inc_iversion(inode); > + sync_it = 1; > + } > + > if (sync_it) > mark_inode_dirty_sync(inode); > } > Index: linux-2.6.23-rc8-ext4-i_version/fs/libfs.c > === > --- linux-2.6.23-rc8-ext4-i_version.orig/fs/libfs.c 2007-07-09 > 01:32:17.0 +0200 > +++ linux-2.6.23-rc8-ext4-i_version/fs/libfs.c2007-09-26 > 14:51:08.0 +0200 > @@ -255,6 +255,10 @@ > struct inode *inode = old_dentry->d_inode; > > inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; > + if (IS_I_VERSION(inode)) { > + inode_inc_iversion(inode); > + inode_inc_iversion(dir); > + } > inc_nlink(inode); > atomic_inc(>i_count); > dget(dentry); > @@ -287,6 +291,10 @@ > struct inode *inode = dentry->d_inode; > > inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; > + if (IS_I_VERSION(inode)) { > + inode_inc_iversion(inode); > + inode_inc_iversion(dir); > + } > drop_nlink(inode); > dput(dentry); > return 0; > @@ -323,6 +331,10 @@ > > old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime = > new_dir->i_mtime = inode->i_ctime = CURRENT_TIME; > + if (IS_I_VERSION(old_dir)) { > + inode_inc_iversion(old_dir); > + inode_inc_iversion(new_dir); > + } > > return 0; > } Need to update the counter in libfs.c? > Index: linux-2.6.23-rc8-ext4-i_version/include/linux/fs.h > === > --- linux-2.6.23-rc8-ext4-i_version.orig/include/linux/fs.h 2007-09-26 > 14:46:15.0 +0200 > +++ linux-2.6.23-rc8-ext4-i_version/include/linux/fs.h2007-09-26 > 14:51:08.0 +0200 > @@ -123,6 +123,7 @@ > #define MS_SLAVE (1<<19) /* change to slave */ > #define MS_SHARED(1<<20) /* change to shared */ > #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ > +#define MS_I_VERSION (1<<22) /* Update inode i_version field */ > #define MS_ACTIVE
Re: [PATCH 2/2] i_version update - ext4 part
On Fri, 2007-10-05 at 17:28 +0200, Cordenner jean noel wrote: > This patch update the i_version field of the inode and add a mount > option to enable this feature. The other condition to enable this > feature is that the inode size should be 256-bytes. > > Signed-off-by: Jean Noel Cordenner <[EMAIL PROTECTED]> > --- > fs/ext4/inode.c |4 +++- > fs/ext4/super.c |7 ++- > include/linux/ext4_fs.h |1 + > 3 files changed, 10 insertions(+), 2 deletions(-) > > Index: linux-2.6.23-rc8-ext4-i_version/fs/ext4/inode.c > === > --- linux-2.6.23-rc8-ext4-i_version.orig/fs/ext4/inode.c 2007-10-03 > 18:11:17.0 +0200 > +++ linux-2.6.23-rc8-ext4-i_version/fs/ext4/inode.c 2007-10-05 > 10:26:42.0 +0200 > @@ -3173,7 +3173,9 @@ > { > int err = 0; > > - inode->i_version++; > + if (test_opt(inode->i_sb, I_VERSION)) > + inode_inc_iversion(inode); > + > /* the do_update_inode consumes one bh->b_count */ > get_bh(iloc->bh); > > Index: linux-2.6.23-rc8-ext4-i_version/fs/ext4/super.c > === > --- linux-2.6.23-rc8-ext4-i_version.orig/fs/ext4/super.c 2007-10-03 > 18:11:17.0 +0200 > +++ linux-2.6.23-rc8-ext4-i_version/fs/ext4/super.c 2007-10-03 > 18:17:44.0 +0200 > @@ -742,7 +742,7 @@ > Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, > Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, > Opt_grpquota, Opt_extents, Opt_noextents, Opt_delalloc, > - Opt_mballoc, Opt_nomballoc, Opt_stripe, > + Opt_mballoc, Opt_nomballoc, Opt_stripe, Opt_i_version, > }; > > static match_table_t tokens = { > @@ -800,6 +800,7 @@ > {Opt_mballoc, "mballoc"}, > {Opt_nomballoc, "nomballoc"}, > {Opt_stripe, "stripe=%u"}, > + {Opt_i_version, "i_version"}, > {Opt_err, NULL}, > {Opt_resize, "resize"}, > }; > @@ -1161,6 +1162,10 @@ > return 0; > sbi->s_stripe = option; > break; > + case Opt_i_version: > + set_opt (sbi->s_mount_opt, I_VERSION); > + sb->s_flags |= MS_I_VERSION; > + break; Need to make sure this flag is cleared if remounted fs without I_VERSION > default: > printk (KERN_ERR > "EXT4-fs: Unrecognized mount option \"%s\" " > Index: linux-2.6.23-rc8-ext4-i_version/include/linux/ext4_fs.h > === > --- linux-2.6.23-rc8-ext4-i_version.orig/include/linux/ext4_fs.h > 2007-10-03 18:11:17.0 +0200 > +++ linux-2.6.23-rc8-ext4-i_version/include/linux/ext4_fs.h 2007-10-03 > 18:11:54.0 +0200 > @@ -500,6 +500,7 @@ > #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x100 /* Journal Async > Commit */ > #define EXT4_MOUNT_DELALLOC 0x200 /* Delalloc support */ > #define EXT4_MOUNT_MBALLOC 0x400 /* Buddy allocation support */ > +#define EXT4_MOUNT_I_VERSION 0x800 /* i_version support */ > /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at > once */ > #ifndef _LINUX_EXT2_FS_H > #define clear_opt(o, opt)o &= ~EXT4_MOUNT_##opt > I don't see places where this counter is being stored/load to/from disk, so I assume this is the not the full patch series? Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] jbd2: JBD replace jbd2_kmalloc with kmalloc
JBD2: JBD2 replace jbd2_kmalloc with kmalloc From: Mingming Cao <[EMAIL PROTECTED]> This patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd2/journal.c | 11 +-- fs/jbd2/transaction.c |4 ++-- include/linux/jbd2.h |7 --- 3 files changed, 3 insertions(+), 19 deletions(-) Index: linux-2.6.23-rc9/fs/jbd2/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd2/journal.c 2007-10-05 12:08:26.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/journal.c 2007-10-05 12:08:32.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1619,15 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* * Journal_head storage management */ static struct kmem_cache *jbd2_journal_head_cache; Index: linux-2.6.23-rc9/fs/jbd2/transaction.c === --- linux-2.6.23-rc9.orig/fs/jbd2/transaction.c 2007-10-05 12:08:26.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/transaction.c 2007-10-05 12:08:32.0 -0700 @@ -96,8 +96,8 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal->j_running_transaction) { - new_transaction = jbd_kmalloc(sizeof(*new_transaction), - GFP_NOFS); + new_transaction = kmalloc(sizeof(*new_transaction), + GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc9/include/linux/jbd2.h === --- linux-2.6.23-rc9.orig/include/linux/jbd2.h 2007-10-05 12:08:26.0 -0700 +++ linux-2.6.23-rc9/include/linux/jbd2.h 2007-10-05 12:08:32.0 -0700 @@ -71,13 +71,6 @@ extern u8 jbd2_journal_enable_debug; #define jbd_debug(f, a...) /**/ #endif -extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry); -#define jbd_kmalloc(size, flags) \ - __jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry) -#define jbd_rep_kmalloc(size, flags) \ - __jbd2_kmalloc(__FUNCTION__, (size), (flags), 1) - - static inline void *jbd2_alloc(size_t size, gfp_t flags) { return (void *)__get_free_pages(flags, get_order(size)); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] jbd: JBD replace jbd_kmalloc with kmalloc
JBD: JBD replace jbd_kmalloc with kmalloc From: Mingming Cao <[EMAIL PROTECTED]> This patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/journal.c | 11 +-- fs/jbd/transaction.c |4 ++-- include/linux/jbd.h |6 -- 3 files changed, 3 insertions(+), 18 deletions(-) Index: linux-2.6.23-rc9/fs/jbd/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd/journal.c 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/journal.c 2007-10-05 12:08:29.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1607,15 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* * Journal_head storage management */ static struct kmem_cache *journal_head_cache; Index: linux-2.6.23-rc9/fs/jbd/transaction.c === --- linux-2.6.23-rc9.orig/fs/jbd/transaction.c 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/transaction.c 2007-10-05 12:08:29.0 -0700 @@ -96,8 +96,8 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal->j_running_transaction) { - new_transaction = jbd_kmalloc(sizeof(*new_transaction), - GFP_NOFS); + new_transaction = kmalloc(sizeof(*new_transaction), + GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc9/include/linux/jbd.h === --- linux-2.6.23-rc9.orig/include/linux/jbd.h 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/include/linux/jbd.h2007-10-05 12:08:29.0 -0700 @@ -71,12 +71,6 @@ extern int journal_enable_debug; #define jbd_debug(f, a...) /**/ #endif -extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry); -#define jbd_kmalloc(size, flags) \ - __jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry) -#define jbd_rep_kmalloc(size, flags) \ - __jbd_kmalloc(__FUNCTION__, (size), (flags), 1) - static inline void *jbd_alloc(size_t size, gfp_t flags) { return (void *)__get_free_pages(flags, get_order(size)); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] jbd2: JBD2 slab allocation cleanups
JBD2: jbd2 slab allocation cleanups From: Mingming Cao <[EMAIL PROTECTED]> JBD2: Replace slab allocations with page allocations JBD2 allocate memory for committed_data and frozen_data from slab. However JBD2 should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 88 ++ fs/jbd2/transaction.c | 14 +++ include/linux/jbd2.h | 18 +++--- 4 files changed, 27 insertions(+), 99 deletions(-) Index: linux-2.6.23-rc9/fs/jbd2/commit.c === --- linux-2.6.23-rc9.orig/fs/jbd2/commit.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/commit.c 2007-10-05 12:08:26.0 -0700 @@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou struct buffer_head *bh = jh2bh(jh); jbd_lock_bh_state(bh); - jbd2_slab_free(jh->b_committed_data, bh->b_size); + jbd2_free(jh->b_committed_data, bh->b_size); jh->b_committed_data = NULL; jbd_unlock_bh_state(bh); } @@ -801,14 +801,14 @@ restart_loop: * Otherwise, we can just throw away the frozen data now. */ if (jh->b_committed_data) { - jbd2_slab_free(jh->b_committed_data, bh->b_size); + jbd2_free(jh->b_committed_data, bh->b_size); jh->b_committed_data = NULL; if (jh->b_frozen_data) { jh->b_committed_data = jh->b_frozen_data; jh->b_frozen_data = NULL; } } else if (jh->b_frozen_data) { - jbd2_slab_free(jh->b_frozen_data, bh->b_size); + jbd2_free(jh->b_frozen_data, bh->b_size); jh->b_frozen_data = NULL; } Index: linux-2.6.23-rc9/fs/jbd2/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd2/journal.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/journal.c 2007-10-05 12:08:26.0 -0700 @@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit) static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int jbd2_journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -335,10 +334,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS); + tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in->b_frozen_data) { - jbd2_slab_free(tmp, bh_in->b_size); + jbd2_free(tmp, bh_in->b_size); goto repeat; } @@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal } } - /* -* Create a slab for this blocksize -*/ - err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (jbd2_journal_recover(journal)) @@ -1636,77 +1628,6 @@ void * __jbd2_kmalloc (const char *where } /* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size >> 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k" -}; - -static void jbd2_journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i < JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL; - } -} - -static int jbd2_journal_create_jbd_slab(size_t slab_size) -{ - int i = JBD_SLAB_INDEX(slab_size); - - BUG_ON(i >= JBD_MAX_SLABS); - - /* -* Check if we already have a slab created for this size -*/ - if (jbd_slab[i]) - r
[PATCH] jbd: JBD slab allocation cleanups
JBD: JBD slab allocation cleanups From: Mingming Cao <[EMAIL PROTECTED]> JBD: Replace slab allocations with page allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 88 ++- fs/jbd/transaction.c |8 ++-- include/linux/jbd.h | 13 +-- 4 files changed, 21 insertions(+), 94 deletions(-) Index: linux-2.6.23-rc9/fs/jbd/commit.c === --- linux-2.6.23-rc9.orig/fs/jbd/commit.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/commit.c2007-10-05 12:08:08.0 -0700 @@ -375,7 +375,7 @@ void journal_commit_transaction(journal_ struct buffer_head *bh = jh2bh(jh); jbd_lock_bh_state(bh); - jbd_slab_free(jh->b_committed_data, bh->b_size); + jbd_free(jh->b_committed_data, bh->b_size); jh->b_committed_data = NULL; jbd_unlock_bh_state(bh); } @@ -792,14 +792,14 @@ restart_loop: * Otherwise, we can just throw away the frozen data now. */ if (jh->b_committed_data) { - jbd_slab_free(jh->b_committed_data, bh->b_size); + jbd_free(jh->b_committed_data, bh->b_size); jh->b_committed_data = NULL; if (jh->b_frozen_data) { jh->b_committed_data = jh->b_frozen_data; jh->b_frozen_data = NULL; } } else if (jh->b_frozen_data) { - jbd_slab_free(jh->b_frozen_data, bh->b_size); + jbd_free(jh->b_frozen_data, bh->b_size); jh->b_frozen_data = NULL; } Index: linux-2.6.23-rc9/fs/jbd/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd/journal.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/journal.c 2007-10-05 12:08:08.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in->b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in->b_frozen_data) { - jbd_slab_free(tmp, bh_in->b_size); + jbd_free(tmp, bh_in->b_size); goto repeat; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1624,77 +1616,6 @@ void * __jbd_kmalloc (const char *where, } /* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size >> 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k" -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i < JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL; - } -} - -static int journal_create_jbd_slab(size_t slab_size) -{ - int i = JBD_SLAB_INDEX(slab_size); - - BUG_ON(i >= JBD_MAX_SLABS); - - /* -* Check if we already have a slab created for this size -*/ - if (jbd_slab[i]) - return 0; - - /* -* Create a slab and force alig
Re: [PATCH] jbd/jbd2: JBD memory allocation cleanups
On Thu, 2007-10-04 at 07:52 +0100, Christoph Hellwig wrote: > On Thu, Oct 04, 2007 at 01:50:36AM -0400, Theodore Ts'o wrote: > > From: Mingming Cao <[EMAIL PROTECTED]> > > > > JBD: Replace slab allocations with page cache allocations > > It's page allocations, not page cache allocations. > > > Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly > > That sounds like it should be a different patch.. Okay. Will sent the patches, that also separate JBD2 changes to a different patch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c
Forwarded Message From: Valerie Clement <[EMAIL PROTECTED]> To: Linux Kernel Mailing List Subject: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c Date: Thu, 04 Oct 2007 18:13:46 +0200 While running ffsb tests on my ext4 filesystem, I got an Oops in cache_alloc_refill(). I turned on SLAB debugging and here is the message I got: slab: Internal list corruption detected in cache 'buffer_head'(30), slabp 81007e100100(1515870810). Hexdump: ===> slabp->inuse counter looks corrupted (1515870810), it should not greater than cachep->num looks valid (30) 000: 5a 5a 5a 5a 5a 5a 5a 5a b8 23 34 7e 00 81 ff ff 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 020: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 030: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 040: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 050: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 060: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a a5 070: c0 88 56 63 c5 56 41 d8 f1 37 4a 80 ff ff ff ff 080: c0 88 56 63 c5 56 41 d8 80 33 53 7d 00 81 ff ff 090: e8 25 60 7d 00 81 ff ff 68 cb 3b 01 00 81 ff ff 0a0: 18 68 50 7d 00 81 ff ff [ cut here ] kernel BUG at /home/clementv/src/linux-2.6.23-rc9/mm/slab.c:2923! invalid opcode: [1] SMP CPU 2 Modules linked in: qla2xxx Pid: 4041, comm: ffsb Not tainted 2.6.23-rc9 #2 RIP: 0010:[] [] check_slabp+0xb5/0xc1 RSP: 0018:8100774bb958 EFLAGS: 00010096 RAX: 0001 RBX: 81007e100100 RCX: 6d20 RDX: RSI: 0046 RDI: 81007e347280 RBP: 00a8 R08: 0005 R09: 8060bb10 R10: 000ae468 R11: 00050002 R12: 00a8 R13: 81007e347280 R14: 81007e347280 R15: 0002 FS: 41802950(0063) GS:81007e0c4728() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 5f83d00c CR3: 78149000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process ffsb (pid: 4041, threadinfo 8100774ba000, task 81007dbdc7a0) Stack: 000d 000e 81007e100100 81007e342398 81007e078488 80277069 8050 81007e347280 8050 0246 80299539 f000 Call Trace: [] cache_alloc_refill+0xc8/0x23f [] alloc_buffer_head+0x14/0x45 [] kmem_cache_alloc+0x94/0xe9 [] alloc_buffer_head+0x14/0x45 [] alloc_page_buffers+0x38/0xd5 [] create_empty_buffers+0x14/0x9b [] __block_prepare_write+0x7c/0x45b [] ext4_get_block+0x0/0x139 [] block_prepare_write+0x1a/0x25 [] ext4_prepare_write+0xaf/0x175 [] generic_file_buffered_write+0x288/0x631 [] __generic_file_aio_write_nolock+0x33f/0x3a9 [] enqueue_entity+0x17c/0x1a3 [] generic_file_aio_write+0x61/0xc1 [] __check_preempt_curr_fair+0x56/0x76 [] ext4_file_write+0x16/0x91 [] do_sync_write+0xc9/0x10c [] file_move+0x1d/0x4c [] autoremove_wake_function+0x0/0x2e [] do_filp_open+0x2a/0x38 [] poison_obj+0x26/0x30 [] vfs_write+0xad/0x136 [] sys_write+0x45/0x6e [] system_call+0x7e/0x83 => The stack track shows ext4_new_block(), is the problem repeatable? Does away without multiple block allocation patch? Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c
On Fri, 2007-10-05 at 07:54 -0700, Badari Pulavarty wrote: > On Fri, 2007-10-05 at 15:41 +0200, Valerie Clement wrote: > > Badari Pulavarty wrote: > > > On Thu, 2007-10-04 at 18:13 +0200, Valerie Clement wrote: > > >> While running ffsb tests on my ext4 filesystem, I got an Oops in > > >> cache_alloc_refill(). > > >> I turned on SLAB debugging and here is the message I got: > > >> > > >> slab: Internal list corruption detected in cache 'buffer_head'(30), > > >> slabp 81007e100100(1515870810). Hexdump: > > > > > > slabp->inuse = 1515870810 looks bogus. Is this easily reproducible ? > > > > Hi Badari, > > Thanks for your answer. > > I didn't reproduce it without the latest ext4 patches. So I suspect a > > bug in one of them. > > But how debugging this? > > Which other debug traces can I turn on? > > Let me understand. You applied latest ext4 patchsets ? If so, Mingming > has some slab-cleanup changes in the patchset. You can try backing them > out and see. > It's unlikely to be the jbd_slab_cleanup.patch, which actually get rid of slab allocation for buffers passing down to disk IO, and replace with get_free_page directly. Could you send me the profile used for ffsb test? Thanks, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c
On Fri, 2007-10-05 at 07:54 -0700, Badari Pulavarty wrote: On Fri, 2007-10-05 at 15:41 +0200, Valerie Clement wrote: Badari Pulavarty wrote: On Thu, 2007-10-04 at 18:13 +0200, Valerie Clement wrote: While running ffsb tests on my ext4 filesystem, I got an Oops in cache_alloc_refill(). I turned on SLAB debugging and here is the message I got: slab: Internal list corruption detected in cache 'buffer_head'(30), slabp 81007e100100(1515870810). Hexdump: slabp-inuse = 1515870810 looks bogus. Is this easily reproducible ? Hi Badari, Thanks for your answer. I didn't reproduce it without the latest ext4 patches. So I suspect a bug in one of them. But how debugging this? Which other debug traces can I turn on? Let me understand. You applied latest ext4 patchsets ? If so, Mingming has some slab-cleanup changes in the patchset. You can try backing them out and see. It's unlikely to be the jbd_slab_cleanup.patch, which actually get rid of slab allocation for buffers passing down to disk IO, and replace with get_free_page directly. Could you send me the profile used for ffsb test? Thanks, Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c
Forwarded Message From: Valerie Clement [EMAIL PROTECTED] To: Linux Kernel Mailing List linux-kernel@vger.kernel.org Subject: 2.6.23-rc9: Oops in cache_alloc_refill() mm/slab.c Date: Thu, 04 Oct 2007 18:13:46 +0200 While running ffsb tests on my ext4 filesystem, I got an Oops in cache_alloc_refill(). I turned on SLAB debugging and here is the message I got: slab: Internal list corruption detected in cache 'buffer_head'(30), slabp 81007e100100(1515870810). Hexdump: === slabp-inuse counter looks corrupted (1515870810), it should not greater than cachep-num looks valid (30) 000: 5a 5a 5a 5a 5a 5a 5a 5a b8 23 34 7e 00 81 ff ff 010: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 020: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 030: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 040: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 050: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 060: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a a5 070: c0 88 56 63 c5 56 41 d8 f1 37 4a 80 ff ff ff ff 080: c0 88 56 63 c5 56 41 d8 80 33 53 7d 00 81 ff ff 090: e8 25 60 7d 00 81 ff ff 68 cb 3b 01 00 81 ff ff 0a0: 18 68 50 7d 00 81 ff ff [ cut here ] kernel BUG at /home/clementv/src/linux-2.6.23-rc9/mm/slab.c:2923! invalid opcode: [1] SMP CPU 2 Modules linked in: qla2xxx Pid: 4041, comm: ffsb Not tainted 2.6.23-rc9 #2 RIP: 0010:[802758b6] [802758b6] check_slabp+0xb5/0xc1 RSP: 0018:8100774bb958 EFLAGS: 00010096 RAX: 0001 RBX: 81007e100100 RCX: 6d20 RDX: RSI: 0046 RDI: 81007e347280 RBP: 00a8 R08: 0005 R09: 8060bb10 R10: 000ae468 R11: 00050002 R12: 00a8 R13: 81007e347280 R14: 81007e347280 R15: 0002 FS: 41802950(0063) GS:81007e0c4728() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 5f83d00c CR3: 78149000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process ffsb (pid: 4041, threadinfo 8100774ba000, task 81007dbdc7a0) Stack: 000d 000e 81007e100100 81007e342398 81007e078488 80277069 8050 81007e347280 8050 0246 80299539 f000 Call Trace: [80277069] cache_alloc_refill+0xc8/0x23f [80299539] alloc_buffer_head+0x14/0x45 [802774cd] kmem_cache_alloc+0x94/0xe9 [80299539] alloc_buffer_head+0x14/0x45 [80299cf7] alloc_page_buffers+0x38/0xd5 [80299da8] create_empty_buffers+0x14/0x9b [8029a875] __block_prepare_write+0x7c/0x45b [802f6e29] ext4_get_block+0x0/0x139 [8029ac6e] block_prepare_write+0x1a/0x25 [802f8340] ext4_prepare_write+0xaf/0x175 [802576c2] generic_file_buffered_write+0x288/0x631 [80257daa] __generic_file_aio_write_nolock+0x33f/0x3a9 [8022b7d5] enqueue_entity+0x17c/0x1a3 [80257e75] generic_file_aio_write+0x61/0xc1 [8022c512] __check_preempt_curr_fair+0x56/0x76 [802f4022] ext4_file_write+0x16/0x91 [8027c4f4] do_sync_write+0xc9/0x10c [8027d50a] file_move+0x1d/0x4c [80245992] autoremove_wake_function+0x0/0x2e [8027b216] do_filp_open+0x2a/0x38 [80275f7a] poison_obj+0x26/0x30 [8027cc34] vfs_write+0xad/0x136 [8027d171] sys_write+0x45/0x6e [8020b32e] system_call+0x7e/0x83 = The stack track shows ext4_new_block(), is the problem repeatable? Does away without multiple block allocation patch? Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] jbd/jbd2: JBD memory allocation cleanups
On Thu, 2007-10-04 at 07:52 +0100, Christoph Hellwig wrote: On Thu, Oct 04, 2007 at 01:50:36AM -0400, Theodore Ts'o wrote: From: Mingming Cao [EMAIL PROTECTED] JBD: Replace slab allocations with page cache allocations It's page allocations, not page cache allocations. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly That sounds like it should be a different patch.. Okay. Will sent the patches, that also separate JBD2 changes to a different patch. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] jbd: JBD slab allocation cleanups
JBD: JBD slab allocation cleanups From: Mingming Cao [EMAIL PROTECTED] JBD: Replace slab allocations with page allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 88 ++- fs/jbd/transaction.c |8 ++-- include/linux/jbd.h | 13 +-- 4 files changed, 21 insertions(+), 94 deletions(-) Index: linux-2.6.23-rc9/fs/jbd/commit.c === --- linux-2.6.23-rc9.orig/fs/jbd/commit.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/commit.c2007-10-05 12:08:08.0 -0700 @@ -375,7 +375,7 @@ void journal_commit_transaction(journal_ struct buffer_head *bh = jh2bh(jh); jbd_lock_bh_state(bh); - jbd_slab_free(jh-b_committed_data, bh-b_size); + jbd_free(jh-b_committed_data, bh-b_size); jh-b_committed_data = NULL; jbd_unlock_bh_state(bh); } @@ -792,14 +792,14 @@ restart_loop: * Otherwise, we can just throw away the frozen data now. */ if (jh-b_committed_data) { - jbd_slab_free(jh-b_committed_data, bh-b_size); + jbd_free(jh-b_committed_data, bh-b_size); jh-b_committed_data = NULL; if (jh-b_frozen_data) { jh-b_committed_data = jh-b_frozen_data; jh-b_frozen_data = NULL; } } else if (jh-b_frozen_data) { - jbd_slab_free(jh-b_frozen_data, bh-b_size); + jbd_free(jh-b_frozen_data, bh-b_size); jh-b_frozen_data = NULL; } Index: linux-2.6.23-rc9/fs/jbd/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd/journal.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/journal.c 2007-10-05 12:08:08.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1624,77 +1616,6 @@ void * __jbd_kmalloc (const char *where, } /* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL; - } -} - -static int journal_create_jbd_slab(size_t slab_size) -{ - int i = JBD_SLAB_INDEX(slab_size); - - BUG_ON(i = JBD_MAX_SLABS); - - /* -* Check if we already have a slab created for this size -*/ - if (jbd_slab[i]) - return 0; - - /* -* Create a slab and force alignment to be same as slabsize - -* this will make sure that allocations won't cross the page -* boundary. -*/ - jbd_slab[i] = kmem_cache_create(jbd_slab_names[i
[PATCH] jbd2: JBD2 slab allocation cleanups
JBD2: jbd2 slab allocation cleanups From: Mingming Cao [EMAIL PROTECTED] JBD2: Replace slab allocations with page allocations JBD2 allocate memory for committed_data and frozen_data from slab. However JBD2 should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 88 ++ fs/jbd2/transaction.c | 14 +++ include/linux/jbd2.h | 18 +++--- 4 files changed, 27 insertions(+), 99 deletions(-) Index: linux-2.6.23-rc9/fs/jbd2/commit.c === --- linux-2.6.23-rc9.orig/fs/jbd2/commit.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/commit.c 2007-10-05 12:08:26.0 -0700 @@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou struct buffer_head *bh = jh2bh(jh); jbd_lock_bh_state(bh); - jbd2_slab_free(jh-b_committed_data, bh-b_size); + jbd2_free(jh-b_committed_data, bh-b_size); jh-b_committed_data = NULL; jbd_unlock_bh_state(bh); } @@ -801,14 +801,14 @@ restart_loop: * Otherwise, we can just throw away the frozen data now. */ if (jh-b_committed_data) { - jbd2_slab_free(jh-b_committed_data, bh-b_size); + jbd2_free(jh-b_committed_data, bh-b_size); jh-b_committed_data = NULL; if (jh-b_frozen_data) { jh-b_committed_data = jh-b_frozen_data; jh-b_frozen_data = NULL; } } else if (jh-b_frozen_data) { - jbd2_slab_free(jh-b_frozen_data, bh-b_size); + jbd2_free(jh-b_frozen_data, bh-b_size); jh-b_frozen_data = NULL; } Index: linux-2.6.23-rc9/fs/jbd2/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd2/journal.c 2007-10-05 12:03:43.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/journal.c 2007-10-05 12:08:26.0 -0700 @@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit) static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int jbd2_journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -335,10 +334,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd2_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd2_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd2_slab_free(tmp, bh_in-b_size); + jbd2_free(tmp, bh_in-b_size); goto repeat; } @@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal } } - /* -* Create a slab for this blocksize -*/ - err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (jbd2_journal_recover(journal)) @@ -1636,77 +1628,6 @@ void * __jbd2_kmalloc (const char *where } /* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd2_1k, jbd2_2k, jbd2_4k, NULL, jbd2_8k -}; - -static void jbd2_journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL; - } -} - -static int jbd2_journal_create_jbd_slab(size_t slab_size) -{ - int i = JBD_SLAB_INDEX(slab_size); - - BUG_ON(i = JBD_MAX_SLABS); - - /* -* Check if we already have a slab created for this size -*/ - if (jbd_slab[i]) - return 0; - - /* -* Create a slab and force alignment to be same as slabsize - -* this will make sure that allocations won't cross the page -* boundary
[PATCH] jbd: JBD replace jbd_kmalloc with kmalloc
JBD: JBD replace jbd_kmalloc with kmalloc From: Mingming Cao [EMAIL PROTECTED] This patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c | 11 +-- fs/jbd/transaction.c |4 ++-- include/linux/jbd.h |6 -- 3 files changed, 3 insertions(+), 18 deletions(-) Index: linux-2.6.23-rc9/fs/jbd/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd/journal.c 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/journal.c 2007-10-05 12:08:29.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1607,15 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* * Journal_head storage management */ static struct kmem_cache *journal_head_cache; Index: linux-2.6.23-rc9/fs/jbd/transaction.c === --- linux-2.6.23-rc9.orig/fs/jbd/transaction.c 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/fs/jbd/transaction.c 2007-10-05 12:08:29.0 -0700 @@ -96,8 +96,8 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = jbd_kmalloc(sizeof(*new_transaction), - GFP_NOFS); + new_transaction = kmalloc(sizeof(*new_transaction), + GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc9/include/linux/jbd.h === --- linux-2.6.23-rc9.orig/include/linux/jbd.h 2007-10-05 12:08:08.0 -0700 +++ linux-2.6.23-rc9/include/linux/jbd.h2007-10-05 12:08:29.0 -0700 @@ -71,12 +71,6 @@ extern int journal_enable_debug; #define jbd_debug(f, a...) /**/ #endif -extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry); -#define jbd_kmalloc(size, flags) \ - __jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry) -#define jbd_rep_kmalloc(size, flags) \ - __jbd_kmalloc(__FUNCTION__, (size), (flags), 1) - static inline void *jbd_alloc(size_t size, gfp_t flags) { return (void *)__get_free_pages(flags, get_order(size)); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] jbd2: JBD replace jbd2_kmalloc with kmalloc
JBD2: JBD2 replace jbd2_kmalloc with kmalloc From: Mingming Cao [EMAIL PROTECTED] This patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/journal.c | 11 +-- fs/jbd2/transaction.c |4 ++-- include/linux/jbd2.h |7 --- 3 files changed, 3 insertions(+), 19 deletions(-) Index: linux-2.6.23-rc9/fs/jbd2/journal.c === --- linux-2.6.23-rc9.orig/fs/jbd2/journal.c 2007-10-05 12:08:26.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/journal.c 2007-10-05 12:08:32.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1619,15 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* * Journal_head storage management */ static struct kmem_cache *jbd2_journal_head_cache; Index: linux-2.6.23-rc9/fs/jbd2/transaction.c === --- linux-2.6.23-rc9.orig/fs/jbd2/transaction.c 2007-10-05 12:08:26.0 -0700 +++ linux-2.6.23-rc9/fs/jbd2/transaction.c 2007-10-05 12:08:32.0 -0700 @@ -96,8 +96,8 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = jbd_kmalloc(sizeof(*new_transaction), - GFP_NOFS); + new_transaction = kmalloc(sizeof(*new_transaction), + GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc9/include/linux/jbd2.h === --- linux-2.6.23-rc9.orig/include/linux/jbd2.h 2007-10-05 12:08:26.0 -0700 +++ linux-2.6.23-rc9/include/linux/jbd2.h 2007-10-05 12:08:32.0 -0700 @@ -71,13 +71,6 @@ extern u8 jbd2_journal_enable_debug; #define jbd_debug(f, a...) /**/ #endif -extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry); -#define jbd_kmalloc(size, flags) \ - __jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry) -#define jbd_rep_kmalloc(size, flags) \ - __jbd2_kmalloc(__FUNCTION__, (size), (flags), 1) - - static inline void *jbd2_alloc(size_t size, gfp_t flags) { return (void *)__get_free_pages(flags, get_order(size)); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] i_version update - ext4 part
On Fri, 2007-10-05 at 17:28 +0200, Cordenner jean noel wrote: This patch update the i_version field of the inode and add a mount option to enable this feature. The other condition to enable this feature is that the inode size should be 256-bytes. Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] --- fs/ext4/inode.c |4 +++- fs/ext4/super.c |7 ++- include/linux/ext4_fs.h |1 + 3 files changed, 10 insertions(+), 2 deletions(-) Index: linux-2.6.23-rc8-ext4-i_version/fs/ext4/inode.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/ext4/inode.c 2007-10-03 18:11:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/ext4/inode.c 2007-10-05 10:26:42.0 +0200 @@ -3173,7 +3173,9 @@ { int err = 0; - inode-i_version++; + if (test_opt(inode-i_sb, I_VERSION)) + inode_inc_iversion(inode); + /* the do_update_inode consumes one bh-b_count */ get_bh(iloc-bh); Index: linux-2.6.23-rc8-ext4-i_version/fs/ext4/super.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/ext4/super.c 2007-10-03 18:11:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/ext4/super.c 2007-10-03 18:17:44.0 +0200 @@ -742,7 +742,7 @@ Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, Opt_grpquota, Opt_extents, Opt_noextents, Opt_delalloc, - Opt_mballoc, Opt_nomballoc, Opt_stripe, + Opt_mballoc, Opt_nomballoc, Opt_stripe, Opt_i_version, }; static match_table_t tokens = { @@ -800,6 +800,7 @@ {Opt_mballoc, mballoc}, {Opt_nomballoc, nomballoc}, {Opt_stripe, stripe=%u}, + {Opt_i_version, i_version}, {Opt_err, NULL}, {Opt_resize, resize}, }; @@ -1161,6 +1162,10 @@ return 0; sbi-s_stripe = option; break; + case Opt_i_version: + set_opt (sbi-s_mount_opt, I_VERSION); + sb-s_flags |= MS_I_VERSION; + break; Need to make sure this flag is cleared if remounted fs without I_VERSION default: printk (KERN_ERR EXT4-fs: Unrecognized mount option \%s\ Index: linux-2.6.23-rc8-ext4-i_version/include/linux/ext4_fs.h === --- linux-2.6.23-rc8-ext4-i_version.orig/include/linux/ext4_fs.h 2007-10-03 18:11:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/include/linux/ext4_fs.h 2007-10-03 18:11:54.0 +0200 @@ -500,6 +500,7 @@ #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x100 /* Journal Async Commit */ #define EXT4_MOUNT_DELALLOC 0x200 /* Delalloc support */ #define EXT4_MOUNT_MBALLOC 0x400 /* Buddy allocation support */ +#define EXT4_MOUNT_I_VERSION 0x800 /* i_version support */ /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H #define clear_opt(o, opt)o = ~EXT4_MOUNT_##opt I don't see places where this counter is being stored/load to/from disk, so I assume this is the not the full patch series? Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] i_version update - vfs part
On Fri, 2007-10-05 at 17:28 +0200, Cordenner jean noel wrote: Hi, Hi Jean Noel, This is an update of the i_version patch. Just to make sure, is this vfs patch and next ext4 patch together going to replace the 4 inode-version related patches currently in ext4-patch-queue (and git tree)? The i_version field is a 64bit counter that is set on every inode creation and that is incremented every time the inode data is modified (similarly to the ctime time-stamp). The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name#DataType Access Description ___ change3uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The servermay return the object's time_metadata attribute for this attribute's value but only if the filesystem object can not be updated more frequently than the resolution of time_metadata. This first part deals with adding a flag in the super block and incrementing the i_version in the vfs. Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED] --- fs/inode.c | 23 +++ fs/libfs.c | 12 include/linux/fs.h |3 +++ 3 files changed, 38 insertions(+) Index: linux-2.6.23-rc8-ext4-i_version/fs/inode.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/inode.c 2007-09-26 14:41:41.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/inode.c2007-10-05 16:14:41.0 +0200 @@ -1216,6 +1216,24 @@ EXPORT_SYMBOL(touch_atime); /** + * inode_inc_iversion - increments i_version + * @inode: inode that need to be updated + * + * Every time the inode is modified, the i_version field + * will be incremented. + * The filesystem has to be mounted with i_version flag + * + */ + +void inode_inc_iversion(struct inode *inode) +{ + spin_lock(inode-i_lock); + inode-i_version++; + spin_unlock(inode-i_lock); +} I suspect we need a lock here, the places where need to update the inode-i_version are already doing update for inode, mostly protected by i_mutex. You could remove the above function and update the counter directly at the places it need to. +EXPORT_SYMBOL(inode_inc_iversion); + Seems unnecessary. +/** * file_update_time- update mtime and ctime time * @file: file accessed * @@ -1249,6 +1267,11 @@ sync_it = 1; } + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + sync_it = 1; + } + if (sync_it) mark_inode_dirty_sync(inode); } Index: linux-2.6.23-rc8-ext4-i_version/fs/libfs.c === --- linux-2.6.23-rc8-ext4-i_version.orig/fs/libfs.c 2007-07-09 01:32:17.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/fs/libfs.c2007-09-26 14:51:08.0 +0200 @@ -255,6 +255,10 @@ struct inode *inode = old_dentry-d_inode; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + inode_inc_iversion(dir); + } inc_nlink(inode); atomic_inc(inode-i_count); dget(dentry); @@ -287,6 +291,10 @@ struct inode *inode = dentry-d_inode; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + if (IS_I_VERSION(inode)) { + inode_inc_iversion(inode); + inode_inc_iversion(dir); + } drop_nlink(inode); dput(dentry); return 0; @@ -323,6 +331,10 @@ old_dir-i_ctime = old_dir-i_mtime = new_dir-i_ctime = new_dir-i_mtime = inode-i_ctime = CURRENT_TIME; + if (IS_I_VERSION(old_dir)) { + inode_inc_iversion(old_dir); + inode_inc_iversion(new_dir); + } return 0; } Need to update the counter in libfs.c? Index: linux-2.6.23-rc8-ext4-i_version/include/linux/fs.h === --- linux-2.6.23-rc8-ext4-i_version.orig/include/linux/fs.h 2007-09-26 14:46:15.0 +0200 +++ linux-2.6.23-rc8-ext4-i_version/include/linux/fs.h2007-09-26 14:51:08.0 +0200 @@ -123,6 +123,7 @@ #define MS_SLAVE (119) /* change to slave */ #define MS_SHARED(120) /* change to shared */ #define MS_RELATIME (121) /* Update atime relative to mtime/ctime. */ +#define MS_I_VERSION (122) /* Update inode i_version field */ #define MS_ACTIVE(130) #define MS_NOUSER(131) @@ -172,6 +173,7 @@ ((inode)-i_flags (S_SYNC|S_DIRSYNC)))
[PATCH 1/2] ext3: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext3 From: Takashi Sato <[EMAIL PROTECTED]> This patch set supports large block size(>4k, <=64k) in ext3 just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext3 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext3: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext3: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext3, and able to handle empty directory block. Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext3/super.c |6 +- include/linux/ext3_fs.h |4 ++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 9537316..b4bfd36 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -1549,7 +1549,11 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent) } brelse (bh); - sb_set_blocksize(sb, blocksize); + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n", + blocksize); + goto out_fail; + } logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize; offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize; bh = sb_bread(sb, logic_sb_block); diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index ece49a8..7aa5556 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -76,8 +76,8 @@ * Macro-instructions used to manage several block sizes */ #define EXT3_MIN_BLOCK_SIZE1024 -#defineEXT3_MAX_BLOCK_SIZE 4096 -#define EXT3_MIN_BLOCK_LOG_SIZE 10 +#defineEXT3_MAX_BLOCK_SIZE 65536 +#define EXT3_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT3_BLOCK_SIZE(s)((s)->s_blocksize) #else - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size
ext3: Avoid rec_len overflow with 64KB block size From: Jan Kara <[EMAIL PROTECTED]> With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext3_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext3/dir.c | 10 +++-- fs/ext3/namei.c | 90 ++- include/linux/ext3_fs.h | 20 ++ 3 files changed, 68 insertions(+), 52 deletions(-) diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index c00723a..3c4c43a 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * function, struct inode * dir, unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de->rec_len); + const int rlen = ext3_rec_len_from_disk(de->rec_len); if (rlen < EXT3_DIR_REC_LEN(1)) error_msg = "rec_len is smaller than minimal"; @@ -177,10 +177,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de->rec_len) < + if (ext3_rec_len_from_disk(de->rec_len) < EXT3_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de->rec_len); + i += ext3_rec_len_from_disk(de->rec_len); } offset = i; filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de->rec_len); + offset += ext3_rec_len_from_disk(de->rec_len); if (le32_to_cpu(de->inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp->f_pos += le16_to_cpu(de->rec_len); + filp->f_pos += ext3_rec_len_from_disk(de->rec_len); } offset = 0; brelse (bh); diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c index c1fa190..2c38eb6 100644 --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -144,6 +144,15 @@ struct dx_map_entry u16 size; }; +/* + * p is at least 6 bytes before the end of page + */ +static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) +{ + return (struct ext3_dir_entry_2 *)((char*)p + + ext3_rec_len_from_disk(p->rec_len)); +} + #ifdef CONFIG_EXT3_INDEX static inline unsigned dx_get_block (struct dx_entry *entry); static void dx_set_block (struct dx_entry *entry, unsigned value); @@ -281,7 +290,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext3_dir_ent space += EXT3_DIR_REC_LEN(de->name_len); names++; } - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext3_next_entry(de); } printk("(%i)\n", names); return (struct stats) { names, space, 1 }; @@ -548,14 +557,6 @@ static int ext3_htree_next_block(struct inode *dir, __u32 hash, /* - * p is at least 6 bytes before the end of page - */ -static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) -{ - return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len)); -} - -/* * This function fills a red-black tree with information from a * directory block. It returns the number directory entries loaded * into the tree. If there is an error it is returned in err. @@ -721,7 +722,7 @@ static int dx_make_map (struct ext3_dir_entry_2 *de, int size, cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext3_next_entry(de); } return count; } @@ -825,7 +826,7 @@ static inline int search_dirblock(struct buffer_head * bh, return 1;
[PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
ext2: Avoid rec_len overflow with 64KB block size From: Jan Kara <[EMAIL PROTECTED]> With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext2/dir.c | 43 +++ include/linux/ext2_fs.h |1 + 2 files changed, 32 insertions(+), 12 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 2bf49d7..1329bdb 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -26,6 +26,24 @@ typedef struct ext2_dir_entry_2 ext2_dirent; +static inline unsigned ext2_rec_len_from_disk(__le16 dlen) +{ + unsigned len = le16_to_cpu(dlen); + + if (len == EXT2_MAX_REC_LEN) + return 1 << 16; + return len; +} + +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 << 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len > (1 << 16)) + BUG(); + return cpu_to_le16(len); +} + /* * ext2 uses block-sized chunks. Arguably, sector-sized ones would be * more robust, but we have what we have @@ -95,7 +113,7 @@ static void ext2_check_page(struct page *page) } for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) { p = (ext2_dirent *)(kaddr + offs); - rec_len = le16_to_cpu(p->rec_len); + rec_len = ext2_rec_len_from_disk(p->rec_len); if (rec_len < EXT2_DIR_REC_LEN(1)) goto Eshort; @@ -193,7 +211,8 @@ static inline int ext2_match (int len, const char * const name, */ static inline ext2_dirent *ext2_next_entry(ext2_dirent *p) { - return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len)); + return (ext2_dirent *)((char*)p + + ext2_rec_len_from_disk(p->rec_len)); } static inline unsigned @@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) return 0; } } - filp->f_pos += le16_to_cpu(de->rec_len); + filp->f_pos += ext2_rec_len_from_disk(de->rec_len); } ext2_put_page(page); } @@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, struct page *page, struct inode *inode) { unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de->rec_len); + unsigned to = from + ext2_rec_len_from_disk(de->rec_len); int err; lock_page(page); @@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) /* We hit i_size */ name_len = 0; rec_len = chunk_size; - de->rec_len = cpu_to_le16(chunk_size); + de->rec_len = ext2_rec_len_to_disk(chunk_size); de->inode = 0; goto got_it; } @@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) if (ext2_match (namelen, name, de)) goto out_unlock; name_len = EXT2_DIR_REC_LEN(de->name_len); - rec_len = le16_to_cpu(de->rec_len); + rec_len = ext2_rec_len_from_disk(de->rec_len); if (!de->inode && rec_len >= reclen) goto got_it; if (rec_len >= name_len + reclen) @@ -504,8 +523,8 @@ got_it: goto out_unlock; if (de->inode) { ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len); - de1->rec_len = cpu_to_le16(rec_len - name_len); - de->rec_len = cpu_to_le16(name_len); + de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len); + de->rec_len = ext2_rec_len_to_disk(name_len); de = de1; } de->name_len = namelen; @@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page ) struct inode *inode = mapping->host; char *kaddr = page_address(page); unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1); - unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len); + unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir->rec_len); ext2_dirent * pde = NULL; ext2_dirent *
[PATCH 1/2] ext2: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext2 From: Takashi Sato <[EMAIL PROTECTED]> This patch set supports large block size(>4k, <=64k) in ext2, just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext2 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext2: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext2: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext2, and able to handle empty directory block. Please consider to include to mm tree. Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext2/super.c |3 ++- include/linux/ext2_fs.h |4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 639a32c..765c805 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -775,7 +775,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) brelse(bh); if (!sb_set_blocksize(sb, blocksize)) { - printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n"); + printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n", + blocksize); goto failed_sbi; } diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h index 153d755..910a705 100644 --- a/include/linux/ext2_fs.h +++ b/include/linux/ext2_fs.h @@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_SB(struct super_block *sb) * Macro-instructions used to manage several block sizes */ #define EXT2_MIN_BLOCK_SIZE1024 -#defineEXT2_MAX_BLOCK_SIZE 4096 -#define EXT2_MIN_BLOCK_LOG_SIZE 10 +#define EXT2_MAX_BLOCK_SIZE65536 +#define EXT2_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT2_BLOCK_SIZE(s)((s)->s_blocksize) #else - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size
ext4: Avoid rec_len overflow with 64KB block size From: Jan Kara <[EMAIL PROTECTED]> With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext4_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext4/dir.c | 12 --- fs/ext4/namei.c | 76 ++- include/linux/ext4_fs.h | 20 3 files changed, 62 insertions(+), 46 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 3ab01c0..20b1e28 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir, unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de->rec_len); + const int rlen = ext4_rec_len_from_disk(de->rec_len); if (rlen < EXT4_DIR_REC_LEN(1)) error_msg = "rec_len is smaller than minimal"; @@ -176,10 +176,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de->rec_len) < - EXT4_DIR_REC_LEN(1)) + if (ext4_rec_len_from_disk(de->rec_len) + < EXT4_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de->rec_len); + i += ext4_rec_len_from_disk(de->rec_len); } offset = i; filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de->rec_len); + offset += ext4_rec_len_from_disk(de->rec_len); if (le32_to_cpu(de->inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp->f_pos += le16_to_cpu(de->rec_len); + filp->f_pos += ext4_rec_len_from_disk(de->rec_len); } offset = 0; brelse (bh); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 5fdb862..96e8a85 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -281,7 +281,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext4_dir_ent space += EXT4_DIR_REC_LEN(de->name_len); names++; } - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext4_next_entry(de); } printk("(%i)\n", names); return (struct stats) { names, space, 1 }; @@ -552,7 +552,8 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash, */ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p) { - return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len)); + return (struct ext4_dir_entry_2 *)((char*)p + + ext4_rec_len_from_disk(p->rec_len)); } /* @@ -721,7 +722,7 @@ static int dx_make_map (struct ext4_dir_entry_2 *de, int size, cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len)); + de = ext4_next_entry(de); } return count; } @@ -823,7 +824,7 @@ static inline int search_dirblock(struct buffer_head * bh, return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de->rec_len); + de_len = ext4_rec_len_from_disk(de->rec_len); if (de_len <= 0) return -1; offset += de_len; @@ -1136,7 +1137,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count) rec_len = EXT4_DIR_REC_LEN(de->name_len); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)->rec_len = -
[PATCH 1/2] ext4: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext4. From: Takashi Sato <[EMAIL PROTECTED]> This patch set supports large block size(>4k, <=64k) in ext4, just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext4 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext4: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext4: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext4dev, and able to handle empty directory block. Patch consider to be merge to 2.6.24-rc1. Signed-off-by: Takashi Sato <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext4/super.c |5 + include/linux/ext4_fs.h |4 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 619db84..d8bb279 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1548,6 +1548,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) goto out_fail; } + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize); + goto out_fail; + } + /* * The ext4 superblock will not be buffer aligned for other than 1kB * block sizes. We need to calculate the offset from buffer start. diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index f9881b6..d15a15e 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -77,8 +77,8 @@ * Macro-instructions used to manage several block sizes */ #define EXT4_MIN_BLOCK_SIZE1024 -#defineEXT4_MAX_BLOCK_SIZE 4096 -#define EXT4_MIN_BLOCK_LOG_SIZE 10 +#defineEXT4_MAX_BLOCK_SIZE 65536 +#define EXT4_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT4_BLOCK_SIZE(s)((s)->s_blocksize) #else - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] ext4: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext4. From: Takashi Sato [EMAIL PROTECTED] This patch set supports large block size(4k, =64k) in ext4, just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext4 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext4: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext4: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext4dev, and able to handle empty directory block. Patch consider to be merge to 2.6.24-rc1. Signed-off-by: Takashi Sato [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/super.c |5 + include/linux/ext4_fs.h |4 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 619db84..d8bb279 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1548,6 +1548,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) goto out_fail; } + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR EXT4-fs: bad blocksize %d.\n, blocksize); + goto out_fail; + } + /* * The ext4 superblock will not be buffer aligned for other than 1kB * block sizes. We need to calculate the offset from buffer start. diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index f9881b6..d15a15e 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -77,8 +77,8 @@ * Macro-instructions used to manage several block sizes */ #define EXT4_MIN_BLOCK_SIZE1024 -#defineEXT4_MAX_BLOCK_SIZE 4096 -#define EXT4_MIN_BLOCK_LOG_SIZE 10 +#defineEXT4_MAX_BLOCK_SIZE 65536 +#define EXT4_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT4_BLOCK_SIZE(s)((s)-s_blocksize) #else - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size
ext4: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext4_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/dir.c | 12 --- fs/ext4/namei.c | 76 ++- include/linux/ext4_fs.h | 20 3 files changed, 62 insertions(+), 46 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index 3ab01c0..20b1e28 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir, unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext4_rec_len_from_disk(de-rec_len); if (rlen EXT4_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -176,10 +176,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de-rec_len) - EXT4_DIR_REC_LEN(1)) + if (ext4_rec_len_from_disk(de-rec_len) +EXT4_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de-rec_len); + i += ext4_rec_len_from_disk(de-rec_len); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext4_rec_len_from_disk(de-rec_len); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext4_rec_len_from_disk(de-rec_len); } offset = 0; brelse (bh); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 5fdb862..96e8a85 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -281,7 +281,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext4_dir_ent space += EXT4_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -552,7 +552,8 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash, */ static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p) { - return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); + return (struct ext4_dir_entry_2 *)((char*)p + + ext4_rec_len_from_disk(p-rec_len)); } /* @@ -721,7 +722,7 @@ static int dx_make_map (struct ext4_dir_entry_2 *de, int size, cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext4_next_entry(de); } return count; } @@ -823,7 +824,7 @@ static inline int search_dirblock(struct buffer_head * bh, return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de-rec_len); + de_len = ext4_rec_len_from_disk(de-rec_len); if (de_len = 0) return -1; offset += de_len; @@ -1136,7 +1137,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count) rec_len = EXT4_DIR_REC_LEN(de-name_len); memcpy (to, de, rec_len); ((struct ext4_dir_entry_2 *) to)-rec_len = - cpu_to_le16(rec_len); + ext4_rec_len_to_disk(rec_len); de-inode = 0; map
[PATCH 1/2] ext2: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext2 From: Takashi Sato [EMAIL PROTECTED] This patch set supports large block size(4k, =64k) in ext2, just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext2 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext2: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext2: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext2, and able to handle empty directory block. Please consider to include to mm tree. Signed-off-by: Takashi Sato [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext2/super.c |3 ++- include/linux/ext2_fs.h |4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 639a32c..765c805 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -775,7 +775,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) brelse(bh); if (!sb_set_blocksize(sb, blocksize)) { - printk(KERN_ERR EXT2-fs: blocksize too small for device.\n); + printk(KERN_ERR EXT2-fs: bad blocksize %d.\n, + blocksize); goto failed_sbi; } diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h index 153d755..910a705 100644 --- a/include/linux/ext2_fs.h +++ b/include/linux/ext2_fs.h @@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_SB(struct super_block *sb) * Macro-instructions used to manage several block sizes */ #define EXT2_MIN_BLOCK_SIZE1024 -#defineEXT2_MAX_BLOCK_SIZE 4096 -#define EXT2_MIN_BLOCK_LOG_SIZE 10 +#define EXT2_MAX_BLOCK_SIZE65536 +#define EXT2_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT2_BLOCK_SIZE(s)((s)-s_blocksize) #else - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
ext2: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. Signed-off-by: Jan Kara [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext2/dir.c | 43 +++ include/linux/ext2_fs.h |1 + 2 files changed, 32 insertions(+), 12 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 2bf49d7..1329bdb 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -26,6 +26,24 @@ typedef struct ext2_dir_entry_2 ext2_dirent; +static inline unsigned ext2_rec_len_from_disk(__le16 dlen) +{ + unsigned len = le16_to_cpu(dlen); + + if (len == EXT2_MAX_REC_LEN) + return 1 16; + return len; +} + +static inline __le16 ext2_rec_len_to_disk(unsigned len) +{ + if (len == (1 16)) + return cpu_to_le16(EXT2_MAX_REC_LEN); + else if (len (1 16)) + BUG(); + return cpu_to_le16(len); +} + /* * ext2 uses block-sized chunks. Arguably, sector-sized ones would be * more robust, but we have what we have @@ -95,7 +113,7 @@ static void ext2_check_page(struct page *page) } for (offs = 0; offs = limit - EXT2_DIR_REC_LEN(1); offs += rec_len) { p = (ext2_dirent *)(kaddr + offs); - rec_len = le16_to_cpu(p-rec_len); + rec_len = ext2_rec_len_from_disk(p-rec_len); if (rec_len EXT2_DIR_REC_LEN(1)) goto Eshort; @@ -193,7 +211,8 @@ static inline int ext2_match (int len, const char * const name, */ static inline ext2_dirent *ext2_next_entry(ext2_dirent *p) { - return (ext2_dirent *)((char*)p + le16_to_cpu(p-rec_len)); + return (ext2_dirent *)((char*)p + + ext2_rec_len_from_disk(p-rec_len)); } static inline unsigned @@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) return 0; } } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext2_rec_len_from_disk(de-rec_len); } ext2_put_page(page); } @@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, struct page *page, struct inode *inode) { unsigned from = (char *) de - (char *) page_address(page); - unsigned to = from + le16_to_cpu(de-rec_len); + unsigned to = from + ext2_rec_len_from_disk(de-rec_len); int err; lock_page(page); @@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) /* We hit i_size */ name_len = 0; rec_len = chunk_size; - de-rec_len = cpu_to_le16(chunk_size); + de-rec_len = ext2_rec_len_to_disk(chunk_size); de-inode = 0; goto got_it; } @@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) if (ext2_match (namelen, name, de)) goto out_unlock; name_len = EXT2_DIR_REC_LEN(de-name_len); - rec_len = le16_to_cpu(de-rec_len); + rec_len = ext2_rec_len_from_disk(de-rec_len); if (!de-inode rec_len = reclen) goto got_it; if (rec_len = name_len + reclen) @@ -504,8 +523,8 @@ got_it: goto out_unlock; if (de-inode) { ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len); - de1-rec_len = cpu_to_le16(rec_len - name_len); - de-rec_len = cpu_to_le16(name_len); + de1-rec_len = ext2_rec_len_to_disk(rec_len - name_len); + de-rec_len = ext2_rec_len_to_disk(name_len); de = de1; } de-name_len = namelen; @@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page ) struct inode *inode = mapping-host; char *kaddr = page_address(page); unsigned from = ((char*)dir - kaddr) ~(ext2_chunk_size(inode)-1); - unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir-rec_len); + unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir-rec_len); ext2_dirent * pde = NULL; ext2_dirent * de = (ext2_dirent *) (kaddr + from); int err; @@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page ) err = mapping-a_ops-prepare_write
[PATCH 1/2] ext3: Support large blocksize up to PAGESIZE
Support large blocksize up to PAGESIZE (max 64KB) for ext3 From: Takashi Sato [EMAIL PROTECTED] This patch set supports large block size(4k, =64k) in ext3 just enlarging the block size limit. But it is NOT possible to have 64kB blocksize on ext3 without some changes to the directory handling code. The reason is that an empty 64kB directory block would have a rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in the filesystem. The proposed solution is treat 64k rec_len with a an impossible value like rec_len = 0x to handle this. The Patch-set consists of the following 2 patches. [1/2] ext3: enlarge blocksize - Allow blocksize up to pagesize [2/2] ext3: fix rec_len overflow - prevent rec_len from overflow with 64KB blocksize Now on 64k page ppc64 box runs with this patch set we could create a 64k block size ext3, and able to handle empty directory block. Signed-off-by: Takashi Sato [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext3/super.c |6 +- include/linux/ext3_fs.h |4 ++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 9537316..b4bfd36 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -1549,7 +1549,11 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent) } brelse (bh); - sb_set_blocksize(sb, blocksize); + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR EXT3-fs: bad blocksize %d.\n, + blocksize); + goto out_fail; + } logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize; offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize; bh = sb_bread(sb, logic_sb_block); diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index ece49a8..7aa5556 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -76,8 +76,8 @@ * Macro-instructions used to manage several block sizes */ #define EXT3_MIN_BLOCK_SIZE1024 -#defineEXT3_MAX_BLOCK_SIZE 4096 -#define EXT3_MIN_BLOCK_LOG_SIZE 10 +#defineEXT3_MAX_BLOCK_SIZE 65536 +#define EXT3_MIN_BLOCK_LOG_SIZE10 #ifdef __KERNEL__ # define EXT3_BLOCK_SIZE(s)((s)-s_blocksize) #else - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size
ext3: Avoid rec_len overflow with 64KB block size From: Jan Kara [EMAIL PROTECTED] With 64KB blocksize, a directory entry can have size 64KB which does not fit into 16 bits we have for entry lenght. So we store 0x instead and convert value when read from / written to disk. The patch also converts some places to use ext3_next_entry() when we are changing them anyway. Signed-off-by: Jan Kara [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext3/dir.c | 10 +++-- fs/ext3/namei.c | 90 ++- include/linux/ext3_fs.h | 20 ++ 3 files changed, 68 insertions(+), 52 deletions(-) diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index c00723a..3c4c43a 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * function, struct inode * dir, unsigned long offset) { const char * error_msg = NULL; - const int rlen = le16_to_cpu(de-rec_len); + const int rlen = ext3_rec_len_from_disk(de-rec_len); if (rlen EXT3_DIR_REC_LEN(1)) error_msg = rec_len is smaller than minimal; @@ -177,10 +177,10 @@ revalidate: * least that it is non-zero. A * failure will be detected in the * dirent test below. */ - if (le16_to_cpu(de-rec_len) + if (ext3_rec_len_from_disk(de-rec_len) EXT3_DIR_REC_LEN(1)) break; - i += le16_to_cpu(de-rec_len); + i += ext3_rec_len_from_disk(de-rec_len); } offset = i; filp-f_pos = (filp-f_pos ~(sb-s_blocksize - 1)) @@ -201,7 +201,7 @@ revalidate: ret = stored; goto out; } - offset += le16_to_cpu(de-rec_len); + offset += ext3_rec_len_from_disk(de-rec_len); if (le32_to_cpu(de-inode)) { /* We might block in the next section * if the data destination is @@ -223,7 +223,7 @@ revalidate: goto revalidate; stored ++; } - filp-f_pos += le16_to_cpu(de-rec_len); + filp-f_pos += ext3_rec_len_from_disk(de-rec_len); } offset = 0; brelse (bh); diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c index c1fa190..2c38eb6 100644 --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -144,6 +144,15 @@ struct dx_map_entry u16 size; }; +/* + * p is at least 6 bytes before the end of page + */ +static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) +{ + return (struct ext3_dir_entry_2 *)((char*)p + + ext3_rec_len_from_disk(p-rec_len)); +} + #ifdef CONFIG_EXT3_INDEX static inline unsigned dx_get_block (struct dx_entry *entry); static void dx_set_block (struct dx_entry *entry, unsigned value); @@ -281,7 +290,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext3_dir_ent space += EXT3_DIR_REC_LEN(de-name_len); names++; } - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext3_next_entry(de); } printk((%i)\n, names); return (struct stats) { names, space, 1 }; @@ -548,14 +557,6 @@ static int ext3_htree_next_block(struct inode *dir, __u32 hash, /* - * p is at least 6 bytes before the end of page - */ -static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p) -{ - return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p-rec_len)); -} - -/* * This function fills a red-black tree with information from a * directory block. It returns the number directory entries loaded * into the tree. If there is an error it is returned in err. @@ -721,7 +722,7 @@ static int dx_make_map (struct ext3_dir_entry_2 *de, int size, cond_resched(); } /* XXX: do we need to check rec_len == 0 case? -Chris */ - de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de-rec_len)); + de = ext3_next_entry(de); } return count; } @@ -825,7 +826,7 @@ static inline int search_dirblock(struct buffer_head * bh, return 1; } /* prevent looping on a bad block */ - de_len = le16_to_cpu(de-rec_len); + de_len = ext3_rec_len_from_disk(de
Re: kernel Oops in ext3 code
> BUG: unable to handle kernel paging request at virtual address 104b > printing eip: > c0195bd3 > *pde = > Oops: [#1] > PREEMPT SMP > Modules linked in: vboxdrv binfmt_misc fuse coretemp hwmon gspca videodev > v4l2_common v4l1_compat iwl3945 mac80211 tifm_7xx1 tifm_core joydev irda > crc_ccitt 8250_pnp 8250 serial_core firewire_ohci firewire_core crc_itu_t > CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010206 (2.6.23-rc6 #1) > EIP is at ext3_discard_reservation+0x18/0x4d > eax: dff23800 ebx: 1033 ecx: dfc15ec0 edx: > esi: c0007c44 edi: 1033 ebp: dfc2bef4 esp: dfc2beac > ds: 007b es: 007b fs: 00d8 gs: ss: 0068 > Process kswapd0 (pid: 261, ti=dfc2a000 task=dfcac570 task.ti=dfc2a000) > Stack: c0007ba4 c0007c44 1033 c019ec51 c0007c44 c0007d8c 002c > c0171b1b > 002c c0007c44 c0007c4c c0171da2 c050880c 0080 > 0080 > c0171fb8 0080 c0007e48 df9e3910 7404 c03f5634 0080 > 00d0 > Call Trace: > [] ext3_clear_inode+0x5d/0x76 > [] clear_inode+0x6b/0xb9 > [] dispose_list+0x48/0xc9 > [] shrink_icache_memory+0x195/0x1bd > [] shrink_slab+0xe2/0x159 > [] kswapd+0x2d3/0x431 > [] autoremove_wake_function+0x0/0x33 > [] kswapd+0x0/0x431 > [] kthread+0x38/0x5d > [] kthread+0x0/0x5d > [] kernel_thread_helper+0x7/0x10 > === > Code: 83 f8 01 19 c0 f7 d0 83 e0 08 89 42 0c 89 56 b4 5b 5e c3 57 56 89 c6 > 53 8b 58 b4 8b 80 a4 00 00 00 85 db 8b 80 78 01 00 00 74 30 <83> 7b 18 00 74 > 2a 8d b8 00 03 00 00 89 f8 e8 b8 ca 1a 00 83 7b > EIP: [] ext3_discard_reservation+0x18/0x4d SS:ESP 0068:dfc2beac > > On Fri, 2007-09-28 at 17:00 +0200, Norbert Preining wrote: > On Fr, 28 Sep 2007, Badari Pulavarty wrote: > > objdump -DlS balloc.o > > Here it is > Thanks Looks like kernel oops at 1753(173b+0x18): 173b : ext3_discard_reservation(): 173b: 57 push %edi 173c: 56 push %esi 173d: 89 c6 mov%eax,%esi 173f: 53 push %ebx 1740: 8b 58 b4mov-0x4c(%eax),%ebx 1743: 8b 80 a4 00 00 00 mov0xa4(%eax),%eax 1749: 85 db test %ebx,%ebx 174b: 8b 80 78 01 00 00 mov0x178(%eax),%eax 1751: 74 30 je 1783 1753: 83 7b 18 00 cmpl $0x0,0x18(%ebx) ==> Kernel oops here, ebx=1033, match bad page location 104b(=1033+0x18) 1757: 74 2a je 1783 1759: 8d b8 00 03 00 00 lea0x300(%eax),%edi 175f: 89 f8 mov%edi,%eax 1761: e8 fc ff ff ff call 1762 1766: 83 7b 18 00 cmpl $0x0,0x18(%ebx) 176a: 74 0d je 1779 176c: 8b 86 a4 00 00 00 mov0xa4(%esi),%eax 1772: 89 da mov%ebx,%edx 1774: e8 dc eb ff ff call 355 1779: 89 f8 mov%edi,%eax 177b: 5b pop%ebx 177c: 5e pop%esi 177d: 5f pop%edi 177e: e9 fc ff ff ff jmp177f 1783: 5b pop%ebx 1784: 5e pop%esi 1785: 5f pop%edi 1786: c3 ret And trying to matching to the code: void ext3_discard_reservation(struct inode *inode) { struct ext3_inode_info *ei = EXT3_I(inode); struct ext3_block_alloc_info *block_i = ei->i_block_alloc_info; struct ext3_reserve_window_node *rsv; spinlock_t *rsv_lock = _SB(inode->i_sb)->s_rsv_window_lock; if (!block_i) return; rsv = _i->rsv_window_node; if (!rsv_is_empty(>rsv_window)) { => kernel oops here spin_lock(rsv_lock); if (!rsv_is_empty(>rsv_window)) rsv_window_remove(inode->i_sb, rsv); spin_unlock(rsv_lock); } } It seems ebx points to block_i(i_block_alloc_info), and that is bad memory location, so that leads to bad paging request when try to get the rsv_window structure. But it confused me why the rsv_window offset is 0x18 to i_block_alloc_info, it should be 0x14(20 bytes)...Are you running a vanilla 2.6.23-rc6? No clue how i_block_alloc_info pointing to a bad location for now. ext3_alloc_inode() clearly init this field to NULL, and ext3_clear_inode() clearly set this field to NULL. So during the lifecycle of the inode, i_block_alloc_info should point to a valid address or being NULL. And the stack trace indicating the oops happened when pushing the inode from the
Re: kernel Oops in ext3 code
BUG: unable to handle kernel paging request at virtual address 104b printing eip: c0195bd3 *pde = Oops: [#1] PREEMPT SMP Modules linked in: vboxdrv binfmt_misc fuse coretemp hwmon gspca videodev v4l2_common v4l1_compat iwl3945 mac80211 tifm_7xx1 tifm_core joydev irda crc_ccitt 8250_pnp 8250 serial_core firewire_ohci firewire_core crc_itu_t CPU:0 EIP:0060:[c0195bd3]Not tainted VLI EFLAGS: 00010206 (2.6.23-rc6 #1) EIP is at ext3_discard_reservation+0x18/0x4d eax: dff23800 ebx: 1033 ecx: dfc15ec0 edx: esi: c0007c44 edi: 1033 ebp: dfc2bef4 esp: dfc2beac ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Process kswapd0 (pid: 261, ti=dfc2a000 task=dfcac570 task.ti=dfc2a000) Stack: c0007ba4 c0007c44 1033 c019ec51 c0007c44 c0007d8c 002c c0171b1b 002c c0007c44 c0007c4c c0171da2 c050880c 0080 0080 c0171fb8 0080 c0007e48 df9e3910 7404 c03f5634 0080 00d0 Call Trace: [c019ec51] ext3_clear_inode+0x5d/0x76 [c0171b1b] clear_inode+0x6b/0xb9 [c0171da2] dispose_list+0x48/0xc9 [c0171fb8] shrink_icache_memory+0x195/0x1bd [c014f5ec] shrink_slab+0xe2/0x159 [c014f9a0] kswapd+0x2d3/0x431 [c0132520] autoremove_wake_function+0x0/0x33 [c014f6cd] kswapd+0x0/0x431 [c0132453] kthread+0x38/0x5d [c013241b] kthread+0x0/0x5d [c0104b73] kernel_thread_helper+0x7/0x10 === Code: 83 f8 01 19 c0 f7 d0 83 e0 08 89 42 0c 89 56 b4 5b 5e c3 57 56 89 c6 53 8b 58 b4 8b 80 a4 00 00 00 85 db 8b 80 78 01 00 00 74 30 83 7b 18 00 74 2a 8d b8 00 03 00 00 89 f8 e8 b8 ca 1a 00 83 7b EIP: [c0195bd3] ext3_discard_reservation+0x18/0x4d SS:ESP 0068:dfc2beac On Fri, 2007-09-28 at 17:00 +0200, Norbert Preining wrote: On Fr, 28 Sep 2007, Badari Pulavarty wrote: objdump -DlS balloc.o Here it is Thanks Looks like kernel oops at 1753(173b+0x18): 173b ext3_discard_reservation: ext3_discard_reservation(): 173b: 57 push %edi 173c: 56 push %esi 173d: 89 c6 mov%eax,%esi 173f: 53 push %ebx 1740: 8b 58 b4mov-0x4c(%eax),%ebx 1743: 8b 80 a4 00 00 00 mov0xa4(%eax),%eax 1749: 85 db test %ebx,%ebx 174b: 8b 80 78 01 00 00 mov0x178(%eax),%eax 1751: 74 30 je 1783 ext3_discard_reservation+0x48 1753: 83 7b 18 00 cmpl $0x0,0x18(%ebx) == Kernel oops here, ebx=1033, match bad page location 104b(=1033+0x18) 1757: 74 2a je 1783 ext3_discard_reservation+0x48 1759: 8d b8 00 03 00 00 lea0x300(%eax),%edi 175f: 89 f8 mov%edi,%eax 1761: e8 fc ff ff ff call 1762 ext3_discard_reservation+0x27 1766: 83 7b 18 00 cmpl $0x0,0x18(%ebx) 176a: 74 0d je 1779 ext3_discard_reservation+0x3e 176c: 8b 86 a4 00 00 00 mov0xa4(%esi),%eax 1772: 89 da mov%ebx,%edx 1774: e8 dc eb ff ff call 355 rsv_window_remove 1779: 89 f8 mov%edi,%eax 177b: 5b pop%ebx 177c: 5e pop%esi 177d: 5f pop%edi 177e: e9 fc ff ff ff jmp177f ext3_discard_reservation+0x44 1783: 5b pop%ebx 1784: 5e pop%esi 1785: 5f pop%edi 1786: c3 ret And trying to matching to the code: void ext3_discard_reservation(struct inode *inode) { struct ext3_inode_info *ei = EXT3_I(inode); struct ext3_block_alloc_info *block_i = ei-i_block_alloc_info; struct ext3_reserve_window_node *rsv; spinlock_t *rsv_lock = EXT3_SB(inode-i_sb)-s_rsv_window_lock; if (!block_i) return; rsv = block_i-rsv_window_node; if (!rsv_is_empty(rsv-rsv_window)) { = kernel oops here spin_lock(rsv_lock); if (!rsv_is_empty(rsv-rsv_window)) rsv_window_remove(inode-i_sb, rsv); spin_unlock(rsv_lock); } } It seems ebx points to block_i(i_block_alloc_info), and that is bad memory location, so that leads to bad paging request when try to get the rsv_window structure. But it confused me why the rsv_window offset is 0x18 to i_block_alloc_info, it should be 0x14(20 bytes)...Are you running a vanilla 2.6.23-rc6? No clue how i_block_alloc_info pointing to a bad location for now. ext3_alloc_inode() clearly init this field to
Re: kernel Oops in ext3 code
Hi, Could you please sent the objdump of the ext4_discard_reservation function? It doesn't match what I see here. Thanks, Mingming On Thu, 2007-09-27 at 12:31 +0200, [EMAIL PROTECTED] wrote: > Hi all! > > (Please Cc) > > kernel 2.6.23-rc6 > Debian/sid > > kernel ooops: > > BUG: unable to handle kernel paging request at virtual address 104b > printing eip: > c0195bd3 > *pde = > Oops: [#1] > PREEMPT SMP > Modules linked in: vboxdrv binfmt_misc fuse coretemp hwmon gspca videodev > v4l2_common v4l1_compat iwl3945 mac80211 tifm_7xx1 tifm_core joydev irda > crc_ccitt 8250_pnp 8250 serial_core firewire_ohci firewire_core crc_itu_t > CPU:0 > EIP:0060:[]Not tainted VLI > EFLAGS: 00010206 (2.6.23-rc6 #1) > EIP is at ext3_discard_reservation+0x18/0x4d > eax: dff23800 ebx: 1033 ecx: dfc15ec0 edx: > esi: c0007c44 edi: 1033 ebp: dfc2bef4 esp: dfc2beac > ds: 007b es: 007b fs: 00d8 gs: ss: 0068 > Process kswapd0 (pid: 261, ti=dfc2a000 task=dfcac570 task.ti=dfc2a000) > Stack: c0007ba4 c0007c44 1033 c019ec51 c0007c44 c0007d8c 002c > c0171b1b > 002c c0007c44 c0007c4c c0171da2 c050880c 0080 > 0080 > c0171fb8 0080 c0007e48 df9e3910 7404 c03f5634 0080 > 00d0 > Call Trace: > [] ext3_clear_inode+0x5d/0x76 > [] clear_inode+0x6b/0xb9 > [] dispose_list+0x48/0xc9 > [] shrink_icache_memory+0x195/0x1bd > [] shrink_slab+0xe2/0x159 > [] kswapd+0x2d3/0x431 > [] autoremove_wake_function+0x0/0x33 > [] kswapd+0x0/0x431 > [] kthread+0x38/0x5d > [] kthread+0x0/0x5d > [] kernel_thread_helper+0x7/0x10 > === > Code: 83 f8 01 19 c0 f7 d0 83 e0 08 89 42 0c 89 56 b4 5b 5e c3 57 56 89 c6 > 53 8b 58 b4 8b 80 a4 00 00 00 85 db 8b 80 78 01 00 00 74 30 <83> 7b 18 00 74 > 2a 8d b8 00 03 00 00 89 f8 e8 b8 ca 1a 00 83 7b > EIP: [] ext3_discard_reservation+0x18/0x4d SS:ESP 0068:dfc2beac > > > Sysrq did work, so the oops was saved. Good. > > Any ideas? > > Best wishes > > Norbert > > --- > Dr. Norbert Preining <[EMAIL PROTECTED]>Vienna University of > Technology > Debian Developer <[EMAIL PROTECTED]> Debian TeX Group > gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 > B094 > --- > As he came into the light they could see his black and > gold uniform on which the buttons were so highly polished > that they shone with an intensity that would have made an > approaching motorist flash his lights in annoyance. > --- Douglas Adams, The Hitchhikers Guide to the Galaxy > - > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel Oops in ext3 code
Hi, Could you please sent the objdump of the ext4_discard_reservation function? It doesn't match what I see here. Thanks, Mingming On Thu, 2007-09-27 at 12:31 +0200, [EMAIL PROTECTED] wrote: Hi all! (Please Cc) kernel 2.6.23-rc6 Debian/sid kernel ooops: BUG: unable to handle kernel paging request at virtual address 104b printing eip: c0195bd3 *pde = Oops: [#1] PREEMPT SMP Modules linked in: vboxdrv binfmt_misc fuse coretemp hwmon gspca videodev v4l2_common v4l1_compat iwl3945 mac80211 tifm_7xx1 tifm_core joydev irda crc_ccitt 8250_pnp 8250 serial_core firewire_ohci firewire_core crc_itu_t CPU:0 EIP:0060:[c0195bd3]Not tainted VLI EFLAGS: 00010206 (2.6.23-rc6 #1) EIP is at ext3_discard_reservation+0x18/0x4d eax: dff23800 ebx: 1033 ecx: dfc15ec0 edx: esi: c0007c44 edi: 1033 ebp: dfc2bef4 esp: dfc2beac ds: 007b es: 007b fs: 00d8 gs: ss: 0068 Process kswapd0 (pid: 261, ti=dfc2a000 task=dfcac570 task.ti=dfc2a000) Stack: c0007ba4 c0007c44 1033 c019ec51 c0007c44 c0007d8c 002c c0171b1b 002c c0007c44 c0007c4c c0171da2 c050880c 0080 0080 c0171fb8 0080 c0007e48 df9e3910 7404 c03f5634 0080 00d0 Call Trace: [c019ec51] ext3_clear_inode+0x5d/0x76 [c0171b1b] clear_inode+0x6b/0xb9 [c0171da2] dispose_list+0x48/0xc9 [c0171fb8] shrink_icache_memory+0x195/0x1bd [c014f5ec] shrink_slab+0xe2/0x159 [c014f9a0] kswapd+0x2d3/0x431 [c0132520] autoremove_wake_function+0x0/0x33 [c014f6cd] kswapd+0x0/0x431 [c0132453] kthread+0x38/0x5d [c013241b] kthread+0x0/0x5d [c0104b73] kernel_thread_helper+0x7/0x10 === Code: 83 f8 01 19 c0 f7 d0 83 e0 08 89 42 0c 89 56 b4 5b 5e c3 57 56 89 c6 53 8b 58 b4 8b 80 a4 00 00 00 85 db 8b 80 78 01 00 00 74 30 83 7b 18 00 74 2a 8d b8 00 03 00 00 89 f8 e8 b8 ca 1a 00 83 7b EIP: [c0195bd3] ext3_discard_reservation+0x18/0x4d SS:ESP 0068:dfc2beac Sysrq did work, so the oops was saved. Good. Any ideas? Best wishes Norbert --- Dr. Norbert Preining [EMAIL PROTECTED]Vienna University of Technology Debian Developer [EMAIL PROTECTED] Debian TeX Group gpg DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094 --- As he came into the light they could see his black and gold uniform on which the buttons were so highly polished that they shone with an intensity that would have made an approaching motorist flash his lights in annoyance. --- Douglas Adams, The Hitchhikers Guide to the Galaxy - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc
On Wed, 2007-09-26 at 12:54 -0700, Andrew Morton wrote: > On Fri, 21 Sep 2007 16:13:56 -0700 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > Convert kmalloc to kzalloc() and get rid of the memset(). > > I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally > better to raise separate patches, please - the ext3 patches I'll merge > directly but the ext4 patches should go through (and be against) the ext4 > devel tree. > Sure. The patches(including ext3/jbd and ext4/jbd2) were merged into ext4 devel tree already, I will remove the ext3/jbd part out of the ext4 devel tree. > I fixed lots of rejects against the already-pending changes to these > filesystems. > > You forgot to remove the memsets in both start_this_handle()s. > Thanks for catching this. Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc
On Wed, 2007-09-26 at 12:54 -0700, Andrew Morton wrote: On Fri, 21 Sep 2007 16:13:56 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert kmalloc to kzalloc() and get rid of the memset(). I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally better to raise separate patches, please - the ext3 patches I'll merge directly but the ext4 patches should go through (and be against) the ext4 devel tree. Sure. The patches(including ext3/jbd and ext4/jbd2) were merged into ext4 devel tree already, I will remove the ext3/jbd part out of the ext4 devel tree. I fixed lots of rejects against the already-pending changes to these filesystems. You forgot to remove the memsets in both start_this_handle()s. Thanks for catching this. Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD2/ext4 naming cleanup
JBD2 naming cleanup From: Mingming Cao <[EMAIL PROTECTED]> change micros name from JBD_XXX to JBD2_XXX in JBD2/Ext4 Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext4/extents.c |2 +- fs/ext4/super.c |2 +- fs/jbd2/commit.c |2 +- fs/jbd2/journal.c |8 fs/jbd2/recovery.c|2 +- fs/jbd2/revoke.c |4 ++-- include/linux/ext4_jbd2.h |6 +++--- include/linux/jbd2.h | 30 +++--- 8 files changed, 28 insertions(+), 28 deletions(-) Index: linux-2.6.23-rc6/fs/ext4/super.c === --- linux-2.6.23-rc6.orig/fs/ext4/super.c 2007-09-21 16:27:31.0 -0700 +++ linux-2.6.23-rc6/fs/ext4/super.c2007-09-21 16:27:46.0 -0700 @@ -966,7 +966,7 @@ static int parse_options (char *options, if (option < 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; sbi->s_commit_interval = HZ * option; break; case Opt_data_journal: Index: linux-2.6.23-rc6/include/linux/ext4_jbd2.h === --- linux-2.6.23-rc6.orig/include/linux/ext4_jbd2.h 2007-09-10 19:50:29.0 -0700 +++ linux-2.6.23-rc6/include/linux/ext4_jbd2.h 2007-09-21 16:27:46.0 -0700 @@ -12,8 +12,8 @@ * Ext4-specific journaling extensions. */ -#ifndef _LINUX_EXT4_JBD_H -#define _LINUX_EXT4_JBD_H +#ifndef _LINUX_EXT4_JBD2_H +#define _LINUX_EXT4_JBD2_H #include #include @@ -228,4 +228,4 @@ static inline int ext4_should_writeback_ return 0; } -#endif /* _LINUX_EXT4_JBD_H */ +#endif /* _LINUX_EXT4_JBD2_H */ Index: linux-2.6.23-rc6/include/linux/jbd2.h === --- linux-2.6.23-rc6.orig/include/linux/jbd2.h 2007-09-21 09:07:09.0 -0700 +++ linux-2.6.23-rc6/include/linux/jbd2.h 2007-09-21 16:27:46.0 -0700 @@ -13,8 +13,8 @@ * filesystem journaling support. */ -#ifndef _LINUX_JBD_H -#define _LINUX_JBD_H +#ifndef _LINUX_JBD2_H +#define _LINUX_JBD2_H /* Allow this file to be included directly into e2fsprogs */ #ifndef __KERNEL__ @@ -37,26 +37,26 @@ #define journal_oom_retry 1 /* - * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds + * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds * certain classes of error which can occur due to failed IOs. Under - * normal use we want ext3 to continue after such errors, because + * normal use we want ext4 to continue after such errors, because * hardware _can_ fail, but for debugging purposes when running tests on * known-good hardware we may want to trap these errors. */ -#undef JBD_PARANOID_IOFAIL +#undef JBD2_PARANOID_IOFAIL /* * The default maximum commit age, in seconds. */ -#define JBD_DEFAULT_MAX_COMMIT_AGE 5 +#define JBD2_DEFAULT_MAX_COMMIT_AGE 5 #ifdef CONFIG_JBD2_DEBUG /* - * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal + * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal * consistency checks. By default we don't do this unless * CONFIG_JBD2_DEBUG is on. */ -#define JBD_EXPENSIVE_CHECKING +#define JBD2_EXPENSIVE_CHECKING extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ @@ -163,8 +163,8 @@ typedef struct journal_block_tag_s __be32 t_blocknr_high; /* most-significant high 32bits. */ } journal_block_tag_t; -#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) -#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t)) +#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) +#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) /* * The revoke descriptor: used on disk to describe a series of blocks to @@ -256,8 +256,8 @@ typedef struct journal_superblock_s #include #include -#define JBD_ASSERTIONS -#ifdef JBD_ASSERTIONS +#define JBD2_ASSERTIONS +#ifdef JBD2_ASSERTIONS #define J_ASSERT(assert) \ do { \ if (!(assert)) {\ @@ -284,9 +284,9 @@ void buffer_assertion_failure(struct buf #else #define J_ASSERT(assert) do { } while (0) -#endif /* JBD_ASSERTIONS */ +#endif /* JBD2_ASSERTIONS */ -#if defined(JBD_PARANOID_IOFAIL) +#if defined(JBD2_PARANOID_IOFAIL) #define J_EXPECT(expr, why...) J_ASSERT(expr) #define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) #define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr) @@ -1104,4 +11
[PATCH] JBD/ext34 cleanups: convert to kzalloc
Convert kmalloc to kzalloc() and get rid of the memset(). Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/ext3/xattr.c |3 +-- fs/ext4/xattr.c |3 +-- fs/jbd/journal.c |3 +-- fs/jbd/transaction.c |2 +- fs/jbd2/journal.c |3 +-- fs/jbd2/transaction.c |2 +- 6 files changed, 6 insertions(+), 10 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-21 09:08:02.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-21 09:10:37.0 -0700 @@ -653,10 +653,9 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; - memset(journal, 0, sizeof(*journal)); init_waitqueue_head(>j_wait_transaction_locked); init_waitqueue_head(>j_wait_logspace); Index: linux-2.6.23-rc6/fs/jbd/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-21 09:13:11.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-21 09:13:24.0 -0700 @@ -96,7 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal->j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), + new_transaction = kzalloc(sizeof(*new_transaction), GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-21 09:10:53.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-21 09:11:13.0 -0700 @@ -654,10 +654,9 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; - memset(journal, 0, sizeof(*journal)); init_waitqueue_head(>j_wait_transaction_locked); init_waitqueue_head(>j_wait_logspace); Index: linux-2.6.23-rc6/fs/jbd2/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-21 09:12:46.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-21 09:12:59.0 -0700 @@ -96,7 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal->j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), + new_transaction = kzalloc(sizeof(*new_transaction), GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; Index: linux-2.6.23-rc6/fs/ext3/xattr.c === --- linux-2.6.23-rc6.orig/fs/ext3/xattr.c 2007-09-21 10:22:24.0 -0700 +++ linux-2.6.23-rc6/fs/ext3/xattr.c2007-09-21 10:24:19.0 -0700 @@ -741,12 +741,11 @@ ext3_xattr_block_set(handle_t *handle, s } } else { /* Allocate a buffer where we construct the new block. */ - s->base = kmalloc(sb->s_blocksize, GFP_KERNEL); + s->base = kzalloc(sb->s_blocksize, GFP_KERNEL); /* assert(header == s->base) */ error = -ENOMEM; if (s->base == NULL) goto cleanup; - memset(s->base, 0, sb->s_blocksize); header(s->base)->h_magic = cpu_to_le32(EXT3_XATTR_MAGIC); header(s->base)->h_blocks = cpu_to_le32(1); header(s->base)->h_refcount = cpu_to_le32(1); Index: linux-2.6.23-rc6/fs/ext4/xattr.c === --- linux-2.6.23-rc6.orig/fs/ext4/xattr.c 2007-09-21 10:20:21.0 -0700 +++ linux-2.6.23-rc6/fs/ext4/xattr.c2007-09-21 10:21:00.0 -0700 @@ -750,12 +750,11 @@ ext4_xattr_block_set(handle_t *handle, s } } else { /* Allocate a buffer where we construct the new block. */ - s->base = kmalloc(sb->s_blocksize, GFP_KERNEL); + s->base = kzalloc(sb->s_blocksize, GFP_KERNEL); /* assert(header == s->base) */ error = -ENOMEM; if (s->base == NULL) goto cleanup; -
[PATCH] JBD/ext34 cleanups: convert to kzalloc
Convert kmalloc to kzalloc() and get rid of the memset(). Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext3/xattr.c |3 +-- fs/ext4/xattr.c |3 +-- fs/jbd/journal.c |3 +-- fs/jbd/transaction.c |2 +- fs/jbd2/journal.c |3 +-- fs/jbd2/transaction.c |2 +- 6 files changed, 6 insertions(+), 10 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-21 09:08:02.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-21 09:10:37.0 -0700 @@ -653,10 +653,9 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; - memset(journal, 0, sizeof(*journal)); init_waitqueue_head(journal-j_wait_transaction_locked); init_waitqueue_head(journal-j_wait_logspace); Index: linux-2.6.23-rc6/fs/jbd/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-21 09:13:11.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-21 09:13:24.0 -0700 @@ -96,7 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), + new_transaction = kzalloc(sizeof(*new_transaction), GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-21 09:10:53.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-21 09:11:13.0 -0700 @@ -654,10 +654,9 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; - memset(journal, 0, sizeof(*journal)); init_waitqueue_head(journal-j_wait_transaction_locked); init_waitqueue_head(journal-j_wait_logspace); Index: linux-2.6.23-rc6/fs/jbd2/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-21 09:12:46.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-21 09:12:59.0 -0700 @@ -96,7 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), + new_transaction = kzalloc(sizeof(*new_transaction), GFP_NOFS|__GFP_NOFAIL); if (!new_transaction) { ret = -ENOMEM; Index: linux-2.6.23-rc6/fs/ext3/xattr.c === --- linux-2.6.23-rc6.orig/fs/ext3/xattr.c 2007-09-21 10:22:24.0 -0700 +++ linux-2.6.23-rc6/fs/ext3/xattr.c2007-09-21 10:24:19.0 -0700 @@ -741,12 +741,11 @@ ext3_xattr_block_set(handle_t *handle, s } } else { /* Allocate a buffer where we construct the new block. */ - s-base = kmalloc(sb-s_blocksize, GFP_KERNEL); + s-base = kzalloc(sb-s_blocksize, GFP_KERNEL); /* assert(header == s-base) */ error = -ENOMEM; if (s-base == NULL) goto cleanup; - memset(s-base, 0, sb-s_blocksize); header(s-base)-h_magic = cpu_to_le32(EXT3_XATTR_MAGIC); header(s-base)-h_blocks = cpu_to_le32(1); header(s-base)-h_refcount = cpu_to_le32(1); Index: linux-2.6.23-rc6/fs/ext4/xattr.c === --- linux-2.6.23-rc6.orig/fs/ext4/xattr.c 2007-09-21 10:20:21.0 -0700 +++ linux-2.6.23-rc6/fs/ext4/xattr.c2007-09-21 10:21:00.0 -0700 @@ -750,12 +750,11 @@ ext4_xattr_block_set(handle_t *handle, s } } else { /* Allocate a buffer where we construct the new block. */ - s-base = kmalloc(sb-s_blocksize, GFP_KERNEL); + s-base = kzalloc(sb-s_blocksize, GFP_KERNEL); /* assert(header == s-base) */ error = -ENOMEM; if (s-base == NULL) goto cleanup; - memset(s-base, 0, sb-s_blocksize); header(s-base)-h_magic
[PATCH] JBD2/ext4 naming cleanup
JBD2 naming cleanup From: Mingming Cao [EMAIL PROTECTED] change micros name from JBD_XXX to JBD2_XXX in JBD2/Ext4 Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/ext4/extents.c |2 +- fs/ext4/super.c |2 +- fs/jbd2/commit.c |2 +- fs/jbd2/journal.c |8 fs/jbd2/recovery.c|2 +- fs/jbd2/revoke.c |4 ++-- include/linux/ext4_jbd2.h |6 +++--- include/linux/jbd2.h | 30 +++--- 8 files changed, 28 insertions(+), 28 deletions(-) Index: linux-2.6.23-rc6/fs/ext4/super.c === --- linux-2.6.23-rc6.orig/fs/ext4/super.c 2007-09-21 16:27:31.0 -0700 +++ linux-2.6.23-rc6/fs/ext4/super.c2007-09-21 16:27:46.0 -0700 @@ -966,7 +966,7 @@ static int parse_options (char *options, if (option 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; sbi-s_commit_interval = HZ * option; break; case Opt_data_journal: Index: linux-2.6.23-rc6/include/linux/ext4_jbd2.h === --- linux-2.6.23-rc6.orig/include/linux/ext4_jbd2.h 2007-09-10 19:50:29.0 -0700 +++ linux-2.6.23-rc6/include/linux/ext4_jbd2.h 2007-09-21 16:27:46.0 -0700 @@ -12,8 +12,8 @@ * Ext4-specific journaling extensions. */ -#ifndef _LINUX_EXT4_JBD_H -#define _LINUX_EXT4_JBD_H +#ifndef _LINUX_EXT4_JBD2_H +#define _LINUX_EXT4_JBD2_H #include linux/fs.h #include linux/jbd2.h @@ -228,4 +228,4 @@ static inline int ext4_should_writeback_ return 0; } -#endif /* _LINUX_EXT4_JBD_H */ +#endif /* _LINUX_EXT4_JBD2_H */ Index: linux-2.6.23-rc6/include/linux/jbd2.h === --- linux-2.6.23-rc6.orig/include/linux/jbd2.h 2007-09-21 09:07:09.0 -0700 +++ linux-2.6.23-rc6/include/linux/jbd2.h 2007-09-21 16:27:46.0 -0700 @@ -13,8 +13,8 @@ * filesystem journaling support. */ -#ifndef _LINUX_JBD_H -#define _LINUX_JBD_H +#ifndef _LINUX_JBD2_H +#define _LINUX_JBD2_H /* Allow this file to be included directly into e2fsprogs */ #ifndef __KERNEL__ @@ -37,26 +37,26 @@ #define journal_oom_retry 1 /* - * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds + * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds * certain classes of error which can occur due to failed IOs. Under - * normal use we want ext3 to continue after such errors, because + * normal use we want ext4 to continue after such errors, because * hardware _can_ fail, but for debugging purposes when running tests on * known-good hardware we may want to trap these errors. */ -#undef JBD_PARANOID_IOFAIL +#undef JBD2_PARANOID_IOFAIL /* * The default maximum commit age, in seconds. */ -#define JBD_DEFAULT_MAX_COMMIT_AGE 5 +#define JBD2_DEFAULT_MAX_COMMIT_AGE 5 #ifdef CONFIG_JBD2_DEBUG /* - * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal + * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal * consistency checks. By default we don't do this unless * CONFIG_JBD2_DEBUG is on. */ -#define JBD_EXPENSIVE_CHECKING +#define JBD2_EXPENSIVE_CHECKING extern u8 jbd2_journal_enable_debug; #define jbd_debug(n, f, a...) \ @@ -163,8 +163,8 @@ typedef struct journal_block_tag_s __be32 t_blocknr_high; /* most-significant high 32bits. */ } journal_block_tag_t; -#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) -#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t)) +#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high)) +#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t)) /* * The revoke descriptor: used on disk to describe a series of blocks to @@ -256,8 +256,8 @@ typedef struct journal_superblock_s #include linux/fs.h #include linux/sched.h -#define JBD_ASSERTIONS -#ifdef JBD_ASSERTIONS +#define JBD2_ASSERTIONS +#ifdef JBD2_ASSERTIONS #define J_ASSERT(assert) \ do { \ if (!(assert)) {\ @@ -284,9 +284,9 @@ void buffer_assertion_failure(struct buf #else #define J_ASSERT(assert) do { } while (0) -#endif /* JBD_ASSERTIONS */ +#endif /* JBD2_ASSERTIONS */ -#if defined(JBD_PARANOID_IOFAIL) +#if defined(JBD2_PARANOID_IOFAIL) #define J_EXPECT(expr, why...) J_ASSERT(expr) #define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) #define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 13:48 -0600, Andreas Dilger wrote: > On Sep 19, 2007 12:15 -0700, Mingming Cao wrote: > > @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * > > > > alloc_transaction: > > if (!journal->j_running_transaction) { > > - new_transaction = kmalloc(sizeof(*new_transaction), > > - GFP_NOFS|__GFP_NOFAIL); > > + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); > > This should probably be a __GFP_NOFAIL if we are trying to start a new > handle in truncate, as there is no way to propagate an error to the caller. > Thanks, updated version. Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases they are not needed. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/journal.c |2 +- fs/jbd2/journal.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 14:23:45.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 14:23:45.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD: use GFP_NOFS in kmalloc
On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote: > On Wed, 19 Sep 2007 12:22:09 -0700 > Mingming Cao <[EMAIL PROTECTED]> wrote: > > > Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent > > with the rest of kmalloc flag used in the JBD/JBD2 layer. > > > > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> > > > > --- > > fs/jbd/journal.c |6 +++--- > > fs/jbd/revoke.c |8 > > fs/jbd2/journal.c |6 +++--- > > fs/jbd2/revoke.c |8 > > 4 files changed, 14 insertions(+), 14 deletions(-) > > > > Index: linux-2.6.23-rc6/fs/jbd/journal.c > > === > > --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 > > -0700 > > +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 > > -0700 > > @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( > > journal_t *journal; > > int err; > > > > - journal = kmalloc(sizeof(*journal), GFP_KERNEL); > > + journal = kmalloc(sizeof(*journal), GFP_NOFS); > > if (!journal) > > goto fail; > > memset(journal, 0, sizeof(*journal)); > > @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc > > journal->j_blocksize = blocksize; > > n = journal->j_blocksize / sizeof(journal_block_tag_t); > > journal->j_wbufsize = n; > > - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); > > + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); > > if (!journal->j_wbuf) { > > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", > > __FUNCTION__); > > @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i > > /* journal descriptor can store up to n blocks -bzzz */ > > n = journal->j_blocksize / sizeof(journal_block_tag_t); > > journal->j_wbufsize = n; > > - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); > > + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); > > if (!journal->j_wbuf) { > > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", > > __FUNCTION__); > > Index: linux-2.6.23-rc6/fs/jbd/revoke.c > > === > > --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 > > -0700 > > +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 > > -0700 > > @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ > > while((tmp >>= 1UL) != 0UL) > > shift++; > > > > - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, > > GFP_KERNEL); > > + journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, > > GFP_NOFS); > > if (!journal->j_revoke_table[0]) > > return -ENOMEM; > > journal->j_revoke = journal->j_revoke_table[0]; > > @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ > > journal->j_revoke->hash_shift = shift; > > > > journal->j_revoke->hash_table = > > - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); > > + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); > > if (!journal->j_revoke->hash_table) { > > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); > > journal->j_revoke = NULL; > > @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ > > for (tmp = 0; tmp < hash_size; tmp++) > > INIT_LIST_HEAD(>j_revoke->hash_table[tmp]); > > > > - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, > > GFP_KERNEL); > > + journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, > > GFP_NOFS); > > if (!journal->j_revoke_table[1]) { > > kfree(journal->j_revoke_table[0]->hash_table); > > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); > > @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ > > journal->j_revoke->hash_shift = shift; > > > > journal->j_revoke->hash_table = > > - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); > > + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); > > if (!journal->j
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 19:28 +, Dave Kleikamp wrote: > On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote: > > On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote: > > > > > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all > > > cases except one handles memory allocation failure so I get rid of those > > > GFP_NOFAIL flags. > > > > > > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc > > > in jbd/jbd2? I will send a separate patch to cleanup that. > > > > No. GFP_NOFS avoids deadlock. It prevents the allocation from making > > recursive calls back into the file system that could end up blocking on > > jbd code. > > Oh, I see your patch now. You mean use GFP_NOFS instead of > GFP_KERNEL. :-) OK then. > oops, I did mean what you say here.:-) > > Shaggy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD: use GFP_NOFS in kmalloc
Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal->j_blocksize = blocksize; n = journal->j_blocksize / sizeof(journal_block_tag_t); journal->j_wbufsize = n; - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal->j_wbuf) { printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal->j_blocksize / sizeof(journal_block_tag_t); journal->j_wbufsize = n; - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal->j_wbuf) { printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp >>= 1UL) != 0UL) shift++; - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal->j_revoke_table[0]) return -ENOMEM; journal->j_revoke = journal->j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal->j_revoke->hash_shift = shift; journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal->j_revoke->hash_table) { kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); journal->j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp < hash_size; tmp++) INIT_LIST_HEAD(>j_revoke->hash_table[tmp]); - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal->j_revoke_table[1]) { kfree(journal->j_revoke_table[0]->hash_table); kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal->j_revoke->hash_shift = shift; journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal->j_revoke->hash_table) { kfree(journal->j_revoke_table[0]->hash_table); kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:53:12.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct journal-
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote: > On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <[EMAIL PROTECTED]> wrote: > > > JBD: Replace slab allocations with page cache allocations > > > > JBD allocate memory for committed_data and frozen_data from slab. However > > JBD should not pass slab pages down to the block layer. Use page allocator > > pages instead. This will also prepare JBD for the large blocksize patchset. > > > > > > Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly > > __GFP_NOFAIL should only be used when we have no way of recovering > from failure. The allocation in journal_init_common() (at least) > _can_ recover and hence really shouldn't be using __GFP_NOFAIL. > > (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is > there as a marker which says "we really shouldn't be doing this but > we don't know how to fix it"). > > So sometime it'd be good if you could review all the __GFP_NOFAILs in > there and see if we can remove some, thanks. Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/journal.c |2 +- fs/jbd/transaction.c |3 +-- fs/jbd2/journal.c |2 +- fs/jbd2/transaction.c |3 +-- 4 files changed, 4 insertions(+), 6 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:48:40.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-19 11:48:05.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-19 11:49:10.0 -0700 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal->j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:49:46.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd2/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-19 11:48:08.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-19 11:50:12.0 -0700 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal->j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); if (!new_transaction) { ret = -ENOMEM; goto out; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote: On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote: JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly __GFP_NOFAIL should only be used when we have no way of recovering from failure. The allocation in journal_init_common() (at least) _can_ recover and hence really shouldn't be using __GFP_NOFAIL. (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is there as a marker which says we really shouldn't be doing this but we don't know how to fix it). So sometime it'd be good if you could review all the __GFP_NOFAILs in there and see if we can remove some, thanks. Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |2 +- fs/jbd/transaction.c |3 +-- fs/jbd2/journal.c |2 +- fs/jbd2/transaction.c |3 +-- 4 files changed, 4 insertions(+), 6 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:48:40.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-19 11:48:05.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-19 11:49:10.0 -0700 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); if (!new_transaction) { ret = -ENOMEM; goto out; Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:49:46.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd2/transaction.c === --- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-19 11:48:08.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-19 11:50:12.0 -0700 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); if (!new_transaction) { ret = -ENOMEM; goto out; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD: use GFP_NOFS in kmalloc
Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:53:12.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 19:28 +, Dave Kleikamp wrote: On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote: On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote: Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all cases except one handles memory allocation failure so I get rid of those GFP_NOFAIL flags. Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc in jbd/jbd2? I will send a separate patch to cleanup that. No. GFP_NOFS avoids deadlock. It prevents the allocation from making recursive calls back into the file system that could end up blocking on jbd code. Oh, I see your patch now. You mean use GFP_NOFS instead of GFP_KERNEL. :-) OK then. oops, I did mean what you say here.:-) Shaggy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD: use GFP_NOFS in kmalloc
On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote: On Wed, 19 Sep 2007 12:22:09 -0700 Mingming Cao [EMAIL PROTECTED] wrote: Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent with the rest of kmalloc flag used in the JBD/JBD2 layer. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |6 +++--- fs/jbd/revoke.c |8 fs/jbd2/journal.c |6 +++--- fs/jbd2/revoke.c |8 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_NOFS); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700 @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ while((tmp = 1UL) != 0UL) shift++; - journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[0]) return -ENOMEM; journal-j_revoke = journal-j_revoke_table[0]; @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ for (tmp = 0; tmp hash_size; tmp++) INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]); - journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS); if (!journal-j_revoke_table[1]) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS); if (!journal-j_revoke-hash_table) { kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); These were all OK using GFP_KERNEL. GFP_NOFS should only be used when the caller is holding some fs locks which might cause a deadlock if that caller reentered the fs in -writepage (and maybe put_inode and such). That isn't the case in any of the above code, which is all mount time stuff (I think). You are right they are all occur at initialization time. ext3/4 should be using GFP_NOFS when the caller has a transaction open, has a page locked, is holding i_mutex, etc. Thanks for your feedback. Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org
Re: [PATCH] JBD slab cleanups
On Wed, 2007-09-19 at 13:48 -0600, Andreas Dilger wrote: On Sep 19, 2007 12:15 -0700, Mingming Cao wrote: @@ -96,8 +96,7 @@ static int start_this_handle(journal_t * alloc_transaction: if (!journal-j_running_transaction) { - new_transaction = kmalloc(sizeof(*new_transaction), - GFP_NOFS|__GFP_NOFAIL); + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS); This should probably be a __GFP_NOFAIL if we are trying to start a new handle in truncate, as there is no way to propagate an error to the caller. Thanks, updated version. Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases they are not needed. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |2 +- fs/jbd2/journal.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 14:23:45.0 -0700 @@ -653,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); Index: linux-2.6.23-rc6/fs/jbd2/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.0 -0700 +++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 14:23:45.0 -0700 @@ -654,7 +654,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote: > On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: > > On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: > > > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: > > > > Here is the incremental small cleanup patch. > > > > > > > > Remove kamlloc usages in jbd/jbd2 and consistently use > > > > jbd_kmalloc/jbd2_malloc. > > > > > > Shouldn't we kill jbd_kmalloc instead? > > > > > > > It seems useful to me to keep jbd_kmalloc/jbd_free. They are central > > places to handle memory (de)allocation( > in the future if we need to change memory allocation in jbd(e.g. not > > using kmalloc or using different flag), we don't need to touch every > > place in the jbd code calling jbd_kmalloc. > > I disagree. Why would jbd need to globally change the way it allocates > memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a > variety of structures. Having to change one particular instance won't > necessarily mean we want to change all of them. Adding unnecessary > wrappers only obfuscates the code making it harder to understand. You > wouldn't want every subsystem to have it's own *_kmalloc() that took > different arguments. Besides, there aren't that many calls to kmalloc > and kfree in the jbd code, so there wouldn't be much pain in changing > GFP flags or whatever, if it ever needed to be done. > > Shaggy Okay, Points taken, Here is the updated patch to get rid of slab management and jbd_kmalloc from jbd totally. This patch is intend to replace the patch in mm tree, Andrew, could you pick up this one instead? Thanks, Mingming jbd/jbd2: JBD memory allocation cleanups From: Christoph Lameter <[EMAIL PROTECTED]> JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 99 ++ fs/jbd/transaction.c | 12 +++--- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 99 ++ fs/jbd2/transaction.c | 18 - include/linux/jbd.h | 18 + include/linux/jbd2.h | 21 +- 8 files changed, 52 insertions(+), 227 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in->b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in->b_frozen_data) { - jbd_slab_free(tmp, bh_in->b_size); + jbd_free(tmp, bh_in->b_size); goto repeat; } @@ -654,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(si
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: > > Here is the incremental small cleanup patch. > > > > Remove kamlloc usages in jbd/jbd2 and consistently use > > jbd_kmalloc/jbd2_malloc. > > Shouldn't we kill jbd_kmalloc instead? > It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] JBD slab cleanups
On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote: On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote: On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote: On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote: Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Shouldn't we kill jbd_kmalloc instead? It seems useful to me to keep jbd_kmalloc/jbd_free. They are central places to handle memory (de)allocation(page size) via kmalloc/kfree, so in the future if we need to change memory allocation in jbd(e.g. not using kmalloc or using different flag), we don't need to touch every place in the jbd code calling jbd_kmalloc. I disagree. Why would jbd need to globally change the way it allocates memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a variety of structures. Having to change one particular instance won't necessarily mean we want to change all of them. Adding unnecessary wrappers only obfuscates the code making it harder to understand. You wouldn't want every subsystem to have it's own *_kmalloc() that took different arguments. Besides, there aren't that many calls to kmalloc and kfree in the jbd code, so there wouldn't be much pain in changing GFP flags or whatever, if it ever needed to be done. Shaggy Okay, Points taken, Here is the updated patch to get rid of slab management and jbd_kmalloc from jbd totally. This patch is intend to replace the patch in mm tree, Andrew, could you pick up this one instead? Thanks, Mingming jbd/jbd2: JBD memory allocation cleanups From: Christoph Lameter [EMAIL PROTECTED] JBD: Replace slab allocations with page cache allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/commit.c |6 +-- fs/jbd/journal.c | 99 ++ fs/jbd/transaction.c | 12 +++--- fs/jbd2/commit.c |6 +-- fs/jbd2/journal.c | 99 ++ fs/jbd2/transaction.c | 18 - include/linux/jbd.h | 18 + include/linux/jbd2.h | 21 +- 8 files changed, 52 insertions(+), 227 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -654,7 +653,7 @@ static journal_t * journal_init_common ( journal_t *journal; int err; - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL); + journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL); if (!journal) goto fail; memset(journal, 0, sizeof(*journal)); @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
Re: [PATCH] JBD slab cleanups
On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote: > On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote: > > On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: > > > jbd/jbd2: Replace slab allocations with page cache allocations > > > > > > From: Christoph Lameter <[EMAIL PROTECTED]> > > > > > > JBD should not pass slab pages down to the block layer. > > > Use page allocator pages instead. This will also prepare > > > JBD for the large blocksize patchset. > > > > > > > Currently memory allocation for committed_data(and frozen_buffer) for > > bufferhead is done through jbd slab management, as Christoph Hellwig > > pointed out that this is broken as jbd should not pass slab pages down > > to IO layer. and suggested to use get_free_pages() directly. > > > > The problem with this patch, as Andreas Dilger pointed today in ext4 > > interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste > > 1/3-1/2 page space. > > > > What was the originally intention to set up slabs for committed_data(and > > frozen_buffer) in JBD? Why not using kmalloc? > > > > Mingming > > Looks good. Small suggestion is to get rid of all kmalloc() usages and > consistently use jbd_kmalloc() or jbd2_kmalloc(). > > Thanks, > Badari > Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/journal.c |8 +--- fs/jbd/revoke.c | 12 ++-- fs/jbd2/journal.c |8 +--- fs/jbd2/revoke.c | 12 ++-- 4 files changed, 22 insertions(+), 18 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-17 14:32:16.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-17 14:33:59.0 -0700 @@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc journal->j_blocksize = blocksize; n = journal->j_blocksize / sizeof(journal_block_tag_t); journal->j_wbufsize = n; - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal->j_wbuf) { printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", __FUNCTION__); @@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal->j_blocksize / sizeof(journal_block_tag_t); journal->j_wbufsize = n; - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal->j_wbuf) { printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", __FUNCTION__); @@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal) iput(journal->j_inode); if (journal->j_revoke) journal_destroy_revoke(journal); - kfree(journal->j_wbuf); + jbd_kfree(journal->j_wbuf); jbd_kfree(journal); } Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-17 14:32:22.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-17 14:35:13.0 -0700 @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal->j_revoke->hash_shift = shift; journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal->j_revoke->hash_table) { kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); journal->j_revoke = NULL; @@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); if (!journal->j_revoke_table[1]) { - kfree(journal->j_revoke_table[0]->hash_table); + jbd_kfree(journal->j_revoke_table[0]->hash_table); kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); return -ENOMEM; } @@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ journal->j_revoke->hash_shift = shift; journal->j_revoke->hash_table = -
Re: [PATCH] JBD slab cleanups
On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: > jbd/jbd2: Replace slab allocations with page cache allocations > > From: Christoph Lameter <[EMAIL PROTECTED]> > > JBD should not pass slab pages down to the block layer. > Use page allocator pages instead. This will also prepare > JBD for the large blocksize patchset. > Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming > Tested on 2.6.23-rc6 with fsx runs fine. > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> > --- > fs/jbd/checkpoint.c |2 > fs/jbd/commit.c |6 +- > fs/jbd/journal.c | 107 > - > fs/jbd/transaction.c | 10 ++-- > fs/jbd2/checkpoint.c |2 > fs/jbd2/commit.c |6 +- > fs/jbd2/journal.c | 109 > -- > fs/jbd2/transaction.c | 18 > include/linux/jbd.h | 23 +- > include/linux/jbd2.h | 28 ++-- > 10 files changed, 83 insertions(+), 228 deletions(-) > > Index: linux-2.6.23-rc5/fs/jbd/journal.c > === > --- linux-2.6.23-rc5.orig/fs/jbd/journal.c2007-09-13 13:37:57.0 > -0700 > +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 > @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); > > static int journal_convert_superblock_v1(journal_t *, journal_superblock_t > *); > static void __journal_abort_soft (journal_t *journal, int errno); > -static int journal_create_jbd_slab(size_t slab_size); > > /* > * Helper function used to manage commit timeouts > @@ -334,10 +333,10 @@ repeat: > char *tmp; > > jbd_unlock_bh_state(bh_in); > - tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS); > + tmp = jbd_alloc(bh_in->b_size, GFP_NOFS); > jbd_lock_bh_state(bh_in); > if (jh_in->b_frozen_data) { > - jbd_slab_free(tmp, bh_in->b_size); > + jbd_free(tmp, bh_in->b_size); > goto repeat; > } > > @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( > /* Set up a default-sized revoke table for the new mount. */ > err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); > if (err) { > - kfree(journal); > + jbd_kfree(journal); > goto fail; > } > return journal; > @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc > if (!journal->j_wbuf) { > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", > __FUNCTION__); > - kfree(journal); > + jbd_kfree(journal); > journal = NULL; > goto out; > } > @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i > if (!journal->j_wbuf) { > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", > __FUNCTION__); > - kfree(journal); > + jbd_kfree(journal); > return NULL; > } > > @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i > if (err) { > printk(KERN_ERR "%s: Cannnot locate journal superblock\n", > __FUNCTION__); > - kfree(journal); > + jbd_kfree(journal); > return NULL; > } > > @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) > } > } > > - /* > - * Create a slab for this blocksize > - */ > - err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize)); > - if (err) > - return err; > - > /* Let the recovery code check whether it needs to recover any >* data from the journal. */ > if (journal_recover(journal)) > @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) > if (journal->j_revoke) > journal_destroy_revoke(journal); > kfree(journal-&g
Re: [PATCH] JBD slab cleanups
On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR %s: Cannnot locate journal superblock\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* - * Create a slab for this blocksize - */ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal-j_revoke) journal_destroy_revoke(journal); kfree(journal-j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen
Re: [PATCH] JBD slab cleanups
On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote: On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote: On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote: jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Currently memory allocation for committed_data(and frozen_buffer) for bufferhead is done through jbd slab management, as Christoph Hellwig pointed out that this is broken as jbd should not pass slab pages down to IO layer. and suggested to use get_free_pages() directly. The problem with this patch, as Andreas Dilger pointed today in ext4 interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste 1/3-1/2 page space. What was the originally intention to set up slabs for committed_data(and frozen_buffer) in JBD? Why not using kmalloc? Mingming Looks good. Small suggestion is to get rid of all kmalloc() usages and consistently use jbd_kmalloc() or jbd2_kmalloc(). Thanks, Badari Here is the incremental small cleanup patch. Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/journal.c |8 +--- fs/jbd/revoke.c | 12 ++-- fs/jbd2/journal.c |8 +--- fs/jbd2/revoke.c | 12 ++-- 4 files changed, 22 insertions(+), 18 deletions(-) Index: linux-2.6.23-rc6/fs/jbd/journal.c === --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-17 14:32:16.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-17 14:33:59.0 -0700 @@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc journal-j_blocksize = blocksize; n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i /* journal descriptor can store up to n blocks -bzzz */ n = journal-j_blocksize / sizeof(journal_block_tag_t); journal-j_wbufsize = n; - journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*), + GFP_KERNEL); if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); @@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal) iput(journal-j_inode); if (journal-j_revoke) journal_destroy_revoke(journal); - kfree(journal-j_wbuf); + jbd_kfree(journal-j_wbuf); jbd_kfree(journal); } Index: linux-2.6.23-rc6/fs/jbd/revoke.c === --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-17 14:32:22.0 -0700 +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-17 14:35:13.0 -0700 @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal-j_revoke-hash_table) { kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); journal-j_revoke = NULL; @@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); if (!journal-j_revoke_table[1]) { - kfree(journal-j_revoke_table[0]-hash_table); + jbd_kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); return -ENOMEM; } @@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ journal-j_revoke-hash_shift = shift; journal-j_revoke-hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); if (!journal-j_revoke-hash_table) { - kfree(journal-j_revoke_table[0]-hash_table); + jbd_kfree(journal-j_revoke_table[0]-hash_table); kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]); kmem_cache_free
[PATCH] JBD slab cleanups
jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter <[EMAIL PROTECTED]> JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in->b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in->b_frozen_data) { - jbd_slab_free(tmp, bh_in->b_size); + jbd_free(tmp, bh_in->b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal->j_wbuf) { printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal->j_wbuf) { printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR "%s: Cannnot locate journal superblock\n", __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal->j_revoke) journal_destroy_revoke(journal); kfree(journal->j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size >> 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k" -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - -
[PATCH] JBD slab cleanups
jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR %s: Cannnot locate journal superblock\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal-j_revoke) journal_destroy_revoke(journal); kfree(journal-j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL
[RFC 2/2] JBD: blocks reservation fix for large block support
The blocks per page could be less or quals to 1 with the large block support in VM. The patch fixed the way to calculate the number of blocks to reserve in journal in the case blocksize > pagesize. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> Index: my2.6/fs/jbd/journal.c === --- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.0 -0700 +++ my2.6/fs/jbd/journal.c 2007-08-31 13:28:18.0 -0700 @@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal) int journal_blocks_per_page(struct inode *inode) { - return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits); + int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits; + + if (bits > 0) + return 1 << bits; + else + return 1; } /* Index: my2.6/fs/jbd2/journal.c === --- my2.6.orig/fs/jbd2/journal.c2007-08-31 13:32:21.0 -0700 +++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.0 -0700 @@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou int jbd2_journal_blocks_per_page(struct inode *inode) { - return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits); + int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits; + + if (bits > 0) + return 1 << bits; + else + return 1; } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 1/2] JBD: slab management support for large block(>8k)
>From clameter: Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3. Signed-off-by: Mingming Cao <[EMAIL PROTECTED]> Index: my2.6/fs/jbd/journal.c === --- my2.6.orig/fs/jbd/journal.c 2007-08-30 18:40:02.0 -0700 +++ my2.6/fs/jbd/journal.c 2007-08-31 11:01:18.0 -0700 @@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where, * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed * and allocate frozen and commit buffers from these slabs. * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. + * (Note: We only seem to need the definitions here for the SLAB_DEBUG + * case. In non debug operations SLUB will find the corresponding kmalloc + * cache and create an alias. --clameter) */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size >> 11) +#define JBD_MAX_SLABS 7 +#define JBD_SLAB_INDEX(size) get_order((size) << (PAGE_SHIFT - 10)) static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; static const char *jbd_slab_names[JBD_MAX_SLABS] = { - "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k" + "jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k", + "jbd_16k", "jbd_32k", "jbd_64k" }; static void journal_destroy_jbd_slabs(void) Index: my2.6/fs/jbd2/journal.c === --- my2.6.orig/fs/jbd2/journal.c2007-08-30 18:40:02.0 -0700 +++ my2.6/fs/jbd2/journal.c 2007-08-31 11:04:37.0 -0700 @@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed * and allocate frozen and commit buffers from these slabs. * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. + * (Note: We only seem to need the definitions here for the SLAB_DEBUG + * case. In non debug operations SLUB will find the corresponding kmalloc + * cache and create an alias. --clameter) */ -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size >> 11) +#define JBD_MAX_SLABS 7 +#define JBD_SLAB_INDEX(size) get_order((size) << (PAGE_SHIFT - 10)) static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; static const char *jbd_slab_names[JBD_MAX_SLABS] = { - "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k" + "jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k", +"jbd2_16k", "jbd2_32k", "jbd2_64k" }; static void jbd2_journal_destroy_jbd_slabs(void) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote: > Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested > only. > > Next steps: > Need a e2fsprogs changes to able test this feature. As mkfs needs to be > educated not assuming rec_len to be blocksize all the time. > Will try it with Christoph Lameter's large block patch next. > Two problems were found when testing largeblock on ext3. Patches to follow. Good news is, with your changes, plus all these extN changes, I am able to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k page size. fsx test runs fine for an hour on ext3 with 16k blocksize on x86:-) Mingming - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/