[patch 5/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
reiserfs: journal_transaction_should_end should increase the count of blocks
allocated so the transaction subsystem can keep new writers from creating
a transaction that is too large.

diff -r 890bf922a629 fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 14:00:50 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 14:01:36 2006 -0500
@@ -2854,6 +2854,9 @@ int journal_transaction_should_end(struc
journal->j_cnode_free < (journal->j_trans_max * 3)) {
return 1;
}
+   /* protected by the BKL here */
+   journal->j_len_alloc += new_alloc;
+   th->t_blocks_allocated += new_alloc ;
return 0;
 }
 

--


[patch 6/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
When a filesystem has been converted from 3.5.x to 3.6.x, we need
an extra check during file write to make sure we are not trying
to make a 3.5.x file > 2GB.

diff -r ee81eb208598 fs/reiserfs/file.c
--- a/fs/reiserfs/file.cFri Jan 13 14:01:37 2006 -0500
+++ b/fs/reiserfs/file.cFri Jan 13 14:08:12 2006 -0500
@@ -1285,6 +1285,23 @@ static ssize_t reiserfs_file_write(struc
struct reiserfs_transaction_handle th;
th.t_trans_id = 0;
 
+   /* If a filesystem is converted from 3.5 to 3.6, we'll have v3.5 items
+   * lying around (most of the disk, in fact). Despite the filesystem
+   * now being a v3.6 format, the old items still can't support large
+   * file sizes. Catch this case here, as the rest of the VFS layer is
+   * oblivious to the different limitations between old and new items.
+   * reiserfs_setattr catches this for truncates. This chunk is lifted
+   * from generic_write_checks. */
+   if (get_inode_item_key_version (inode) == KEY_FORMAT_3_5 && 
+   *ppos + count > MAX_NON_LFS) {
+   if (*ppos >= MAX_NON_LFS) {
+   send_sig(SIGXFSZ, current, 0);
+   return -EFBIG;
+   }
+   if (count > MAX_NON_LFS - (unsigned long)*ppos)
+   count = MAX_NON_LFS - (unsigned long)*ppos;
+   }
+
if (file->f_flags & O_DIRECT) { // Direct IO needs treatment
ssize_t result, after_file_end = 0;
if ((*ppos + count >= inode->i_size)

--


[patch 3/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
In data=journal mode, reiserfs writepage needs to make sure not to
trigger transactions while being run under PF_MEMALLOC.  This patch
makes sure to redirty the page instead of forcing a transaction start
in this case.

Also, calling filemap_fdata* in order to trigger io on the block device
can cause lock inversions on the page lock.  Instead, do simple
batching from flush_commit_list.

diff -r c10585019f18 fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Fri Jan 13 13:51:10 2006 -0500
+++ b/fs/reiserfs/inode.c   Fri Jan 13 13:55:09 2006 -0500
@@ -2363,6 +2363,13 @@ static int reiserfs_write_full_page(stru
int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize;
th.t_trans_id = 0;
 
+   /* no logging allowed when nonblocking or from PF_MEMALLOC */
+   if (checked && (current->flags & PF_MEMALLOC)) {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   return 0;
+   }
+
/* The page dirty bit is cleared before writepage is called, which
 * means we have to tell create_empty_buffers to make dirty buffers
 * The page really should be up to date at this point, so tossing
diff -r c10585019f18 fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 13:51:10 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 13:55:09 2006 -0500
@@ -990,6 +990,7 @@ static int flush_commit_list(struct supe
struct reiserfs_journal *journal = SB_JOURNAL(s);
int barrier = 0;
int retval = 0;
+   int write_len;
 
reiserfs_check_lock_depth(s, "flush_commit_list");
 
@@ -1039,16 +1040,24 @@ static int flush_commit_list(struct supe
BUG_ON(!list_empty(&jl->j_bh_list));
/*
 * for the description block and all the log blocks, submit any buffers
-* that haven't already reached the disk
+* that haven't already reached the disk.  Try to write at least 256
+* log blocks. later on, we will only wait on blocks that correspond
+* to this transaction, but while we're unplugging we might as well
+* get a chunk of data on there.
 */
atomic_inc(&journal->j_async_throttle);
-   for (i = 0; i < (jl->j_len + 1); i++) {
+   write_len = jl->j_len + 1;
+   if (write_len < 256)
+   write_len = 256;
+   for (i = 0 ; i < write_len ; i++) {
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start + i) %
SB_ONDISK_JOURNAL_SIZE(s);
tbh = journal_find_get_block(s, bn);
-   if (buffer_dirty(tbh))  /* redundant, ll_rw_block() checks */
-   ll_rw_block(SWRITE, 1, &tbh);
-   put_bh(tbh);
+   if (tbh) {
+   if (buffer_dirty(tbh))
+   ll_rw_block(WRITE, 1, &tbh) ;
+   put_bh(tbh) ; 
+   }
}
atomic_dec(&journal->j_async_throttle);
 

--


[patch 4/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
write_ordered_buffers should handle dirty non-uptodate buffers without
a BUG()

diff -r 18fa5554d7e2 fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 13:55:10 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 14:00:49 2006 -0500
@@ -848,6 +848,14 @@ static int write_ordered_buffers(spinloc
spin_lock(lock);
goto loop_next;
}
+   /* in theory, dirty non-uptodate buffers should never get here,
+* but the upper layer io error paths still have a few quirks.  
+* Handle them here as gracefully as we can
+*/
+   if (!buffer_uptodate(bh) && buffer_dirty(bh)) {
+   clear_buffer_dirty(bh);
+   ret = -EIO;
+   }
if (buffer_dirty(bh)) {
list_del_init(&jh->list);
list_add(&jh->list, &tmp);
@@ -1032,9 +1040,12 @@ static int flush_commit_list(struct supe
}
 
if (!list_empty(&jl->j_bh_list)) {
+   int ret;
unlock_kernel();
-   write_ordered_buffers(&journal->j_dirty_buffers_lock,
- journal, jl, &jl->j_bh_list);
+   ret = write_ordered_buffers(&journal->j_dirty_buffers_lock,
+   journal, jl, &jl->j_bh_list);
+   if (ret < 0 && retval == 0)
+   retval = ret;
lock_kernel();
}
BUG_ON(!list_empty(&jl->j_bh_list));

--


[patch 2/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
The b_private field in buffer heads needs to be zero filled
when the buffers are allocated.  Thanks to Nathan Scott for
finding this.  It was causing problems on systems with both XFS and
reiserfs.

diff -r 5ef1fa0a021a fs/buffer.c
--- a/fs/buffer.c   Fri Jan 13 13:50:39 2006 -0500
+++ b/fs/buffer.c   Fri Jan 13 13:51:09 2006 -0500
@@ -1022,6 +1022,7 @@ try_again:
 
bh->b_state = 0;
atomic_set(&bh->b_count, 0);
+   bh->b_private = NULL;
bh->b_size = size;
 
/* Link the buffer to its page */

--


[patch 1/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
After a transaction has closed but before it has finished commit, there
is a window where data=ordered mode requires invalidatepage to pin pages
instead of freeing them.  This patch fixes a race between the
invalidatepage checks and data=ordered writeback, and it also adds a
check to the reiserfs write_ordered_buffers routines to write any
anonymous buffers that were dirtied after its first writeback loop.

That bug works like this:

proc1: transaction closes and a new one starts
proc1: write_ordered_buffers starts processing data=ordered list
proc1: buffer A is cleaned and written
proc2: buffer A is dirtied by another process
proc2: File is truncated to zero, page A goes through invalidatepage
proc2: reiserfs_invalidatepage sees dirty buffer A with reiserfs
   journal head, pins it
proc1: write_ordered_buffers frees the journal head on buffer A

At this point, buffer A stays dirty forever

diff -r 21be96fa294a fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Fri Jan 13 13:48:03 2006 -0500
+++ b/fs/reiserfs/inode.c   Fri Jan 13 13:50:37 2006 -0500
@@ -2743,6 +2743,7 @@ static int invalidatepage_can_drop(struc
int ret = 1;
struct reiserfs_journal *j = SB_JOURNAL(inode->i_sb);
 
+   lock_buffer(bh);
spin_lock(&j->j_dirty_buffers_lock);
if (!buffer_mapped(bh)) {
goto free_jh;
@@ -2758,7 +2759,7 @@ static int invalidatepage_can_drop(struc
if (buffer_journaled(bh) || buffer_journal_dirty(bh)) {
ret = 0;
}
-   } else if (buffer_dirty(bh) || buffer_locked(bh)) {
+   } else  if (buffer_dirty(bh)) {
struct reiserfs_journal_list *jl;
struct reiserfs_jh *jh = bh->b_private;
 
@@ -2784,6 +2785,7 @@ static int invalidatepage_can_drop(struc
reiserfs_free_jh(bh);
}
spin_unlock(&j->j_dirty_buffers_lock);
+   unlock_buffer(bh);
return ret;
 }
 
diff -r 21be96fa294a fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 13:48:03 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 13:50:37 2006 -0500
@@ -878,6 +878,19 @@ static int write_ordered_buffers(spinloc
}
if (!buffer_uptodate(bh)) {
ret = -EIO;
+   }
+   /* ugly interaction with invalidatepage here.
+* reiserfs_invalidate_page will pin any buffer that has a valid
+* journal head from an older transaction.  If someone else sets
+* our buffer dirty after we write it in the first loop, and
+* then someone truncates the page away, nobody will ever write
+* the buffer. We're safe if we write the page one last time
+* after freeing the journal header.
+*/
+   if (buffer_dirty(bh) && unlikely(bh->b_page->mapping == NULL)) {
+   spin_unlock(lock);
+   ll_rw_block(WRITE, 1, &bh);
+   spin_lock(lock);
}
put_bh(bh);
cond_resched_lock(lock);

--


[patch 0/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
Hello everyone,

Here is my current queue of reiserfs patches.  These originated from
various bugs solved in the suse sles9 kernel, and have been ported to
2.6.15-git9.

-chris

--


Re: Data being corrupted on reiserfs 3.6

2006-01-15 Thread Michael Barnwell

Thanks,

Pierre Etchemaïté wrote:

Le Sun, 15 Jan 2006 21:36:20 +, Michael Barnwell <[EMAIL PROTECTED]> a 
écrit :

I'm not sure how to search 
through a binary file for non-zero bytes.


cmp -b ?


[EMAIL PROTECTED]:~$ cmp -b /tmp/1GB.tst /home/michael/1GB.tst
/tmp/1GB.tst /home/michael/1GB.tst differ: byte 68494094, line 1 is   0 
^@  40


That seems to stop after the first difference, so I did: -

[EMAIL PROTECTED]:~$ cmp -bl /tmp/1GB.tst /home/michael/1GB.tst | wc -l
243

The full output of cmp -bl is at http://pastebin.com/507389

Regards,

Michael.


Re: Data being corrupted on reiserfs 3.6

2006-01-15 Thread Pierre Etchemaïté
Le Sun, 15 Jan 2006 21:36:20 +, Michael Barnwell <[EMAIL PROTECTED]> a 
écrit :

> I'm not sure how to search 
> through a binary file for non-zero bytes.

cmp -b ?


Re: Data being corrupted on reiserfs 3.6

2006-01-15 Thread Michael Barnwell

Hi,

Jan Kara wrote:



  Hmm, that is really strange. Do the files have the same size? Do you
get an error also if you just create file full of zeros? If so, how do
the differences look like (e.g. any signs of flipped bits or so?).



[EMAIL PROTECTED]:/tmp$ dd bs=1024 count=1000k if=/dev/zero of=./1GB.tst
1024000+0 records in
1024000+0 records out
1048576000 bytes transferred in 61.578769 seconds (17028207 bytes/sec)
[EMAIL PROTECTED]:/tmp$ ls -l 1GB.tst
-rw-r--r--  1 michael michael 1048576000 2006-01-15 20:51 1GB.tst
[EMAIL PROTECTED]:/tmp$ md5sum 1GB.tst
e5c834fbdaa6bfd8eac5eb9404eefdd4  1GB.tst
[EMAIL PROTECTED]:/tmp$ ls -l /home/michael/1GB.tst
-rw-r--r--  1 michael michael 1048576000 2006-01-15 20:54 
/home/michael/1GB.tst

[EMAIL PROTECTED]:/tmp$ md5sum /home/michael/1GB.tst
92c51557041ebd6424b4467a878c9f44  /home/michael/1GB.tst

I looked at the file in /home/michael/1GB.tst with xdd for about 5 
minutes but couldn't see anything but zeros - I'm not sure how to search 
through a binary file for non-zero bytes.


So yes, error if the file is all zeros and they have the same size.

Thanks,

Michael Barnwell.


Re: Data being corrupted on reiserfs 3.6

2006-01-15 Thread Jan Kara
  Hello,

> I'm experiencing data corruption when creating or copy data to my 
> reiserfs 3.6 partition mounted under /home. The following extract gives 
> a pretty clear indication that it's getting corrupted somewhere.
> 
> [EMAIL PROTECTED]:/tmp$ mount
> /dev/md0 on / type ext3 (rw,errors=remount-ro)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw)
> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
> tmpfs on /dev/shm type tmpfs (rw)
> usbfs on /proc/bus/usb type usbfs (rw)
> tmpfs on /dev type tmpfs (rw,size=10M,mode=0755)
> /dev/md2 on /home type reiserfs (rw)
> 
> [EMAIL PROTECTED]:/tmp$ dd bs=1024 count=1000k if=/dev/urandom of=./1GB.tst
> 1024000+0 records in
> 1024000+0 records out
> 1048576000 bytes transferred in 231.749782 seconds (4524604 bytes/sec)
> 
> [EMAIL PROTECTED]:/tmp$ md5sum 1GB.tst
> 48f46744c7e50c42c061a00d11541a85  1GB.tst
> 
> [EMAIL PROTECTED]:/tmp$ cp 1GB.tst /home/michael/
> 
> [EMAIL PROTECTED]:/tmp$ md5sum /home/michael/1GB.tst
> 042d8c462882f848412679e3cea03fe2  /home/michael/1GB.tst
  Hmm, that is really strange. Do the files have the same size? Do you
get an error also if you just create file full of zeros? If so, how do
the differences look like (e.g. any signs of flipped bits or so?).

> I'm running Debian Sarge on an Athlon XP 2200+, /dev/md2 is made up of 
> four 400GB SATA hard disks on a Silicon Image 3114 controller in RAID 5. 
> Dmesg is showing no errors what so ever, the RAID array has been stable 
> since I installed it a couple of weeks ago and the drive was formatted 
> with mkfs.reiserfs with no special options.
> 
> [EMAIL PROTECTED]:/tmp$ uname -a
> Linux biggs 2.6.8-2-k7 #1 Tue Aug 16 14:00:15 UTC 2005 i686 GNU/Linux
  Any chance of trying some newer kernel? 2.6.8 is really old...

Honza


Re: reiserfsck --rebuild-tree aborts at same block

2006-01-15 Thread Jan Kara
> I have a situation where if I run "reiserfsck --rebuild-tree" multiple 
> times, it always aborts at the same block.  The output includes
> "Send us the bug report only if the second run dies at the same place 
> with the same block number."
> 
> Before sending a bunch of info to the wrong place though, could someone 
> please confirm if I should submit details here as a bug report, or would 
> this be something to go through the support channel with?
  First check that you have the latest version of reiserfsck. If so, then
this is the appropriate list for the report.

Honza


reiserfsck --rebuild-tree aborts at same block

2006-01-15 Thread Mike Depot
I have a situation where if I run "reiserfsck --rebuild-tree" multiple 
times, it always aborts at the same block.  The output includes
"Send us the bug report only if the second run dies at the same place 
with the same block number."


Before sending a bunch of info to the wrong place though, could someone 
please confirm if I should submit details here as a bug report, or would 
this be something to go through the support channel with?


Thanks