Re: [PATCH] reiserfs:fix journaling issue regarding fsync()

2006-06-30 Thread Chris Mason
On Thursday 29 June 2006 21:36, Hisashi Hifumi wrote:
 Hi,

 At 09:47 06/06/30, Chris Mason wrote:
  Thanks for the patch.  One problem is this will bump the transaction
   marker for atime updates too.  I'd rather see the change done inside
  reiserfs_file_write.

 I did not realize that an atime updates is also influenced.

  reiserfs_file_write already updates the transaction when blocks are
   allocated, but you're right that to be 100% correct we should cover the
   case when i_size increases but new blocks are not added.  Was this the
   case you were trying to fix?

 Yes, that's right.

 So, I remade my patch as follows.
 I tested this patch and confirmed that the kernel with this patch work
 well.

This is correct, excpet you need to put the update_inode_transaction call 
inside reiserfs_write_lock/unlock.

-chris


[patch 0/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
Hello everyone,

Here is my current queue of reiserfs patches.  These originated from
various bugs solved in the suse sles9 kernel, and have been ported to
2.6.15-git9.

-chris

--


[patch 1/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
After a transaction has closed but before it has finished commit, there
is a window where data=ordered mode requires invalidatepage to pin pages
instead of freeing them.  This patch fixes a race between the
invalidatepage checks and data=ordered writeback, and it also adds a
check to the reiserfs write_ordered_buffers routines to write any
anonymous buffers that were dirtied after its first writeback loop.

That bug works like this:

proc1: transaction closes and a new one starts
proc1: write_ordered_buffers starts processing data=ordered list
proc1: buffer A is cleaned and written
proc2: buffer A is dirtied by another process
proc2: File is truncated to zero, page A goes through invalidatepage
proc2: reiserfs_invalidatepage sees dirty buffer A with reiserfs
   journal head, pins it
proc1: write_ordered_buffers frees the journal head on buffer A

At this point, buffer A stays dirty forever

diff -r 21be96fa294a fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Fri Jan 13 13:48:03 2006 -0500
+++ b/fs/reiserfs/inode.c   Fri Jan 13 13:50:37 2006 -0500
@@ -2743,6 +2743,7 @@ static int invalidatepage_can_drop(struc
int ret = 1;
struct reiserfs_journal *j = SB_JOURNAL(inode-i_sb);
 
+   lock_buffer(bh);
spin_lock(j-j_dirty_buffers_lock);
if (!buffer_mapped(bh)) {
goto free_jh;
@@ -2758,7 +2759,7 @@ static int invalidatepage_can_drop(struc
if (buffer_journaled(bh) || buffer_journal_dirty(bh)) {
ret = 0;
}
-   } else if (buffer_dirty(bh) || buffer_locked(bh)) {
+   } else  if (buffer_dirty(bh)) {
struct reiserfs_journal_list *jl;
struct reiserfs_jh *jh = bh-b_private;
 
@@ -2784,6 +2785,7 @@ static int invalidatepage_can_drop(struc
reiserfs_free_jh(bh);
}
spin_unlock(j-j_dirty_buffers_lock);
+   unlock_buffer(bh);
return ret;
 }
 
diff -r 21be96fa294a fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 13:48:03 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 13:50:37 2006 -0500
@@ -878,6 +878,19 @@ static int write_ordered_buffers(spinloc
}
if (!buffer_uptodate(bh)) {
ret = -EIO;
+   }
+   /* ugly interaction with invalidatepage here.
+* reiserfs_invalidate_page will pin any buffer that has a valid
+* journal head from an older transaction.  If someone else sets
+* our buffer dirty after we write it in the first loop, and
+* then someone truncates the page away, nobody will ever write
+* the buffer. We're safe if we write the page one last time
+* after freeing the journal header.
+*/
+   if (buffer_dirty(bh)  unlikely(bh-b_page-mapping == NULL)) {
+   spin_unlock(lock);
+   ll_rw_block(WRITE, 1, bh);
+   spin_lock(lock);
}
put_bh(bh);
cond_resched_lock(lock);

--


[patch 2/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
The b_private field in buffer heads needs to be zero filled
when the buffers are allocated.  Thanks to Nathan Scott for
finding this.  It was causing problems on systems with both XFS and
reiserfs.

diff -r 5ef1fa0a021a fs/buffer.c
--- a/fs/buffer.c   Fri Jan 13 13:50:39 2006 -0500
+++ b/fs/buffer.c   Fri Jan 13 13:51:09 2006 -0500
@@ -1022,6 +1022,7 @@ try_again:
 
bh-b_state = 0;
atomic_set(bh-b_count, 0);
+   bh-b_private = NULL;
bh-b_size = size;
 
/* Link the buffer to its page */

--


[patch 4/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
write_ordered_buffers should handle dirty non-uptodate buffers without
a BUG()

diff -r 18fa5554d7e2 fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 13:55:10 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 14:00:49 2006 -0500
@@ -848,6 +848,14 @@ static int write_ordered_buffers(spinloc
spin_lock(lock);
goto loop_next;
}
+   /* in theory, dirty non-uptodate buffers should never get here,
+* but the upper layer io error paths still have a few quirks.  
+* Handle them here as gracefully as we can
+*/
+   if (!buffer_uptodate(bh)  buffer_dirty(bh)) {
+   clear_buffer_dirty(bh);
+   ret = -EIO;
+   }
if (buffer_dirty(bh)) {
list_del_init(jh-list);
list_add(jh-list, tmp);
@@ -1032,9 +1040,12 @@ static int flush_commit_list(struct supe
}
 
if (!list_empty(jl-j_bh_list)) {
+   int ret;
unlock_kernel();
-   write_ordered_buffers(journal-j_dirty_buffers_lock,
- journal, jl, jl-j_bh_list);
+   ret = write_ordered_buffers(journal-j_dirty_buffers_lock,
+   journal, jl, jl-j_bh_list);
+   if (ret  0  retval == 0)
+   retval = ret;
lock_kernel();
}
BUG_ON(!list_empty(jl-j_bh_list));

--


[patch 3/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
In data=journal mode, reiserfs writepage needs to make sure not to
trigger transactions while being run under PF_MEMALLOC.  This patch
makes sure to redirty the page instead of forcing a transaction start
in this case.

Also, calling filemap_fdata* in order to trigger io on the block device
can cause lock inversions on the page lock.  Instead, do simple
batching from flush_commit_list.

diff -r c10585019f18 fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c   Fri Jan 13 13:51:10 2006 -0500
+++ b/fs/reiserfs/inode.c   Fri Jan 13 13:55:09 2006 -0500
@@ -2363,6 +2363,13 @@ static int reiserfs_write_full_page(stru
int bh_per_page = PAGE_CACHE_SIZE / s-s_blocksize;
th.t_trans_id = 0;
 
+   /* no logging allowed when nonblocking or from PF_MEMALLOC */
+   if (checked  (current-flags  PF_MEMALLOC)) {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   return 0;
+   }
+
/* The page dirty bit is cleared before writepage is called, which
 * means we have to tell create_empty_buffers to make dirty buffers
 * The page really should be up to date at this point, so tossing
diff -r c10585019f18 fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 13:51:10 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 13:55:09 2006 -0500
@@ -990,6 +990,7 @@ static int flush_commit_list(struct supe
struct reiserfs_journal *journal = SB_JOURNAL(s);
int barrier = 0;
int retval = 0;
+   int write_len;
 
reiserfs_check_lock_depth(s, flush_commit_list);
 
@@ -1039,16 +1040,24 @@ static int flush_commit_list(struct supe
BUG_ON(!list_empty(jl-j_bh_list));
/*
 * for the description block and all the log blocks, submit any buffers
-* that haven't already reached the disk
+* that haven't already reached the disk.  Try to write at least 256
+* log blocks. later on, we will only wait on blocks that correspond
+* to this transaction, but while we're unplugging we might as well
+* get a chunk of data on there.
 */
atomic_inc(journal-j_async_throttle);
-   for (i = 0; i  (jl-j_len + 1); i++) {
+   write_len = jl-j_len + 1;
+   if (write_len  256)
+   write_len = 256;
+   for (i = 0 ; i  write_len ; i++) {
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl-j_start + i) %
SB_ONDISK_JOURNAL_SIZE(s);
tbh = journal_find_get_block(s, bn);
-   if (buffer_dirty(tbh))  /* redundant, ll_rw_block() checks */
-   ll_rw_block(SWRITE, 1, tbh);
-   put_bh(tbh);
+   if (tbh) {
+   if (buffer_dirty(tbh))
+   ll_rw_block(WRITE, 1, tbh) ;
+   put_bh(tbh) ; 
+   }
}
atomic_dec(journal-j_async_throttle);
 

--


[patch 6/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
When a filesystem has been converted from 3.5.x to 3.6.x, we need
an extra check during file write to make sure we are not trying
to make a 3.5.x file  2GB.

diff -r ee81eb208598 fs/reiserfs/file.c
--- a/fs/reiserfs/file.cFri Jan 13 14:01:37 2006 -0500
+++ b/fs/reiserfs/file.cFri Jan 13 14:08:12 2006 -0500
@@ -1285,6 +1285,23 @@ static ssize_t reiserfs_file_write(struc
struct reiserfs_transaction_handle th;
th.t_trans_id = 0;
 
+   /* If a filesystem is converted from 3.5 to 3.6, we'll have v3.5 items
+   * lying around (most of the disk, in fact). Despite the filesystem
+   * now being a v3.6 format, the old items still can't support large
+   * file sizes. Catch this case here, as the rest of the VFS layer is
+   * oblivious to the different limitations between old and new items.
+   * reiserfs_setattr catches this for truncates. This chunk is lifted
+   * from generic_write_checks. */
+   if (get_inode_item_key_version (inode) == KEY_FORMAT_3_5  
+   *ppos + count  MAX_NON_LFS) {
+   if (*ppos = MAX_NON_LFS) {
+   send_sig(SIGXFSZ, current, 0);
+   return -EFBIG;
+   }
+   if (count  MAX_NON_LFS - (unsigned long)*ppos)
+   count = MAX_NON_LFS - (unsigned long)*ppos;
+   }
+
if (file-f_flags  O_DIRECT) { // Direct IO needs treatment
ssize_t result, after_file_end = 0;
if ((*ppos + count = inode-i_size)

--


[patch 5/6] reiserfs v3 patches

2006-01-15 Thread Chris Mason
reiserfs: journal_transaction_should_end should increase the count of blocks
allocated so the transaction subsystem can keep new writers from creating
a transaction that is too large.

diff -r 890bf922a629 fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 13 14:00:50 2006 -0500
+++ b/fs/reiserfs/journal.c Fri Jan 13 14:01:36 2006 -0500
@@ -2854,6 +2854,9 @@ int journal_transaction_should_end(struc
journal-j_cnode_free  (journal-j_trans_max * 3)) {
return 1;
}
+   /* protected by the BKL here */
+   journal-j_len_alloc += new_alloc;
+   th-t_blocks_allocated += new_alloc ;
return 0;
 }
 

--


Re: [PATCH] fix problems related to journaling in Reiserfs

2005-09-01 Thread Chris Mason
On Wed, 31 Aug 2005 20:35:52 -0700
Hans Reiser [EMAIL PROTECTED] wrote:

 Thanks much Hifumi!
 
 Chris, please comment on the patch.

The problem is that I'm not always making the inode dirty during the
reiserfs_file_write.  The get_block based write function does an
explicit commit during O_SYNC mode.  I've got a cleanup related to this
for quotas and other things, but I didn't realize it would help O_SYNC
as well.

I'll diff/test against mainline in the morning and send out.

-chris


Re: [PATCH] Make reiserfs BUG on too big transaction

2005-05-19 Thread Chris Mason
On Thursday 19 May 2005 05:36, Jan Kara wrote:
   Hello!

   Attached patch makes reiserfs BUG() when somebody tries to start a
 larger transaction than it's allowed (currently the code just silently
 deadlocks). I think this is a better behaviour. Can you please apply the
 patch?

Ack, looks ok.  In theory, we could return an error instead and force the FS 
into readonly mode, but it's better to catch the offending caller.

-chris


Re: [PATCH] Fix quota transaction size

2005-05-19 Thread Chris Mason
On Thursday 19 May 2005 05:40, Jan Kara wrote:
   Hello,

   attached patch improves the estimates on the number of credits needed
 for a quota operation. This is needed as currently quota overflows the
 maximum size of a transaction if 1KB blocksize is used. Please apply.

Thanks Jan,

It would make more sense to only allocate for the quota if quotas are in use.  
When you have 10 or more concurrent procs unlinking things, they end up 
waiting for each other because they are trying to reserve so many blocks in 
the transaction.  So, a smaller reservation allows for better concurrency 
when quotas are off.

-chris


Re: BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!

2005-03-14 Thread Chris Mason
On Friday 11 March 2005 18:39, Hans Reiser wrote:
 I/O errors usually indicate bad hardware not bad software,
 probably you need to get a new disk and use dd_rescue to copy everything

This is your user friendly error message targeted at users that don't know 
what an I/O error is?  

What's an I/O error?
What's software?
What's hardware?
What's a disk?
What's dd_rescue?
How do I copy everything?
How do I put a new disk in?
How do I make the kernel recognize use new disk instead of the old one?

The list goes on and on.  You'll never make the kernel more usable by making 
messages in the syslog more verbose.  You can make it more usable by having 
consistent error messages that can be found via search engines or the manual.

Jeff's completely right here.

-chris


Re: reiserfs3, rsync and hardlinks

2005-02-07 Thread Chris Mason
On Monday 07 February 2005 15:50, Pierre Etchemaite wrote:
 Le lun 07 fév 2005 13:22:51 CET, Vladimir Saveliev [EMAIL PROTECTED] a écrit

  Hello

   Hi,

  yes, reiserfs reuses inode number of removed files for newly created
  files. However, ext2 also does that. Have  you ever noticed this problem
  on other filesystems?

 No, but I'm only using rsync -H for a few weeks. The problem may also exist
 with tar, but unnoticed (unless tar detects hardlinks in a different way,
 or does more checks, like checking the consistency with references
 counters, whatever, to avoid it). rsync handles hardlinks in a final pass,
 so as soon as the verbosity level is raised, problems are easy to detect.

 I have only one server left that uses ext2. It's also saved with rsync, no
 problem seen so far (a few weeks only, as I said).
 But the filesystem used isn't the only difference. Usage pattern probably
 matters a lot. On the system where it happens, hardlinked files are often
 Maildir files (unsurprizingly) and mrtg log files (which are rotated every
 5 minutes). inodes are probably freed by mrtg, and one reused for a new
 email.

If you've got files being deleted in the middle of the backup,  then it is 
extremely difficult for rsync (or any tool) to get the hard link detection 
correct.  You've got a few choices:

1) put everything on lvm and backup snapshots instead of the live filesystem.  
This has a number of benefits.

2) link everyfile into some temp directory before the backup starts.  This 
will prevent that particular inode number from being reused during the 
backup, but won't help if new files are added during the rsync (since those 
new files could also be deleted).

for each file in backup list
ln file tmpdir/counter
counter++
rsync
rm -rf  tmpdir

-chris


Re: Stacktrace

2004-12-16 Thread Chris Mason
On Thu, 2004-12-16 at 14:17 +0100, Joachim Reichelt wrote:
 Dear all,
 I just got in my dmesg:
 
 reiserfs/1: page allocation failure. order:0, mode:0x21

This is an atomic allocation, which is allowed to fail.  You can ignore
the message (which comes from the VM), later versions of the suse kernel
don't even print it for atomic failures.

-chris




Re: Oops with large file in 2.6.8, reiser 3.6.13

2004-11-29 Thread Chris Mason
On Mon, 2004-11-29 at 14:46 -0500, Jeff Mahoney wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Alex Zarochentsev wrote:
 | Hello,
 |
 | On Fri, Oct 29, 2004 at 10:55:36AM +0100, Richard Gregory wrote:
 |
 |Hi Alex,
 |
 |That fixed it. I created a 617gig file that filled the filesystem. It
 |then deleted without a problem. The delete took a long time, but at
 |least it got there.
 |
 |
 | Thanks a lot.  Your reply is what we needed to make a correct fix.
 
 Alex -
 
 Is the fix in the parent message the correct fix? It seems to leave an
 if (1 || ...) in, and I've yet to see the fix appear in bk or lkml.

Jeff, look for subject: reiserfs_do_truncate patch from zam.

-chris




Re: [PATCH] Fix reiserfs oops on small fs

2004-11-18 Thread Chris Mason
On Thu, 2004-11-18 at 12:49 +0100, Jan Kara wrote:
   Hello!
 
   Attached patch fixes oops of reiserfs on a filesystem with just one
 bitmap block - current code always tries to return second bitmap even if
 there's not any. Could someone review it please so that it can be merged
 in mainline?

Hi Jan,

A slightly different form of this patch is in already.  Look for the
checks in bmap_hash_id.  Are you still able to reproduce this bug on
kernels newer than October 18?

-chris




Re: [PATCH] Expose sync_fs()

2004-11-18 Thread Chris Mason
On Thu, 2004-11-18 at 13:00 +0100, Jan Kara wrote:
   Hello!
 
   Attached patch makes reiserfs provide sync_fs() function. It is
 necessary for a new quota code to work correctly and expose quota data
 to the user space after quotaoff. Currently the functionality is hidden
 behind the write_super() call which also seems a bit non-intuitive to me.
 Do you think the patch is acceptable?


Looks fine.

-chris




Re: [PATCH] Compile fix for reiserfs quota debug

2004-11-18 Thread Chris Mason
On Thu, 2004-11-18 at 12:44 +0100, Jan Kara wrote:
   Hello!
 
   Attached patch fixes debugging messages of the quota code in the
 reiserfs so that they compile. Could some of the reiserfs developers
 have a look at it please so that it can be merged in the mainline?

Looks fine, thanks Jan.

-chris




Re: [PATCH] ReiserFS v3 I/O error handling

2004-09-21 Thread Chris Mason
On Wed, 2004-09-15 at 10:31, Hans Reiser wrote:
 Jeff Mahoney wrote:
 
  Alex Zarochentsev wrote:
  | I assume that was tested with some simulated i/o errors, wasn't it?.
 
  Of course. The debugging code is since removed, but every place there
  was a !buffer_uptodate(bh) check, I added a trigger such that I could
  trigger each error path individually. I triggered various error paths
  while running fsx, stress.sh, and/or ltp's fsstress.
 
 Your patch is much needed stuff.  Would be nice to see it for reiser4 
 someday.;-)

;-) Any objections if we start by sending this to Andrew for v3?

-chris




Re: BUG in Reiserfs Journal Thread

2004-09-15 Thread Chris Mason
On Wed, 2004-09-15 at 16:02, Vijayan Prabhakaran wrote:
 Dear Chris Mason,
 
 I found a bug in Reiserfs journal thread. This bug is in function
 reiserfs_journal_commit_thread().

Hi,

Which version of the code are you reading?

-chris




Re: silent semantic changes with reiser4

2004-08-25 Thread Chris Mason
On Wed, 2004-08-25 at 16:41, Hans Reiser wrote:
 I just want to add that I AM capable of working with the other 
 filesystem developers in a team-player way, and I am happy to cooperate 
 with making portions more reusable where there is serious interest from 
 other filesystems in that, 

Prove it.  Stop replying for today and come back tomorrow with some
useful discussions.  Christoph suggested that some of the v4 semantics
belong in the VFS and therefore linux as a whole.  He's helping you to
make sure the semantics and fit nicely with the rest of kernel
interfaces and are race free.

Take him up on the offer.

-chris




Re: Quicker alternative to find /?

2004-08-16 Thread Chris Mason
On Mon, 2004-08-16 at 08:52, Christophe Saout wrote:
 Am Sonntag, den 15.08.2004, 23:16 +0200 schrieb Felix E. Klee:
 
  I'd like to store the directory structure of a partition formatted as
  ReiserFS into a file. Currently, I use
  
  find / file
  
  This process takes approximately 5 minutes (the result is 26MB of
  data). Are there any alternative *quicker* ways to do this?
 
 The main problem is that find only uses one thread. This thread only
 reads one directory at once and as a result of that you'll get a lot of
 seeks.
 
 This can usually be improved *a lot* by doing a massively multi-threaded
 search with a lot of threads trying to read a lot of directories at
 once. The disk io scheduler will then linearize all the outstanding read
 requests.
 
 I've done something similar to speed up a diff -r using a shell script
 (not for find but for reading the file content that should be compared).
 

This gets tricky quickly.  Many threads reading many different
directories at once will introduce a lot of seeks, since the directories
are likely to be far apart on disk.  It is far better for the filesystem
to realize a sequential scan of the directory is in progress and do
smarter readahead based on that.  

The latest patches in 2.6.8 for reiser v3 do some of this, triggering
metadata readahead on readdirs.  You can also make things faster by
mounting with -o noatime,nodiratime.

This is one workload where v4 should do better, since the inode data is
close to the directory entry.

-chris




Re: Quicker alternative to find /?

2004-08-16 Thread Chris Mason
On Mon, 2004-08-16 at 09:19, Spam wrote:
  Am Sonntag, den 15.08.2004, 23:16 +0200 schrieb Felix E. Klee:
 
  I'd like to store the directory structure of a partition formatted as
  ReiserFS into a file. Currently, I use
  
  find / file
  
  This process takes approximately 5 minutes (the result is 26MB of
  data). Are there any alternative *quicker* ways to do this?
 
  The main problem is that find only uses one thread. This thread only
  reads one directory at once and as a result of that you'll get a lot of
  seeks.
 
   I am confused about this in general with most filesystems. I thought
   that all filenames/foldernames etc were stored in one place and not
   spread out over the intire filesystem.
 
   It seem to me very strange that things like find/ls -R etc takes so
   so long just read/list files like this on any modern filesystem.

It varies by filesystem.  The filenames and folder names are stored in
one place on almost all filesystems.  But, the actual inode information
that tells you what kind of file it is and how to read the file are
sometimes stored in a different place (v3, ext[23]).

find has to read both sets of information because a recursive find has
to descend into all subdirectories, and the only way it can know if
something is a subdirectory is by reading the inode.  There is an
optimization that ext[23] use to store the mode information (identifying
things as a file or dir) in the directory listing.  reiserfs v3 doesn't
have this in the disk format, not sure if v4 does or not.

Different directories are likely to be stored in different areas of the
disk.  So, a multithreaded find that tries to read multiple dirs at once
is likely to introduce more seeks.

-chris




Re: Odd Block allocation behavior on Reiser3

2004-08-10 Thread Chris Mason
On Mon, 2004-08-09 at 18:04, Sonny Rao wrote:
 On Mon, Aug 09, 2004 at 04:30:51PM -0400, Chris Mason wrote:
  On Mon, 2004-08-09 at 16:19, Sonny Rao wrote:
   Hi, I'm investigating filesystem performance on sequential read
   patterns of large files, and I discovered an odd pattern of block
   allocation and subsequent re-allocation after overwrite under reiser3:
   
  Exactly which kernel is this?  The block allocator in v3 has changed
  recently.
 
 2.6.7 stock
 
Ok, the block allocator optimizations went in after 2.6.7.  I'd be
curious to see how 2.6.8-rc3 does in your tests.

   This was done on a newly created filesystem with plenty of available
   space and no other files.  I tried this test several times and saw the
   number of extents for the file vary from 5,6,7 and 134 extents, but it
   is always different after each over-write.
   
  You've hit a feature of the journal.  When you delete a file, the data
  blocks aren't available for reuse until the transaction that allocated
  them is committed to the log.  If you were to put a sync in between each
  run of dd, you should get roughly the same blocks allocated each time. 
  ext3 does the same things, although somewhat differently.  The
  asynchronous commit is probably just finishing a little sooner on ext3.
  
   First, I expect that an extent-based filesystem like reiserfs
  
  reiser4 is extent based, reiser3 is not.
 
 
 Ah, I didn't know that.  I'm still confused as to why on the first
 allocation/create we get such bad fragmentation, you can see that even
 though the file is fragmented into 134 blocks, the blocks are very
 close together.  Most of the extents are only 2 blocks apart.

This could be the metadata mixed in with the file data.  In general this
is a good
Still there were a number of cases the old allocator didn't do as well
with. thing, when you read the file sequentially, the metadata required
to find the next block will already be in the drive's cache.

Whenever you're doing fragmentation tests, it helps to also identify the
actual effect of the fragmentation on the time it takes to read a file
or set of files.  It's easy to create a directory where all the files
are 99.99% contiguous, but that takes 3x as much time to read in.

-chris



Re: mongo_copy: cp: cannot stat `/mnt/testfs/testdir0-0-0/f92': Input/output error

2004-08-04 Thread Chris Mason
On Wed, 2004-08-04 at 13:38, Hans Reiser wrote:
 Please do whatever you can to reproduce this.  We are going to delay 
 release by one day to see if it can be reproduced.  Vs thinks it might 
 be a hardware problem, I am not so optimistic, what are your thoughts?
 

If it currently passes all of your internal tests, and the disk format
is fixed, I'd release it.  There are going to be bugs found, pretty much
everyone expects v4 to go through a churning period.

-chris




Re: Reiser4 on SuSE 9.1

2004-06-04 Thread Chris Mason
On Fri, 2004-06-04 at 10:14, Mike Young wrote:
 Has anyone put Reiser4 on the latest SuSE 9.1 release?  I'd like to use it
 there without having to patch a pristine kernel.  Preferably, I'd like to be
 able to use their RPM build environment so I can continue to take updates
 from SuSE.
 
 I've placed a couple of requests into their support group, but have gotten
 nowhere with them.
 
 Anyone got a suggestion.  I can patch kernels.  I just haven't deciphered
 the build environment.  I'm use to doing this in a chroot environment.

We're interested in v4 of course, as things stabilize we'll consider
including it.  This doc is fairly up to date, and it might help you
figure out the suse build setup:

http://www.suse.de/~agruen/kernel-doc/

-chris




Re: I would like to see ReiserFS V3 enter a feature freeze real soon.

2004-06-01 Thread Chris Mason
On Tue, 2004-06-01 at 13:02, Hans Reiser wrote:

  I can't promise that I'll never making another
 change in there, but my goal is to keep them to a minimum.
 
   
 
 Also, I would like to see some serious benchmarks of the bitmap 
 algorithm changes before they go in.  They seem nice in theory, and some 
 users liked them for their uses, but that does not make a serious 
 scientific study.  Such a study has a high chance of making them even 
 better.;-)
 
 
 
 
 Some benchmarks have been posted on reiserfs-list, but I'd love to
 coordinate with you on getting some mongo numbers. 
 
 Ok.

 A good start would be to just rebenchmark against v4.
   
 
 V4 performance is not at a stable point at the moment I think, I have 
 not been monitoring things closely due to trying to earn bucks 
 consulting, and performance did not get tested every week, but there 
 have been reports of performance decreasing and no reports of anyone 
 investigating it, so I need to
 

Sure, since v4 is being done again -mm right now (right?) you can just
benchmark against a few of the new options.  mount -o alloc=skip_busy
will give you the old allocator.

 Elena, please compose a URL consisting of past benchmarks of various V4 
 snapshots and send it to me.  (I did not read the last one you sent, 
 sorry about that, so include the contents of that one also).
 
 If the objective is to determine if the algorithm is good, then we 
 should test it with only the algorithm in question changed.
 
 I would be quite happy to add the algorithm to V4 (or Chris and Jeff can 
 do that) and test it on vs. off.

The algorithm has a few key components, but v4 doesn't need most of it. 
The part to inherit packing localities down the chain would be most
interesting in v4.

The rest approximates things v4 should already be good at, like grouping
some of the metadata near the data blocks.

-chris




Re: journal viewer for reiserfs

2004-05-21 Thread Chris Mason
On Fri, 2004-05-21 at 11:12, Redeeman wrote:
 hey, i just lost power today, and i saw it mentioned some transactions
 it replayed. now, i dont like that, but happend is happend. though, i
 would like to know what the transactions really were, is such thing
 possible? 
 
 note: reiserfs, not reiser4
 

The transactions are block based, so there's no nice logical way to sum
them up.  You can look at the blocks in the log with debugreiserfs -j
/dev/xxx /dev/xxx, but that's not very exciting.

-chris





Re: large files

2004-05-18 Thread Chris Mason
On Tue, 2004-05-18 at 09:42, Bernd Schubert wrote:
 Hello Chris,
 
 
  As a comparison data point, could you please try 2.6.6-mm3?  I realize
  you don't want to run this kernel in production, but it would tell us if
  I understand the problems at hand.
 
 the results in 2.6.6-mm3 are below, we almost consider to run this kernel 
 version.
 
 Here are two other interesting facts:
 
 1.) During the filecreation in 2.4.26 the load on the system was around 3-4, 
 whereas in 2.6.6-mm3 the load was at about 8-9.
 
Which procs contributed to this load?  The simple dd should have kept
the load at one.

 2.) When the dd file creation process finished (2.4.26 was running) the system 
 became so unresponsible, that the drdb connection timed out and a resync 
 process automatically started when the system became responsible again. I 
 don't have any comparism to 2.6.6-mm3 since we would need another drbd 
 version. Also, I don't know if this happend when dd finished or when 
 rm-started, since both were running from a script.
 
Probably the rm.

[ 2.6.6-mm3 is much faster ]

 Do you have any ideas how we could improve 2.4.x? 
 

2.6.6-mm has a few key improvements.  There's less metadata
fragmentation thanks to some block allocator fixes.  More importantly,
during the rm, metadata blocks are read in 16 at a time instead of 1 at
a time.  I'd be happy to give someone pointers on porting the metadata
readahead bits back to 2.4.

-chris




Re: large files

2004-05-17 Thread Chris Mason
On Mon, 2004-05-17 at 15:48, Bernd Schubert wrote:
 Hello,
 
 I'm currently testing our new server and though it will primarily not serve 
 really large files (about 40-60 users will have a quota of 25GB each on a 2TB 
 array), I'm still testing the performance for large files.
 
 So I created an about 300GB fil and the problem is to remove it now. 
 Removing it took much more than 15 minutes. Here's the the relevant top line:
 
  5012 root  18   0   368  368   312 D21.9  0.0   5:48 rm
 
 Since I didn't expect it to take so much time, I didn't measure the time to 
 delete this file.
 
 system specifications:
   - dual opteron 242 (1600 MHz)
   - linux-2.4.26 with all patches from Chris, no further patches
   - reiserfs-3.6 format
 
 The partition with the 300GB file has a size of 1.7TB.

This is most likely a combination of metadata fragmentation and the fact
that during deletes, 2.4.x reiserfs ends up reading one block at a time.

As a comparison data point, could you please try 2.6.6-mm3?  I realize
you don't want to run this kernel in production, but it would tell us if
I understand the problems at hand.

-chris




Re: 2 Terabyte install

2004-05-12 Thread Chris Mason
On Wed, 2004-05-12 at 14:24, Mike Benoit wrote:
 On Wed, 2004-05-12 at 09:59, Hans Reiser wrote:
   There were a few bugs of ours that acted as red herrings, but Linspire 
   is now up and running on this system with ReiserFS 3 and kernel 2.6.5.
  
   While I'm here, I have some other questions:
  
   * What is the time complexity of mounting a ReiserFS partition? 
 It seems to be proportional to the size of the partition?  Is it
 different for Reiser4?
   * Is there a tool to determine the type of file system on a
 partition without mounting it?
  
  is the time a problem in practice, or just a curiousity point?
 
 Hans,
 
   The last place I worked had a 2.5TB RAID array that had 10's of
 millions of files on it, basically a perfect fit for ReiserFS. However
 they rejected it simply because of the mount times. It was a slower
 machine (mainly used for dumb storage, however uptime was critical) but
 the mount times were just crazy, I don't have hard numbers, but if I
 recall correctly I was told around 5-10 minutes. 
 
Hans' guys made mount times better in v3 by reading the bitmaps in big
chunks.  I suspect you were using a kernel from before that fix.

-chris




Re: metas Permission Denied

2004-04-30 Thread Chris Mason
On Fri, 2004-04-30 at 01:19, Hans Reiser wrote:
 Chris Mason wrote:
 
 On Thu, 2004-04-29 at 12:22, Nikita Danilov wrote:
   
 
 [EMAIL PROTECTED] writes:
   On Thu, 29 Apr 2004 19:59:22 +0400, Nikita Danilov said:
   
chmod u+rx backup/fsplit.c

x bit is necessary for lookups, and r bit---for readdir.
   
   This is going to be *such* a non-starter - there's many decades of
   C files are mode 644 and executables are 755 tradition that this
   will fly against.  What this basically implies is that the 'execute'
 
 Eh? What I described is precisely decades old meaning of rwx bits for
 directories.
 
 Problem is that we have to fit objects that are both regular files and
 directories into access control scheme that wasn't designed for such a
 mix. I don't see better solution short of inventing new bit(s).
 
 
 
 Please forgive me for jumping into the end of a thread without reading
 the whole thing, but it seems like the r bit should be sufficient here. 
 If you can read the file, you should be able to read the metas.
 
 x should be for execution of the file...
 
 what if the file/directory contains real files which are not metas, 
 and it also has a file body?  This is possible in reiser4.

Well, that would explain needing the execute bit ;-)

I guess this is a matter of taste, but to me, the metas are really part
of the file.  If you can read the file you should be able to at least
read the listing of metas, for the same reasons that you can read the
file size and atime/mtime etc.  

This could hold true for /somedir/metas as well.

-chris






Re: Reiserfs concurrent write problems

2004-04-27 Thread Chris Mason
On Tue, 2004-04-27 at 12:15, Bruce Guenter wrote:
 On Mon, Apr 26, 2004 at 01:36:11PM -0400, Chris Mason wrote:
  Please try 2.6.6-rc2-mm2, which has new block allocator patches and
  other speedups.
  
  The default mount options should work best for you, but this might work
  too:
  
  mount -o alloc=skip_busy:dirid_groups /dev/xxx /mnt
 
 Using the new kernel and the above mount options boosted the write
 performance of ReiserFS, almost to the level of XFS.  At the highest
 concurrency level, the reading performance dropped, but it's still much
 faster than the other FSs.
 
 Thanks for your help.

Good to hear, could I trouble you to ping me (or better yet
reiserfs-list) when you have the graphs updated?

Other notes about 2.6.6-rc2-mm2, it has some ext3 optimizations as well,
you might want to rebenchmark there.  reiserfs defaults to -o
data=ordered in the 2.6.6-rc2 (and -mm), you should get slightly better
numbers with -o data=writeback.

-chris




Re: I oppose Chris and Jeff's patch to add an unnecessary additional namespace to ReiserFS

2004-04-26 Thread Chris Mason
On Mon, 2004-04-26 at 14:15, Hans Reiser wrote:
 Chris Mason wrote:
 
 On Mon, 2004-04-26 at 12:59, Hans Reiser wrote:
 
 v4 didn't factor into these decisions because it was still in extremely
 early stages back then (2.4.16 or so). 
 
 It was clearly indicated then that accessing acls was scheduled for V4 
 not V3. 
 
Well, that part we've always disagreed most on is how to support
existing users.  SUSE implemented the acls for v3 because we felt they
were an important feature, and didn't want to tell users asking for ACLs
to switch filesystems when it was reasonable to implement in v3.

It seems that you don't want the ACLs in v3 for two major reasons:

1) it's not v4
2) it's based on xattrs

I don't feel this is a good way to support v3, since v4 still means
telling someone to switch just for acls, and not using xattrs means not
using the same API as the rest of the kernel.

I hope v4 does improve the xattr api, and I hope it manages to do so for
more then just reiser4.  It is important that application writers are
able to code to a single interface and get coverage across all the major
linux filesystems.

-chris




Re: online fsck

2004-04-22 Thread Chris Mason
On Thu, 2004-04-22 at 13:51, Hans Reiser wrote:
 Chris Mason wrote:
 
 On Thu, 2004-04-22 at 09:00, Jure Pear wrote:
   
 
 Hi all,
 
 Is it theoretically posible?
 
 Like, does it need a drastic redesing of reiserfs or just sufficient $$
 directed to the team to be implemented?
 
 
 Because i think that reiserfsck --check in 12h + --rebuild-tree in 18h is
 still waaay too much downtime for a 500gb mail server...
 
 
 
 Online check is easy, just use lvm or evms to make a snapshot and then
 check the snapshot. 
 
 Requires that users use lvm before discovering the need for fsck, but, 
 yes.  What would be ideal would be some support for finding the 
 inconsistency on the snapshot, and then fixing it on the real fs using 
 the information learned from the snapshot fsck.

Which should be possible, espeically for corruptions at the leaf
levels.  Things like incorrect i_size, pointers to files that don't
exist, etc.  Corruptions that require a full blow rebuild-tree will be
much harder.

I was wrong to say that an online rebuild-tree would be impossible, but
it certainly does seem tricky.  Basically you could freeze the old tree,
using it for readonly lookups.  Rebuild to new tree in the background,
and verify things you find in the old tree in the new tree (to catch a
file that has been deleted while the FS was mounted but is present in
the old tree).

-chris




Re: Resier Fragmentation Effects (was compression vs performance)

2004-04-09 Thread Chris Mason
On Fri, 2004-04-09 at 01:53, Hans Reiser wrote:
 Burnes, James wrote:
 
 I thought nearly all filesystems designed since Berkeley FFS were nearly
 immune to fragmentation problems.
 
 After reading the following analysis at Harvard, it seems that
 fragmentation is still a problem.
 
 http://www.eecs.harvard.edu/~keith/research/tr94.html
 
   
 
 Yeah, I wish I had read this in 94.  V3 suffers from the same problems 
 as FFS does as described in the abstract (all that I read, sorry about 
 that, I really am a bit busy, so unless someone suggests I should read 
 more ) .  V4 cures it though.

I put out some patches last week that try to deal with this in v3.  Take
a look through the archives for mail from me.

The v3 patches are an attempt to do better under common workloads.  I
think they are a big improvement, and I doubt there's much more that can
(or should) be done beyond simple tweaking.

v4 does a better job, and even if it doesn't, it should at least have
enough info in the metadata such that any problems can be fixed.

-chris




Re: reiserfs v3 patches updated

2004-04-06 Thread Chris Mason
On Tue, 2004-04-06 at 14:29, camis wrote:
  Seem to run fine so far with
  rw,noatime,nodiratime,acl,user_xattr,data=ordered,alloc=skip_busy:dirid_groups,packing_groups
 
 Has anyone done any throughput/benchmarks for the various
 new patches/code?
 
The block allocator stuff is fresh from the oven and still warm inside. 
I'll be posting benchmarks for it as the week goes on.

 The block allocator improvements is our attempt to reduce
 fragmentation.  The patch defaults to the regular 2.6.5 block allocator,
 but has options documented at the top of the patch that allow grouping
 of blocks by packing locality or object id.  It also has an option to
 inherit lightly used packing localities across multiple subdirs, which
 keeps things closer together in the tree if you have a bunch of subdirs
 without much in them.
 
 Any of these new features documented anywhere? ;)

I'm writing up a mount option summary, the block allocator stuff is all
documented at the top of the patch.

-chris




Re: reiserfs v3 patches updated

2004-04-06 Thread Chris Mason
On Tue, 2004-04-06 at 15:17, camis wrote:
 Has anyone done any throughput/benchmarks for the various
 new patches/code?
  
  The block allocator stuff is fresh from the oven and still warm inside. 
  I'll be posting benchmarks for it as the week goes on.
 
 Something interesitng:
 I have 8 dell 2650's (dual xeon 2.8ghz+hyperthreading) and i've setup
 the one machine above with the patches.. All 8 machines are incoming
 MX/smtp machines which handle quite large loads (5gigs of mail per day).
 
 machine 1: 
 (mounted:noatime,nodiratime,data=ordered,alloc=skip_busy:dirid_groups,packing_groups)
 
   21:16:17  up 12:17,  1 user,  load average: 0.45, 1.04, 1.07
 375 processes: 374 sleeping, 1 running, 0 zombie, 0 stopped
 CPU states:  cpuusernice  systemirq  softirq  iowaitidle
 total7.2%0.0%   32.4%   0.0% 0.8%5.2%  353.2%
 cpu001.8%0.0%   10.6%   0.1% 0.9%1.3%   85.0%
 cpu011.5%0.0%7.4%   0.0% 0.1%0.9%   89.9%
 cpu021.7%0.0%7.0%   0.0% 0.1%1.7%   89.3%
 cpu031.8%0.0%7.4%   0.0% 0.1%1.5%   88.9%
 
 Iowait total stays hovering at around the 6-8% mark..
 
 machine 2-8: (mounted: noatime,nodiratime)
 
   21:14:31  up 1 day,  8:41,  1 user,  load average: 0.43, 0.72, 0.66
 392 processes: 391 sleeping, 1 running, 0 zombie, 0 stopped
 CPU states:  cpuusernice  systemirq  softirq  iowaitidle
 total8.0%0.0%   35.2%   0.0% 1.6%0.8%  352.8%
 cpu002.0%0.0%   11.1%   0.1% 1.5%0.3%   84.6%
 cpu012.0%0.0%7.7%   0.0% 0.3%0.1%   89.5%
 cpu022.2%0.0%7.7%   0.0% 0.0%0.3%   89.5%
 cpu031.8%0.0%8.7%   0.0% 0.0%0.0%   89.3%
 
 The majority of the rest of the machines iowait hover around the 1%
 mark.. CPU time tends to be about the same, just the iowait is much
 much higher..

Very interesting.  data=ordered makes fsync more expensive, since it
ends up syncing more then just the buffers for that one file.  Could you
please try removing data=ordered from machine1?

-chris




Re: reiserfs v3 patches updated

2004-04-06 Thread Chris Mason
On Tue, 2004-04-06 at 15:53, camis wrote:
 The majority of the rest of the machines iowait hover around the 1%
 mark.. CPU time tends to be about the same, just the iowait is much
 much higher..
  
  Very interesting.  data=ordered makes fsync more expensive, since it
  ends up syncing more then just the buffers for that one file.  Could you
  please try removing data=ordered from machine1?
 
 Ok.. after leaving it for a few minutes..

[ higher io wait percentage with patches applied ]

 What i then tried on machine1 was remounting it noatime,nodiratime and
 what was wierd was that the iowait stayed exactly the same, no indication
 of dropping at all.. Does the kernel patches change any of the default
 mount options at all? If i put the stock 2.6.5 kernel back on without
 the patch applied and mount it back as noatime,nodiratime, the iowait
 drops back to its normal 1%..

Do you have any numbers for the amount of mail delivered per second on
each box?  I've got an idea that should help lower the io wait
percentage as well, trying a few things here.

-chris




Re: reiserfs v3 patches updated

2004-04-06 Thread Chris Mason
On Tue, 2004-04-06 at 15:53, camis wrote:
 The majority of the rest of the machines iowait hover around the 1%
 mark.. CPU time tends to be about the same, just the iowait is much
 much higher..
  
  Very interesting.  data=ordered makes fsync more expensive, since it
  ends up syncing more then just the buffers for that one file.  Could you
  please try removing data=ordered from machine1?
 
 Ok.. after leaving it for a few minutes..

I'm a little slow today.  data=ordered is the default with these
patches.  You need to mount -o data=writeback.

-chris




Re: reiserfs v3 patches updated

2004-04-06 Thread Chris Mason
On Tue, 2004-04-06 at 16:51, Cami wrote:
 The majority of the rest of the machines iowait hover around the 1%
 mark.. CPU time tends to be about the same, just the iowait is much
 much higher..
 
 Very interesting.  data=ordered makes fsync more expensive, since it
 ends up syncing more then just the buffers for that one file.  Could you
 please try removing data=ordered from machine1?
 
 Ok.. after leaving it for a few minutes..
  
  I'm a little slow today.  data=ordered is the default with these
  patches.  You need to mount -o data=writeback.
 
 data=writeback yields pretty much the same iowait results..
 (machine 1+2 are around 5%-8% whereas machine 3-8 are at around 0.8%)

This is so much faster I'm worried the io isn't actually getting done. 
In the mail server benchmark I use (synctest -n 1 -t 50 -f -F), the time
went from 2m15s to 43s.  The old 2m15s was still faster then I used to
get with unpatched reiserfs.

I'm posting this for the truly brave among you and a little review.  I
need to do more tests on it.  The basic idea is to make sure we don't
start writeback on the log buffers and metadata if someone is already
doing it.

Also, the reiserfs work queue is not kicked during transaction end if
some other process is going to do the commit.  This saves a lot of
context switches.

-chris

Index: linux.mm/fs/reiserfs/journal.c
===
--- linux.mm.orig/fs/reiserfs/journal.c	2004-04-05 17:46:12.0 -0400
+++ linux.mm/fs/reiserfs/journal.c	2004-04-06 17:00:25.391877520 -0400
@@ -86,6 +86,7 @@ static struct workqueue_struct *commit_w
 /* journal list state bits */
 #define LIST_TOUCHED 1
 #define LIST_DIRTY   2
+#define LIST_COMMIT_PENDING  4		/* someone will commit this list */
 
 /* flags for do_journal_end */
 #define FLUSH_ALL   1		/* flush commit and real blocks */
@@ -2484,7 +2485,9 @@ static void let_transaction_grow(struct 
 {
 unsigned long bcount = SB_JOURNAL(sb)-j_bcount;
 while(1) {
-	yield();
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	schedule_timeout(1);
+	SB_JOURNAL(sb)-j_current_jl-j_state |= LIST_COMMIT_PENDING;
 while ((atomic_read(SB_JOURNAL(sb)-j_wcount)  0 ||
 	atomic_read(SB_JOURNAL(sb)-j_jlock)) 
 	   SB_JOURNAL(sb)-j_trans_id == trans_id) {
@@ -2920,9 +2923,15 @@ static void flush_async_commits(void *p)
   flush_commit_list(p_s_sb, jl, 1);
   }
   unlock_kernel();
-  atomic_inc(SB_JOURNAL(p_s_sb)-j_async_throttle);
-  filemap_fdatawrite(p_s_sb-s_bdev-bd_inode-i_mapping);
-  atomic_dec(SB_JOURNAL(p_s_sb)-j_async_throttle);
+  /*
+   * this is a little racey, but there's no harm in missing
+   * the filemap_fdata_write
+   */
+  if (!atomic_read(SB_JOURNAL(p_s_sb)-j_async_throttle)) {
+  atomic_inc(SB_JOURNAL(p_s_sb)-j_async_throttle);
+  filemap_fdatawrite(p_s_sb-s_bdev-bd_inode-i_mapping);
+  atomic_dec(SB_JOURNAL(p_s_sb)-j_async_throttle);
+  }
 }
 
 /*
@@ -3011,7 +3020,8 @@ static int check_journal_end(struct reis
 
   jl = SB_JOURNAL(p_s_sb)-j_current_jl;
   trans_id = jl-j_trans_id;
-
+  if (wait_on_commit)
+jl-j_state |= LIST_COMMIT_PENDING;
   atomic_set((SB_JOURNAL(p_s_sb)-j_jlock), 1) ;
   if (flush) {
 SB_JOURNAL(p_s_sb)-j_next_full_flush = 1 ;
@@ -3533,8 +3543,8 @@ static int do_journal_end(struct reiserf
   if (flush) {
 flush_commit_list(p_s_sb, jl, 1) ;
 flush_journal_list(p_s_sb, jl, 1) ;
-  } else
-queue_work(commit_wq, SB_JOURNAL(p_s_sb)-j_work);
+  } else if (!(jl-j_state  LIST_COMMIT_PENDING))
+queue_delayed_work(commit_wq, SB_JOURNAL(p_s_sb)-j_work, HZ/10);
 
 
   /* if the next transaction has any chance of wrapping, flush 


Re: NFS issues with reiserfs 3.6? ext3 works...

2004-03-26 Thread Chris Mason
On Fri, 2004-03-26 at 08:18, Vladimir Saveliev wrote:
 Hello
 
 On Fri, 2004-03-26 at 14:53, Bernhard Sadlowski wrote:
  Hi!
  
  A short question:
  
  What are the remaining differences between reiserfs and ext3 regarding
  NFS? 
  
 
 We thought no before your mail
 
  Details:
  
  I have two Linux machines with current kernel 2.4.25 and two
  Directories on the NFS Server dematl04:
  
  dematl04:/vol01 (ext3 fs)
  dematl04:/raid  (reiserfs 3.6)
  
  On both machines we use Helios Ethershare (a commercial software for
  print and fileservices for Macintosh). Helios Ethershare uses and
  updates a file .Desktop every time some mac is creating, writing or
  changing files in a Macintosh volume.  The dt utility has to be used
  in a shell for mkdir, rm, mv, touch, ... I guess it has to lock this
  file whilche applying changes.
 
Does Helios Ethershare use the standard linux in kernel nfs server, or
does it patch things somehow?

-chris




Re: NFS issues with reiserfs 3.6? ext3 works...

2004-03-26 Thread Chris Mason
On Fri, 2004-03-26 at 09:09, Bernhard Sadlowski wrote:
 Hi Chris,
 
 On 26 Mar 2004 08:42, Chris Mason [EMAIL PROTECTED] wrote:
  Does Helios Ethershare use the standard linux in kernel nfs server, or
  does it patch things somehow?
 
 System1 (dematl04) is Debian/woody with Linux 2.4.25 kernel-nfs-server
 System2 (dematl08) is Redhat9 with Linux 2.4.25 with its standard nfs client
 
 I did compile both kernels myself from original kernel.org source. 
 
Thanks for the clarification.  Running dt under strace as Vladimir
suggested will help.

-chris




Re: NFS issues with reiserfs 3.6? ext3 works...

2004-03-26 Thread Chris Mason
On Fri, 2004-03-26 at 09:20, Bernhard Sadlowski wrote:
 On 26 Mar 2004 09:18, Chris Mason [EMAIL PROTECTED] wrote:
   I did compile both kernels myself from original kernel.org source. 
   
  Thanks for the clarification.  Running dt under strace as Vladimir
  suggested will help.
 
 I did send him the output by private mail, because the list doesn't
 accept bigger attachments. If you are interested, I can send you the
 strace output too.

Vladimir sent them along.  I only read the quickly, the traces looks the
same to me for reiserfs and ext3.  It tries to find directory test3 on
the disk and then call off to helios which returns a failure.

You'll have to check things on the helios side I think.

-chris





reiserfs logging patches udpated

2004-03-24 Thread Chris Mason
Hello everyone,

ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.5-rc2-mm2

Has a new set of reiserfs patches.  These should also work on 2.6.5-rc2,
but I did testing on top of -mm2 because I'm submitting part of the
patch set to Andrew.

New since the 2.6.4 code:

- add reiserfs laptop mode support
- removed patches that made it into mainline
- fixed buffer refcount problem in reiserfs-writepage-ordered-race when 
block size  page size
- fix buffer head leak in my invalidatepage func
- fix bogus warning message about locked buffers during logging

The series file has the list of the patches in the order they should be
applied.

Andrew, this subset is ready for testing -mm.  It's everything except
the xattr and ACLs, I'm still trying to convince Hans those should go
in.

reiserfs-nesting-02
reiserfs-journal-writer
reiserfs-logging
reiserfs-jh-2
reiserfs-prealloc
reiserfs-tail-jh
reiserfs-writepage-ordered-race
reiserfs-file_write_hole_sd.diff
reiserfs-laptop-mode
reiserfs-truncate-leak
reiserfs-ordered-lat
reiserfs-dirty-warning

-chris




Re: reiserfs logging patches udpated

2004-03-24 Thread Chris Mason
On Wed, 2004-03-24 at 14:49, Hubert Chan wrote:
  Chris == Chris Mason [EMAIL PROTECTED] writes:
 
 [...]
 
 Chris - add reiserfs laptop mode support
 
 Can you explain what laptop mode is?

Take a look at linux/Documentation/laptop-mode.txt on any recent -mm
kernel.

The short description is that it allows you to force the filesytem into
a mode where it doesn't flush things to disk.  The risk is increased
loss of data if you crash, the gain is better battery life on laptops. 
The docs include a sample control script and other details.

The reiserfs patch is quite small, just making sure the journal honors
the flush timings requested by the rest of the kernel.

-chris




Re: reiserfs logging patches udpated

2004-03-24 Thread Chris Mason
On Wed, 2004-03-24 at 17:18, Andrew Morton wrote:
 Chris Mason [EMAIL PROTECTED] wrote:
 
  ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.5-rc2-mm2
  
  Has a new set of reiserfs patches. 
 
 -ENODOCCO.  If people are going to test this stuff we will need setup and
 usage instructions, links to userspace tool upgrades, etc, etc.

For data=ordered?  The only docs are to mount -o data=writeback if you
don't want data=ordered (which is the new default).  No tool upgrades
are required.

reiserfs needs a Documentation/filesystems/reiserfs.txt, but it needs
that in general.  I'll write one up and have the namesys guys review.

-chris




Re: reiserfs logging patches udpated

2004-03-24 Thread Chris Mason
On Wed, 2004-03-24 at 17:35, Andrew Morton wrote:
  
  For data=ordered?  The only docs are to mount -o data=writeback if you
  don't want data=ordered (which is the new default).  No tool upgrades
  are required.
 
 OK, thanks.  Switching the default on day one sounds radical doesn't it?
 
It does, but the code has been in testing -suse and on the reiserfs
list.  This dooesn't mean data=ordered is perfect, but it's not quite
day one either.  I can switch the default back, but I'd rather have a
trial by fire ;-)

  reiserfs needs a Documentation/filesystems/reiserfs.txt, but it needs
  that in general.  I'll write one up and have the namesys guys review.
 
 OK.  And these guys need boring old changelogs please:
 
The top of each patch has a boring old changelog.  I can reformat them
if needed.

-chris





Re: reiserfs logging patches udpated

2004-03-24 Thread Chris Mason
On Wed, 2004-03-24 at 17:59, Andrew Morton wrote:
  The top of each patch has a boring old changelog.  I can reformat them
  if needed.
 
 Oh, I didn't notice that.
 
 Anything other than one-patch-per-email with changelog in the body is a bit
 of a pain.
 
 I'll go fetch the patches again :(

I can send them over private mail to you that way if you haven't already
started downloading.

-chris



Re: reiserfs logging patches udpated

2004-03-24 Thread Chris Mason
On Wed, 2004-03-24 at 19:47, Bernd Schubert wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
  It does, but the code has been in testing -suse and on the reiserfs
  list.  This dooesn't mean data=ordered is perfect, but it's not quite
  day one either.  I can switch the default back, but I'd rather have a
  trial by fire ;-)
 
 Oh please not, just have a look at german newsgroups, there are real flamewars 
 about the stability of reiserfs. Everytime someone has a disk problem and a 
 reiserfs partion is affected by it, then there are more than a dozen answers 
 that its the fault of reiserfs. PLEASE, don't prove them right.
 
The code just won't get tested if it isn't turned on.  I think I've got
all the major issues fixed, and I'm sending to -mm to try and prove
that.  When we're all happy with the stability of the patches, they can
go to -linus.

But there are just as many threads about reiserfs not having
data=ordered support.  It's an important feature that has been out of
mainline for far too long.

-chris




Re: reiserfsprogs: lib/misc.c: why die() aborts?

2004-03-22 Thread Chris Mason
On Mon, 2004-03-22 at 07:49, Vladimir Saveliev wrote:
 Hi
 
 On Sat, 2004-03-20 at 19:18, Domenico Andreoli wrote:
  hi all,
  
  in trying to figure out what is the unpack program in reiserfsprogs
  and if it supposed to be distributed in the debian package,
 
 No, it should not be distributed
 
Hmmm, I disagree.  Normal users can use debugreiserfs -p and unpack to
do test rebuilds on copies of broken filesystems when they are
rebuilding critical data.

-chris




Re: new v3 2.6.4 logging/xattr patches

2004-03-19 Thread Chris Mason
On Fri, 2004-03-19 at 03:00, Hans Reiser wrote:
 Chris Mason wrote:
 
 Hello everyone,
 
 I've just uploaded all the reiserfs patches from the suse 2.6 kernel to:
 
 ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4
 
 (they should also apply to 2.6.5-rc1 and 2.6.5-rc1-mm2)
 
 These include:
 Latency fixes
 Logging performance fixes
 data=ordered support
 quota support
 xattrs and acls from Jeff Mahoney
 
 and a few random bug fixes.  There's a README file in the directory with
 a some more details.
 
 are these ready to go to Andrew or?

They are ready for Andrew for additional testing, I'd like to submit the
whole bunch to him actually.  Any objections?

-chris






Re: new v3 2.6.4 logging/xattr patches

2004-03-19 Thread Chris Mason
On Fri, 2004-03-19 at 09:03, Chris Mason wrote:
 On Fri, 2004-03-19 at 03:00, Hans Reiser wrote:
  Chris Mason wrote:
  
  Hello everyone,
  
  I've just uploaded all the reiserfs patches from the suse 2.6 kernel to:
  
  ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4
  
  (they should also apply to 2.6.5-rc1 and 2.6.5-rc1-mm2)

Since 2.6.5-rc1-mm2 includes some of these patches, I've put a series.mm
file in the ftp directory to tell you which ones are needed when you
want to apply to Andrew's tree.

-chris




new v3 2.6.4 logging/xattr patches

2004-03-18 Thread Chris Mason
Hello everyone,

I've just uploaded all the reiserfs patches from the suse 2.6 kernel to:

ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4

(they should also apply to 2.6.5-rc1 and 2.6.5-rc1-mm2)

These include:
Latency fixes
Logging performance fixes
data=ordered support
quota support
xattrs and acls from Jeff Mahoney

and a few random bug fixes.  There's a README file in the directory with
a some more details.

The names of the files no longer start with numbers to help you figure
out the order the patches get applied.  The directory includes a file
called series that has a list of all the patches in the proper order.

There are various shell tricks to apply things in order.  Or a few
different patch management tools that understand a series file.  I use
quilt (http://savannah.nongnu.org/projects/quilt/) which is very nice.

-chris






Re: mount hang on kernel 2.6

2004-03-07 Thread Chris Mason
On Sun, 2004-03-07 at 17:32, Fabiano Reis wrote:
 Hi list, 
 
 I´m having problems to mount a reiserfs filesystem using kernel 2.6.3 on 
 RedHat 9. 
 
 The fs I´m trying to mount was formated using kernel 2.4.20-24.9 (the latest 
 kernel for RH9) with ReiserFS version 3.6.25 (from dmesg) and reiserfsprogs 
 3.6.4 (from reiserfsck -v command) 
 
 I upgraded to kernel 2.6.3 (rpm from 
 http://people.redhat.com/arjanv/2.6/RPMS.kernel/) 
 
 The symptom: When I execute the command 
 

Can you reproduce with a vanilla or -mm kernel?

-chris




[PATCH] corruption bugs in 2.6 v3

2004-03-03 Thread Chris Mason
Hello everyone,

These two patches fix corruption problems I've been hitting on 2.6. 
Both bugs are present in the vanilla and suse kernels.

reiserfs-search-restart:
This was originally from [EMAIL PROTECTED], I recently made a small
addition to make sure the expected height was checked after reading in
blocks in search_by_key (this depends on reiserfs-lock-lat from my data
logging directory).

reiserfs-write-sched-bug:
Fixes two schedule related bugs during reiserfs_file_write.  One place
in the code assumes that after a schedule, path.pos_in_item will still
be valid even if the item has moved.  Since items can split during a
schedule this is incorrect.

The second bug took a little longer to figure out,
reiserfs_prepare_file_region_for_write needs to make sure a stale item
pointer isn't used if search_for_position_by_key doesn't return
POSITION_FOUND.

The most common symptoms of the two bugs is attempts to read beyond the
end of the device, file data corruption and errors like this:

is_tree_node: node level X does not match to the expected one Y

Where X and Y are both valid tree heights (between 1 and 5) and usually
one away from each other.

These reproduced reliably for me by running dbench 20 in a loop for
about 20 minutes on an amd64 box.

Hans, I'd like to submit these along with the other fixes I ported from
2.4 and sent to reiserfs-dev.  Any objections?

-chris

fix a bug in reiserfs search_by_key call, where it might not properly detect
a change in tree height during a schedule.  Originally from [EMAIL PROTECTED]

Index: linux.t/fs/reiserfs/stree.c
===
--- linux.t.orig/fs/reiserfs/stree.c2004-03-03 14:11:40.984705584 -0500
+++ linux.t/fs/reiserfs/stree.c 2004-03-03 14:12:59.466460675 -0500
@@ -678,7 +678,7 @@
current node, and calculate the next current node(next path element)
for the next iteration of this loop.. */
 n_block_number = SB_ROOT_BLOCK (p_s_sb);
-expected_level = SB_TREE_HEIGHT (p_s_sb);
+expected_level = -1;
 while ( 1 ) {
 
 #ifdef CONFIG_REISERFS_CHECK
@@ -692,7 +692,6 @@
/* prep path to have another element added to it. */
p_s_last_element = PATH_OFFSET_PELEMENT(p_s_search_path, 
++p_s_search_path-path_length);
fs_gen = get_generation (p_s_sb);
-   expected_level --;
 
 #ifdef SEARCH_BY_KEY_READA
/* schedule read of right neighbor */
@@ -707,21 +706,26 @@
pathrelse(p_s_search_path);
return IO_ERROR;
}
+   if (expected_level == -1)
+   expected_level = SB_TREE_HEIGHT (p_s_sb);
+   expected_level --;
 
/* It is possible that schedule occurred. We must check whether the key
   to search is still in the tree rooted from the current buffer. If
   not then repeat search from the root. */
if ( fs_changed (fs_gen, p_s_sb)  
-(!B_IS_IN_TREE (p_s_bh) || !key_in_buffer(p_s_search_path, p_s_key, 
p_s_sb)) ) {
+   (!B_IS_IN_TREE (p_s_bh) || 
+B_LEVEL(p_s_bh) != expected_level || 
+!key_in_buffer(p_s_search_path, p_s_key, p_s_sb))) {
PROC_INFO_INC( p_s_sb, search_by_key_fs_changed );
-   PROC_INFO_INC( p_s_sb, search_by_key_restarted );
+   PROC_INFO_INC( p_s_sb, search_by_key_restarted );
PROC_INFO_INC( p_s_sb, sbk_restarted[ expected_level - 1 ] );
decrement_counters_in_path(p_s_search_path);

/* Get the root block number so that we can repeat the search
-   starting from the root. */
+  starting from the root. */
n_block_number = SB_ROOT_BLOCK (p_s_sb);
-   expected_level = SB_TREE_HEIGHT (p_s_sb);
+   expected_level = -1;
right_neighbor_of_leaf_node = 0;

/* repeat search from the root */
Index: linux.t/fs/reiserfs/file.c
===
--- linux.t.orig/fs/reiserfs/file.c	2004-03-03 14:16:44.762750907 -0500
+++ linux.t/fs/reiserfs/file.c	2004-03-03 14:16:57.361012562 -0500
@@ -365,7 +365,7 @@
 // it means there are no existing in-tree representation for file area
 // we are going to overwrite, so there is nothing to scan through for holes.
 for ( curr_block = 0, itempos = path.pos_in_item ; curr_block  blocks_to_allocate  res == POSITION_FOUND ; ) {
-
+retry:
 	if ( itempos = ih_item_len(ih)/UNFM_P_SIZE ) {
 	/* We run out of data in this indirect item, let's look for another
 	   one. */
@@ -422,8 +422,8 @@
 		bh=get_last_bh(path);
 		ih=get_ih(path);
 		item = get_item(path);
-		// Itempos is still the same
-		continue;
+		itempos = path.pos_in_item;
+		goto retry;
 		}
 		modifying_this_item = 1;
 	}
@@ -856,8 +856,12 @@
 			/* Try to find next item */
 			res = search_for_position_by_key(inode-i_sb, key, path);
 			/* Abort if no more items */
-			if ( res 

Re: [PATCH] updated data=ordered patch for 2.6.3

2004-03-01 Thread Chris Mason
On Mon, 2004-03-01 at 08:30, Christophe Saout wrote:
 Hi,
 
  Also, the code has some extra performance tweaks to smooth out
  performance both with and without data=ordered.  There are new
  mechanisms to trigger metadata/commit block writeback and to help
  throttle writers.  The goal is to reduce the huge bursts of io during a
  commit and during data=ordered writeback.
 
 It seems you introduced a bug here. I installed the patches yesterday
 and found a lockup on my notebook when running lilo (with /boot on the
 root reiserfs filesystem).
 
 A SysRq-T showed that lilo is stuck in fsync:
 

Ugh, I use grub so I haven't tried lilo.  Could you please send me the
full sysrq-t, this is probably something stupid.

-chris




Re: [PATCH] updated data=ordered patch for 2.6.3

2004-03-01 Thread Chris Mason
On Mon, 2004-03-01 at 09:30, Christophe Saout wrote:
 Am Mo, den 01.03.2004 schrieb Chris Mason um 15:01:
 
   It seems you introduced a bug here. I installed the patches yesterday
   and found a lockup on my notebook when running lilo (with /boot on the
   root reiserfs filesystem).
   
   A SysRq-T showed that lilo is stuck in fsync:
  
  Ugh, I use grub so I haven't tried lilo.  Could you please send me the
  full sysrq-t, this is probably something stupid.
 
 Yes. I could reproduce it by simply creating a dummy /boot volume on
 reiserfs. I copied the content of /boot, ran lilo and it hung again. The
 other reiserfs filesystems were still usable (but a global sync hangs
 afterwards). I also attached a bzipped strace of the lilo process.
 

Ok, thanks.  The problem is in reiserfs_unpack(), which needs updating
for the patch.  Fixing.

-chris




Re: data-logging 2.4.23 patches: 1 reject in fs/inode.c in 2.4.25

2004-02-20 Thread Chris Mason
On Thu, 2004-02-19 at 07:40, Jens Benecke wrote:
 Hi,
 
 Chris' data-logging patches do not apply cleanly to 2.4.25, are they being
 updated?
 
 I think the fix for inode.c is not very hard but I don't dare to fiddle
 around with file system code. ;)

Ok, I'll rediff the data logging + quota against 2.4.25.

-chris




Re: Big Block Device does not format...

2004-02-19 Thread Chris Mason
On Thu, 2004-02-19 at 09:02, Vitaly Fertman wrote:
 On Thursday 19 February 2004 12:39, Oliver Pabst wrote:
  Hello list,
 
 Hello Oliver,
 
  I try to mkreiserfs /dev/evms/userhomes and /dev/evms/userhomes is a
  drive-linked evms feature object with a sum of 18.8TB.
 
  mkreiserfs /dev/evms/userhomes gives following result:
 
  count_blocks: block device too large
  Aborted
 
 which reiserfsprogs version do you use?
 would you try the latest one from our ftp site?
 
The limit for 4k block based filesysytems is 16TB.

Try something with extents, or try on ia64 where you can use bigger
pages.

-chris




[PATCH] updated data=ordered patch for 2.6.3

2004-02-18 Thread Chris Mason
Hello everyone,

I've updated the reiserfs logging speedups and data=ordered support to
2.6.3, and fixed a few bugs:

i_block count not properly updated on files with holes
oops when the disk is full
small files were not being packed into tails

Also, the code has some extra performance tweaks to smooth out
performance both with and without data=ordered.  There are new
mechanisms to trigger metadata/commit block writeback and to help
throttle writers.  The goal is to reduce the huge bursts of io during a
commit and during data=ordered writeback.

I'm very interested in benchmarks here and info about how the patches
feel for interactive performance.

Please note that data=ordered is not the default yet, you have to mount
with -o data=ordered to get it.

You can get the patches from:

ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.3

(once the suse mirror copies it over there)

-chris



Re: v3 experimental data=ordered and logging speedups for 2.6.1

2004-02-12 Thread Chris Mason
On Wed, 2004-02-11 at 10:09, Oleg Drokin wrote:
 Hello!
 
 On Wed, Feb 11, 2004 at 09:59:31AM -0500, Chris Mason wrote:
thousands (hundreds of thousands) of times per day.  It wasn't an easy
bug to hit.
   What are the symptoms?
  The rpm database is corrupted, rpm --rebuild-db is required.
 
 Hm, that's really strange. But on ia64 default io size is 64k, do they have
 same problems there?

No, ia64 works fine.  It really is strange.

 
other tricky parts when the data=journal code is added.  We've already
made our own file_write call, it doesn't make sense to warp it just to
avoid our own __block_commit_write ;-)
   Well, code duplication is not very good thing.
  It depends on how much you have to twist things to use the generic
  code.  If we used __block_commit_write, buffers would be marked dirty
  when it completes.  This won't work for data=journal at all, we don't
  want them marked dirty.
 
 Well, if you do not want them marked dirty, you just do not need to call
 commit_write at all since the only thing it does is marking buffers dirty ;)
 And you can have a list of buffers at the allocation time anyway,
 so no need to do extra checks about partial page writes and so on since
 all these checks were already done.

Interesting.  I'll look harder at that.

-chris




Re: v3 experimental data=ordered and logging speedups for 2.6.1

2004-01-20 Thread Chris Mason
On Mon, 2004-01-19 at 17:53, Dieter Nützel wrote:

 05 and 06 needed some handwork 'cause the SuSE kernel inclues xattrs and posix 
 acl's but nothing special.
 

Good to hear.  I wasn't expecting the suse merge to be difficult,
luckily it doesn't have many patches in it yet.  Jeff and I will look at
getting them into the suse kernel once data=journal is done as well.

 An EXPORT was missing in linux/fs/buffer.c to compile ReiserFS 3.x.x as modul 
 (inode.c, unresolved symbol):
 

Thanks, I'll add it into the patch when I get back from linux world.

-chris




Re: 3.6.25 - Journal replayed back to 3 weeks ago

2004-01-15 Thread Chris Mason
On Thu, 2004-01-15 at 00:23, Neil Robinson wrote:
 Chris,
 
 I haven't heard anything since my response, so just in case it wasn't
 complete enough:
 
 uname -a results:
 
 2.4.20-gentoo-r9 #1 Sat Dec 6 03:17:43 GMT 2003 i686 Intel(R) Pentium(R)
 4 CPU 2.66GHz GenuineIntel GNU/Linux
 
 This is running in a vmware workstation 4.0 Windows XP host. It is set
 up with 3 MAX 4GB emulated SCSI drives and 1 emulated MAX 2GB FAT32 IDE
 hard drive.
 
 sda1 is ext2 on /boot
 sda2 is swap on swap
 sda3 is reiserfs on /
 sdb1 is reiserfs on /home
 sdc1 is reiserfs on /var/tmp/portage
 hda1 is vfat on /work

Ok, please tell me more about the vmware setup.  Is are the vmware
drives configured to do copy on write?

-chris




Re: 3.6.25 - Journal replayed back to 3 weeks ago

2004-01-15 Thread Chris Mason
On Thu, 2004-01-15 at 11:33, Neil Robinson wrote:

  Ok, please tell me more about the vmware setup.  Is are the vmware
  drives configured to do copy on write?
 
 According to the documentation, changes made to the virtual drives are
 committed to the physical drive immediately. Here is the salient info
 from the dmesg:

vmware has a mode where the changes are written to a log file instead of
to the main file.  The client doesn't see things are split between two
files, since vmware exports them as a single device.  This allows you to
do fancy things like roll back changes, or share a base file between two
hosts and have changes to into private files.

It's possible that you're in this mode and somehow lost a log file.

Are the virtual drives being exported to linux real block devices in
windows, or are the files on some windows filesystem?

-chris




Re: 3.6.25 - Journal replayed back to 3 weeks ago

2004-01-15 Thread Chris Mason
On Thu, 2004-01-15 at 14:19, Neil Robinson wrote:

 Uh oh, I think I now know what happened. Ouch. I was not clear what the
 REDO files were for, since I seemed to have lots of them taking up a
 considerable amount of drive space. I deleted them (probably taking all
 of my changes with them). On the whole, I am quite relieved. Although I
 did it to myself (this is my first experience using vmware), I realize
 now what happened and it won't happen again. Even more important, it
 seems my problems have nothing to do with the Reiser File System which
 is also a relief. I wish to apologize to you and everyone on the list
 for taking up your time with soething that, as it turns out, has nothing
 really to do with you.
 
 Thanks for all of your help.

Good news, I wasn't looking forward to hunting that as an FS bug ;-) 
Thanks for the details.

-chris




Re: v3 logging speedups for 2.6

2004-01-12 Thread Chris Mason
On Mon, 2004-01-12 at 02:07, Jens Benecke wrote:
 Chris Mason wrote:
 
  Hello everyone,
  
  This is part one of the data logging port to 2.6, it includes all the
  cleanups and journal performance fixes.  Basically, it's everything
  except the data=journal and data=ordered changes.
  ftp.suse.com/pub/people/mason/patches/data-logging
 experimental/2.6.0-test11
 
 Hi,
 
 Does it make sense to apply those to 2.6.1-mm2? 
 
Not those at least, since I managed to screw up the diff.  I've got a
2.6.1 directory under experimental now with better patches.

I'm checking now to see if they apply to -mm2.

 Does except the data=ordered changes mean that data journalling ist _not_
 in there, or that that data journalling is there but hasn't been updated to
 what is there for 2.4.x yet?

Correct, but I'm almost there.  Thing got off track a lot during xmas
break.

-chris



Re: 3.6.25 - Journal replayed back to 3 weeks ago

2004-01-12 Thread Chris Mason
On Mon, 2004-01-12 at 13:47, Neil Robinson wrote:
 Hi,
 
 this morning when I started up my notebook (running Windows XP) with a
 VMware session running Gentoo, the boot sequence claimed that the
 reiserfs drives had not been cleanly umounted (not true, I powered down
 the usual way on Friday evening -- su to root and then issued the
 poweroff command). It then replayed the journals of the two partitions
 using reiserfs. When it finished and booted, it was as if my entire
 machine had stepped back in time by 3 weeks or so (to around the 23rd of
 December). Since then I had installed and built openoffice, emacs, and
 numerous other bits and pieces. I also lost all of the email that was
 living in the courier-imap server.
 
 I am *very* concerned about this behaviour. I have successfully
 restarted, booted, etc. literally dozens of times since mid-December. I
 have now just installed a software RAID using RAID 5 on Gentoo and using
 reiserfs for a fairly large system (250GB on 8 SCSI U160 drives) with an
 available hot spare and a tape backup unit. Losing a few weeks of
 relatively insignificant changes is nothing compared with possibly
 losing the contents of my company's master file server. Can anyone tell
 me why reiserfs rolled back all the way to mid-December in spite of
 numerous reboots and how I can avoid a rerun of this scenario *ever*
 again. Is ther some way to tell it to commit its changes that I am not
 doing and should be aware of?

That's not supposed to happen.  Lets start with details about which
version of the kernel you were using.

-chris




Re: Errors requiring --rebuild-tree in 2.4.23

2003-12-11 Thread Chris Mason
On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
 Hi,
 
 I posted earlier about quota problems. WE updated to 2.4.23 b ecause of the
 logging patches because some power failures made our /home partition spew
 out these: (QUESTIONS at the end of the mail)

Sorry, before we got to the questions, what was the order of the events
above?

-chris




v3 logging speedups for 2.6

2003-12-11 Thread Chris Mason
Hello everyone,

This is part one of the data logging port to 2.6, it includes all the
cleanups and journal performance fixes.  Basically, it's everything
except the data=journal and data=ordered changes.

The 2.6 merge has a few new things as well, I've changed things around
so that metadata and log blocks will go onto the system dirty lists. 
This should make it easier to improve log performance, since most of the
work will be done outside the journal locks.

The code works for me, but should be considered highly experimental.  In
general, it is significantly faster than vanilla 2.6.0-test11, I've done
tests with dbench, iozone, synctest and a few others.  streaming writes
didn't see much improvement (they were already at disk speeds), but most
other tests did.

Anyway, for the truly daring among you:

ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.0-test11

The more bug reports I get now, the faster I'll be able to stabilize
things.  Get the latest reiserfsck and check your disks after each use.

-chris




Re: Errors requiring --rebuild-tree in 2.4.23

2003-12-11 Thread Chris Mason
On Thu, 2003-12-11 at 11:43, Jens Benecke wrote:
 Chris Mason wrote:
 
  On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
  Hi,
  
  I posted earlier about quota problems. WE updated to 2.4.23 b ecause of
  the logging patches because some power failures made our /home partition
  spew out these: (QUESTIONS at the end of the mail)
  
  Sorry, before we got to the questions, what was the order of the events
  above?
 
 Oops. I guess I was a bit too confused myself. :)
 
 1. Errors on /home in syslog, cron jobs running wild with i/o failures
  system kept running for a couple days because nobody was there 
  to fix it, though
  Those errors were probably caused by power outages and 
  a non-data-logging ReiserFS kernel.
 2. Backup what's left of /home to firewire harddisk.
 3. Update to 2.4.23 with Chris' patches for data logging/quota
 4. Repartition hda2..4 (was needed anyway for drbd), 
  reformat new /home (drbd), restore /home on drbd device
 5. crash of the server overnight, reboot (don't know why yet)

Ok, we need to better understand step 5 here.

 6. couldn't reboot because root partition was totally b0rken
 7. reiserfsck --rebuild-tree under Knoppix, killed a couple files
 8. still running Knoppix, secondary server took over and is running now
 
 btw: Is there a reiserfs stress test kind of thing to make sure a
 configuration works before sending it two time zones away for production? I
 plan on doing that in the next couple weeks. =;)
 Would bonnie++ accomplish this or are there better tests?

The best test is whatever that environment is going to use in
production.  I've got a ton of different scripts that get used based on
different situations, most are ugly hacks.

-chris




Re: v3 logging speedups for 2.6

2003-12-11 Thread Chris Mason
On Thu, 2003-12-11 at 13:30, Dieter Nützel wrote:
 Am Donnerstag, 11. Dezember 2003 19:10 schrieb Chris Mason:
  Hello everyone,
 
  This is part one of the data logging port to 2.6, it includes all the
  cleanups and journal performance fixes.  Basically, it's everything
  except the data=journal and data=ordered changes.
 
  The 2.6 merge has a few new things as well, I've changed things around
  so that metadata and log blocks will go onto the system dirty lists.
  This should make it easier to improve log performance, since most of the
  work will be done outside the journal locks.
 
  The code works for me, but should be considered highly experimental.  In
  general, it is significantly faster than vanilla 2.6.0-test11, I've done
  tests with dbench, iozone, synctest and a few others.  streaming writes
  didn't see much improvement (they were already at disk speeds), but most
  other tests did.
 
  Anyway, for the truly daring among you:
 
  ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.0-test11
 
  The more bug reports I get now, the faster I'll be able to stabilize
  things.  Get the latest reiserfsck and check your disks after each use.
 
 Chris,
 
 with which kernel should I start on my SuSE 9.0?
 A special SuSE 2.6.0-test11 + data logging?
 Or plane native? --- There are such much patches in SuSE kernels...

For the moment you can only try it on vanilla 2.6.0-test11.   The suse
2.6 rpms have acls/xattrs and the new logging stuff won't apply.

Jeff and I will fix that when the logging merge is really complete.  At
the rate I'm going, that should be by the end of next week, this part of
the merge was the really tricky bits.

-chris




Re: Strange errors with 2.4.22 patches (from Chris) and bonnie++

2003-12-03 Thread Chris Mason
On Tue, 2003-12-02 at 03:35, Jens Benecke wrote:
 Chris Mason wrote:
 
  On Fri, 2003-11-28 at 16:38, Jens Benecke wrote:
  
  b) bonnie++ on a (previously created) reiserfs partition (with
 mkreiserfs 3.6.6) exited with random disk full errors, although
 the disk was never full. This didn't happen before.
  (...)
  I would appreciate any help. Thank you! :-)
  
  Ugh, the new block allocator isn't properly forcing a commit when an
  allocation fails, so we don't reclaim blocks deleted in an uncommitted
  transaction.  I thought this fix got pulled out of the suse kernel when
  namesys and I pulled out the important bug fixes, but it got missed.
  
  porting.
 
 Hi Chris,
 
 so... the worst-case impact on this bug is that reiserfs will report disk
 full when you still have some space available. Right? No data loss,
 corruption, or similar Bad Things(tm)?
 
Correct.  I'll have a fix available today along with a remerge of data
logging and quota against 2.4.23.

 I have two servers here that are supposed to be deployed next week and use
 ReiserFS. I've had some bad issues with MySQL files becoming corrupted
 after a crash in the past so I'd really like to put these into production
 with data-logging patches.
 
 btw, what are the patches that SuSE uses? IIRC, SuSE ships with data-logging
 enabled, right?
 

SUSE ships data logging and the xattr patches, along with a few others.

-chris




Re: FW: reiser4 plugin for maildir

2003-12-03 Thread Chris Mason
[reiser4 plugins to make mail delivery faster]

There are a few basic things that slow down mail servers when they talk
to filesystems:

1) multiple threads delivering to the same directory contend for the
directory semaphore for creating new files

2) atomic creation of an entire file

3) high load leads to numerous small synchronous filesystem operations
when multiple threads all try to deliver at once.

#1 is best fixed at the VFS level, basically allowing more fine grained
directory locks (lustre needs this as well I think).  It could be dealt
with entirely inside reiser4, but this is a generic need for linux as a
whole I think.

#2 and #3 could be dealt with in a reiser4 transaction subsystem. 
Basically the mail server would deliver a number of messages and trigger
commits at regular intervals, and reiser4 would make sure the files were
created atomically in the FS.  The mail server would not report the file
as delivered to the smtp client until the commit had completed.

So, I definitely think you could use reiser4 to gain significant mail
delivery performance.  But you probably don't need  a specific plugin
outside the transaction subsystems.  The obvious benefit of only using
the transaction systems is that you won't need to teach mail clients
about the reiser4 plugin.

There's lots of other plugin topics related to teaching the FS about
various file formats.  This may or may not be a good idea in some cases,
but it is a good place for research.  I'd start by looking at the xattr
interfaces and Hans' ideas for FS-as-a-database.

-chris




Re: precise characterization of ext3 atomicity

2003-09-05 Thread Chris Mason
On Thu, 2003-09-04 at 18:03, Andreas Dilger wrote:
 On Sep 05, 2003  01:32 +0400, Hans Reiser wrote:
  Andreas Dilger wrote:
  It is possible to do the same with ext3, namely exporting journal_start()
  and journal_stop() (or some interface to them) to userspace so the application
  can start a transaction for multiple operations.  We had discussed this in
  the past, but decided not to do so because user applications can screw up in
  so many ways, and if an application uses these interfaces it is possible to
  deadlock the entire filesystem if the application isn't well behaved.
 
  That's why we confine it to a (finite #defined number) set of 
  operations within one sys_reiser4 call.  At some point we will allow 
  trusted user space processes to span multiple system calls (mail server 
  applicances, database appliances, etc., might find this useful).  You 
  might consider supporting sys_reiser4 at some point.

Please rename sys_reiser4 if you want it to be a generic use syscall ;-)

-chris




Re: write barrier patches for 2.4.21

2003-08-28 Thread Chris Mason
On Wed, 2003-08-27 at 18:03, Tom Vier wrote:
 On Wed, Aug 27, 2003 at 10:41:03AM +0400, Oleg Drokin wrote:
  There was a discussion about that on Kernel Summit 2003 and general opinion was 
  that SCSI
  does not need the WB stuff at all as it does the correct thing anyway.
 
 i found this, but no real details. do you have a better link? or could you
 tell me why scsi drives don't need wb's? afaik, they don't have nvram cache.
 so, the danger is still there, even if small (unless i'm wrong).
 
 http://lwn.net/Articles/40850/
 

scsi drives don't really need them because most scsi drives don't have
write back caching on by default, and most actually listen when you turn
the cache off.  The scsi tag queuing makes good performance possible
even without writeback caching.

-chris




Re: BUG in reiserfs_write_full_page().

2003-07-18 Thread Chris Mason
On Fri, 2003-07-18 at 09:45, Nikita Danilov wrote:

So, in the case of reiserfs_write_full_page(), the BUG() is falsely 
triggered
due to a transaction that was started on another filesystem (ext3).  And the
fix would simply be to do something along the lines of ext3...
   
   The reiserfs data logging actually does more than ext3 to make sure
   things get along, recording the super of the filesystem holding the
   transaction.  So, it is actually possible to start a new non-nested
   transaction.
 
 This can result in a/b-b/a deadlock, right?
 
Sorry, I wasn't clear.  The transaction nesting code could detect and
deal with it (making sure not to nest into an ext3 transaction or a
reiserfs transaction on a different FS), but there are still other
deadlocks to deal with.

   
   But, we shouldn't have to.  Other parts of the OS should be protecting
   us from a writepage being called at this time, which is why the bug is
   there.  Someone did a non GFP_NOFS allocation with a transaction
 
 __bread()-__getblk()-__find_get_block()-find_get_page() allocates
 page with bdev-bd_inode-i_mapping-gfp_flags, which is GFP_USER, that
 includes GFP_FS.

You're in 2.5 land ;-)  There seem to be a few problems there, I've got
an oops in find_inode and a deadlock under load, but I still need to
read the (huge) sysrq-t to figure things out.

-chris




Re: BUG in reiserfs_write_full_page().

2003-07-17 Thread Chris Mason
On Thu, 2003-07-17 at 17:19, Michael Gaughen wrote:
 Hello,
 
 We have a test machine that continues to BUG() in 
 reiserfs_write_full_page().
 The machine is running SLES8 (2.4.19-152, UP). Here is the (kdb) stack 
 trace:

Hmmm, the allocation masks are supposed to be set such that writepage
won't get called.  I'll take a look.  How easy is it to reproduce?  If
you have any tests cases that can trigger it, please send them along.

-chris




Re: Horrible ftruncate performance

2003-07-11 Thread Chris Mason
On Fri, 2003-07-11 at 11:44, Oleg Drokin wrote:
 Hello!
 
 On Fri, Jul 11, 2003 at 05:34:12PM +0200, Marc-Christian Petersen wrote:
 
   Actually I did it already, as data-logging patches can be applied to
   2.4.22-pre3 (where this truncate patch was included).
Maybe it _IS_ time for this _AND_ all the other data-logging patches?
2.4.22-pre5?
   It's Chris turn. I thought it is good idea to test in -ac first, though
   (even taking into account that these patches are part of SuSE's stock
   kernels).
  Well, I don't think that testing in -ac is necessary at all in this case.
 
 May be not. But it is still useful ;)
 
  I am using WOLK on many production machines with ReiserFS mostly as Fileserver 
  (hundred of gigabytes) and proxy caches.
 
 I am using this code on my production server myself ;)
 
  If someone would ask me: Go for 2.4 mainline inclusion w/o going via -ac! :)
 
 Chris should decide (and Marcelo should agree) (Actually Chris thought it is good
 idea to submit data-logging to Marcelo now, too). I have no objections.
 Also now, that quota v2 code is in place, even quota code can be included.
 
 Also it would be great to port this stuff to 2.5 (yes, I know Chris wants this to be 
 in 2.4 first)

Marcelo seems to like being really conservative on this point, and I
don't have a problem with Oleg's original idea to just do relocation in
2.4.22 and the full data logging in 2.4.23-pre4 (perhaps +quota now that
32 bit quota support is in there).

2.5 porting work has restarted at last, Oleg's really been helpful with
keeping the 2.4 stuff up to date.

-chris




Re: Horrible ftruncate performance

2003-07-11 Thread Chris Mason
On Fri, 2003-07-11 at 13:27, Dieter Nützel wrote:

  2.5 porting work has restarted at last, Oleg's really been helpful with
  keeping the 2.4 stuff up to date.
 
 Nice but.
 
 Patches against latest -aa could be helpful, then.

Hmmm, the latest -aa isn't all that latest right now.  Do you want
something against 2.4.21-rc8aa1 or should I wait until andrea updates to
2.4.22-pre something?

-chris




updated data logging available

2003-07-11 Thread Chris Mason
Hello all,

ftp.suse.com/pub/people/mason/patches/data-logging/2.4.22

Has a merge of the data logging and quota code into 2.4.22-pre4 (should
apply to -pre5).  Overall, the performance of pre5 + reiserfs-jh is nice
and smooth, I'm very happy with how things are turning out.

Thanks to Oleg for merging data logging with his reiserfs warning
patches and hole performance fixes.

The relocation and base quota code is now in, so our number of patches
is somewhat smaller.  The SuSE ftp server might need an hour or so to
update, please be patient if the patches aren't there yet.

-chris




Re: reiserfs on removable media

2003-07-02 Thread Chris Mason
On Wed, 2003-07-02 at 14:53, Hans Reiser wrote:

 This is called ordered data mode, and exists on ext3 and also reiserfs with
 Chris Mason's patches.  Under normal usage it shouldn't change performance
 compared to writeback data mode (which is what reiserfs does by default).
 
 It had some impact, I forget exactly how much, maybe Chris can 
 resuscitate his benchmark of it?
 

The major cost of data=ordered is that dirty blocks are flushed every 5
seconds instead of every 30.  The journal header patch in my
experimental data logging directory changes things so that only new
bytes in the file are done in data=ordered mode (either adding a new
block or appending onto the end of the file).

This helps a lot in the file rewrite tests.

-chris




Re: reiserfs on removable media

2003-07-02 Thread Chris Mason
On Wed, 2003-07-02 at 15:08, Dieter Nützel wrote:
 Am Mittwoch, 2. Juli 2003 20:59 schrieb Chris Mason:
  On Wed, 2003-07-02 at 14:53, Hans Reiser wrote:
   This is called ordered data mode, and exists on ext3 and also reiserfs
with Chris Mason's patches.  Under normal usage it shouldn't change
performance compared to writeback data mode (which is what reiserfs
does by default).
 
 Chris,
 
 I thought data=ordered is the new default with your patch?
 
It is.

   It had some impact, I forget exactly how much, maybe Chris can
   resuscitate his benchmark of it?
 
  The major cost of data=ordered is that dirty blocks are flushed every 5
  seconds instead of every 30.  The journal header patch in my
  experimental data logging directory changes things so that only new
  bytes in the file are done in data=ordered mode (either adding a new
  block or appending onto the end of the file).
 
  This helps a lot in the file rewrite tests.
 
 What's faster than with your patches? ordered|journal|writeback?
 
 I thought is order: writeback  ordered  journal ;-)

Usually ;-)  ordered is faster in a few rare benchmarks because it helps
keeps the number of dirty buffers lower and generally sends the dirty
buffers to the disk in a big flood.

journal is faster for some fsync heavy benchmarks.

For practical desktop usage, data=ordered and writeback are usually
close to each other.

-chris




Re: udpated data logging available

2003-07-01 Thread Chris Mason
On Tue, 2003-07-01 at 20:46, Manuel Krause wrote:
  
  Setting HZ=1000 (from 100) in linux/include/asm/param.h give me very 
  impressive latency boost. 2.4.21-rc1-jam1 (-rc1aa1)
 
 Just tried this HZ=1000 setting, too.
 
 (With the following patches only: data-logging, search_reada-4 and
 rml-preempt on 2.4.21-final, and AA.00_nanosleep-6.diff (THAT ONE
 decreased my VMware+system CPU idle usage in these circumstances by
 approx. 1/2 [from 25% to approx. 12.5%] ).
 

Some userspace tools depend on the HZ value.  It's clear the io-stalls-7
patch can't be the final one, I need to add userspace knobs to tune
things towards latency or multi-writer throughput (basically server or
desktop workloads).

-chris



Re: vpf-10680, minor corruptions

2003-06-27 Thread Chris Mason
On Fri, 2003-06-27 at 12:13, Oleg Drokin wrote:
 Hello!
 
 On Fri, Jun 27, 2003 at 04:38:00PM +0400, Oleg Drokin wrote:
 
  I was looking in the wrong direction, when I produced that patch,
  so it will produce zero output.
  I hope to come up with ultimate fix soon enough. ;)
 
 Well, there is a patch below that does *not* work for me ;)
 But it should work.
 I have traced the new problem to a cross compiler that compiles
 code in a different way than native compiler for whatever reason
 (demo is attached as test.c program, it should print result is 1
 in case it is compiled correctly and stuff about unknown
 uniqueness if it is miscompiled. In fact may be this is just correct compiler 
 behaviour.)
 I now think that when I compile a kernel with native compiler, it should work
 with below patch. But I can verify that only tomorrow it seems.
 You might try that patch as well to see if it helps you before I try it ;)
 The patch is obviously correct one. (except that it does not work
 with my cross compiler and kernel does work without patch which is really-really 
 strange).
 

Most of these changes are in 2.4.21, which I've been using on an AMD64
bit box for a while without any problems.  The bug should be somewhere
else, it looks to me like these spots aren't trying to send an unsigned
long to disk.

-chris




Re: 2.4.21 reiserfs oops

2003-06-24 Thread Chris Mason
On Tue, 2003-06-24 at 16:34, Nix wrote:
 On Tue, 24 Jun 2003, Oleg Drokin moaned:
  Hello!
  
  On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
  
   Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference 
   at virtual address 0001 
   This is very strange address to oops on.
  I'll say! Looks almost like it JMPed to a null pointer or something.
  
  No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.
 
 JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
 not somewhere that even a memory error would make us likely to jump to.
 
   Jun 22 13:52:43 loki kernel: EIP:0010:[c0092df4]Not tainted 

The EIP isn't zero or 1, you've got a bad null pinter dereference at
address 1.  You get this when you do something like *(char *)1 =
some_val.

The ram is most likely bad, you're 1 bit away from zero, but you might
try a reiserfsck on any drives affected by the scsi errors.

-chris




Re: xattr

2003-06-19 Thread Chris Mason
On Mon, 2003-06-16 at 08:26, Russell Coker wrote:
 What is the status of xattr support in 2.5.x?
 
 How is journalling of xattr's being handled?
 
 For correct operation of SE Linux we need to have the xattr that is used for 
 the security context be changed atomically, and if a file is created and 
 immediately has the xattr set then ideally we would have the file creation 
 and the xattr creation in the same journal entry.
 
 Is this possible?  If doing this requires that the file system be mounted with 
 data=journal then this will be fine.

How big are the xattrs you have in mind?  We can get atomic writes of 4k
in length but beyond that things get more difficult.

As for the xattr and the create in the same transaction, that's a little
harder.  We'd probably need a new syscall, or to change the semantics of
the xattr call such that creating an xattr on a file that doesn't exist
also creates the file.

-chris




Re: xattr

2003-06-19 Thread Chris Mason
On Thu, 2003-06-19 at 10:46, Russell Coker wrote:
 On the topic of atomic xattr operations on ReiserFS as needed for the new 
 LSM/SE Linux operations.
 
 On Thu, 19 Jun 2003 23:52, Chris Mason wrote:
  How big are the xattrs you have in mind?  We can get atomic writes of 4k
  in length but beyond that things get more difficult.
 
 Most of them will be less than 80 bytes.  They are currently of the form:
 user-name:object_r:type
 
 The user-name is the Unix account name which usually isn't much more than 8 
 bytes.  The type is usually less than 15 bytes (the longest I've used so 
 far is 20 bytes).
 
 So the longest value I've used is 38 bytes.
 

Then data=journal mode will do what you want.  You'll get atomic writes
up to 4k.  If you really don't want data=journal for the rest of the FS,
we can make an option for using data logging on xattr files only.  Jeff
and I had wanted to avoid the complication but it is at least possible.

 Also they can't be chosen arbitarily by the user.  The user gets some small 
 control over the type within a range of types that the administrator permits.
 If the administrator permits overly long type names and has to deal with 
 non-atomicity as a result then it's their issue.
 
 If you can guarantee atomic operations on 160 byte operations (twice what I 
 expect anyone to use) then it'll be fine.
 
  As for the xattr and the create in the same transaction, that's a little
  harder.  We'd probably need a new syscall, or to change the semantics of
  the xattr call such that creating an xattr on a file that doesn't exist
  also creates the file.
 
 Creating a file by creating the xattr sounds like a bad idea as you can't 
 control the Unix permissions of the file.  This isn't much of a big deal with 
 SE Linux as the security type determines who can access the file.  But for 
 other uses it may be a serious problem.
 
 I agree that we need a new syscall and other people had the same idea before 
 either of us.
 
 Maybe ReiserFS could be used as the first implementation of this proposed new 
 syscall...

It would be best to go through Andreas Gruenbacher for xattr API
changes.  He's quite reasonable.

-chris




Re: xattr

2003-06-19 Thread Chris Mason
On Thu, 2003-06-19 at 13:25, Stephen Smalley wrote:
 On Thu, 2003-06-19 at 11:21, Chris Mason wrote:
  Ok, so in the new api, the xattr information is available at the time of
  the create.  reiserfs would be able to include it all into the same
  transaction but doesn't do it right now.
 
 This was true of the old API as well; the only difference is whether the
 attribute is specified as a parameter to an extended open/mkdir/etc call
 or whether it is set separately as a process attribute that is applied
 to subsequent ordinary open/mkdir/etc calls.  Including the setxattr in
 the same transaction as the create is not strictly necessary, although
 it would be nice.  The SELinux API change didn't change the create+set
 atomicity, which is still provided by performing the set before the
 parent directory semaphore is released.
 

Ok, I get it.  You would need a special reiserfs xattr add on patch to
get the atomicity right.

  First we need to get the data logging code in (which Hans has agreed
  to), getting the xattr code in depends on Hans, Jeff Mahoney will be
  maintaining as an external patch otherwise.
 
 My impression (possibly wrong) is that Hans prefers a EA-as-file model,
 which I understand is also the Solaris model.  The key question then
 becomes whether mainline reiser{3,4} will ever support the xattr inode
 operations (e.g. implementing them as reads/writes of the EA files
 associated with the inode).  If not, then neither the SELinux module
 nor the SELinux userland will be able to access file security attributes
 on reiser{3,4} in the same manner as on ext[23], xfs, or jfs.

The reiserfsv3 xattr patches maintained by SuSE implement the xattr api
(acl.bestbits.at).  The xattrs happen to be implemented as individual
files on disk because reiserfs is so well suited for it, and because it
allowed Jeff to code them without changing the v3 disk format.

But, these files are only available through the xattr api right now, and
they are not visible via tools like ls etc.

Not sure how namesys is doing things in version 4, but I'd bet they are 
willing to talk about making it work with SELinux.

-chris




udpated data logging available

2003-06-16 Thread Chris Mason
Hello all,

This doesn't have the data=ordered performance fixes because I got
distracted fixing io latency problems in 2.4.21.  Those were screwing up
my benchmarks, so I couldn't really tell if things were faster or not
;-)  Anyway, I'm back on track now, and since 2.4.21 is out I've just
copied Oleg's merge into my data logging directory.  I'll add the
experimental performance patches later today.

But, the code in my data logging directory now is what I would like to
see merged into 2.4.22-pre asap (pending namesys approval), so review
and testing would be appreciated.

ftp.suse.com/pub/people/mason/patches/data-logging/2.4.21

It might take a 30 minutes or so for the rsync to complete.  

-chris






Re: Timeframe for 2.4.21 quota patches?

2003-06-16 Thread Chris Mason
On Mon, 2003-06-16 at 10:45, Christian Mayrhuber wrote:

 
 What has happened to 10-quota-link-fix.diff? Is it not necessary any more?
 I'm asking because it is still mentioned in the README, but seems to have
 been replaced by 10-quota_nospace_fix.diff

Whoops, I screwed that up.



Re: Recent spam

2003-06-06 Thread Chris Mason
On Thu, 2003-06-05 at 09:26, Hans Reiser wrote:

 
 This is a bit inconsiderate of you, don't you think?  Why don't you just 
 unsubscribe?  I have a sysadmin with too many tasks to get them done, 
 and he doesn't need his time wasted with such crap as dealing with spamcop.

Honestly Hans, this has come up a bunch of times over the last few
months.  The reiserfs mail server setup doesn't play nicely with
automated spam reporting tools, and many people other than Russell use
them.

There have been a ton of suggestions on ways to make things better, and
they range from 5 minute tasks to complex ones.  They've largely been
ignored.  I'd really rather not lose valuable contributors to the list
over something as stupid as spam.

-chris




Re: [PATCH] various allocator optimizations

2003-03-14 Thread Chris Mason
On Fri, 2003-03-14 at 08:59, Manuel Krause wrote:
 On 03/14/2003 02:34 AM, Chris Mason wrote:
  On Thu, 2003-03-13 at 19:15, Hans Reiser wrote:
  
 
 [ discussion on how to implement lower fragmentation on ReiserFS ]
 
 
 Let's get lots of different testers.  You may have a nice heuristic here 
 though
 
  
  
  If everyone agrees the approach is worth trying, I'll make a patch that
  enables it via a mount option.
  
 [...]
 
 A dumb question inbetween: How do we - possible testers, users - get 
 information about fragmentation on our ReiserFS partitions?

The best tool I've seen so far originally came from Vladimir and was
modified for a study on fragmentation of reiserfs and ext2, Jeff found
the link somewhere in his archives:

http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/index.html

There is also a filesystem aging tools there that I haven't played with
yet.

-chris




Re: [PATCH] new data logging and quota patches available

2003-02-23 Thread Chris Mason
On Sat, 2003-02-22 at 20:04, Manuel Krause wrote:
 On 02/23/2003 01:50 AM, Manuel Krause wrote:
  On 02/22/2003 05:12 PM, Chris Mason wrote:
  
  And here's one more patch for those of you that want to experiment a
  little.  reiserfs_dirty_inode gets called to log the inode every time
  the vfs layer does an atime/mtime/ctime update, which is one of hte
  reasons mounting with -o noatime and -o nodiratime makes things 
  faster. We had to do this because otherwise kswapd can end up trying 
  to write
  inodes, which can lead to deadlock as he tries to wait on the log.
 
  One of the patches in my data logging directory is kinoded-8, which
  changes things so a new kinoded does all inode writeback instead of
  kswapd.
 
  That means that if you apply up to 05-data-logging-36 and then apply
  09-kinoded-8 (you won't need any of the other quota patches), you can
  also apply this patch.  It changes reiserfs to leave inodes dirty, which
  saves us lots of time on atime/mtime updates.
 
  I'l upload this after it gets a little additional testing, but wanted to
  include it here in case anyone else was interested in benchmarking.
 
  [11.dirty-inodes-for-kinoded.diff]
  
  Hi Chris!
  
  At least I'm not able to copy through my partition via cp -a ... with 
  the last supposed setup + preempt without quota ...
  
  When umounting the destination partition it says device is busy.
  
  I don't know what is gone for now on my repository partition, but I know 
  from earlier times this should not happen.
 
 Sorry, I was incomplete with my report, at least with the following: 
 After thinking over the remount strategy for some minutes with a pizza 
 the partition was umountable again. I don't have exact values but it was 
 under about 15 minutes max. with 3.5GB copied previously.

Hmmm, I think I need to changed around kinoded slightly.  Thanks for
giving it a try...

-chris




Re: [PATCH] new data logging and quota patches available

2003-02-22 Thread Chris Mason
On Sat, 2003-02-22 at 08:54, Oleg Drokin wrote:

 Replacement 05-data-logging-36.diff.gz file that applies to 2.4.21-pre4-ac5
 is available at
 ftp://namesys.com/pub/reiserfs-for-2.4/testing/05-data-logging-36-ac5.diff.gz

Thanks Oleg.

 It compiles, boots, survives my (simple) testing. (writing this email
 from patched 2.4.21-pre4-ac5, too). Quota works. symlinks are now have correct
 blocks count too
 The reason for rejects is mostly DIRECTIO fix that also went into current
 bk snapshot, so probably it will apply to Marcelo's bk tree, too.

 Chris: Is it intended that directio only works on data=writeback
 mounted filesystems?
 

Yes, the way ordered writes work is the buffers are put onto a per
transaction list that gets flushed before the commit.  Since buffers can
only be on one list, this means they don't get onto the list of buffers
for that particular inode, and that makes it difficult to make sure all
pending io to the file is finished before allowing the directio to
proceed (just like the tail alias bug).

The only way to do it is by forcing a commit before the directio, which
would be horribly slow so I've disabled the data=ordered o_direct
support.  The real solution is to allocate a data structure and point to
it from the private journal pointer in the bh.  That requires some other
changes and other performance stuff is more important right now.

 Also following README file diff should be considered:
 

Thanks, I'm uploading a few more optimizations shortly, I'll include
this change.

-chris




Re: [PATCH] new data logging and quota patches available

2003-02-22 Thread Chris Mason
On Sat, 2003-02-22 at 07:41, Dieter Nützel wrote:

 Which files are needed for 2.4.21-pre4aa3?

ftp.suse.com/pub/people/mason/patches/data-logging/2.4.21-aa

I just uploaded a fresh set of data logging and quota patches for -aa

-chris




[PATCH] new data logging and quota patches available

2003-02-21 Thread Chris Mason
Hello all,

ftp.suse.com/pub/people/mason/patches/data-logging/2.4.21 will soon be
updated with a new set of data logging and quota patches against
2.4.21-pre4

The data logging code is updated with another set of io stalling fixes,
they should improve performance of data=ordered and data=writeback by
being smarter about forcing commits under heavy write load and kicking
kreiserfsd.

Treat these with care, they've gotten a ton of testing under the suse
kernel, but the port to vanilla was just done today.

The quota patches include a fix for incorrect sd_block counts on
symlinks.

-chris




Re: no reiserfs quota in 2.4 yet? 2.4.21-pre4-ac4 says different

2003-02-17 Thread Chris Mason
On Mon, 2003-02-17 at 12:39, Ookhoi wrote:
 Hi Reiserfs team,
 
 Today I put a new kernel on a server which has reiserfs and needs quota.
 I searched for the quota patches (found them in the mail archive) and
 saw that they are very old:
 
 ftp://ftp.namesys.com/pub/reiserfs-for-2.4/testing/quota-2.4.20
 
 3 december 2002. They don't apply to a current kernel. 
 

Well, 2.4.20 is the current kernel ;-)  Which kernel do you want them
against?  I've got patches against 2.4.21-preX in testing here, but not
against -ac.  They should merge against -ac now more easily, but I
haven't had time to really test it.

Do you want to try the merge on -ac or would you rather try against
2.4.21-preX

-chris






Re: What is [PATCH] 02-directio-fix.diff (namesys.com) for?

2003-02-17 Thread Chris Mason
On Mon, 2003-02-17 at 15:55, Manuel Krause wrote:
 Hi!
 
 Is this patch from 030213 it needed by anyone using ReiserFS within 
 2.4.20 and 2.4.21-preX ?
 
 What is DIRECT IO with reiserfs from the topic line of the patch:
 # reiserfs: Fix DIRECT IO interference with tail packing ?

It fixes a bug where a recently unpacked tail might race to the disk
with bytes modified via DIRECT IO.  The common way to trigger the bug is
via a mixture of direct io and regular file access at the same time.

Most people won't see the bug, since it is uncommon to mix regular and
direct io that way.

-chris





Re: when distros do not support official Marcelo kernels they arenot being team players (was Re: reiserfs on redhat advanced server?)

2003-02-03 Thread Chris Mason
On Mon, 2003-02-03 at 09:20, Hans Reiser wrote:

 It is different from refusing to support the user who downloads 
 Marcelo's kernel after it does ship (after the distro CD went into the 
 stamping plant). That is what I am complaining about.  The default 
 should be to support all Marcelo kernels unless there is a motivated 
 reason not to (e.g. he ships a broken NFS kernel and the user is 
 complaining about NFS).  Users should feel that they can download any 
 latest official stable kernel (it is okay though to tell them to check a 
 website created by the distro to see if it is a known bad/unsupported 
 kernel), and everything will be fine with the distro.  When distros 
 don't do this, they are not being team players.

Hans, the vanilla kernels are lacking both bug fixes and features that
are critical to what our users are doing.  Even if the bug fixes all got
in, there are various reasons the features probably won't.  

If there was any vanilla kernel that had everything we needed, we'd
support it, and do a dance around a bonfire made from all of our patch
maintenance scripts and code.

The whole point of buying the distro is that you don't have the time and
energy to collect and compile every application and turn it into
something you can easily install on your personal machine.  The kernel
is one of those applications.  Feel free to replace it, but it doesn't 
make sense to expect us to help you fix the problems when we don't have
control over the configuration, compile or sources.

That would be like switching engines in your car and expecting the
original car company to do a warranty repair on the new engine.

-chris (speaking only for himself and not SuSE)





  1   2   3   >