Re: [PATCH] reiserfs:fix journaling issue regarding fsync()
On Thursday 29 June 2006 21:36, Hisashi Hifumi wrote: Hi, At 09:47 06/06/30, Chris Mason wrote: Thanks for the patch. One problem is this will bump the transaction marker for atime updates too. I'd rather see the change done inside reiserfs_file_write. I did not realize that an atime updates is also influenced. reiserfs_file_write already updates the transaction when blocks are allocated, but you're right that to be 100% correct we should cover the case when i_size increases but new blocks are not added. Was this the case you were trying to fix? Yes, that's right. So, I remade my patch as follows. I tested this patch and confirmed that the kernel with this patch work well. This is correct, excpet you need to put the update_inode_transaction call inside reiserfs_write_lock/unlock. -chris
[patch 0/6] reiserfs v3 patches
Hello everyone, Here is my current queue of reiserfs patches. These originated from various bugs solved in the suse sles9 kernel, and have been ported to 2.6.15-git9. -chris --
[patch 1/6] reiserfs v3 patches
After a transaction has closed but before it has finished commit, there is a window where data=ordered mode requires invalidatepage to pin pages instead of freeing them. This patch fixes a race between the invalidatepage checks and data=ordered writeback, and it also adds a check to the reiserfs write_ordered_buffers routines to write any anonymous buffers that were dirtied after its first writeback loop. That bug works like this: proc1: transaction closes and a new one starts proc1: write_ordered_buffers starts processing data=ordered list proc1: buffer A is cleaned and written proc2: buffer A is dirtied by another process proc2: File is truncated to zero, page A goes through invalidatepage proc2: reiserfs_invalidatepage sees dirty buffer A with reiserfs journal head, pins it proc1: write_ordered_buffers frees the journal head on buffer A At this point, buffer A stays dirty forever diff -r 21be96fa294a fs/reiserfs/inode.c --- a/fs/reiserfs/inode.c Fri Jan 13 13:48:03 2006 -0500 +++ b/fs/reiserfs/inode.c Fri Jan 13 13:50:37 2006 -0500 @@ -2743,6 +2743,7 @@ static int invalidatepage_can_drop(struc int ret = 1; struct reiserfs_journal *j = SB_JOURNAL(inode-i_sb); + lock_buffer(bh); spin_lock(j-j_dirty_buffers_lock); if (!buffer_mapped(bh)) { goto free_jh; @@ -2758,7 +2759,7 @@ static int invalidatepage_can_drop(struc if (buffer_journaled(bh) || buffer_journal_dirty(bh)) { ret = 0; } - } else if (buffer_dirty(bh) || buffer_locked(bh)) { + } else if (buffer_dirty(bh)) { struct reiserfs_journal_list *jl; struct reiserfs_jh *jh = bh-b_private; @@ -2784,6 +2785,7 @@ static int invalidatepage_can_drop(struc reiserfs_free_jh(bh); } spin_unlock(j-j_dirty_buffers_lock); + unlock_buffer(bh); return ret; } diff -r 21be96fa294a fs/reiserfs/journal.c --- a/fs/reiserfs/journal.c Fri Jan 13 13:48:03 2006 -0500 +++ b/fs/reiserfs/journal.c Fri Jan 13 13:50:37 2006 -0500 @@ -878,6 +878,19 @@ static int write_ordered_buffers(spinloc } if (!buffer_uptodate(bh)) { ret = -EIO; + } + /* ugly interaction with invalidatepage here. +* reiserfs_invalidate_page will pin any buffer that has a valid +* journal head from an older transaction. If someone else sets +* our buffer dirty after we write it in the first loop, and +* then someone truncates the page away, nobody will ever write +* the buffer. We're safe if we write the page one last time +* after freeing the journal header. +*/ + if (buffer_dirty(bh) unlikely(bh-b_page-mapping == NULL)) { + spin_unlock(lock); + ll_rw_block(WRITE, 1, bh); + spin_lock(lock); } put_bh(bh); cond_resched_lock(lock); --
[patch 2/6] reiserfs v3 patches
The b_private field in buffer heads needs to be zero filled when the buffers are allocated. Thanks to Nathan Scott for finding this. It was causing problems on systems with both XFS and reiserfs. diff -r 5ef1fa0a021a fs/buffer.c --- a/fs/buffer.c Fri Jan 13 13:50:39 2006 -0500 +++ b/fs/buffer.c Fri Jan 13 13:51:09 2006 -0500 @@ -1022,6 +1022,7 @@ try_again: bh-b_state = 0; atomic_set(bh-b_count, 0); + bh-b_private = NULL; bh-b_size = size; /* Link the buffer to its page */ --
[patch 4/6] reiserfs v3 patches
write_ordered_buffers should handle dirty non-uptodate buffers without a BUG() diff -r 18fa5554d7e2 fs/reiserfs/journal.c --- a/fs/reiserfs/journal.c Fri Jan 13 13:55:10 2006 -0500 +++ b/fs/reiserfs/journal.c Fri Jan 13 14:00:49 2006 -0500 @@ -848,6 +848,14 @@ static int write_ordered_buffers(spinloc spin_lock(lock); goto loop_next; } + /* in theory, dirty non-uptodate buffers should never get here, +* but the upper layer io error paths still have a few quirks. +* Handle them here as gracefully as we can +*/ + if (!buffer_uptodate(bh) buffer_dirty(bh)) { + clear_buffer_dirty(bh); + ret = -EIO; + } if (buffer_dirty(bh)) { list_del_init(jh-list); list_add(jh-list, tmp); @@ -1032,9 +1040,12 @@ static int flush_commit_list(struct supe } if (!list_empty(jl-j_bh_list)) { + int ret; unlock_kernel(); - write_ordered_buffers(journal-j_dirty_buffers_lock, - journal, jl, jl-j_bh_list); + ret = write_ordered_buffers(journal-j_dirty_buffers_lock, + journal, jl, jl-j_bh_list); + if (ret 0 retval == 0) + retval = ret; lock_kernel(); } BUG_ON(!list_empty(jl-j_bh_list)); --
[patch 3/6] reiserfs v3 patches
In data=journal mode, reiserfs writepage needs to make sure not to trigger transactions while being run under PF_MEMALLOC. This patch makes sure to redirty the page instead of forcing a transaction start in this case. Also, calling filemap_fdata* in order to trigger io on the block device can cause lock inversions on the page lock. Instead, do simple batching from flush_commit_list. diff -r c10585019f18 fs/reiserfs/inode.c --- a/fs/reiserfs/inode.c Fri Jan 13 13:51:10 2006 -0500 +++ b/fs/reiserfs/inode.c Fri Jan 13 13:55:09 2006 -0500 @@ -2363,6 +2363,13 @@ static int reiserfs_write_full_page(stru int bh_per_page = PAGE_CACHE_SIZE / s-s_blocksize; th.t_trans_id = 0; + /* no logging allowed when nonblocking or from PF_MEMALLOC */ + if (checked (current-flags PF_MEMALLOC)) { + redirty_page_for_writepage(wbc, page); + unlock_page(page); + return 0; + } + /* The page dirty bit is cleared before writepage is called, which * means we have to tell create_empty_buffers to make dirty buffers * The page really should be up to date at this point, so tossing diff -r c10585019f18 fs/reiserfs/journal.c --- a/fs/reiserfs/journal.c Fri Jan 13 13:51:10 2006 -0500 +++ b/fs/reiserfs/journal.c Fri Jan 13 13:55:09 2006 -0500 @@ -990,6 +990,7 @@ static int flush_commit_list(struct supe struct reiserfs_journal *journal = SB_JOURNAL(s); int barrier = 0; int retval = 0; + int write_len; reiserfs_check_lock_depth(s, flush_commit_list); @@ -1039,16 +1040,24 @@ static int flush_commit_list(struct supe BUG_ON(!list_empty(jl-j_bh_list)); /* * for the description block and all the log blocks, submit any buffers -* that haven't already reached the disk +* that haven't already reached the disk. Try to write at least 256 +* log blocks. later on, we will only wait on blocks that correspond +* to this transaction, but while we're unplugging we might as well +* get a chunk of data on there. */ atomic_inc(journal-j_async_throttle); - for (i = 0; i (jl-j_len + 1); i++) { + write_len = jl-j_len + 1; + if (write_len 256) + write_len = 256; + for (i = 0 ; i write_len ; i++) { bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl-j_start + i) % SB_ONDISK_JOURNAL_SIZE(s); tbh = journal_find_get_block(s, bn); - if (buffer_dirty(tbh)) /* redundant, ll_rw_block() checks */ - ll_rw_block(SWRITE, 1, tbh); - put_bh(tbh); + if (tbh) { + if (buffer_dirty(tbh)) + ll_rw_block(WRITE, 1, tbh) ; + put_bh(tbh) ; + } } atomic_dec(journal-j_async_throttle); --
[patch 6/6] reiserfs v3 patches
When a filesystem has been converted from 3.5.x to 3.6.x, we need an extra check during file write to make sure we are not trying to make a 3.5.x file 2GB. diff -r ee81eb208598 fs/reiserfs/file.c --- a/fs/reiserfs/file.cFri Jan 13 14:01:37 2006 -0500 +++ b/fs/reiserfs/file.cFri Jan 13 14:08:12 2006 -0500 @@ -1285,6 +1285,23 @@ static ssize_t reiserfs_file_write(struc struct reiserfs_transaction_handle th; th.t_trans_id = 0; + /* If a filesystem is converted from 3.5 to 3.6, we'll have v3.5 items + * lying around (most of the disk, in fact). Despite the filesystem + * now being a v3.6 format, the old items still can't support large + * file sizes. Catch this case here, as the rest of the VFS layer is + * oblivious to the different limitations between old and new items. + * reiserfs_setattr catches this for truncates. This chunk is lifted + * from generic_write_checks. */ + if (get_inode_item_key_version (inode) == KEY_FORMAT_3_5 + *ppos + count MAX_NON_LFS) { + if (*ppos = MAX_NON_LFS) { + send_sig(SIGXFSZ, current, 0); + return -EFBIG; + } + if (count MAX_NON_LFS - (unsigned long)*ppos) + count = MAX_NON_LFS - (unsigned long)*ppos; + } + if (file-f_flags O_DIRECT) { // Direct IO needs treatment ssize_t result, after_file_end = 0; if ((*ppos + count = inode-i_size) --
[patch 5/6] reiserfs v3 patches
reiserfs: journal_transaction_should_end should increase the count of blocks allocated so the transaction subsystem can keep new writers from creating a transaction that is too large. diff -r 890bf922a629 fs/reiserfs/journal.c --- a/fs/reiserfs/journal.c Fri Jan 13 14:00:50 2006 -0500 +++ b/fs/reiserfs/journal.c Fri Jan 13 14:01:36 2006 -0500 @@ -2854,6 +2854,9 @@ int journal_transaction_should_end(struc journal-j_cnode_free (journal-j_trans_max * 3)) { return 1; } + /* protected by the BKL here */ + journal-j_len_alloc += new_alloc; + th-t_blocks_allocated += new_alloc ; return 0; } --
Re: [PATCH] fix problems related to journaling in Reiserfs
On Wed, 31 Aug 2005 20:35:52 -0700 Hans Reiser [EMAIL PROTECTED] wrote: Thanks much Hifumi! Chris, please comment on the patch. The problem is that I'm not always making the inode dirty during the reiserfs_file_write. The get_block based write function does an explicit commit during O_SYNC mode. I've got a cleanup related to this for quotas and other things, but I didn't realize it would help O_SYNC as well. I'll diff/test against mainline in the morning and send out. -chris
Re: [PATCH] Make reiserfs BUG on too big transaction
On Thursday 19 May 2005 05:36, Jan Kara wrote: Hello! Attached patch makes reiserfs BUG() when somebody tries to start a larger transaction than it's allowed (currently the code just silently deadlocks). I think this is a better behaviour. Can you please apply the patch? Ack, looks ok. In theory, we could return an error instead and force the FS into readonly mode, but it's better to catch the offending caller. -chris
Re: [PATCH] Fix quota transaction size
On Thursday 19 May 2005 05:40, Jan Kara wrote: Hello, attached patch improves the estimates on the number of credits needed for a quota operation. This is needed as currently quota overflows the maximum size of a transaction if 1KB blocksize is used. Please apply. Thanks Jan, It would make more sense to only allocate for the quota if quotas are in use. When you have 10 or more concurrent procs unlinking things, they end up waiting for each other because they are trying to reserve so many blocks in the transaction. So, a smaller reservation allows for better concurrency when quotas are off. -chris
Re: BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
On Friday 11 March 2005 18:39, Hans Reiser wrote: I/O errors usually indicate bad hardware not bad software, probably you need to get a new disk and use dd_rescue to copy everything This is your user friendly error message targeted at users that don't know what an I/O error is? What's an I/O error? What's software? What's hardware? What's a disk? What's dd_rescue? How do I copy everything? How do I put a new disk in? How do I make the kernel recognize use new disk instead of the old one? The list goes on and on. You'll never make the kernel more usable by making messages in the syslog more verbose. You can make it more usable by having consistent error messages that can be found via search engines or the manual. Jeff's completely right here. -chris
Re: reiserfs3, rsync and hardlinks
On Monday 07 February 2005 15:50, Pierre Etchemaite wrote: Le lun 07 fév 2005 13:22:51 CET, Vladimir Saveliev [EMAIL PROTECTED] a écrit Hello Hi, yes, reiserfs reuses inode number of removed files for newly created files. However, ext2 also does that. Have you ever noticed this problem on other filesystems? No, but I'm only using rsync -H for a few weeks. The problem may also exist with tar, but unnoticed (unless tar detects hardlinks in a different way, or does more checks, like checking the consistency with references counters, whatever, to avoid it). rsync handles hardlinks in a final pass, so as soon as the verbosity level is raised, problems are easy to detect. I have only one server left that uses ext2. It's also saved with rsync, no problem seen so far (a few weeks only, as I said). But the filesystem used isn't the only difference. Usage pattern probably matters a lot. On the system where it happens, hardlinked files are often Maildir files (unsurprizingly) and mrtg log files (which are rotated every 5 minutes). inodes are probably freed by mrtg, and one reused for a new email. If you've got files being deleted in the middle of the backup, then it is extremely difficult for rsync (or any tool) to get the hard link detection correct. You've got a few choices: 1) put everything on lvm and backup snapshots instead of the live filesystem. This has a number of benefits. 2) link everyfile into some temp directory before the backup starts. This will prevent that particular inode number from being reused during the backup, but won't help if new files are added during the rsync (since those new files could also be deleted). for each file in backup list ln file tmpdir/counter counter++ rsync rm -rf tmpdir -chris
Re: Stacktrace
On Thu, 2004-12-16 at 14:17 +0100, Joachim Reichelt wrote: Dear all, I just got in my dmesg: reiserfs/1: page allocation failure. order:0, mode:0x21 This is an atomic allocation, which is allowed to fail. You can ignore the message (which comes from the VM), later versions of the suse kernel don't even print it for atomic failures. -chris
Re: Oops with large file in 2.6.8, reiser 3.6.13
On Mon, 2004-11-29 at 14:46 -0500, Jeff Mahoney wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Alex Zarochentsev wrote: | Hello, | | On Fri, Oct 29, 2004 at 10:55:36AM +0100, Richard Gregory wrote: | |Hi Alex, | |That fixed it. I created a 617gig file that filled the filesystem. It |then deleted without a problem. The delete took a long time, but at |least it got there. | | | Thanks a lot. Your reply is what we needed to make a correct fix. Alex - Is the fix in the parent message the correct fix? It seems to leave an if (1 || ...) in, and I've yet to see the fix appear in bk or lkml. Jeff, look for subject: reiserfs_do_truncate patch from zam. -chris
Re: [PATCH] Fix reiserfs oops on small fs
On Thu, 2004-11-18 at 12:49 +0100, Jan Kara wrote: Hello! Attached patch fixes oops of reiserfs on a filesystem with just one bitmap block - current code always tries to return second bitmap even if there's not any. Could someone review it please so that it can be merged in mainline? Hi Jan, A slightly different form of this patch is in already. Look for the checks in bmap_hash_id. Are you still able to reproduce this bug on kernels newer than October 18? -chris
Re: [PATCH] Expose sync_fs()
On Thu, 2004-11-18 at 13:00 +0100, Jan Kara wrote: Hello! Attached patch makes reiserfs provide sync_fs() function. It is necessary for a new quota code to work correctly and expose quota data to the user space after quotaoff. Currently the functionality is hidden behind the write_super() call which also seems a bit non-intuitive to me. Do you think the patch is acceptable? Looks fine. -chris
Re: [PATCH] Compile fix for reiserfs quota debug
On Thu, 2004-11-18 at 12:44 +0100, Jan Kara wrote: Hello! Attached patch fixes debugging messages of the quota code in the reiserfs so that they compile. Could some of the reiserfs developers have a look at it please so that it can be merged in the mainline? Looks fine, thanks Jan. -chris
Re: [PATCH] ReiserFS v3 I/O error handling
On Wed, 2004-09-15 at 10:31, Hans Reiser wrote: Jeff Mahoney wrote: Alex Zarochentsev wrote: | I assume that was tested with some simulated i/o errors, wasn't it?. Of course. The debugging code is since removed, but every place there was a !buffer_uptodate(bh) check, I added a trigger such that I could trigger each error path individually. I triggered various error paths while running fsx, stress.sh, and/or ltp's fsstress. Your patch is much needed stuff. Would be nice to see it for reiser4 someday.;-) ;-) Any objections if we start by sending this to Andrew for v3? -chris
Re: BUG in Reiserfs Journal Thread
On Wed, 2004-09-15 at 16:02, Vijayan Prabhakaran wrote: Dear Chris Mason, I found a bug in Reiserfs journal thread. This bug is in function reiserfs_journal_commit_thread(). Hi, Which version of the code are you reading? -chris
Re: silent semantic changes with reiser4
On Wed, 2004-08-25 at 16:41, Hans Reiser wrote: I just want to add that I AM capable of working with the other filesystem developers in a team-player way, and I am happy to cooperate with making portions more reusable where there is serious interest from other filesystems in that, Prove it. Stop replying for today and come back tomorrow with some useful discussions. Christoph suggested that some of the v4 semantics belong in the VFS and therefore linux as a whole. He's helping you to make sure the semantics and fit nicely with the rest of kernel interfaces and are race free. Take him up on the offer. -chris
Re: Quicker alternative to find /?
On Mon, 2004-08-16 at 08:52, Christophe Saout wrote: Am Sonntag, den 15.08.2004, 23:16 +0200 schrieb Felix E. Klee: I'd like to store the directory structure of a partition formatted as ReiserFS into a file. Currently, I use find / file This process takes approximately 5 minutes (the result is 26MB of data). Are there any alternative *quicker* ways to do this? The main problem is that find only uses one thread. This thread only reads one directory at once and as a result of that you'll get a lot of seeks. This can usually be improved *a lot* by doing a massively multi-threaded search with a lot of threads trying to read a lot of directories at once. The disk io scheduler will then linearize all the outstanding read requests. I've done something similar to speed up a diff -r using a shell script (not for find but for reading the file content that should be compared). This gets tricky quickly. Many threads reading many different directories at once will introduce a lot of seeks, since the directories are likely to be far apart on disk. It is far better for the filesystem to realize a sequential scan of the directory is in progress and do smarter readahead based on that. The latest patches in 2.6.8 for reiser v3 do some of this, triggering metadata readahead on readdirs. You can also make things faster by mounting with -o noatime,nodiratime. This is one workload where v4 should do better, since the inode data is close to the directory entry. -chris
Re: Quicker alternative to find /?
On Mon, 2004-08-16 at 09:19, Spam wrote: Am Sonntag, den 15.08.2004, 23:16 +0200 schrieb Felix E. Klee: I'd like to store the directory structure of a partition formatted as ReiserFS into a file. Currently, I use find / file This process takes approximately 5 minutes (the result is 26MB of data). Are there any alternative *quicker* ways to do this? The main problem is that find only uses one thread. This thread only reads one directory at once and as a result of that you'll get a lot of seeks. I am confused about this in general with most filesystems. I thought that all filenames/foldernames etc were stored in one place and not spread out over the intire filesystem. It seem to me very strange that things like find/ls -R etc takes so so long just read/list files like this on any modern filesystem. It varies by filesystem. The filenames and folder names are stored in one place on almost all filesystems. But, the actual inode information that tells you what kind of file it is and how to read the file are sometimes stored in a different place (v3, ext[23]). find has to read both sets of information because a recursive find has to descend into all subdirectories, and the only way it can know if something is a subdirectory is by reading the inode. There is an optimization that ext[23] use to store the mode information (identifying things as a file or dir) in the directory listing. reiserfs v3 doesn't have this in the disk format, not sure if v4 does or not. Different directories are likely to be stored in different areas of the disk. So, a multithreaded find that tries to read multiple dirs at once is likely to introduce more seeks. -chris
Re: Odd Block allocation behavior on Reiser3
On Mon, 2004-08-09 at 18:04, Sonny Rao wrote: On Mon, Aug 09, 2004 at 04:30:51PM -0400, Chris Mason wrote: On Mon, 2004-08-09 at 16:19, Sonny Rao wrote: Hi, I'm investigating filesystem performance on sequential read patterns of large files, and I discovered an odd pattern of block allocation and subsequent re-allocation after overwrite under reiser3: Exactly which kernel is this? The block allocator in v3 has changed recently. 2.6.7 stock Ok, the block allocator optimizations went in after 2.6.7. I'd be curious to see how 2.6.8-rc3 does in your tests. This was done on a newly created filesystem with plenty of available space and no other files. I tried this test several times and saw the number of extents for the file vary from 5,6,7 and 134 extents, but it is always different after each over-write. You've hit a feature of the journal. When you delete a file, the data blocks aren't available for reuse until the transaction that allocated them is committed to the log. If you were to put a sync in between each run of dd, you should get roughly the same blocks allocated each time. ext3 does the same things, although somewhat differently. The asynchronous commit is probably just finishing a little sooner on ext3. First, I expect that an extent-based filesystem like reiserfs reiser4 is extent based, reiser3 is not. Ah, I didn't know that. I'm still confused as to why on the first allocation/create we get such bad fragmentation, you can see that even though the file is fragmented into 134 blocks, the blocks are very close together. Most of the extents are only 2 blocks apart. This could be the metadata mixed in with the file data. In general this is a good Still there were a number of cases the old allocator didn't do as well with. thing, when you read the file sequentially, the metadata required to find the next block will already be in the drive's cache. Whenever you're doing fragmentation tests, it helps to also identify the actual effect of the fragmentation on the time it takes to read a file or set of files. It's easy to create a directory where all the files are 99.99% contiguous, but that takes 3x as much time to read in. -chris
Re: mongo_copy: cp: cannot stat `/mnt/testfs/testdir0-0-0/f92': Input/output error
On Wed, 2004-08-04 at 13:38, Hans Reiser wrote: Please do whatever you can to reproduce this. We are going to delay release by one day to see if it can be reproduced. Vs thinks it might be a hardware problem, I am not so optimistic, what are your thoughts? If it currently passes all of your internal tests, and the disk format is fixed, I'd release it. There are going to be bugs found, pretty much everyone expects v4 to go through a churning period. -chris
Re: Reiser4 on SuSE 9.1
On Fri, 2004-06-04 at 10:14, Mike Young wrote: Has anyone put Reiser4 on the latest SuSE 9.1 release? I'd like to use it there without having to patch a pristine kernel. Preferably, I'd like to be able to use their RPM build environment so I can continue to take updates from SuSE. I've placed a couple of requests into their support group, but have gotten nowhere with them. Anyone got a suggestion. I can patch kernels. I just haven't deciphered the build environment. I'm use to doing this in a chroot environment. We're interested in v4 of course, as things stabilize we'll consider including it. This doc is fairly up to date, and it might help you figure out the suse build setup: http://www.suse.de/~agruen/kernel-doc/ -chris
Re: I would like to see ReiserFS V3 enter a feature freeze real soon.
On Tue, 2004-06-01 at 13:02, Hans Reiser wrote: I can't promise that I'll never making another change in there, but my goal is to keep them to a minimum. Also, I would like to see some serious benchmarks of the bitmap algorithm changes before they go in. They seem nice in theory, and some users liked them for their uses, but that does not make a serious scientific study. Such a study has a high chance of making them even better.;-) Some benchmarks have been posted on reiserfs-list, but I'd love to coordinate with you on getting some mongo numbers. Ok. A good start would be to just rebenchmark against v4. V4 performance is not at a stable point at the moment I think, I have not been monitoring things closely due to trying to earn bucks consulting, and performance did not get tested every week, but there have been reports of performance decreasing and no reports of anyone investigating it, so I need to Sure, since v4 is being done again -mm right now (right?) you can just benchmark against a few of the new options. mount -o alloc=skip_busy will give you the old allocator. Elena, please compose a URL consisting of past benchmarks of various V4 snapshots and send it to me. (I did not read the last one you sent, sorry about that, so include the contents of that one also). If the objective is to determine if the algorithm is good, then we should test it with only the algorithm in question changed. I would be quite happy to add the algorithm to V4 (or Chris and Jeff can do that) and test it on vs. off. The algorithm has a few key components, but v4 doesn't need most of it. The part to inherit packing localities down the chain would be most interesting in v4. The rest approximates things v4 should already be good at, like grouping some of the metadata near the data blocks. -chris
Re: journal viewer for reiserfs
On Fri, 2004-05-21 at 11:12, Redeeman wrote: hey, i just lost power today, and i saw it mentioned some transactions it replayed. now, i dont like that, but happend is happend. though, i would like to know what the transactions really were, is such thing possible? note: reiserfs, not reiser4 The transactions are block based, so there's no nice logical way to sum them up. You can look at the blocks in the log with debugreiserfs -j /dev/xxx /dev/xxx, but that's not very exciting. -chris
Re: large files
On Tue, 2004-05-18 at 09:42, Bernd Schubert wrote: Hello Chris, As a comparison data point, could you please try 2.6.6-mm3? I realize you don't want to run this kernel in production, but it would tell us if I understand the problems at hand. the results in 2.6.6-mm3 are below, we almost consider to run this kernel version. Here are two other interesting facts: 1.) During the filecreation in 2.4.26 the load on the system was around 3-4, whereas in 2.6.6-mm3 the load was at about 8-9. Which procs contributed to this load? The simple dd should have kept the load at one. 2.) When the dd file creation process finished (2.4.26 was running) the system became so unresponsible, that the drdb connection timed out and a resync process automatically started when the system became responsible again. I don't have any comparism to 2.6.6-mm3 since we would need another drbd version. Also, I don't know if this happend when dd finished or when rm-started, since both were running from a script. Probably the rm. [ 2.6.6-mm3 is much faster ] Do you have any ideas how we could improve 2.4.x? 2.6.6-mm has a few key improvements. There's less metadata fragmentation thanks to some block allocator fixes. More importantly, during the rm, metadata blocks are read in 16 at a time instead of 1 at a time. I'd be happy to give someone pointers on porting the metadata readahead bits back to 2.4. -chris
Re: large files
On Mon, 2004-05-17 at 15:48, Bernd Schubert wrote: Hello, I'm currently testing our new server and though it will primarily not serve really large files (about 40-60 users will have a quota of 25GB each on a 2TB array), I'm still testing the performance for large files. So I created an about 300GB fil and the problem is to remove it now. Removing it took much more than 15 minutes. Here's the the relevant top line: 5012 root 18 0 368 368 312 D21.9 0.0 5:48 rm Since I didn't expect it to take so much time, I didn't measure the time to delete this file. system specifications: - dual opteron 242 (1600 MHz) - linux-2.4.26 with all patches from Chris, no further patches - reiserfs-3.6 format The partition with the 300GB file has a size of 1.7TB. This is most likely a combination of metadata fragmentation and the fact that during deletes, 2.4.x reiserfs ends up reading one block at a time. As a comparison data point, could you please try 2.6.6-mm3? I realize you don't want to run this kernel in production, but it would tell us if I understand the problems at hand. -chris
Re: 2 Terabyte install
On Wed, 2004-05-12 at 14:24, Mike Benoit wrote: On Wed, 2004-05-12 at 09:59, Hans Reiser wrote: There were a few bugs of ours that acted as red herrings, but Linspire is now up and running on this system with ReiserFS 3 and kernel 2.6.5. While I'm here, I have some other questions: * What is the time complexity of mounting a ReiserFS partition? It seems to be proportional to the size of the partition? Is it different for Reiser4? * Is there a tool to determine the type of file system on a partition without mounting it? is the time a problem in practice, or just a curiousity point? Hans, The last place I worked had a 2.5TB RAID array that had 10's of millions of files on it, basically a perfect fit for ReiserFS. However they rejected it simply because of the mount times. It was a slower machine (mainly used for dumb storage, however uptime was critical) but the mount times were just crazy, I don't have hard numbers, but if I recall correctly I was told around 5-10 minutes. Hans' guys made mount times better in v3 by reading the bitmaps in big chunks. I suspect you were using a kernel from before that fix. -chris
Re: metas Permission Denied
On Fri, 2004-04-30 at 01:19, Hans Reiser wrote: Chris Mason wrote: On Thu, 2004-04-29 at 12:22, Nikita Danilov wrote: [EMAIL PROTECTED] writes: On Thu, 29 Apr 2004 19:59:22 +0400, Nikita Danilov said: chmod u+rx backup/fsplit.c x bit is necessary for lookups, and r bit---for readdir. This is going to be *such* a non-starter - there's many decades of C files are mode 644 and executables are 755 tradition that this will fly against. What this basically implies is that the 'execute' Eh? What I described is precisely decades old meaning of rwx bits for directories. Problem is that we have to fit objects that are both regular files and directories into access control scheme that wasn't designed for such a mix. I don't see better solution short of inventing new bit(s). Please forgive me for jumping into the end of a thread without reading the whole thing, but it seems like the r bit should be sufficient here. If you can read the file, you should be able to read the metas. x should be for execution of the file... what if the file/directory contains real files which are not metas, and it also has a file body? This is possible in reiser4. Well, that would explain needing the execute bit ;-) I guess this is a matter of taste, but to me, the metas are really part of the file. If you can read the file you should be able to at least read the listing of metas, for the same reasons that you can read the file size and atime/mtime etc. This could hold true for /somedir/metas as well. -chris
Re: Reiserfs concurrent write problems
On Tue, 2004-04-27 at 12:15, Bruce Guenter wrote: On Mon, Apr 26, 2004 at 01:36:11PM -0400, Chris Mason wrote: Please try 2.6.6-rc2-mm2, which has new block allocator patches and other speedups. The default mount options should work best for you, but this might work too: mount -o alloc=skip_busy:dirid_groups /dev/xxx /mnt Using the new kernel and the above mount options boosted the write performance of ReiserFS, almost to the level of XFS. At the highest concurrency level, the reading performance dropped, but it's still much faster than the other FSs. Thanks for your help. Good to hear, could I trouble you to ping me (or better yet reiserfs-list) when you have the graphs updated? Other notes about 2.6.6-rc2-mm2, it has some ext3 optimizations as well, you might want to rebenchmark there. reiserfs defaults to -o data=ordered in the 2.6.6-rc2 (and -mm), you should get slightly better numbers with -o data=writeback. -chris
Re: I oppose Chris and Jeff's patch to add an unnecessary additional namespace to ReiserFS
On Mon, 2004-04-26 at 14:15, Hans Reiser wrote: Chris Mason wrote: On Mon, 2004-04-26 at 12:59, Hans Reiser wrote: v4 didn't factor into these decisions because it was still in extremely early stages back then (2.4.16 or so). It was clearly indicated then that accessing acls was scheduled for V4 not V3. Well, that part we've always disagreed most on is how to support existing users. SUSE implemented the acls for v3 because we felt they were an important feature, and didn't want to tell users asking for ACLs to switch filesystems when it was reasonable to implement in v3. It seems that you don't want the ACLs in v3 for two major reasons: 1) it's not v4 2) it's based on xattrs I don't feel this is a good way to support v3, since v4 still means telling someone to switch just for acls, and not using xattrs means not using the same API as the rest of the kernel. I hope v4 does improve the xattr api, and I hope it manages to do so for more then just reiser4. It is important that application writers are able to code to a single interface and get coverage across all the major linux filesystems. -chris
Re: online fsck
On Thu, 2004-04-22 at 13:51, Hans Reiser wrote: Chris Mason wrote: On Thu, 2004-04-22 at 09:00, Jure Pear wrote: Hi all, Is it theoretically posible? Like, does it need a drastic redesing of reiserfs or just sufficient $$ directed to the team to be implemented? Because i think that reiserfsck --check in 12h + --rebuild-tree in 18h is still waaay too much downtime for a 500gb mail server... Online check is easy, just use lvm or evms to make a snapshot and then check the snapshot. Requires that users use lvm before discovering the need for fsck, but, yes. What would be ideal would be some support for finding the inconsistency on the snapshot, and then fixing it on the real fs using the information learned from the snapshot fsck. Which should be possible, espeically for corruptions at the leaf levels. Things like incorrect i_size, pointers to files that don't exist, etc. Corruptions that require a full blow rebuild-tree will be much harder. I was wrong to say that an online rebuild-tree would be impossible, but it certainly does seem tricky. Basically you could freeze the old tree, using it for readonly lookups. Rebuild to new tree in the background, and verify things you find in the old tree in the new tree (to catch a file that has been deleted while the FS was mounted but is present in the old tree). -chris
Re: Resier Fragmentation Effects (was compression vs performance)
On Fri, 2004-04-09 at 01:53, Hans Reiser wrote: Burnes, James wrote: I thought nearly all filesystems designed since Berkeley FFS were nearly immune to fragmentation problems. After reading the following analysis at Harvard, it seems that fragmentation is still a problem. http://www.eecs.harvard.edu/~keith/research/tr94.html Yeah, I wish I had read this in 94. V3 suffers from the same problems as FFS does as described in the abstract (all that I read, sorry about that, I really am a bit busy, so unless someone suggests I should read more ) . V4 cures it though. I put out some patches last week that try to deal with this in v3. Take a look through the archives for mail from me. The v3 patches are an attempt to do better under common workloads. I think they are a big improvement, and I doubt there's much more that can (or should) be done beyond simple tweaking. v4 does a better job, and even if it doesn't, it should at least have enough info in the metadata such that any problems can be fixed. -chris
Re: reiserfs v3 patches updated
On Tue, 2004-04-06 at 14:29, camis wrote: Seem to run fine so far with rw,noatime,nodiratime,acl,user_xattr,data=ordered,alloc=skip_busy:dirid_groups,packing_groups Has anyone done any throughput/benchmarks for the various new patches/code? The block allocator stuff is fresh from the oven and still warm inside. I'll be posting benchmarks for it as the week goes on. The block allocator improvements is our attempt to reduce fragmentation. The patch defaults to the regular 2.6.5 block allocator, but has options documented at the top of the patch that allow grouping of blocks by packing locality or object id. It also has an option to inherit lightly used packing localities across multiple subdirs, which keeps things closer together in the tree if you have a bunch of subdirs without much in them. Any of these new features documented anywhere? ;) I'm writing up a mount option summary, the block allocator stuff is all documented at the top of the patch. -chris
Re: reiserfs v3 patches updated
On Tue, 2004-04-06 at 15:17, camis wrote: Has anyone done any throughput/benchmarks for the various new patches/code? The block allocator stuff is fresh from the oven and still warm inside. I'll be posting benchmarks for it as the week goes on. Something interesitng: I have 8 dell 2650's (dual xeon 2.8ghz+hyperthreading) and i've setup the one machine above with the patches.. All 8 machines are incoming MX/smtp machines which handle quite large loads (5gigs of mail per day). machine 1: (mounted:noatime,nodiratime,data=ordered,alloc=skip_busy:dirid_groups,packing_groups) 21:16:17 up 12:17, 1 user, load average: 0.45, 1.04, 1.07 375 processes: 374 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpuusernice systemirq softirq iowaitidle total7.2%0.0% 32.4% 0.0% 0.8%5.2% 353.2% cpu001.8%0.0% 10.6% 0.1% 0.9%1.3% 85.0% cpu011.5%0.0%7.4% 0.0% 0.1%0.9% 89.9% cpu021.7%0.0%7.0% 0.0% 0.1%1.7% 89.3% cpu031.8%0.0%7.4% 0.0% 0.1%1.5% 88.9% Iowait total stays hovering at around the 6-8% mark.. machine 2-8: (mounted: noatime,nodiratime) 21:14:31 up 1 day, 8:41, 1 user, load average: 0.43, 0.72, 0.66 392 processes: 391 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpuusernice systemirq softirq iowaitidle total8.0%0.0% 35.2% 0.0% 1.6%0.8% 352.8% cpu002.0%0.0% 11.1% 0.1% 1.5%0.3% 84.6% cpu012.0%0.0%7.7% 0.0% 0.3%0.1% 89.5% cpu022.2%0.0%7.7% 0.0% 0.0%0.3% 89.5% cpu031.8%0.0%8.7% 0.0% 0.0%0.0% 89.3% The majority of the rest of the machines iowait hover around the 1% mark.. CPU time tends to be about the same, just the iowait is much much higher.. Very interesting. data=ordered makes fsync more expensive, since it ends up syncing more then just the buffers for that one file. Could you please try removing data=ordered from machine1? -chris
Re: reiserfs v3 patches updated
On Tue, 2004-04-06 at 15:53, camis wrote: The majority of the rest of the machines iowait hover around the 1% mark.. CPU time tends to be about the same, just the iowait is much much higher.. Very interesting. data=ordered makes fsync more expensive, since it ends up syncing more then just the buffers for that one file. Could you please try removing data=ordered from machine1? Ok.. after leaving it for a few minutes.. [ higher io wait percentage with patches applied ] What i then tried on machine1 was remounting it noatime,nodiratime and what was wierd was that the iowait stayed exactly the same, no indication of dropping at all.. Does the kernel patches change any of the default mount options at all? If i put the stock 2.6.5 kernel back on without the patch applied and mount it back as noatime,nodiratime, the iowait drops back to its normal 1%.. Do you have any numbers for the amount of mail delivered per second on each box? I've got an idea that should help lower the io wait percentage as well, trying a few things here. -chris
Re: reiserfs v3 patches updated
On Tue, 2004-04-06 at 15:53, camis wrote: The majority of the rest of the machines iowait hover around the 1% mark.. CPU time tends to be about the same, just the iowait is much much higher.. Very interesting. data=ordered makes fsync more expensive, since it ends up syncing more then just the buffers for that one file. Could you please try removing data=ordered from machine1? Ok.. after leaving it for a few minutes.. I'm a little slow today. data=ordered is the default with these patches. You need to mount -o data=writeback. -chris
Re: reiserfs v3 patches updated
On Tue, 2004-04-06 at 16:51, Cami wrote: The majority of the rest of the machines iowait hover around the 1% mark.. CPU time tends to be about the same, just the iowait is much much higher.. Very interesting. data=ordered makes fsync more expensive, since it ends up syncing more then just the buffers for that one file. Could you please try removing data=ordered from machine1? Ok.. after leaving it for a few minutes.. I'm a little slow today. data=ordered is the default with these patches. You need to mount -o data=writeback. data=writeback yields pretty much the same iowait results.. (machine 1+2 are around 5%-8% whereas machine 3-8 are at around 0.8%) This is so much faster I'm worried the io isn't actually getting done. In the mail server benchmark I use (synctest -n 1 -t 50 -f -F), the time went from 2m15s to 43s. The old 2m15s was still faster then I used to get with unpatched reiserfs. I'm posting this for the truly brave among you and a little review. I need to do more tests on it. The basic idea is to make sure we don't start writeback on the log buffers and metadata if someone is already doing it. Also, the reiserfs work queue is not kicked during transaction end if some other process is going to do the commit. This saves a lot of context switches. -chris Index: linux.mm/fs/reiserfs/journal.c === --- linux.mm.orig/fs/reiserfs/journal.c 2004-04-05 17:46:12.0 -0400 +++ linux.mm/fs/reiserfs/journal.c 2004-04-06 17:00:25.391877520 -0400 @@ -86,6 +86,7 @@ static struct workqueue_struct *commit_w /* journal list state bits */ #define LIST_TOUCHED 1 #define LIST_DIRTY 2 +#define LIST_COMMIT_PENDING 4 /* someone will commit this list */ /* flags for do_journal_end */ #define FLUSH_ALL 1 /* flush commit and real blocks */ @@ -2484,7 +2485,9 @@ static void let_transaction_grow(struct { unsigned long bcount = SB_JOURNAL(sb)-j_bcount; while(1) { - yield(); + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(1); + SB_JOURNAL(sb)-j_current_jl-j_state |= LIST_COMMIT_PENDING; while ((atomic_read(SB_JOURNAL(sb)-j_wcount) 0 || atomic_read(SB_JOURNAL(sb)-j_jlock)) SB_JOURNAL(sb)-j_trans_id == trans_id) { @@ -2920,9 +2923,15 @@ static void flush_async_commits(void *p) flush_commit_list(p_s_sb, jl, 1); } unlock_kernel(); - atomic_inc(SB_JOURNAL(p_s_sb)-j_async_throttle); - filemap_fdatawrite(p_s_sb-s_bdev-bd_inode-i_mapping); - atomic_dec(SB_JOURNAL(p_s_sb)-j_async_throttle); + /* + * this is a little racey, but there's no harm in missing + * the filemap_fdata_write + */ + if (!atomic_read(SB_JOURNAL(p_s_sb)-j_async_throttle)) { + atomic_inc(SB_JOURNAL(p_s_sb)-j_async_throttle); + filemap_fdatawrite(p_s_sb-s_bdev-bd_inode-i_mapping); + atomic_dec(SB_JOURNAL(p_s_sb)-j_async_throttle); + } } /* @@ -3011,7 +3020,8 @@ static int check_journal_end(struct reis jl = SB_JOURNAL(p_s_sb)-j_current_jl; trans_id = jl-j_trans_id; - + if (wait_on_commit) +jl-j_state |= LIST_COMMIT_PENDING; atomic_set((SB_JOURNAL(p_s_sb)-j_jlock), 1) ; if (flush) { SB_JOURNAL(p_s_sb)-j_next_full_flush = 1 ; @@ -3533,8 +3543,8 @@ static int do_journal_end(struct reiserf if (flush) { flush_commit_list(p_s_sb, jl, 1) ; flush_journal_list(p_s_sb, jl, 1) ; - } else -queue_work(commit_wq, SB_JOURNAL(p_s_sb)-j_work); + } else if (!(jl-j_state LIST_COMMIT_PENDING)) +queue_delayed_work(commit_wq, SB_JOURNAL(p_s_sb)-j_work, HZ/10); /* if the next transaction has any chance of wrapping, flush
Re: NFS issues with reiserfs 3.6? ext3 works...
On Fri, 2004-03-26 at 08:18, Vladimir Saveliev wrote: Hello On Fri, 2004-03-26 at 14:53, Bernhard Sadlowski wrote: Hi! A short question: What are the remaining differences between reiserfs and ext3 regarding NFS? We thought no before your mail Details: I have two Linux machines with current kernel 2.4.25 and two Directories on the NFS Server dematl04: dematl04:/vol01 (ext3 fs) dematl04:/raid (reiserfs 3.6) On both machines we use Helios Ethershare (a commercial software for print and fileservices for Macintosh). Helios Ethershare uses and updates a file .Desktop every time some mac is creating, writing or changing files in a Macintosh volume. The dt utility has to be used in a shell for mkdir, rm, mv, touch, ... I guess it has to lock this file whilche applying changes. Does Helios Ethershare use the standard linux in kernel nfs server, or does it patch things somehow? -chris
Re: NFS issues with reiserfs 3.6? ext3 works...
On Fri, 2004-03-26 at 09:09, Bernhard Sadlowski wrote: Hi Chris, On 26 Mar 2004 08:42, Chris Mason [EMAIL PROTECTED] wrote: Does Helios Ethershare use the standard linux in kernel nfs server, or does it patch things somehow? System1 (dematl04) is Debian/woody with Linux 2.4.25 kernel-nfs-server System2 (dematl08) is Redhat9 with Linux 2.4.25 with its standard nfs client I did compile both kernels myself from original kernel.org source. Thanks for the clarification. Running dt under strace as Vladimir suggested will help. -chris
Re: NFS issues with reiserfs 3.6? ext3 works...
On Fri, 2004-03-26 at 09:20, Bernhard Sadlowski wrote: On 26 Mar 2004 09:18, Chris Mason [EMAIL PROTECTED] wrote: I did compile both kernels myself from original kernel.org source. Thanks for the clarification. Running dt under strace as Vladimir suggested will help. I did send him the output by private mail, because the list doesn't accept bigger attachments. If you are interested, I can send you the strace output too. Vladimir sent them along. I only read the quickly, the traces looks the same to me for reiserfs and ext3. It tries to find directory test3 on the disk and then call off to helios which returns a failure. You'll have to check things on the helios side I think. -chris
reiserfs logging patches udpated
Hello everyone, ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.5-rc2-mm2 Has a new set of reiserfs patches. These should also work on 2.6.5-rc2, but I did testing on top of -mm2 because I'm submitting part of the patch set to Andrew. New since the 2.6.4 code: - add reiserfs laptop mode support - removed patches that made it into mainline - fixed buffer refcount problem in reiserfs-writepage-ordered-race when block size page size - fix buffer head leak in my invalidatepage func - fix bogus warning message about locked buffers during logging The series file has the list of the patches in the order they should be applied. Andrew, this subset is ready for testing -mm. It's everything except the xattr and ACLs, I'm still trying to convince Hans those should go in. reiserfs-nesting-02 reiserfs-journal-writer reiserfs-logging reiserfs-jh-2 reiserfs-prealloc reiserfs-tail-jh reiserfs-writepage-ordered-race reiserfs-file_write_hole_sd.diff reiserfs-laptop-mode reiserfs-truncate-leak reiserfs-ordered-lat reiserfs-dirty-warning -chris
Re: reiserfs logging patches udpated
On Wed, 2004-03-24 at 14:49, Hubert Chan wrote: Chris == Chris Mason [EMAIL PROTECTED] writes: [...] Chris - add reiserfs laptop mode support Can you explain what laptop mode is? Take a look at linux/Documentation/laptop-mode.txt on any recent -mm kernel. The short description is that it allows you to force the filesytem into a mode where it doesn't flush things to disk. The risk is increased loss of data if you crash, the gain is better battery life on laptops. The docs include a sample control script and other details. The reiserfs patch is quite small, just making sure the journal honors the flush timings requested by the rest of the kernel. -chris
Re: reiserfs logging patches udpated
On Wed, 2004-03-24 at 17:18, Andrew Morton wrote: Chris Mason [EMAIL PROTECTED] wrote: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.5-rc2-mm2 Has a new set of reiserfs patches. -ENODOCCO. If people are going to test this stuff we will need setup and usage instructions, links to userspace tool upgrades, etc, etc. For data=ordered? The only docs are to mount -o data=writeback if you don't want data=ordered (which is the new default). No tool upgrades are required. reiserfs needs a Documentation/filesystems/reiserfs.txt, but it needs that in general. I'll write one up and have the namesys guys review. -chris
Re: reiserfs logging patches udpated
On Wed, 2004-03-24 at 17:35, Andrew Morton wrote: For data=ordered? The only docs are to mount -o data=writeback if you don't want data=ordered (which is the new default). No tool upgrades are required. OK, thanks. Switching the default on day one sounds radical doesn't it? It does, but the code has been in testing -suse and on the reiserfs list. This dooesn't mean data=ordered is perfect, but it's not quite day one either. I can switch the default back, but I'd rather have a trial by fire ;-) reiserfs needs a Documentation/filesystems/reiserfs.txt, but it needs that in general. I'll write one up and have the namesys guys review. OK. And these guys need boring old changelogs please: The top of each patch has a boring old changelog. I can reformat them if needed. -chris
Re: reiserfs logging patches udpated
On Wed, 2004-03-24 at 17:59, Andrew Morton wrote: The top of each patch has a boring old changelog. I can reformat them if needed. Oh, I didn't notice that. Anything other than one-patch-per-email with changelog in the body is a bit of a pain. I'll go fetch the patches again :( I can send them over private mail to you that way if you haven't already started downloading. -chris
Re: reiserfs logging patches udpated
On Wed, 2004-03-24 at 19:47, Bernd Schubert wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 It does, but the code has been in testing -suse and on the reiserfs list. This dooesn't mean data=ordered is perfect, but it's not quite day one either. I can switch the default back, but I'd rather have a trial by fire ;-) Oh please not, just have a look at german newsgroups, there are real flamewars about the stability of reiserfs. Everytime someone has a disk problem and a reiserfs partion is affected by it, then there are more than a dozen answers that its the fault of reiserfs. PLEASE, don't prove them right. The code just won't get tested if it isn't turned on. I think I've got all the major issues fixed, and I'm sending to -mm to try and prove that. When we're all happy with the stability of the patches, they can go to -linus. But there are just as many threads about reiserfs not having data=ordered support. It's an important feature that has been out of mainline for far too long. -chris
Re: reiserfsprogs: lib/misc.c: why die() aborts?
On Mon, 2004-03-22 at 07:49, Vladimir Saveliev wrote: Hi On Sat, 2004-03-20 at 19:18, Domenico Andreoli wrote: hi all, in trying to figure out what is the unpack program in reiserfsprogs and if it supposed to be distributed in the debian package, No, it should not be distributed Hmmm, I disagree. Normal users can use debugreiserfs -p and unpack to do test rebuilds on copies of broken filesystems when they are rebuilding critical data. -chris
Re: new v3 2.6.4 logging/xattr patches
On Fri, 2004-03-19 at 03:00, Hans Reiser wrote: Chris Mason wrote: Hello everyone, I've just uploaded all the reiserfs patches from the suse 2.6 kernel to: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4 (they should also apply to 2.6.5-rc1 and 2.6.5-rc1-mm2) These include: Latency fixes Logging performance fixes data=ordered support quota support xattrs and acls from Jeff Mahoney and a few random bug fixes. There's a README file in the directory with a some more details. are these ready to go to Andrew or? They are ready for Andrew for additional testing, I'd like to submit the whole bunch to him actually. Any objections? -chris
Re: new v3 2.6.4 logging/xattr patches
On Fri, 2004-03-19 at 09:03, Chris Mason wrote: On Fri, 2004-03-19 at 03:00, Hans Reiser wrote: Chris Mason wrote: Hello everyone, I've just uploaded all the reiserfs patches from the suse 2.6 kernel to: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4 (they should also apply to 2.6.5-rc1 and 2.6.5-rc1-mm2) Since 2.6.5-rc1-mm2 includes some of these patches, I've put a series.mm file in the ftp directory to tell you which ones are needed when you want to apply to Andrew's tree. -chris
new v3 2.6.4 logging/xattr patches
Hello everyone, I've just uploaded all the reiserfs patches from the suse 2.6 kernel to: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4 (they should also apply to 2.6.5-rc1 and 2.6.5-rc1-mm2) These include: Latency fixes Logging performance fixes data=ordered support quota support xattrs and acls from Jeff Mahoney and a few random bug fixes. There's a README file in the directory with a some more details. The names of the files no longer start with numbers to help you figure out the order the patches get applied. The directory includes a file called series that has a list of all the patches in the proper order. There are various shell tricks to apply things in order. Or a few different patch management tools that understand a series file. I use quilt (http://savannah.nongnu.org/projects/quilt/) which is very nice. -chris
Re: mount hang on kernel 2.6
On Sun, 2004-03-07 at 17:32, Fabiano Reis wrote: Hi list, I´m having problems to mount a reiserfs filesystem using kernel 2.6.3 on RedHat 9. The fs I´m trying to mount was formated using kernel 2.4.20-24.9 (the latest kernel for RH9) with ReiserFS version 3.6.25 (from dmesg) and reiserfsprogs 3.6.4 (from reiserfsck -v command) I upgraded to kernel 2.6.3 (rpm from http://people.redhat.com/arjanv/2.6/RPMS.kernel/) The symptom: When I execute the command Can you reproduce with a vanilla or -mm kernel? -chris
[PATCH] corruption bugs in 2.6 v3
Hello everyone, These two patches fix corruption problems I've been hitting on 2.6. Both bugs are present in the vanilla and suse kernels. reiserfs-search-restart: This was originally from [EMAIL PROTECTED], I recently made a small addition to make sure the expected height was checked after reading in blocks in search_by_key (this depends on reiserfs-lock-lat from my data logging directory). reiserfs-write-sched-bug: Fixes two schedule related bugs during reiserfs_file_write. One place in the code assumes that after a schedule, path.pos_in_item will still be valid even if the item has moved. Since items can split during a schedule this is incorrect. The second bug took a little longer to figure out, reiserfs_prepare_file_region_for_write needs to make sure a stale item pointer isn't used if search_for_position_by_key doesn't return POSITION_FOUND. The most common symptoms of the two bugs is attempts to read beyond the end of the device, file data corruption and errors like this: is_tree_node: node level X does not match to the expected one Y Where X and Y are both valid tree heights (between 1 and 5) and usually one away from each other. These reproduced reliably for me by running dbench 20 in a loop for about 20 minutes on an amd64 box. Hans, I'd like to submit these along with the other fixes I ported from 2.4 and sent to reiserfs-dev. Any objections? -chris fix a bug in reiserfs search_by_key call, where it might not properly detect a change in tree height during a schedule. Originally from [EMAIL PROTECTED] Index: linux.t/fs/reiserfs/stree.c === --- linux.t.orig/fs/reiserfs/stree.c2004-03-03 14:11:40.984705584 -0500 +++ linux.t/fs/reiserfs/stree.c 2004-03-03 14:12:59.466460675 -0500 @@ -678,7 +678,7 @@ current node, and calculate the next current node(next path element) for the next iteration of this loop.. */ n_block_number = SB_ROOT_BLOCK (p_s_sb); -expected_level = SB_TREE_HEIGHT (p_s_sb); +expected_level = -1; while ( 1 ) { #ifdef CONFIG_REISERFS_CHECK @@ -692,7 +692,6 @@ /* prep path to have another element added to it. */ p_s_last_element = PATH_OFFSET_PELEMENT(p_s_search_path, ++p_s_search_path-path_length); fs_gen = get_generation (p_s_sb); - expected_level --; #ifdef SEARCH_BY_KEY_READA /* schedule read of right neighbor */ @@ -707,21 +706,26 @@ pathrelse(p_s_search_path); return IO_ERROR; } + if (expected_level == -1) + expected_level = SB_TREE_HEIGHT (p_s_sb); + expected_level --; /* It is possible that schedule occurred. We must check whether the key to search is still in the tree rooted from the current buffer. If not then repeat search from the root. */ if ( fs_changed (fs_gen, p_s_sb) -(!B_IS_IN_TREE (p_s_bh) || !key_in_buffer(p_s_search_path, p_s_key, p_s_sb)) ) { + (!B_IS_IN_TREE (p_s_bh) || +B_LEVEL(p_s_bh) != expected_level || +!key_in_buffer(p_s_search_path, p_s_key, p_s_sb))) { PROC_INFO_INC( p_s_sb, search_by_key_fs_changed ); - PROC_INFO_INC( p_s_sb, search_by_key_restarted ); + PROC_INFO_INC( p_s_sb, search_by_key_restarted ); PROC_INFO_INC( p_s_sb, sbk_restarted[ expected_level - 1 ] ); decrement_counters_in_path(p_s_search_path); /* Get the root block number so that we can repeat the search - starting from the root. */ + starting from the root. */ n_block_number = SB_ROOT_BLOCK (p_s_sb); - expected_level = SB_TREE_HEIGHT (p_s_sb); + expected_level = -1; right_neighbor_of_leaf_node = 0; /* repeat search from the root */ Index: linux.t/fs/reiserfs/file.c === --- linux.t.orig/fs/reiserfs/file.c 2004-03-03 14:16:44.762750907 -0500 +++ linux.t/fs/reiserfs/file.c 2004-03-03 14:16:57.361012562 -0500 @@ -365,7 +365,7 @@ // it means there are no existing in-tree representation for file area // we are going to overwrite, so there is nothing to scan through for holes. for ( curr_block = 0, itempos = path.pos_in_item ; curr_block blocks_to_allocate res == POSITION_FOUND ; ) { - +retry: if ( itempos = ih_item_len(ih)/UNFM_P_SIZE ) { /* We run out of data in this indirect item, let's look for another one. */ @@ -422,8 +422,8 @@ bh=get_last_bh(path); ih=get_ih(path); item = get_item(path); - // Itempos is still the same - continue; + itempos = path.pos_in_item; + goto retry; } modifying_this_item = 1; } @@ -856,8 +856,12 @@ /* Try to find next item */ res = search_for_position_by_key(inode-i_sb, key, path); /* Abort if no more items */ - if ( res
Re: [PATCH] updated data=ordered patch for 2.6.3
On Mon, 2004-03-01 at 08:30, Christophe Saout wrote: Hi, Also, the code has some extra performance tweaks to smooth out performance both with and without data=ordered. There are new mechanisms to trigger metadata/commit block writeback and to help throttle writers. The goal is to reduce the huge bursts of io during a commit and during data=ordered writeback. It seems you introduced a bug here. I installed the patches yesterday and found a lockup on my notebook when running lilo (with /boot on the root reiserfs filesystem). A SysRq-T showed that lilo is stuck in fsync: Ugh, I use grub so I haven't tried lilo. Could you please send me the full sysrq-t, this is probably something stupid. -chris
Re: [PATCH] updated data=ordered patch for 2.6.3
On Mon, 2004-03-01 at 09:30, Christophe Saout wrote: Am Mo, den 01.03.2004 schrieb Chris Mason um 15:01: It seems you introduced a bug here. I installed the patches yesterday and found a lockup on my notebook when running lilo (with /boot on the root reiserfs filesystem). A SysRq-T showed that lilo is stuck in fsync: Ugh, I use grub so I haven't tried lilo. Could you please send me the full sysrq-t, this is probably something stupid. Yes. I could reproduce it by simply creating a dummy /boot volume on reiserfs. I copied the content of /boot, ran lilo and it hung again. The other reiserfs filesystems were still usable (but a global sync hangs afterwards). I also attached a bzipped strace of the lilo process. Ok, thanks. The problem is in reiserfs_unpack(), which needs updating for the patch. Fixing. -chris
Re: data-logging 2.4.23 patches: 1 reject in fs/inode.c in 2.4.25
On Thu, 2004-02-19 at 07:40, Jens Benecke wrote: Hi, Chris' data-logging patches do not apply cleanly to 2.4.25, are they being updated? I think the fix for inode.c is not very hard but I don't dare to fiddle around with file system code. ;) Ok, I'll rediff the data logging + quota against 2.4.25. -chris
Re: Big Block Device does not format...
On Thu, 2004-02-19 at 09:02, Vitaly Fertman wrote: On Thursday 19 February 2004 12:39, Oliver Pabst wrote: Hello list, Hello Oliver, I try to mkreiserfs /dev/evms/userhomes and /dev/evms/userhomes is a drive-linked evms feature object with a sum of 18.8TB. mkreiserfs /dev/evms/userhomes gives following result: count_blocks: block device too large Aborted which reiserfsprogs version do you use? would you try the latest one from our ftp site? The limit for 4k block based filesysytems is 16TB. Try something with extents, or try on ia64 where you can use bigger pages. -chris
[PATCH] updated data=ordered patch for 2.6.3
Hello everyone, I've updated the reiserfs logging speedups and data=ordered support to 2.6.3, and fixed a few bugs: i_block count not properly updated on files with holes oops when the disk is full small files were not being packed into tails Also, the code has some extra performance tweaks to smooth out performance both with and without data=ordered. There are new mechanisms to trigger metadata/commit block writeback and to help throttle writers. The goal is to reduce the huge bursts of io during a commit and during data=ordered writeback. I'm very interested in benchmarks here and info about how the patches feel for interactive performance. Please note that data=ordered is not the default yet, you have to mount with -o data=ordered to get it. You can get the patches from: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.3 (once the suse mirror copies it over there) -chris
Re: v3 experimental data=ordered and logging speedups for 2.6.1
On Wed, 2004-02-11 at 10:09, Oleg Drokin wrote: Hello! On Wed, Feb 11, 2004 at 09:59:31AM -0500, Chris Mason wrote: thousands (hundreds of thousands) of times per day. It wasn't an easy bug to hit. What are the symptoms? The rpm database is corrupted, rpm --rebuild-db is required. Hm, that's really strange. But on ia64 default io size is 64k, do they have same problems there? No, ia64 works fine. It really is strange. other tricky parts when the data=journal code is added. We've already made our own file_write call, it doesn't make sense to warp it just to avoid our own __block_commit_write ;-) Well, code duplication is not very good thing. It depends on how much you have to twist things to use the generic code. If we used __block_commit_write, buffers would be marked dirty when it completes. This won't work for data=journal at all, we don't want them marked dirty. Well, if you do not want them marked dirty, you just do not need to call commit_write at all since the only thing it does is marking buffers dirty ;) And you can have a list of buffers at the allocation time anyway, so no need to do extra checks about partial page writes and so on since all these checks were already done. Interesting. I'll look harder at that. -chris
Re: v3 experimental data=ordered and logging speedups for 2.6.1
On Mon, 2004-01-19 at 17:53, Dieter Nützel wrote: 05 and 06 needed some handwork 'cause the SuSE kernel inclues xattrs and posix acl's but nothing special. Good to hear. I wasn't expecting the suse merge to be difficult, luckily it doesn't have many patches in it yet. Jeff and I will look at getting them into the suse kernel once data=journal is done as well. An EXPORT was missing in linux/fs/buffer.c to compile ReiserFS 3.x.x as modul (inode.c, unresolved symbol): Thanks, I'll add it into the patch when I get back from linux world. -chris
Re: 3.6.25 - Journal replayed back to 3 weeks ago
On Thu, 2004-01-15 at 00:23, Neil Robinson wrote: Chris, I haven't heard anything since my response, so just in case it wasn't complete enough: uname -a results: 2.4.20-gentoo-r9 #1 Sat Dec 6 03:17:43 GMT 2003 i686 Intel(R) Pentium(R) 4 CPU 2.66GHz GenuineIntel GNU/Linux This is running in a vmware workstation 4.0 Windows XP host. It is set up with 3 MAX 4GB emulated SCSI drives and 1 emulated MAX 2GB FAT32 IDE hard drive. sda1 is ext2 on /boot sda2 is swap on swap sda3 is reiserfs on / sdb1 is reiserfs on /home sdc1 is reiserfs on /var/tmp/portage hda1 is vfat on /work Ok, please tell me more about the vmware setup. Is are the vmware drives configured to do copy on write? -chris
Re: 3.6.25 - Journal replayed back to 3 weeks ago
On Thu, 2004-01-15 at 11:33, Neil Robinson wrote: Ok, please tell me more about the vmware setup. Is are the vmware drives configured to do copy on write? According to the documentation, changes made to the virtual drives are committed to the physical drive immediately. Here is the salient info from the dmesg: vmware has a mode where the changes are written to a log file instead of to the main file. The client doesn't see things are split between two files, since vmware exports them as a single device. This allows you to do fancy things like roll back changes, or share a base file between two hosts and have changes to into private files. It's possible that you're in this mode and somehow lost a log file. Are the virtual drives being exported to linux real block devices in windows, or are the files on some windows filesystem? -chris
Re: 3.6.25 - Journal replayed back to 3 weeks ago
On Thu, 2004-01-15 at 14:19, Neil Robinson wrote: Uh oh, I think I now know what happened. Ouch. I was not clear what the REDO files were for, since I seemed to have lots of them taking up a considerable amount of drive space. I deleted them (probably taking all of my changes with them). On the whole, I am quite relieved. Although I did it to myself (this is my first experience using vmware), I realize now what happened and it won't happen again. Even more important, it seems my problems have nothing to do with the Reiser File System which is also a relief. I wish to apologize to you and everyone on the list for taking up your time with soething that, as it turns out, has nothing really to do with you. Thanks for all of your help. Good news, I wasn't looking forward to hunting that as an FS bug ;-) Thanks for the details. -chris
Re: v3 logging speedups for 2.6
On Mon, 2004-01-12 at 02:07, Jens Benecke wrote: Chris Mason wrote: Hello everyone, This is part one of the data logging port to 2.6, it includes all the cleanups and journal performance fixes. Basically, it's everything except the data=journal and data=ordered changes. ftp.suse.com/pub/people/mason/patches/data-logging experimental/2.6.0-test11 Hi, Does it make sense to apply those to 2.6.1-mm2? Not those at least, since I managed to screw up the diff. I've got a 2.6.1 directory under experimental now with better patches. I'm checking now to see if they apply to -mm2. Does except the data=ordered changes mean that data journalling ist _not_ in there, or that that data journalling is there but hasn't been updated to what is there for 2.4.x yet? Correct, but I'm almost there. Thing got off track a lot during xmas break. -chris
Re: 3.6.25 - Journal replayed back to 3 weeks ago
On Mon, 2004-01-12 at 13:47, Neil Robinson wrote: Hi, this morning when I started up my notebook (running Windows XP) with a VMware session running Gentoo, the boot sequence claimed that the reiserfs drives had not been cleanly umounted (not true, I powered down the usual way on Friday evening -- su to root and then issued the poweroff command). It then replayed the journals of the two partitions using reiserfs. When it finished and booted, it was as if my entire machine had stepped back in time by 3 weeks or so (to around the 23rd of December). Since then I had installed and built openoffice, emacs, and numerous other bits and pieces. I also lost all of the email that was living in the courier-imap server. I am *very* concerned about this behaviour. I have successfully restarted, booted, etc. literally dozens of times since mid-December. I have now just installed a software RAID using RAID 5 on Gentoo and using reiserfs for a fairly large system (250GB on 8 SCSI U160 drives) with an available hot spare and a tape backup unit. Losing a few weeks of relatively insignificant changes is nothing compared with possibly losing the contents of my company's master file server. Can anyone tell me why reiserfs rolled back all the way to mid-December in spite of numerous reboots and how I can avoid a rerun of this scenario *ever* again. Is ther some way to tell it to commit its changes that I am not doing and should be aware of? That's not supposed to happen. Lets start with details about which version of the kernel you were using. -chris
Re: Errors requiring --rebuild-tree in 2.4.23
On Thu, 2003-12-11 at 08:51, Jens Benecke wrote: Hi, I posted earlier about quota problems. WE updated to 2.4.23 b ecause of the logging patches because some power failures made our /home partition spew out these: (QUESTIONS at the end of the mail) Sorry, before we got to the questions, what was the order of the events above? -chris
v3 logging speedups for 2.6
Hello everyone, This is part one of the data logging port to 2.6, it includes all the cleanups and journal performance fixes. Basically, it's everything except the data=journal and data=ordered changes. The 2.6 merge has a few new things as well, I've changed things around so that metadata and log blocks will go onto the system dirty lists. This should make it easier to improve log performance, since most of the work will be done outside the journal locks. The code works for me, but should be considered highly experimental. In general, it is significantly faster than vanilla 2.6.0-test11, I've done tests with dbench, iozone, synctest and a few others. streaming writes didn't see much improvement (they were already at disk speeds), but most other tests did. Anyway, for the truly daring among you: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.0-test11 The more bug reports I get now, the faster I'll be able to stabilize things. Get the latest reiserfsck and check your disks after each use. -chris
Re: Errors requiring --rebuild-tree in 2.4.23
On Thu, 2003-12-11 at 11:43, Jens Benecke wrote: Chris Mason wrote: On Thu, 2003-12-11 at 08:51, Jens Benecke wrote: Hi, I posted earlier about quota problems. WE updated to 2.4.23 b ecause of the logging patches because some power failures made our /home partition spew out these: (QUESTIONS at the end of the mail) Sorry, before we got to the questions, what was the order of the events above? Oops. I guess I was a bit too confused myself. :) 1. Errors on /home in syslog, cron jobs running wild with i/o failures system kept running for a couple days because nobody was there to fix it, though Those errors were probably caused by power outages and a non-data-logging ReiserFS kernel. 2. Backup what's left of /home to firewire harddisk. 3. Update to 2.4.23 with Chris' patches for data logging/quota 4. Repartition hda2..4 (was needed anyway for drbd), reformat new /home (drbd), restore /home on drbd device 5. crash of the server overnight, reboot (don't know why yet) Ok, we need to better understand step 5 here. 6. couldn't reboot because root partition was totally b0rken 7. reiserfsck --rebuild-tree under Knoppix, killed a couple files 8. still running Knoppix, secondary server took over and is running now btw: Is there a reiserfs stress test kind of thing to make sure a configuration works before sending it two time zones away for production? I plan on doing that in the next couple weeks. =;) Would bonnie++ accomplish this or are there better tests? The best test is whatever that environment is going to use in production. I've got a ton of different scripts that get used based on different situations, most are ugly hacks. -chris
Re: v3 logging speedups for 2.6
On Thu, 2003-12-11 at 13:30, Dieter Nützel wrote: Am Donnerstag, 11. Dezember 2003 19:10 schrieb Chris Mason: Hello everyone, This is part one of the data logging port to 2.6, it includes all the cleanups and journal performance fixes. Basically, it's everything except the data=journal and data=ordered changes. The 2.6 merge has a few new things as well, I've changed things around so that metadata and log blocks will go onto the system dirty lists. This should make it easier to improve log performance, since most of the work will be done outside the journal locks. The code works for me, but should be considered highly experimental. In general, it is significantly faster than vanilla 2.6.0-test11, I've done tests with dbench, iozone, synctest and a few others. streaming writes didn't see much improvement (they were already at disk speeds), but most other tests did. Anyway, for the truly daring among you: ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.0-test11 The more bug reports I get now, the faster I'll be able to stabilize things. Get the latest reiserfsck and check your disks after each use. Chris, with which kernel should I start on my SuSE 9.0? A special SuSE 2.6.0-test11 + data logging? Or plane native? --- There are such much patches in SuSE kernels... For the moment you can only try it on vanilla 2.6.0-test11. The suse 2.6 rpms have acls/xattrs and the new logging stuff won't apply. Jeff and I will fix that when the logging merge is really complete. At the rate I'm going, that should be by the end of next week, this part of the merge was the really tricky bits. -chris
Re: Strange errors with 2.4.22 patches (from Chris) and bonnie++
On Tue, 2003-12-02 at 03:35, Jens Benecke wrote: Chris Mason wrote: On Fri, 2003-11-28 at 16:38, Jens Benecke wrote: b) bonnie++ on a (previously created) reiserfs partition (with mkreiserfs 3.6.6) exited with random disk full errors, although the disk was never full. This didn't happen before. (...) I would appreciate any help. Thank you! :-) Ugh, the new block allocator isn't properly forcing a commit when an allocation fails, so we don't reclaim blocks deleted in an uncommitted transaction. I thought this fix got pulled out of the suse kernel when namesys and I pulled out the important bug fixes, but it got missed. porting. Hi Chris, so... the worst-case impact on this bug is that reiserfs will report disk full when you still have some space available. Right? No data loss, corruption, or similar Bad Things(tm)? Correct. I'll have a fix available today along with a remerge of data logging and quota against 2.4.23. I have two servers here that are supposed to be deployed next week and use ReiserFS. I've had some bad issues with MySQL files becoming corrupted after a crash in the past so I'd really like to put these into production with data-logging patches. btw, what are the patches that SuSE uses? IIRC, SuSE ships with data-logging enabled, right? SUSE ships data logging and the xattr patches, along with a few others. -chris
Re: FW: reiser4 plugin for maildir
[reiser4 plugins to make mail delivery faster] There are a few basic things that slow down mail servers when they talk to filesystems: 1) multiple threads delivering to the same directory contend for the directory semaphore for creating new files 2) atomic creation of an entire file 3) high load leads to numerous small synchronous filesystem operations when multiple threads all try to deliver at once. #1 is best fixed at the VFS level, basically allowing more fine grained directory locks (lustre needs this as well I think). It could be dealt with entirely inside reiser4, but this is a generic need for linux as a whole I think. #2 and #3 could be dealt with in a reiser4 transaction subsystem. Basically the mail server would deliver a number of messages and trigger commits at regular intervals, and reiser4 would make sure the files were created atomically in the FS. The mail server would not report the file as delivered to the smtp client until the commit had completed. So, I definitely think you could use reiser4 to gain significant mail delivery performance. But you probably don't need a specific plugin outside the transaction subsystems. The obvious benefit of only using the transaction systems is that you won't need to teach mail clients about the reiser4 plugin. There's lots of other plugin topics related to teaching the FS about various file formats. This may or may not be a good idea in some cases, but it is a good place for research. I'd start by looking at the xattr interfaces and Hans' ideas for FS-as-a-database. -chris
Re: precise characterization of ext3 atomicity
On Thu, 2003-09-04 at 18:03, Andreas Dilger wrote: On Sep 05, 2003 01:32 +0400, Hans Reiser wrote: Andreas Dilger wrote: It is possible to do the same with ext3, namely exporting journal_start() and journal_stop() (or some interface to them) to userspace so the application can start a transaction for multiple operations. We had discussed this in the past, but decided not to do so because user applications can screw up in so many ways, and if an application uses these interfaces it is possible to deadlock the entire filesystem if the application isn't well behaved. That's why we confine it to a (finite #defined number) set of operations within one sys_reiser4 call. At some point we will allow trusted user space processes to span multiple system calls (mail server applicances, database appliances, etc., might find this useful). You might consider supporting sys_reiser4 at some point. Please rename sys_reiser4 if you want it to be a generic use syscall ;-) -chris
Re: write barrier patches for 2.4.21
On Wed, 2003-08-27 at 18:03, Tom Vier wrote: On Wed, Aug 27, 2003 at 10:41:03AM +0400, Oleg Drokin wrote: There was a discussion about that on Kernel Summit 2003 and general opinion was that SCSI does not need the WB stuff at all as it does the correct thing anyway. i found this, but no real details. do you have a better link? or could you tell me why scsi drives don't need wb's? afaik, they don't have nvram cache. so, the danger is still there, even if small (unless i'm wrong). http://lwn.net/Articles/40850/ scsi drives don't really need them because most scsi drives don't have write back caching on by default, and most actually listen when you turn the cache off. The scsi tag queuing makes good performance possible even without writeback caching. -chris
Re: BUG in reiserfs_write_full_page().
On Fri, 2003-07-18 at 09:45, Nikita Danilov wrote: So, in the case of reiserfs_write_full_page(), the BUG() is falsely triggered due to a transaction that was started on another filesystem (ext3). And the fix would simply be to do something along the lines of ext3... The reiserfs data logging actually does more than ext3 to make sure things get along, recording the super of the filesystem holding the transaction. So, it is actually possible to start a new non-nested transaction. This can result in a/b-b/a deadlock, right? Sorry, I wasn't clear. The transaction nesting code could detect and deal with it (making sure not to nest into an ext3 transaction or a reiserfs transaction on a different FS), but there are still other deadlocks to deal with. But, we shouldn't have to. Other parts of the OS should be protecting us from a writepage being called at this time, which is why the bug is there. Someone did a non GFP_NOFS allocation with a transaction __bread()-__getblk()-__find_get_block()-find_get_page() allocates page with bdev-bd_inode-i_mapping-gfp_flags, which is GFP_USER, that includes GFP_FS. You're in 2.5 land ;-) There seem to be a few problems there, I've got an oops in find_inode and a deadlock under load, but I still need to read the (huge) sysrq-t to figure things out. -chris
Re: BUG in reiserfs_write_full_page().
On Thu, 2003-07-17 at 17:19, Michael Gaughen wrote: Hello, We have a test machine that continues to BUG() in reiserfs_write_full_page(). The machine is running SLES8 (2.4.19-152, UP). Here is the (kdb) stack trace: Hmmm, the allocation masks are supposed to be set such that writepage won't get called. I'll take a look. How easy is it to reproduce? If you have any tests cases that can trigger it, please send them along. -chris
Re: Horrible ftruncate performance
On Fri, 2003-07-11 at 11:44, Oleg Drokin wrote: Hello! On Fri, Jul 11, 2003 at 05:34:12PM +0200, Marc-Christian Petersen wrote: Actually I did it already, as data-logging patches can be applied to 2.4.22-pre3 (where this truncate patch was included). Maybe it _IS_ time for this _AND_ all the other data-logging patches? 2.4.22-pre5? It's Chris turn. I thought it is good idea to test in -ac first, though (even taking into account that these patches are part of SuSE's stock kernels). Well, I don't think that testing in -ac is necessary at all in this case. May be not. But it is still useful ;) I am using WOLK on many production machines with ReiserFS mostly as Fileserver (hundred of gigabytes) and proxy caches. I am using this code on my production server myself ;) If someone would ask me: Go for 2.4 mainline inclusion w/o going via -ac! :) Chris should decide (and Marcelo should agree) (Actually Chris thought it is good idea to submit data-logging to Marcelo now, too). I have no objections. Also now, that quota v2 code is in place, even quota code can be included. Also it would be great to port this stuff to 2.5 (yes, I know Chris wants this to be in 2.4 first) Marcelo seems to like being really conservative on this point, and I don't have a problem with Oleg's original idea to just do relocation in 2.4.22 and the full data logging in 2.4.23-pre4 (perhaps +quota now that 32 bit quota support is in there). 2.5 porting work has restarted at last, Oleg's really been helpful with keeping the 2.4 stuff up to date. -chris
Re: Horrible ftruncate performance
On Fri, 2003-07-11 at 13:27, Dieter Nützel wrote: 2.5 porting work has restarted at last, Oleg's really been helpful with keeping the 2.4 stuff up to date. Nice but. Patches against latest -aa could be helpful, then. Hmmm, the latest -aa isn't all that latest right now. Do you want something against 2.4.21-rc8aa1 or should I wait until andrea updates to 2.4.22-pre something? -chris
updated data logging available
Hello all, ftp.suse.com/pub/people/mason/patches/data-logging/2.4.22 Has a merge of the data logging and quota code into 2.4.22-pre4 (should apply to -pre5). Overall, the performance of pre5 + reiserfs-jh is nice and smooth, I'm very happy with how things are turning out. Thanks to Oleg for merging data logging with his reiserfs warning patches and hole performance fixes. The relocation and base quota code is now in, so our number of patches is somewhat smaller. The SuSE ftp server might need an hour or so to update, please be patient if the patches aren't there yet. -chris
Re: reiserfs on removable media
On Wed, 2003-07-02 at 14:53, Hans Reiser wrote: This is called ordered data mode, and exists on ext3 and also reiserfs with Chris Mason's patches. Under normal usage it shouldn't change performance compared to writeback data mode (which is what reiserfs does by default). It had some impact, I forget exactly how much, maybe Chris can resuscitate his benchmark of it? The major cost of data=ordered is that dirty blocks are flushed every 5 seconds instead of every 30. The journal header patch in my experimental data logging directory changes things so that only new bytes in the file are done in data=ordered mode (either adding a new block or appending onto the end of the file). This helps a lot in the file rewrite tests. -chris
Re: reiserfs on removable media
On Wed, 2003-07-02 at 15:08, Dieter Nützel wrote: Am Mittwoch, 2. Juli 2003 20:59 schrieb Chris Mason: On Wed, 2003-07-02 at 14:53, Hans Reiser wrote: This is called ordered data mode, and exists on ext3 and also reiserfs with Chris Mason's patches. Under normal usage it shouldn't change performance compared to writeback data mode (which is what reiserfs does by default). Chris, I thought data=ordered is the new default with your patch? It is. It had some impact, I forget exactly how much, maybe Chris can resuscitate his benchmark of it? The major cost of data=ordered is that dirty blocks are flushed every 5 seconds instead of every 30. The journal header patch in my experimental data logging directory changes things so that only new bytes in the file are done in data=ordered mode (either adding a new block or appending onto the end of the file). This helps a lot in the file rewrite tests. What's faster than with your patches? ordered|journal|writeback? I thought is order: writeback ordered journal ;-) Usually ;-) ordered is faster in a few rare benchmarks because it helps keeps the number of dirty buffers lower and generally sends the dirty buffers to the disk in a big flood. journal is faster for some fsync heavy benchmarks. For practical desktop usage, data=ordered and writeback are usually close to each other. -chris
Re: udpated data logging available
On Tue, 2003-07-01 at 20:46, Manuel Krause wrote: Setting HZ=1000 (from 100) in linux/include/asm/param.h give me very impressive latency boost. 2.4.21-rc1-jam1 (-rc1aa1) Just tried this HZ=1000 setting, too. (With the following patches only: data-logging, search_reada-4 and rml-preempt on 2.4.21-final, and AA.00_nanosleep-6.diff (THAT ONE decreased my VMware+system CPU idle usage in these circumstances by approx. 1/2 [from 25% to approx. 12.5%] ). Some userspace tools depend on the HZ value. It's clear the io-stalls-7 patch can't be the final one, I need to add userspace knobs to tune things towards latency or multi-writer throughput (basically server or desktop workloads). -chris
Re: vpf-10680, minor corruptions
On Fri, 2003-06-27 at 12:13, Oleg Drokin wrote: Hello! On Fri, Jun 27, 2003 at 04:38:00PM +0400, Oleg Drokin wrote: I was looking in the wrong direction, when I produced that patch, so it will produce zero output. I hope to come up with ultimate fix soon enough. ;) Well, there is a patch below that does *not* work for me ;) But it should work. I have traced the new problem to a cross compiler that compiles code in a different way than native compiler for whatever reason (demo is attached as test.c program, it should print result is 1 in case it is compiled correctly and stuff about unknown uniqueness if it is miscompiled. In fact may be this is just correct compiler behaviour.) I now think that when I compile a kernel with native compiler, it should work with below patch. But I can verify that only tomorrow it seems. You might try that patch as well to see if it helps you before I try it ;) The patch is obviously correct one. (except that it does not work with my cross compiler and kernel does work without patch which is really-really strange). Most of these changes are in 2.4.21, which I've been using on an AMD64 bit box for a while without any problems. The bug should be somewhere else, it looks to me like these spots aren't trying to send an unsigned long to disk. -chris
Re: 2.4.21 reiserfs oops
On Tue, 2003-06-24 at 16:34, Nix wrote: On Tue, 24 Jun 2003, Oleg Drokin moaned: Hello! On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote: Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at virtual address 0001 This is very strange address to oops on. I'll say! Looks almost like it JMPed to a null pointer or something. No, if it'd jumped to a NULL pointer, we'd see 0 in EIP. JMPed to ((long)NULL)+1 or something then :) the fact remains that it's not somewhere that even a memory error would make us likely to jump to. Jun 22 13:52:43 loki kernel: EIP:0010:[c0092df4]Not tainted The EIP isn't zero or 1, you've got a bad null pinter dereference at address 1. You get this when you do something like *(char *)1 = some_val. The ram is most likely bad, you're 1 bit away from zero, but you might try a reiserfsck on any drives affected by the scsi errors. -chris
Re: xattr
On Mon, 2003-06-16 at 08:26, Russell Coker wrote: What is the status of xattr support in 2.5.x? How is journalling of xattr's being handled? For correct operation of SE Linux we need to have the xattr that is used for the security context be changed atomically, and if a file is created and immediately has the xattr set then ideally we would have the file creation and the xattr creation in the same journal entry. Is this possible? If doing this requires that the file system be mounted with data=journal then this will be fine. How big are the xattrs you have in mind? We can get atomic writes of 4k in length but beyond that things get more difficult. As for the xattr and the create in the same transaction, that's a little harder. We'd probably need a new syscall, or to change the semantics of the xattr call such that creating an xattr on a file that doesn't exist also creates the file. -chris
Re: xattr
On Thu, 2003-06-19 at 10:46, Russell Coker wrote: On the topic of atomic xattr operations on ReiserFS as needed for the new LSM/SE Linux operations. On Thu, 19 Jun 2003 23:52, Chris Mason wrote: How big are the xattrs you have in mind? We can get atomic writes of 4k in length but beyond that things get more difficult. Most of them will be less than 80 bytes. They are currently of the form: user-name:object_r:type The user-name is the Unix account name which usually isn't much more than 8 bytes. The type is usually less than 15 bytes (the longest I've used so far is 20 bytes). So the longest value I've used is 38 bytes. Then data=journal mode will do what you want. You'll get atomic writes up to 4k. If you really don't want data=journal for the rest of the FS, we can make an option for using data logging on xattr files only. Jeff and I had wanted to avoid the complication but it is at least possible. Also they can't be chosen arbitarily by the user. The user gets some small control over the type within a range of types that the administrator permits. If the administrator permits overly long type names and has to deal with non-atomicity as a result then it's their issue. If you can guarantee atomic operations on 160 byte operations (twice what I expect anyone to use) then it'll be fine. As for the xattr and the create in the same transaction, that's a little harder. We'd probably need a new syscall, or to change the semantics of the xattr call such that creating an xattr on a file that doesn't exist also creates the file. Creating a file by creating the xattr sounds like a bad idea as you can't control the Unix permissions of the file. This isn't much of a big deal with SE Linux as the security type determines who can access the file. But for other uses it may be a serious problem. I agree that we need a new syscall and other people had the same idea before either of us. Maybe ReiserFS could be used as the first implementation of this proposed new syscall... It would be best to go through Andreas Gruenbacher for xattr API changes. He's quite reasonable. -chris
Re: xattr
On Thu, 2003-06-19 at 13:25, Stephen Smalley wrote: On Thu, 2003-06-19 at 11:21, Chris Mason wrote: Ok, so in the new api, the xattr information is available at the time of the create. reiserfs would be able to include it all into the same transaction but doesn't do it right now. This was true of the old API as well; the only difference is whether the attribute is specified as a parameter to an extended open/mkdir/etc call or whether it is set separately as a process attribute that is applied to subsequent ordinary open/mkdir/etc calls. Including the setxattr in the same transaction as the create is not strictly necessary, although it would be nice. The SELinux API change didn't change the create+set atomicity, which is still provided by performing the set before the parent directory semaphore is released. Ok, I get it. You would need a special reiserfs xattr add on patch to get the atomicity right. First we need to get the data logging code in (which Hans has agreed to), getting the xattr code in depends on Hans, Jeff Mahoney will be maintaining as an external patch otherwise. My impression (possibly wrong) is that Hans prefers a EA-as-file model, which I understand is also the Solaris model. The key question then becomes whether mainline reiser{3,4} will ever support the xattr inode operations (e.g. implementing them as reads/writes of the EA files associated with the inode). If not, then neither the SELinux module nor the SELinux userland will be able to access file security attributes on reiser{3,4} in the same manner as on ext[23], xfs, or jfs. The reiserfsv3 xattr patches maintained by SuSE implement the xattr api (acl.bestbits.at). The xattrs happen to be implemented as individual files on disk because reiserfs is so well suited for it, and because it allowed Jeff to code them without changing the v3 disk format. But, these files are only available through the xattr api right now, and they are not visible via tools like ls etc. Not sure how namesys is doing things in version 4, but I'd bet they are willing to talk about making it work with SELinux. -chris
udpated data logging available
Hello all, This doesn't have the data=ordered performance fixes because I got distracted fixing io latency problems in 2.4.21. Those were screwing up my benchmarks, so I couldn't really tell if things were faster or not ;-) Anyway, I'm back on track now, and since 2.4.21 is out I've just copied Oleg's merge into my data logging directory. I'll add the experimental performance patches later today. But, the code in my data logging directory now is what I would like to see merged into 2.4.22-pre asap (pending namesys approval), so review and testing would be appreciated. ftp.suse.com/pub/people/mason/patches/data-logging/2.4.21 It might take a 30 minutes or so for the rsync to complete. -chris
Re: Timeframe for 2.4.21 quota patches?
On Mon, 2003-06-16 at 10:45, Christian Mayrhuber wrote: What has happened to 10-quota-link-fix.diff? Is it not necessary any more? I'm asking because it is still mentioned in the README, but seems to have been replaced by 10-quota_nospace_fix.diff Whoops, I screwed that up.
Re: Recent spam
On Thu, 2003-06-05 at 09:26, Hans Reiser wrote: This is a bit inconsiderate of you, don't you think? Why don't you just unsubscribe? I have a sysadmin with too many tasks to get them done, and he doesn't need his time wasted with such crap as dealing with spamcop. Honestly Hans, this has come up a bunch of times over the last few months. The reiserfs mail server setup doesn't play nicely with automated spam reporting tools, and many people other than Russell use them. There have been a ton of suggestions on ways to make things better, and they range from 5 minute tasks to complex ones. They've largely been ignored. I'd really rather not lose valuable contributors to the list over something as stupid as spam. -chris
Re: [PATCH] various allocator optimizations
On Fri, 2003-03-14 at 08:59, Manuel Krause wrote: On 03/14/2003 02:34 AM, Chris Mason wrote: On Thu, 2003-03-13 at 19:15, Hans Reiser wrote: [ discussion on how to implement lower fragmentation on ReiserFS ] Let's get lots of different testers. You may have a nice heuristic here though If everyone agrees the approach is worth trying, I'll make a patch that enables it via a mount option. [...] A dumb question inbetween: How do we - possible testers, users - get information about fragmentation on our ReiserFS partitions? The best tool I've seen so far originally came from Vladimir and was modified for a study on fragmentation of reiserfs and ext2, Jeff found the link somewhere in his archives: http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/index.html There is also a filesystem aging tools there that I haven't played with yet. -chris
Re: [PATCH] new data logging and quota patches available
On Sat, 2003-02-22 at 20:04, Manuel Krause wrote: On 02/23/2003 01:50 AM, Manuel Krause wrote: On 02/22/2003 05:12 PM, Chris Mason wrote: And here's one more patch for those of you that want to experiment a little. reiserfs_dirty_inode gets called to log the inode every time the vfs layer does an atime/mtime/ctime update, which is one of hte reasons mounting with -o noatime and -o nodiratime makes things faster. We had to do this because otherwise kswapd can end up trying to write inodes, which can lead to deadlock as he tries to wait on the log. One of the patches in my data logging directory is kinoded-8, which changes things so a new kinoded does all inode writeback instead of kswapd. That means that if you apply up to 05-data-logging-36 and then apply 09-kinoded-8 (you won't need any of the other quota patches), you can also apply this patch. It changes reiserfs to leave inodes dirty, which saves us lots of time on atime/mtime updates. I'l upload this after it gets a little additional testing, but wanted to include it here in case anyone else was interested in benchmarking. [11.dirty-inodes-for-kinoded.diff] Hi Chris! At least I'm not able to copy through my partition via cp -a ... with the last supposed setup + preempt without quota ... When umounting the destination partition it says device is busy. I don't know what is gone for now on my repository partition, but I know from earlier times this should not happen. Sorry, I was incomplete with my report, at least with the following: After thinking over the remount strategy for some minutes with a pizza the partition was umountable again. I don't have exact values but it was under about 15 minutes max. with 3.5GB copied previously. Hmmm, I think I need to changed around kinoded slightly. Thanks for giving it a try... -chris
Re: [PATCH] new data logging and quota patches available
On Sat, 2003-02-22 at 08:54, Oleg Drokin wrote: Replacement 05-data-logging-36.diff.gz file that applies to 2.4.21-pre4-ac5 is available at ftp://namesys.com/pub/reiserfs-for-2.4/testing/05-data-logging-36-ac5.diff.gz Thanks Oleg. It compiles, boots, survives my (simple) testing. (writing this email from patched 2.4.21-pre4-ac5, too). Quota works. symlinks are now have correct blocks count too The reason for rejects is mostly DIRECTIO fix that also went into current bk snapshot, so probably it will apply to Marcelo's bk tree, too. Chris: Is it intended that directio only works on data=writeback mounted filesystems? Yes, the way ordered writes work is the buffers are put onto a per transaction list that gets flushed before the commit. Since buffers can only be on one list, this means they don't get onto the list of buffers for that particular inode, and that makes it difficult to make sure all pending io to the file is finished before allowing the directio to proceed (just like the tail alias bug). The only way to do it is by forcing a commit before the directio, which would be horribly slow so I've disabled the data=ordered o_direct support. The real solution is to allocate a data structure and point to it from the private journal pointer in the bh. That requires some other changes and other performance stuff is more important right now. Also following README file diff should be considered: Thanks, I'm uploading a few more optimizations shortly, I'll include this change. -chris
Re: [PATCH] new data logging and quota patches available
On Sat, 2003-02-22 at 07:41, Dieter Nützel wrote: Which files are needed for 2.4.21-pre4aa3? ftp.suse.com/pub/people/mason/patches/data-logging/2.4.21-aa I just uploaded a fresh set of data logging and quota patches for -aa -chris
[PATCH] new data logging and quota patches available
Hello all, ftp.suse.com/pub/people/mason/patches/data-logging/2.4.21 will soon be updated with a new set of data logging and quota patches against 2.4.21-pre4 The data logging code is updated with another set of io stalling fixes, they should improve performance of data=ordered and data=writeback by being smarter about forcing commits under heavy write load and kicking kreiserfsd. Treat these with care, they've gotten a ton of testing under the suse kernel, but the port to vanilla was just done today. The quota patches include a fix for incorrect sd_block counts on symlinks. -chris
Re: no reiserfs quota in 2.4 yet? 2.4.21-pre4-ac4 says different
On Mon, 2003-02-17 at 12:39, Ookhoi wrote: Hi Reiserfs team, Today I put a new kernel on a server which has reiserfs and needs quota. I searched for the quota patches (found them in the mail archive) and saw that they are very old: ftp://ftp.namesys.com/pub/reiserfs-for-2.4/testing/quota-2.4.20 3 december 2002. They don't apply to a current kernel. Well, 2.4.20 is the current kernel ;-) Which kernel do you want them against? I've got patches against 2.4.21-preX in testing here, but not against -ac. They should merge against -ac now more easily, but I haven't had time to really test it. Do you want to try the merge on -ac or would you rather try against 2.4.21-preX -chris
Re: What is [PATCH] 02-directio-fix.diff (namesys.com) for?
On Mon, 2003-02-17 at 15:55, Manuel Krause wrote: Hi! Is this patch from 030213 it needed by anyone using ReiserFS within 2.4.20 and 2.4.21-preX ? What is DIRECT IO with reiserfs from the topic line of the patch: # reiserfs: Fix DIRECT IO interference with tail packing ? It fixes a bug where a recently unpacked tail might race to the disk with bytes modified via DIRECT IO. The common way to trigger the bug is via a mixture of direct io and regular file access at the same time. Most people won't see the bug, since it is uncommon to mix regular and direct io that way. -chris
Re: when distros do not support official Marcelo kernels they arenot being team players (was Re: reiserfs on redhat advanced server?)
On Mon, 2003-02-03 at 09:20, Hans Reiser wrote: It is different from refusing to support the user who downloads Marcelo's kernel after it does ship (after the distro CD went into the stamping plant). That is what I am complaining about. The default should be to support all Marcelo kernels unless there is a motivated reason not to (e.g. he ships a broken NFS kernel and the user is complaining about NFS). Users should feel that they can download any latest official stable kernel (it is okay though to tell them to check a website created by the distro to see if it is a known bad/unsupported kernel), and everything will be fine with the distro. When distros don't do this, they are not being team players. Hans, the vanilla kernels are lacking both bug fixes and features that are critical to what our users are doing. Even if the bug fixes all got in, there are various reasons the features probably won't. If there was any vanilla kernel that had everything we needed, we'd support it, and do a dance around a bonfire made from all of our patch maintenance scripts and code. The whole point of buying the distro is that you don't have the time and energy to collect and compile every application and turn it into something you can easily install on your personal machine. The kernel is one of those applications. Feel free to replace it, but it doesn't make sense to expect us to help you fix the problems when we don't have control over the configuration, compile or sources. That would be like switching engines in your car and expecting the original car company to do a warranty repair on the new engine. -chris (speaking only for himself and not SuSE)