Re: vpf-10680, minor corruptions
Hello! On Fri, Jun 27, 2003 at 12:23:07PM -0400, Chris Mason wrote: > Most of these changes are in 2.4.21, which I've been using on an AMD64 Not the reiserfs_file_write() ones. > bit box for a while without any problems. The bug should be somewhere > else, it looks to me like these spots aren't trying to send an unsigned > long to disk. the reiserfs_file_write() code have an array of b_blocknr_t elements. It then submits this array to reiserfs_paste_into_item/reiserfs_insert_item, but b_blocknr_t is unsigned long (read - 64 bit on alpha - oops). Funny thing is when I declare b_blocknr_t as u32, kernel basically falls apart if cross compiled. E.g. key comparison does not work and all kind of weird things start to happen. In short - if you want to make sure the bug is there - compile 2.5.70+ code on any 64 bit platform, write any file bigger than 2 blocks, unmount and remount the fs and see what's in the file. Bye, Oleg
Re: vpf-10680, minor corruptions
Oleg Drokin schrieb: I have traced the new problem to a cross compiler that compiles code in a different way than native compiler for whatever reason (demo is attached as test.c program, it should print "result is 1" yes, that what it prints, no warnings were shown. You might try that patch as well to see if it helps you before I try it ;) yes, compiling with _this_ patch but _not_ with the last patch you sent (file.c) is under way again... Thank you, Christian.
Re: vpf-10680, minor corruptions
Oleg Drokin schrieb: I have traced the new problem to a cross compiler that compiles code in a different way than native compiler for whatever reason (demo is attached as test.c program, it should print "result is 1" yes, that what it prints, no warnings were shown. You might try that patch as well to see if it helps you before I try it ;) yes, compiling with _this_ patch but _not_ with the last patch you sent (file.c) is under way again... Thank you, Christian.
Re: vpf-10680, minor corruptions
On Fri, 2003-06-27 at 12:13, Oleg Drokin wrote: > Hello! > > On Fri, Jun 27, 2003 at 04:38:00PM +0400, Oleg Drokin wrote: > > > I was looking in the wrong direction, when I produced that patch, > > so it will produce zero output. > > I hope to come up with ultimate fix soon enough. ;) > > Well, there is a patch below that does *not* work for me ;) > But it should work. > I have traced the new problem to a cross compiler that compiles > code in a different way than native compiler for whatever reason > (demo is attached as test.c program, it should print "result is 1" > in case it is compiled correctly and stuff about unknown > uniqueness if it is miscompiled. In fact may be this is just correct compiler > behaviour.) > I now think that when I compile a kernel with native compiler, it should work > with below patch. But I can verify that only tomorrow it seems. > You might try that patch as well to see if it helps you before I try it ;) > The patch is "obviously correct" one. (except that it does not work > with my cross compiler and kernel does work without patch which is really-really > strange). > Most of these changes are in 2.4.21, which I've been using on an AMD64 bit box for a while without any problems. The bug should be somewhere else, it looks to me like these spots aren't trying to send an unsigned long to disk. -chris
Re: vpf-10680, minor corruptions
Hello! On Fri, Jun 27, 2003 at 04:38:00PM +0400, Oleg Drokin wrote: > I was looking in the wrong direction, when I produced that patch, > so it will produce zero output. > I hope to come up with ultimate fix soon enough. ;) Well, there is a patch below that does *not* work for me ;) But it should work. I have traced the new problem to a cross compiler that compiles code in a different way than native compiler for whatever reason (demo is attached as test.c program, it should print "result is 1" in case it is compiled correctly and stuff about unknown uniqueness if it is miscompiled. In fact may be this is just correct compiler behaviour.) I now think that when I compile a kernel with native compiler, it should work with below patch. But I can verify that only tomorrow it seems. You might try that patch as well to see if it helps you before I try it ;) The patch is "obviously correct" one. (except that it does not work with my cross compiler and kernel does work without patch which is really-really strange). = fs/reiserfs/bitmap.c 1.26 vs edited = --- 1.26/fs/reiserfs/bitmap.c Sun May 18 01:09:36 2003 +++ edited/fs/reiserfs/bitmap.c Fri Jun 27 16:58:44 2003 @@ -43,7 +43,7 @@ test_bit(_ALLOC_ ## optname , &SB_ALLOC_OPTS(s)) static inline void get_bit_address (struct super_block * s, - unsigned long block, int * bmap_nr, int * offset) + b_blocknr_t block, int * bmap_nr, int * offset) { /* It is in the bitmap block number equal to the block * number divided by the number of bits in a block. */ @@ -54,7 +54,7 @@ } #ifdef CONFIG_REISERFS_CHECK -int is_reusable (struct super_block * s, unsigned long block, int bit_value) +int is_reusable (struct super_block * s, b_blocknr_t block, int bit_value) { int i, j; @@ -107,7 +107,7 @@ static inline int is_block_in_journal (struct super_block * s, int bmap, int off, int *next) { -unsigned long tmp; +b_blocknr_t tmp; if (reiserfs_in_journal (s, bmap, off, 1, &tmp)) { if (tmp) { /* hint supplied */ @@ -235,7 +235,7 @@ /* Tries to find contiguous zero bit window (given size) in given region of * bitmap and place new blocks there. Returns number of allocated blocks. */ static int scan_bitmap (struct reiserfs_transaction_handle *th, - unsigned long *start, unsigned long finish, + b_blocknr_t *start, b_blocknr_t finish, int min, int max, int unfm, unsigned long file_block) { int nr_allocated=0; @@ -281,7 +281,7 @@ } static void _reiserfs_free_block (struct reiserfs_transaction_handle *th, - unsigned long block) + b_blocknr_t block) { struct super_block * s = th->t_super; struct reiserfs_super_block * rs; @@ -327,7 +327,7 @@ } void reiserfs_free_block (struct reiserfs_transaction_handle *th, - unsigned long block) + b_blocknr_t block) { struct super_block * s = th->t_super; @@ -340,7 +340,7 @@ /* preallocated blocks don't need to be run through journal_mark_freed */ void reiserfs_free_prealloc_block (struct reiserfs_transaction_handle *th, - unsigned long block) { + b_blocknr_t block) { RFALSE(!th->t_super, "vs-4060: trying to free block on nonexistent device"); RFALSE(is_reusable (th->t_super, block, 1) == 0, "vs-4070: can not free such block"); _reiserfs_free_block(th, block) ; @@ -589,15 +589,15 @@ static inline int old_hashed_relocation (reiserfs_blocknr_hint_t * hint) { -unsigned long border; -unsigned long hash_in; +b_blocknr_t border; +u32 long hash_in; if (hint->formatted_node || hint->inode == NULL) { return 0; } hash_in = le32_to_cpu((INODE_PKEY(hint->inode))->k_dir_id); -border = hint->beg + (unsigned long) keyed_hash(((char *) (&hash_in)), 4) % (hint->end - hint->beg - 1); +border = hint->beg + (u32) keyed_hash(((char *) (&hash_in)), 4) % (hint->end - hint->beg - 1); if (border > hint->search_start) hint->search_start = border; @@ -606,7 +606,7 @@ static inline int old_way (reiserfs_blocknr_hint_t * hint) { -unsigned long border; +b_blocknr_t border; if (hint->formatted_node || hint->inode == NULL) { return 0; @@ -622,7 +622,7 @@ static inline void hundredth_slices (reiserfs_blocknr_hint_t * hint) { struct key * key = &hint->key; -unsigned long slice_start; +b_blocknr_t slice_start; slice_start = (keyed_hash((char*)(&key->k_dir_id),4) % 100) * (hint->end / 100); if ( slice_start > hint->search_start || slice_start + (hint->end / 100) <= hint->search_start) { @@ -910,7 +910,7 @@ int reiserfs_can_fit_pages ( struct super_block *sb /* superblock of filesystem
Re: vpf-10680, minor corruptions
Oleg Drokin schrieb: Try to compile with CONFIG_REISERFS_CHECK=y the kernel that known-bad for you. (e.g. 2.5.72/2.5.73) yes, 2.5.72 with CONFIG_REISERFS_CHECK=y is compiling now. over night the alpha finished compiling 2.5.65 and 2.5.69. i had to compile reiserfs statically, inserting modules gave these "Invalid module format" errors. under both (2.5.65+2.5.69) i was able to mkreiserfs sde2. mounting the fs went ok, but copying data (cp -a /lib /mnt/reiserfs) brought several kernel-errors (see https://ephigenie.kicks-ass.net/browse/reiserfs/). but: diff -r showed _no_ differences betweeen the directories, a following reiserfsck brought no vpf-10680 anymore! so i'd say the problem occurs somewhere between 2.5.69 and 2.5.70. thanks, Christian.
Re: vpf-10680, minor corruptions
Hello! On Wed, Jun 25, 2003 at 02:42:24AM +0200, Christian Kujau wrote: > (/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format > lila:~# uname -a > Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux > i compiled the module with CONFIG_REISERFS_CHECK=y. > shall i go on with 2.5.64 or better 2.5.67 ? Try to compile with CONFIG_REISERFS_CHECK=y the kernel that known-bad for you. (e.g. 2.5.72/2.5.73) Bye, Oleg
Re: vpf-10680, minor corruptions
Christian Kujau schrieb: of course, the best thing i can do is the el-cheapo-hacking approach: compiling 2.5.60...up to 2.5.72 and see *when* it breaks. hm, compiling a 2.5 kernel takes 180min on this machine. but anyway, i'll start with 2.5.60 now, see what it gives. no, i started with 2.5.66 but the kernel did not compile. 2.5.65 did compile (don't ask how long) and has already booted. but trying to mount the newly created reiserfs gives: module reiserfs: Relocation overflow vs section 9 in the log. the reiserfs module was not loaded. "modprobe reiserfs" gives: lila:~# modprobe reiserfs FATAL: Error inserting reiserfs (/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format lila:~# uname -a Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux i compiled the module with CONFIG_REISERFS_CHECK=y. shall i go on with 2.5.64 or better 2.5.67 ? good night, Christian.
Re: vpf-10680, minor corruptions
Oleg Drokin schrieb: I see that you have used 2.5.70 and earlier kernels on alpha too. Do you have any idea of when stuff broke for you? hm, i used 2.5.6x kernels too on this machine, but i recognized the vpf-10680 the first time with 2.5.70. of course, the best thing i can do is the el-cheapo-hacking approach: compiling 2.5.60...up to 2.5.72 and see *when* it breaks. hm, compiling a 2.5 kernel takes 180min on this machine. but anyway, i'll start with 2.5.60 now, see what it gives. You are certainly not the one person with alpha and 2.5, but I do not know if others are using reiserfs. you gotta send ads (read: spam) to all the linux-alpha lists :-) BTW, have you tried to run with CONFIG_REISERFS_CHECK enabled to see if it will break and panic in kernel or something like that? no, only CONFIG_REISERFS_PROC_INFO, but i'll do so now. Thanks, Christian.
Re: vpf-10680, minor corruptions
Hello! On Mon, Jun 23, 2003 at 03:38:20PM +0200, Christian Kujau wrote: > as stated before, the corruptions occur only on this very alpha machine, Well, I still cannot build the kernel myself and still working on it. (having "make: *** [vmlinux] Error 139" and zero length vmlinux) BTW, I realised that I have not looked into your kernel config for that box, can you send it to me please? > bread: Cannot read the block (523914): (Input/output error). Hm, but still it means kernel returned some error for read request. > hah! i was not aware that the disk might have an hw problem, not a > single error ever showed up in my logs. this was weird. so i > re-partitioned the disk with a 10MB sde (to circumvent the bread error) > on the beginning and a 2 GB sde2. now reiserfsck/cp/diff are all working > fine under 2.4.21, but 2.5.72 is still erroneous. Sigh. > > btw: i am still using reiserfsprogs 3.6.8 now (since debian/testing has > 3.6.6) and i have compiled these utils under a 2.5.72 kernel. is it safe > to use them under 2.4 ? I see that you have used 2.5.70 and earlier kernels on alpha too. Do you have any idea of when stuff broke for you? Bye, Oleg
Re: vpf-10680, minor corruptions -- oooh!
Oleg Drokin schrieb: Well, normally reiserfs is caring about consistency. There are two noticeable omissions, though: 1. if the unexpected shutdown was because of power loss and you have write cache enabled and your write reorders write requests, then it is possible invalid data gets written to disk, before "transaction is finished" mark is written to the drive. yes, the on-disk write cache. this could be indeed a problem hard to cover from any fs. i could disable it, yes. So can you say check/fix the fs, mount it write some files to it, unmount it and run fsck again to see if everything is ok? oh, oh! i was about to answer this question with a plain "Yes". ok, with --fix-fixable the corruptions got fixed, a reiserfsck went O.K. with "no corruptions". i mounted the device yesterday, but no files were written to it until today. now, i've just unmounted the partition, reiserfsck went O.K. again, no corruptions. mounted again, i created a directory on the fs and copied 329 files into it (cp -a /lib /path-to-reiser-fs/). unmounted, reiserfsck found 131 corruptions in an instant: lila:~# reiserfsck /dev/sde2[...] Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes ### reiserfsck --check started at Thu Jun 19 16:51:49 2003 ### Replaying journal.. 0 transactions replayed Checking internal tree..finished Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs. Checking Semantic tree: /temp/lib/libnss_files-2.3.1.sovpf-10680: The file [3214 3538] has the wrong block count in the StatData (104), should be (56) [...] finished 131 found corruptions can be fixed with --fix-fixable ### reiserfsck finished at Thu Jun 19 16:52:04 2003 ### lila:~# find /lib/ | wc -l 329 lila:~# the pathnames (/temp/lib/...) are the same files i just copied to the fs. i was not aware of a reproduceable bug (?) at all on this issue. the fs is used once in a week very often, but rarely _during_ week. this could be the cause, that i never recognized the errors before or were fixed by a journal replay at boot time. fyi only: i have _one_ weird issue with this alpha: it has 128 MB RAM inside, but only 64 MB are recognized. putting 64 MB into it gives 32 MB, but 32 MB is still 32MB. this is odd, but kernel compiling / heavy load causes no ooopses, well i got some with 2.5.6x kernels, but this is long ago. and: the hd is a little old, it's a ST34573N (SCSI, 2 GB). but there are no odd kernel messages or failures in the log. i say this, because often "bad RAM" or other issues are often on-topic here. Thank you, Christian. PS: sorry for the delay. mail probs.
Re: vpf-10680, minor corruptions
Hello! On Wed, Jun 18, 2003 at 08:01:12PM +0200, Christian Kujau wrote: > >Hm, interesting. Do you had crashes/unexpected shutdowns before > >corruptions appears > >or are they appear without any reason at all? > i had this issue once before -- did a check and noticed vpf-10680/some > corruptions. but these must have been from an crash. > but now, i think as i rebooted the machine yesterday (because i upgraded > to kernel 2.5.72) the journal was checked (replayed?) anyway at boot: > found reiserfs format "3.6" with standard journal > Reiserfs journal params: device sde2, size 8192, journal first block 18, > max trans len 1024, max batch 900, max commit age 30, max trans age 30 > reiserfs: checking transaction log (sde2) for (sde2) > Using r5 hash to sort names > (from dmesg, booting process) No, there is no sign of replaying journal. If there was replay, you'd normally see "x transactions replayed in y seconds" message. > and i thought the fs is "O.K." at least after boot, because ReiserFS > cares about consistency for itsself. if not, the corruptions are likely > from the unclean shutdowns. but that would mean, that i still have to > manually reiserfsck from time to time. Well, normally reiserfs is caring about consistency. There are two noticeable omissions, though: 1. if the unexpected shutdown was because of power loss and you have write cache enabled and your write reorders write requests, then it is possible invalid data gets written to disk, before "transaction is finished" mark is written to the drive. (there is a way to avoid this with some drives, by explicitly flushing drive cache in some cases, but this method seems to create some problems on itself. So this is not yet merged in any mainstream kernel). 2. there is no protection against kernel bugs. 1st usually leads to bitmap problems, but I also seen names pointing to nowhere. Your corruption is somewhat strange by the fact the number of blocks in statdata is ~ 2x bigger than it should be (on several files). Sounds like a pattern to me. > btw, is there a switch like "Maximum mount counft before doing the next > fsck while booting"? No. > >Well, I guess it's time to clear the dust off our alpha and do some > >testing. > hehe, should it be architecture related? This is also possible. So can you say check/fix the fs, mount it write some files to it, unmount it and run fsck again to see if everything is ok? Thank you. Bye, Oleg
Re: vpf-10680, minor corruptions
Oleg Drokin schrieb: Hm, interesting. Do you had crashes/unexpected shutdowns before corruptions appears or are they appear without any reason at all? i had this issue once before -- did a check and noticed vpf-10680/some corruptions. but these must have been from an crash. but now, i think as i rebooted the machine yesterday (because i upgraded to kernel 2.5.72) the journal was checked (replayed?) anyway at boot: found reiserfs format "3.6" with standard journal Reiserfs journal params: device sde2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 reiserfs: checking transaction log (sde2) for (sde2) Using r5 hash to sort names (from dmesg, booting process) and i thought the fs is "O.K." at least after boot, because ReiserFS cares about consistency for itsself. if not, the corruptions are likely from the unclean shutdowns. but that would mean, that i still have to manually reiserfsck from time to time. btw, is there a switch like "Maximum mount counft before doing the next fsck while booting"? Well, I guess it's time to clear the dust off our alpha and do some testing. hehe, should it be architecture related? Thank you, Christian.