Re: vpf-10680, minor corruptions

2003-06-27 Thread Oleg Drokin
Hello!

On Fri, Jun 27, 2003 at 12:23:07PM -0400, Chris Mason wrote:

> Most of these changes are in 2.4.21, which I've been using on an AMD64

Not the reiserfs_file_write() ones.

> bit box for a while without any problems.  The bug should be somewhere
> else, it looks to me like these spots aren't trying to send an unsigned
> long to disk.

the reiserfs_file_write() code
have an array of b_blocknr_t elements.
It then submits this array to reiserfs_paste_into_item/reiserfs_insert_item,
but b_blocknr_t is unsigned long (read - 64 bit on alpha - oops).
Funny thing is when I declare b_blocknr_t as u32, kernel basically falls apart
if cross compiled. E.g. key comparison does not work and
all kind of weird things start to happen.

In short - if you want to make sure the bug is there - compile 2.5.70+ code
on any 64 bit platform, write any file bigger than 2 blocks,
unmount and remount the fs and see what's in the file.

Bye,
Oleg


Re: vpf-10680, minor corruptions

2003-06-27 Thread Christian Kujau
Oleg Drokin schrieb:
I have traced the new problem to a cross compiler that compiles
code in a different way than native compiler for whatever reason
(demo is attached as test.c program, it should print "result is 1"
yes, that what it prints, no warnings were shown.

You might try that patch as well to see if it helps you before I try it ;)
yes, compiling with _this_ patch but _not_ with the last patch you sent 
(file.c) is under way again...

Thank you,
Christian.


Re: vpf-10680, minor corruptions

2003-06-27 Thread Christian Kujau
Oleg Drokin schrieb:
I have traced the new problem to a cross compiler that compiles
code in a different way than native compiler for whatever reason
(demo is attached as test.c program, it should print "result is 1"
yes, that what it prints, no warnings were shown.

You might try that patch as well to see if it helps you before I try it ;)
yes, compiling with _this_ patch but _not_ with the last patch you sent 
(file.c) is under way again...

Thank you,
Christian.


Re: vpf-10680, minor corruptions

2003-06-27 Thread Chris Mason
On Fri, 2003-06-27 at 12:13, Oleg Drokin wrote:
> Hello!
> 
> On Fri, Jun 27, 2003 at 04:38:00PM +0400, Oleg Drokin wrote:
> 
> > I was looking in the wrong direction, when I produced that patch,
> > so it will produce zero output.
> > I hope to come up with ultimate fix soon enough. ;)
> 
> Well, there is a patch below that does *not* work for me ;)
> But it should work.
> I have traced the new problem to a cross compiler that compiles
> code in a different way than native compiler for whatever reason
> (demo is attached as test.c program, it should print "result is 1"
> in case it is compiled correctly and stuff about unknown
> uniqueness if it is miscompiled. In fact may be this is just correct compiler 
> behaviour.)
> I now think that when I compile a kernel with native compiler, it should work
> with below patch. But I can verify that only tomorrow it seems.
> You might try that patch as well to see if it helps you before I try it ;)
> The patch is "obviously correct" one. (except that it does not work
> with my cross compiler and kernel does work without patch which is really-really 
> strange).
> 

Most of these changes are in 2.4.21, which I've been using on an AMD64
bit box for a while without any problems.  The bug should be somewhere
else, it looks to me like these spots aren't trying to send an unsigned
long to disk.

-chris




Re: vpf-10680, minor corruptions

2003-06-27 Thread Oleg Drokin
Hello!

On Fri, Jun 27, 2003 at 04:38:00PM +0400, Oleg Drokin wrote:

> I was looking in the wrong direction, when I produced that patch,
> so it will produce zero output.
> I hope to come up with ultimate fix soon enough. ;)

Well, there is a patch below that does *not* work for me ;)
But it should work.
I have traced the new problem to a cross compiler that compiles
code in a different way than native compiler for whatever reason
(demo is attached as test.c program, it should print "result is 1"
in case it is compiled correctly and stuff about unknown
uniqueness if it is miscompiled. In fact may be this is just correct compiler 
behaviour.)
I now think that when I compile a kernel with native compiler, it should work
with below patch. But I can verify that only tomorrow it seems.
You might try that patch as well to see if it helps you before I try it ;)
The patch is "obviously correct" one. (except that it does not work
with my cross compiler and kernel does work without patch which is really-really 
strange).

= fs/reiserfs/bitmap.c 1.26 vs edited =
--- 1.26/fs/reiserfs/bitmap.c   Sun May 18 01:09:36 2003
+++ edited/fs/reiserfs/bitmap.c Fri Jun 27 16:58:44 2003
@@ -43,7 +43,7 @@
 test_bit(_ALLOC_ ## optname , &SB_ALLOC_OPTS(s))
 
 static inline void get_bit_address (struct super_block * s,
-   unsigned long block, int * bmap_nr, int * offset)
+   b_blocknr_t block, int * bmap_nr, int * offset)
 {
 /* It is in the bitmap block number equal to the block
  * number divided by the number of bits in a block. */
@@ -54,7 +54,7 @@
 }
 
 #ifdef CONFIG_REISERFS_CHECK
-int is_reusable (struct super_block * s, unsigned long block, int bit_value)
+int is_reusable (struct super_block * s, b_blocknr_t block, int bit_value)
 {
 int i, j;
 
@@ -107,7 +107,7 @@
 static inline  int is_block_in_journal (struct super_block * s, int bmap, int
 off, int *next)
 {
-unsigned long tmp;
+b_blocknr_t tmp;
 
 if (reiserfs_in_journal (s, bmap, off, 1, &tmp)) {
if (tmp) {  /* hint supplied */
@@ -235,7 +235,7 @@
 /* Tries to find contiguous zero bit window (given size) in given region of
  * bitmap and place new blocks there. Returns number of allocated blocks. */
 static int scan_bitmap (struct reiserfs_transaction_handle *th,
-   unsigned long *start, unsigned long finish,
+   b_blocknr_t *start, b_blocknr_t finish,
int min, int max, int unfm, unsigned long file_block)
 {
 int nr_allocated=0;
@@ -281,7 +281,7 @@
 }
 
 static void _reiserfs_free_block (struct reiserfs_transaction_handle *th,
- unsigned long block)
+ b_blocknr_t block)
 {
 struct super_block * s = th->t_super;
 struct reiserfs_super_block * rs;
@@ -327,7 +327,7 @@
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
-  unsigned long block)
+  b_blocknr_t block)
 {
 struct super_block * s = th->t_super;
 
@@ -340,7 +340,7 @@
 
 /* preallocated blocks don't need to be run through journal_mark_freed */
 void reiserfs_free_prealloc_block (struct reiserfs_transaction_handle *th, 
-  unsigned long block) {
+  b_blocknr_t block) {
 RFALSE(!th->t_super, "vs-4060: trying to free block on nonexistent device");
 RFALSE(is_reusable (th->t_super, block, 1) == 0, "vs-4070: can not free such 
block");
 _reiserfs_free_block(th, block) ;
@@ -589,15 +589,15 @@
 
 static inline int old_hashed_relocation (reiserfs_blocknr_hint_t * hint)
 {
-unsigned long border;
-unsigned long hash_in;
+b_blocknr_t border;
+u32 long hash_in;
 
 if (hint->formatted_node || hint->inode == NULL) {
return 0;
   }
 
 hash_in = le32_to_cpu((INODE_PKEY(hint->inode))->k_dir_id);
-border = hint->beg + (unsigned long) keyed_hash(((char *) (&hash_in)), 4) % 
(hint->end - hint->beg - 1);
+border = hint->beg + (u32) keyed_hash(((char *) (&hash_in)), 4) % (hint->end - 
hint->beg - 1);
 if (border > hint->search_start)
hint->search_start = border;
 
@@ -606,7 +606,7 @@
   
 static inline int old_way (reiserfs_blocknr_hint_t * hint)
 {
-unsigned long border;
+b_blocknr_t border;
 
 if (hint->formatted_node || hint->inode == NULL) {
return 0;
@@ -622,7 +622,7 @@
 static inline void hundredth_slices (reiserfs_blocknr_hint_t * hint)
 {
 struct key * key = &hint->key;
-unsigned long slice_start;
+b_blocknr_t slice_start;
 
 slice_start = (keyed_hash((char*)(&key->k_dir_id),4) % 100) * (hint->end / 100);
 if ( slice_start > hint->search_start || slice_start + (hint->end / 100) <= 
hint->search_start) {
@@ -910,7 +910,7 @@
 int reiserfs_can_fit_pages ( struct super_block *sb /* superblock of filesystem
   

Re: vpf-10680, minor corruptions

2003-06-25 Thread Christian Kujau
Oleg Drokin schrieb:
Try to compile with CONFIG_REISERFS_CHECK=y the kernel that known-bad for you.
(e.g. 2.5.72/2.5.73)
yes, 2.5.72 with CONFIG_REISERFS_CHECK=y is compiling now.

over night the alpha finished compiling 2.5.65 and 2.5.69. i had to 
compile reiserfs statically, inserting modules gave these "Invalid 
module format" errors.

under both (2.5.65+2.5.69) i was able to mkreiserfs sde2. mounting the 
fs went ok, but copying data (cp -a /lib /mnt/reiserfs) brought several 
kernel-errors (see https://ephigenie.kicks-ass.net/browse/reiserfs/).

but: diff -r showed _no_ differences betweeen the directories, a 
following reiserfsck brought no vpf-10680 anymore!

so i'd say the problem occurs somewhere between 2.5.69 and 2.5.70.

thanks,
Christian.


Re: vpf-10680, minor corruptions

2003-06-24 Thread Oleg Drokin
Hello!

On Wed, Jun 25, 2003 at 02:42:24AM +0200, Christian Kujau wrote:
> (/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format
> lila:~# uname -a
> Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux
> i compiled the module with CONFIG_REISERFS_CHECK=y.
> shall i go on with 2.5.64 or better 2.5.67 ?

Try to compile with CONFIG_REISERFS_CHECK=y the kernel that known-bad for you.
(e.g. 2.5.72/2.5.73)

Bye,
Oleg


Re: vpf-10680, minor corruptions

2003-06-24 Thread Christian Kujau
Christian Kujau schrieb:
of course, the best thing i can do is the el-cheapo-hacking approach: 
compiling 2.5.60...up to 2.5.72 and see *when* it breaks. hm, compiling 
a 2.5 kernel takes 180min on this machine. but anyway, i'll start with 
2.5.60 now, see what it gives.
no, i started with 2.5.66 but the kernel did not compile. 2.5.65 did 
compile (don't ask how long) and has already booted. but trying to 
mount the newly created reiserfs gives:

module reiserfs: Relocation overflow vs section 9

in the log. the reiserfs module was not loaded. "modprobe reiserfs" gives:

lila:~# modprobe reiserfs
FATAL: Error inserting reiserfs 
(/lib/modules/2.5.65/kernel/fs/reiserfs/reiserfs.ko): Invalid module format
lila:~# uname -a
Linux lila 2.5.65 #4 Wed Jun 25 00:48:46 CEST 2003 alpha GNU/Linux

i compiled the module with CONFIG_REISERFS_CHECK=y.

shall i go on with 2.5.64 or better 2.5.67 ?

good night,
Christian.


Re: vpf-10680, minor corruptions

2003-06-24 Thread Christian Kujau
Oleg Drokin schrieb:
I see that you have used 2.5.70 and earlier kernels on alpha too.
Do you have any idea of when stuff broke for you?
hm, i used 2.5.6x kernels too on this machine, but i recognized the 
vpf-10680 the first time with 2.5.70.
of course, the best thing i can do is the el-cheapo-hacking approach: 
compiling 2.5.60...up to 2.5.72 and see *when* it breaks. hm, compiling 
a 2.5 kernel takes 180min on this machine. but anyway, i'll start with 
2.5.60 now, see what it gives.

You are certainly not the one person with alpha and 2.5, but I do not know
if others are using reiserfs.
you gotta send ads (read: spam) to all the linux-alpha lists :-)

BTW, have you tried to run with CONFIG_REISERFS_CHECK enabled to see if it will break
and panic in kernel or something like that?
no, only CONFIG_REISERFS_PROC_INFO, but i'll do so now.

Thanks,
Christian.


Re: vpf-10680, minor corruptions

2003-06-24 Thread Oleg Drokin
Hello!

On Mon, Jun 23, 2003 at 03:38:20PM +0200, Christian Kujau wrote:

> as stated before, the corruptions occur only on this very alpha machine, 

Well, I still cannot build the kernel myself and still working on it.
(having "make: *** [vmlinux] Error 139" and zero length vmlinux)

BTW, I realised that I have not looked into your kernel config for that box,
can you send it to me please?

> bread: Cannot read the block (523914): (Input/output error).

Hm, but still it means kernel returned some error for read request.

> hah! i was not aware that the disk might have an hw problem, not a 
> single error ever showed up in my logs. this was weird. so i 
> re-partitioned the disk with a 10MB sde (to circumvent the bread error) 
> on the beginning and a 2 GB sde2. now reiserfsck/cp/diff are all working 
> fine under 2.4.21, but 2.5.72 is still erroneous.

Sigh.

> 
> btw: i am still using reiserfsprogs 3.6.8 now (since debian/testing has 
> 3.6.6) and i have compiled these utils under a 2.5.72 kernel. is it safe 
> to use them under 2.4 ?

I see that you have used 2.5.70 and earlier kernels on alpha too.
Do you have any idea of when stuff broke for you?

Bye,
Oleg


Re: vpf-10680, minor corruptions -- oooh!

2003-06-19 Thread Christian Kujau
Oleg Drokin schrieb:
Well, normally reiserfs is caring about consistency.
There are two noticeable omissions, though:
1. if the unexpected shutdown was because of power loss and you have write cache 
enabled
   and your write reorders write requests, then it is possible invalid data gets
   written to disk, before "transaction is finished" mark is written to the drive.
yes, the on-disk write cache. this could be indeed a problem hard to
cover from any fs. i could disable it, yes.
So can you say check/fix the fs, mount it write some files to it,
unmount it and run fsck again to see if everything is ok?
oh, oh! i was about to answer this question with a plain "Yes". ok, with
--fix-fixable the corruptions got fixed, a reiserfsck went O.K. with "no
corruptions". i mounted the device yesterday, but no files were written
to it until today. now, i've just unmounted the partition, reiserfsck
went O.K. again, no corruptions. mounted again, i created a directory on
the fs and copied 329 files into it (cp -a /lib /path-to-reiser-fs/).
unmounted, reiserfsck found 131 corruptions in an instant:

lila:~# reiserfsck /dev/sde2

[...]

Do you want to run this program?[N/Yes] (note need to type Yes if you
do):Yes
###
reiserfsck --check started at Thu Jun 19 16:51:49 2003
###
Replaying journal..
0 transactions replayed
Checking internal tree..finished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Checking Semantic tree:
/temp/lib/libnss_files-2.3.1.sovpf-10680: The file [3214 3538] has the
wrong block count in the StatData (104), should be (56)
[...]

finished

131 found corruptions can be fixed with --fix-fixable
###
reiserfsck finished at Thu Jun 19 16:52:04 2003
###
lila:~# find /lib/ | wc -l
329
lila:~#
the pathnames (/temp/lib/...) are the same files i just copied to the
fs. i was not aware of a reproduceable bug (?) at all on this issue.
the fs is used once in a week very often, but rarely _during_ week. this
could be the cause, that i never recognized the errors before or were
fixed by a journal replay at boot time.
fyi only: i have _one_ weird issue with this alpha: it has 128 MB RAM
inside, but only 64 MB are recognized. putting 64 MB into it gives 32
MB, but 32 MB is still 32MB. this is odd, but kernel compiling / heavy
load causes no ooopses, well i got some with 2.5.6x kernels, but this is
long ago. and: the hd is a little old, it's a ST34573N (SCSI, 2 GB). but
there are no odd kernel messages or failures in the log.  i say this,
because often "bad RAM" or other issues are often on-topic here.
Thank you,
Christian.
PS: sorry for the delay. mail probs.



Re: vpf-10680, minor corruptions

2003-06-18 Thread Oleg Drokin
Hello!

On Wed, Jun 18, 2003 at 08:01:12PM +0200, Christian Kujau wrote:

> >Hm, interesting. Do you had crashes/unexpected shutdowns before 
> >corruptions appears
> >or are they appear without any reason at all?
> i had this issue once before -- did a check and noticed vpf-10680/some 
> corruptions. but these must have been from an crash.
> but now, i think as i rebooted the machine yesterday (because i upgraded 
> to kernel 2.5.72) the journal was checked (replayed?) anyway at boot:
> found reiserfs format "3.6" with standard journal
> Reiserfs journal params: device sde2, size 8192, journal first block 18, 
> max trans len 1024, max batch 900, max commit age 30, max trans age 30
> reiserfs: checking transaction log (sde2) for (sde2)
> Using r5 hash to sort names
> (from dmesg, booting process)

No, there is no sign of replaying journal.
If there was replay, you'd normally see "x transactions replayed in y seconds" message.

> and i thought the fs is "O.K." at least after boot, because ReiserFS 
> cares about consistency for itsself. if not, the corruptions are likely 
> from the unclean shutdowns. but that would mean, that i still have to 
> manually reiserfsck from time to time.

Well, normally reiserfs is caring about consistency.
There are two noticeable omissions, though:
1. if the unexpected shutdown was because of power loss and you have write cache 
enabled
   and your write reorders write requests, then it is possible invalid data gets
   written to disk, before "transaction is finished" mark is written to the drive.
   (there is a way to avoid this with some drives, by explicitly flushing
drive cache in some cases, but this method seems to create some problems
on itself. So this is not yet merged in any mainstream kernel).
2. there is no protection against kernel bugs.

1st usually leads to bitmap problems, but I also seen names pointing to nowhere.
Your corruption is somewhat strange by the fact the number of blocks
in statdata is ~ 2x bigger than it should be (on several files). Sounds like a pattern
to me.

> btw, is there a switch like "Maximum mount counft before doing the next 
> fsck while booting"?

No.

> >Well, I guess it's time to clear the dust off our alpha and do some 
> >testing.
> hehe, should it be architecture related?

This is also possible.

So can you say check/fix the fs, mount it write some files to it,
unmount it and run fsck again to see if everything is ok?

Thank you.

Bye,
Oleg


Re: vpf-10680, minor corruptions

2003-06-18 Thread Christian Kujau
Oleg Drokin schrieb:
Hm, interesting. Do you had crashes/unexpected shutdowns before corruptions appears
or are they appear without any reason at all?
i had this issue once before -- did a check and noticed vpf-10680/some 
corruptions. but these must have been from an crash.
but now, i think as i rebooted the machine yesterday (because i upgraded 
to kernel 2.5.72) the journal was checked (replayed?) anyway at boot:

found reiserfs format "3.6" with standard journal
Reiserfs journal params: device sde2, size 8192, journal first block 18, 
max trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (sde2) for (sde2)
Using r5 hash to sort names

(from dmesg, booting process)

and i thought the fs is "O.K." at least after boot, because ReiserFS 
cares about consistency for itsself. if not, the corruptions are likely 
from the unclean shutdowns. but that would mean, that i still have to 
manually reiserfsck from time to time.

btw, is there a switch like "Maximum mount counft before doing the next 
fsck while booting"?

Well, I guess it's time to clear the dust off our alpha and do some testing.
hehe, should it be architecture related?

Thank you,
Christian.