Re: 2.4.21 reiserfs oops

2003-06-24 Thread Chris Mason
On Tue, 2003-06-24 at 16:34, Nix wrote:
> On Tue, 24 Jun 2003, Oleg Drokin moaned:
> > Hello!
> > 
> > On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
> > 
> >> >> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference 
> >> >> at virtual address 0001 
> >> > This is very strange address to oops on.
> >> I'll say! Looks almost like it JMPed to a null pointer or something.
> > 
> > No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.
> 
> JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
> not somewhere that even a memory error would make us likely to jump to.
> 
> >> >> Jun 22 13:52:43 loki kernel: EIP:0010:[]Not tainted 

The EIP isn't zero or 1, you've got a bad null pinter dereference at
address 1.  You get this when you do something like *(char *)1 =
some_val.

The ram is most likely bad, you're 1 bit away from zero, but you might
try a reiserfsck on any drives affected by the scsi errors.

-chris




Re: 2.4.21 reiserfs oops

2003-06-24 Thread Nix
On Tue, 24 Jun 2003, Oleg Drokin moaned:
> Hello!
> 
> On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:
> 
>> >> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at 
>> >> virtual address 0001 
>> > This is very strange address to oops on.
>> I'll say! Looks almost like it JMPed to a null pointer or something.
> 
> No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.

JMPed to ((long)NULL)+1 or something then :) the fact remains that it's
not somewhere that even a memory error would make us likely to jump to.

>> >> Jun 22 13:52:43 loki kernel: EIP:0010:[]Not tainted 
>> > And the EIP is prior to kernel start which is also very strange.
>> > On the other hand the address c0192df4 is somewhere inside reiserfs code,
>> > so it looks like a single bit error, I'd say.
>> I think it unlikely to be RAM problems given that the problem happened
>> shortly after upgrading to 2.4.21; this was about half a day after I
>> rebooted it because it threw a pile of never-seen-again, un-syslogged
>> SCSI abort errors at me (sym53c875); and *that* was a few minutes after
>> I rebooted into 2.4.21 for the first time.
> 
> Hm, so first there were some scsi problems and then reiserfs oops?

Different boots. I upgraded, the first boot crashed within five minutes
with weird SCSI errors, so I rebooted again and this happened six hours
later.

I'm willing to write off the SCSI errors to the shock effect of having
just been powered down for the first time in a year (the shutdown
scripts didn't quite work and the reset button is disconnected).

>> Actually since the RAM is good, I see no good reason for this to happen.
>> (actually I see no good reason for valid code before _text, either).
> 
> I wonder if 2.4.21 constantly crashes like that for you, then?

No obvious sign of it:

  9:34pm  up 1 day 22:30,  14 users,  load average: 0.09, 0.12, 0.16

(it is of course waiting until I am hundreds of miles away. *Then* it'll
crash.)

-- 
`It is an unfortunate coincidence that the date locarchive.h was
 written (in hex) matches Ritchie's birthday (in octal).'
   -- Roland McGrath on the libc-alpha list


Re: 2.4.21 reiserfs oops

2003-06-23 Thread Oleg Drokin
Hello!

On Mon, Jun 23, 2003 at 11:16:27PM +0100, Nix wrote:

> >> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at 
> >> virtual address 0001 
> > This is very strange address to oops on.
> I'll say! Looks almost like it JMPed to a null pointer or something.

No, if it'd jumped to a NULL pointer, we'd see 0 in EIP.

> >> Jun 22 13:52:43 loki kernel: EIP:0010:[]Not tainted 
> > And the EIP is prior to kernel start which is also very strange.
> > On the other hand the address c0192df4 is somewhere inside reiserfs code,
> > so it looks like a single bit error, I'd say.
> I think it unlikely to be RAM problems given that the problem happened
> shortly after upgrading to 2.4.21; this was about half a day after I
> rebooted it because it threw a pile of never-seen-again, un-syslogged
> SCSI abort errors at me (sym53c875); and *that* was a few minutes after
> I rebooted into 2.4.21 for the first time.

Hm, so first there were some scsi problems and then reiserfs oops?

Actually since the RAM is good, I see no good reason for this to happen.
(actually I see no good reason for valid code before _text, either).

I wonder if 2.4.21 constantly crashes like that for you, then?

Bye,
Oleg


Re: 2.4.21 reiserfs oops

2003-06-23 Thread Nix
On Mon, 23 Jun 2003, Oleg Drokin said:
> Hello!
> 
> On Sun, Jun 22, 2003 at 03:00:20PM +0100, Nix wrote:
> 
>> Jun 22 13:52:42 loki kernel: Unable to handle kernel NULL pointer dereference at 
>> virtual address 0001 
> 
> This is very strange address to oops on.

I'll say! Looks almost like it JMPed to a null pointer or something.

>> Jun 22 13:52:43 loki kernel: EIP:0010:[]Not tainted 
> 
> And the EIP is prior to kernel start which is also very strange.
> On the other hand the address c0192df4 is somewhere inside reiserfs code,
> so it looks like a single bit error, I'd say.

I think it unlikely to be RAM problems given that the problem happened
shortly after upgrading to 2.4.21; this was about half a day after I
rebooted it because it threw a pile of never-seen-again, un-syslogged
SCSI abort errors at me (sym53c875); and *that* was a few minutes after
I rebooted into 2.4.21 for the first time.

All my other boxes love 2.4.21, but this one dislikes it. (Of course it
has to be my second-most-critical server... ah well, the NFS problems
in 2.4.20 bit my most critical server and my home directory both, so
I guess this is less unpleasant.)

> Can you run memtest86 for some time to verify that your RAM is OK?

Did that last night; no problems reported. (Not really surprising.)

> (hm, and the oops got twice to the logs which is pretty strange thing, too,
> never seen anything like this).

That's my weirdly broken syslog config. I've never got around to fixing
it; it only happens with kernel messages and I don't get all that many
of those.

-- 
`It is an unfortunate coincidence that the date locarchive.h was
 written (in hex) matches Ritchie's birthday (in octal).'
   -- Roland McGrath on the libc-alpha list