Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-29 Thread J.C. Roberts
On Tue, 27 Dec 2005 09:01:00 +0200, nikns [EMAIL PROTECTED] wrote:

Upgraded alphastation to 3.8 and first time in my life hit
alpha bug. ;)
Kernel panicked while ungziping src.tar.gz.
When I hit continue in ddb I was dropped into
other panic.
There is photos of panic, maybe it helps someone to
find alphabug :))

http://secure.lv/~nikns/alphabug/

Any chance you can post a dmesg for the box?

thanks,
jcr



Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-29 Thread nikns
On Thu, Dec 29, 2005 at 01:51:34PM -0800, J.C. Roberts wrote:
On Tue, 27 Dec 2005 09:01:00 +0200, nikns [EMAIL PROTECTED] wrote:

Upgraded alphastation to 3.8 and first time in my life hit
alpha bug. ;)
Kernel panicked while ungziping src.tar.gz.
When I hit continue in ddb I was dropped into
other panic.
There is photos of panic, maybe it helps someone to
find alphabug :))

http://secure.lv/~nikns/alphabug/

Any chance you can post a dmesg for the box?

http://marc.theaimsgroup.com/?l=openbsd-alpham=113051046212041w=2

Welcome!



Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-26 Thread nikns
Upgraded alphastation to 3.8 and first time in my life hit
alpha bug. ;)
Kernel panicked while ungziping src.tar.gz.
When I hit continue in ddb I was dropped into
other panic.
There is photos of panic, maybe it helps someone to
find alphabug :))

http://secure.lv/~nikns/alphabug/



Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-22 Thread J.C. Roberts
On Wed, 21 Dec 2005 12:13:54 -0800, J.C. Roberts [EMAIL PROTECTED]
wrote:

I found something interesting, namely a (more than once)
reported bug that looks very similar to The alpha bug. The primary
difference is you get cpu_switch_queuescan rather than cpu_switch in
the trace output.

2003-10-01 21:40:00
http://marc.theaimsgroup.com/?l=openbsd-alpham=106504464724168w=2

2003-08-03 12:00:14
http://marc.theaimsgroup.com/?l=openbsd-alpham=105999853009839w=2

There is also another report that is vague but since it is missing the
needed trace information, there's no way to tell if it's related.
2003-05-13 22:13:50
http://marc.theaimsgroup.com/?l=openbsd-bugsm=105286536018393w=2


Yes, the two bugs, one which shows cpu_switch in the trace output and
the other that shows cpu_switch_queuescan in the trace output, are
definitely related. 

I managed to reproduce the cpu_switch_queuescan output originally
reported from OpenBSD 3.3 while compiling 3.8-STABLE tonight.

The only change in the source files is that I enabled the

  #makeoptions DEBUG=-g

line in /src/sys/conf/GENERIC file. I'm going to try flipping this back
and forth a few times to see if it really is the deciding factor for
which output the bug displays.

JCR



Bug Hunting 101 - Finding The Alpha Bug

2005-12-21 Thread J.C. Roberts
Bug Hunting 101 - Finding The Alpha Bug

I've been told that The alpha bug has been around for quite some time
and no one has been able to find or fix it. I've also been told looking
for this bug has driven a few developers to drink, well, probably drink
more is a better description. Anyhow, since I could use a drink, I'm
going to give it a shot.

Since I don't have the skill to fix it myself, my goal is simply to
figure out when The alpha bug entered the tree. If I can just figure
out the `when' hopefully someone a lot smarter than me can figure out
the `what' of the problem. Basically I'm going to turn loose a half
dozen alpha systems compiling various versions of OpenBSD until I find
where the bug stops occurring.

As far as I can tell, the bug smells like a race condition of some sort
and if my wild guess is correct, it will be difficult to reproduce
consistently. With some (but not all) race conditions, you can increase
the chance of triggering them by increasing loads. Since I want the race
condition to occur, what is the best way stress to the systems while
also doing make build?

http://www.holm.cc/stress/
http://www.openbsd.org/cgi-bin/cvsweb/ports/sysutils/stress/

I simply don't know and I'm only guessing but the prime suspects for
where the race might live seem to be physical memory management,
PAL/interrupt handling or even the scheduler. 

Are there better ways to stress the system?
Are there better ways to increase the odds of a race occurring?

Since I needed to find a starting point, I went searching and reading
through the archives of misc@, tech@, alpha@ and bugs@ even the netbsd
archives in hopes of finding a patient zero where the bug was first
reported. I found something interesting, namely a (more than once)
reported bug that looks very similar to The alpha bug. The primary
difference is you get cpu_switch_queuescan rather than cpu_switch in
the trace output.

2003-10-01 21:40:00
http://marc.theaimsgroup.com/?l=openbsd-alpham=106504464724168w=2

2003-08-03 12:00:14
http://marc.theaimsgroup.com/?l=openbsd-alpham=105999853009839w=2

There is also another report that is vague but since it is missing the
needed trace information, there's no way to tell if it's related.
2003-05-13 22:13:50
http://marc.theaimsgroup.com/?l=openbsd-bugsm=105286536018393w=2

From other bug reports in the archive I know 3.8, 3.7 and 3.6 are all
affected by The alpha bug if my hunch is correct and the bugs linked
above are related to The alpha bug, then I should start the
compile-a-thon at OpenBSD v3.3 and work backwards.

If you've got a better idea, please let me know.

Kind Regards,
jcr



Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-21 Thread Siegbert Marschall
Hi,

 As far as I can tell, the bug smells like a race condition of some sort
 and if my wild guess is correct, it will be difficult to reproduce
 consistently. With some (but not all) race conditions, you can increase
 the chance of triggering them by increasing loads. Since I want the race
 condition to occur, what is the best way stress to the systems while
 also doing make build?
well, I have three alphas in the basement where I am trying to figure
this one out, nothing provable yet but everything is pointing into
some hardware problem with the low-end alpha cpus and second-level cache.
llsc errors, stuck cachelines and stuff but I didn't dive deep enough
into the code and processor documentations to figure out what's going
on there and will not be in the next weeks/months since I have a few
more pressing issues to take care of first before having the spare
time for this ;)

only thing I can tell is that with netbsd the machines stay up for
weeks/months and with obsd they crash latest after a few days.
no flame, doesn't show that netbsd is better, probably just missing the
tripwire or doesn't care wether it blows.

good luck, siggi.



Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-21 Thread J.C. Roberts
On Wed, 21 Dec 2005 22:46:00 +0100 (CET), Siegbert Marschall
[EMAIL PROTECTED] wrote:

Hi,

 As far as I can tell, the bug smells like a race condition of some sort
 and if my wild guess is correct, it will be difficult to reproduce
 consistently. With some (but not all) race conditions, you can increase
 the chance of triggering them by increasing loads. Since I want the race
 condition to occur, what is the best way stress to the systems while
 also doing make build?

well, I have three alphas in the basement where I am trying to figure
this one out, nothing provable yet but everything is pointing into
some hardware problem with the low-end alpha cpus and second-level cache.

Due to the old bug reports which may or may not be related, I've been
looking into the changes in src/sys/arch/alpha/alpha/locore.s

llsc errors, stuck cachelines and stuff but I didn't dive deep enough
into the code and processor documentations to figure out what's going
on there and will not be in the next weeks/months since I have a few
more pressing issues to take care of first before having the spare
time for this ;)


If I can figure out when the bug entered the tree, it will hopefully
make it easy for someone else to figure out the what of the problem.
Since I lack the skill and experience to deal with figuring out the
what, I'm just going to use brute force to figure out the when. ;-)

only thing I can tell is that with netbsd the machines stay up for
weeks/months and with obsd they crash latest after a few days.
no flame, doesn't show that netbsd is better, probably just missing the
tripwire or doesn't care wether it blows.

good luck, siggi.

I've searched the netbsd list archives thoroughly and found no similar
bug reports. As far as I know netbsd is not affected.

jcr



Re: Bug Hunting 101 - Finding The Alpha Bug

2005-12-21 Thread ober
I know this is going to be OT, but since this bug seems to deal with only 
OpenBSD on alpha, possibly in locore.s and does not seem to affect netbsd, 
that I might point out a coincidental, but most likely unrelated bug.


http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/alpha/alpha/locore.s.diff?r1=1.19r2=1.20f=h
Search on OpenBSD. :D



-Ober

On Wed, 21 Dec 2005, J.C. Roberts wrote:


On Wed, 21 Dec 2005 22:46:00 +0100 (CET), Siegbert Marschall
[EMAIL PROTECTED] wrote:


Hi,


As far as I can tell, the bug smells like a race condition of some sort
and if my wild guess is correct, it will be difficult to reproduce
consistently. With some (but not all) race conditions, you can increase
the chance of triggering them by increasing loads. Since I want the race
condition to occur, what is the best way stress to the systems while
also doing make build?


well, I have three alphas in the basement where I am trying to figure
this one out, nothing provable yet but everything is pointing into
some hardware problem with the low-end alpha cpus and second-level cache.


Due to the old bug reports which may or may not be related, I've been
looking into the changes in src/sys/arch/alpha/alpha/locore.s


llsc errors, stuck cachelines and stuff but I didn't dive deep enough
into the code and processor documentations to figure out what's going
on there and will not be in the next weeks/months since I have a few
more pressing issues to take care of first before having the spare
time for this ;)



If I can figure out when the bug entered the tree, it will hopefully
make it easy for someone else to figure out the what of the problem.
Since I lack the skill and experience to deal with figuring out the
what, I'm just going to use brute force to figure out the when. ;-)


only thing I can tell is that with netbsd the machines stay up for
weeks/months and with obsd they crash latest after a few days.
no flame, doesn't show that netbsd is better, probably just missing the
tripwire or doesn't care wether it blows.

good luck, siggi.


I've searched the netbsd list archives thoroughly and found no similar
bug reports. As far as I know netbsd is not affected.

jcr