Re: -current lockups

2001-08-23 Thread Vincent Poy

On Tue, 31 Jul 2001, John Baldwin wrote:


 On 31-Jul-01 Vincent Poy wrote:
  On Mon, 30 Jul 2001, John Baldwin wrote:
 
  On 30-Jul-01 Sheldon Hearn wrote:
  
  
   On Mon, 30 Jul 2001 07:38:47 MST, David O'Brien wrote:
  
   However, those boxes were panicing often before I made that statement.
   So I still believe current is now in better shape than it was in June.
  
   I'll be a lot happier when I can enabled DDB_UNATTENDED and do whatever
   it is that causes my panic of the day and actually get a crashdump
   instead of
  
 panic: witness_restore: lock (sleep mutex) Giant not locked
 
  This is a different one.  Is this during the dump itself?  That I can try to
  work on.  (Basically, I need to make witness just stop doing all of its
  various
  checks if panicstr != NULL).
 
I'm getting the following lock order reversal for any -current
  since July 19, 2001 including today and it just hangs solid after this, no
  db prompt or anything...  It only happens after passwd or chpass
  successfully rebuilds the database, vipw works fine.
 
  root@pele [9:29pm][/usr/temp] 
  Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
  Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
  Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
  /usr/src/sys/vm/vm_glue.c:469
  Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
  /usr/src/sys/vm/vm_glue.c:469
  Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
  @ /usr/src/sys/kern/kern_lock.c:239
  Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
  @ /usr/src/sys/kern/kern_lock.c:239

 This is due to the way that lockmgr locks are implemented unfortunately, and
 will be fixed when vm maps switch to sx locks instead of lockmgr locks.

Just a note to say thanks to John Baldwin, Peter Wemm, Ian Dowse
and a few others for all their hard work and code commits since the panics
from both stability and running passwd have completely disappeared.  The
system is solid as a rock!  Thanks guys!


Cheers,
Vince - [EMAIL PROTECTED] - Vice President    __ 
Unix Networking Operations - FreeBSD-Real Unix for Free / / / / |  / |[__  ]
WurldLink Corporation  / / / /  | /  | __] ]
San Francisco - Honolulu - Hong Kong  / / / / / |/ / | __] ]
HongKong Stars/Gravis UltraSound Mailing Lists Admin /_/_/_/_/|___/|_|[]
Almighty1@IRC - oahu.DAL.NET Hawaii's DALnet IRC Network Server Admin



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-08-01 Thread Sheldon Hearn



On Tue, 31 Jul 2001 12:06:49 -1000, Vincent Poy wrote:

   Yeah, that's the weird part... I thought adding a DDB_UNATTENDED
 as a option would atleast make it reboot or something...

For the record, DDB_UNATTENDED is mostly pointless.  It just sets the
default value of debug.debugger_on_panic, which you can just as well set
in /etc/sysctl.conf.  Unless, of course, you're seeing a panic in the
startup process.  But then do you really want an indefinite panic cycle?
:-)

Ciao,
Sheldon.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-08-01 Thread David Scheidt

On Wed, 1 Aug 2001, Sheldon Hearn wrote:

:
:
:On Tue, 31 Jul 2001 12:06:49 -1000, Vincent Poy wrote:
:
:  Yeah, that's the weird part... I thought adding a DDB_UNATTENDED
: as a option would atleast make it reboot or something...
:
:For the record, DDB_UNATTENDED is mostly pointless.  It just sets the
:default value of debug.debugger_on_panic, which you can just as well set
:in /etc/sysctl.conf.  Unless, of course, you're seeing a panic in the
:startup process.  But then do you really want an indefinite panic cycle?
::-)

Well, my current startup panic only happens at cold boot.  After it panics
the first time, it boots fine.  If DDB_UNATTENED isn't set, it hangs trying
to enter DDB.

-- 
[EMAIL PROTECTED]
Bipedalism is only a fad.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread Vincent Poy

On Tue, 31 Jul 2001, John Baldwin wrote:


 On 31-Jul-01 Vincent Poy wrote:
  On Mon, 30 Jul 2001, John Baldwin wrote:
 
  On 30-Jul-01 Sheldon Hearn wrote:
  
  
   On Mon, 30 Jul 2001 07:38:47 MST, David O'Brien wrote:
  
   However, those boxes were panicing often before I made that statement.
   So I still believe current is now in better shape than it was in June.
  
   I'll be a lot happier when I can enabled DDB_UNATTENDED and do whatever
   it is that causes my panic of the day and actually get a crashdump
   instead of
  
 panic: witness_restore: lock (sleep mutex) Giant not locked
 
  This is a different one.  Is this during the dump itself?  That I can try to
  work on.  (Basically, I need to make witness just stop doing all of its
  various
  checks if panicstr != NULL).
 
I'm getting the following lock order reversal for any -current
  since July 19, 2001 including today and it just hangs solid after this, no
  db prompt or anything...  It only happens after passwd or chpass
  successfully rebuilds the database, vipw works fine.
 
  root@pele [9:29pm][/usr/temp] 
  Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
  Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
  Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
  /usr/src/sys/vm/vm_glue.c:469
  Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
  /usr/src/sys/vm/vm_glue.c:469
  Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
  @ /usr/src/sys/kern/kern_lock.c:239
  Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
  @ /usr/src/sys/kern/kern_lock.c:239

 This is due to the way that lockmgr locks are implemented unfortunately, and
 will be fixed when vm maps switch to sx locks instead of lockmgr locks.

Interesting.  Is there a workaround so it just reboots instead of
freezing?  Also, I noticed that you committed some changes to the kernel,
is that supposed to help it any?


Cheers,
Vince - [EMAIL PROTECTED] - Vice President    __ 
Unix Networking Operations - FreeBSD-Real Unix for Free / / / / |  / |[__  ]
WurldLink Corporation  / / / /  | /  | __] ]
San Francisco - Honolulu - Hong Kong  / / / / / |/ / | __] ]
HongKong Stars/Gravis UltraSound Mailing Lists Admin /_/_/_/_/|___/|_|[]
Almighty1@IRC - oahu.DAL.NET Hawaii's DALnet IRC Network Server Admin



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread John Baldwin


On 31-Jul-01 Vincent Poy wrote:
 On Tue, 31 Jul 2001, John Baldwin wrote:
  root@pele [9:29pm][/usr/temp] 
  Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
  Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
  Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
  /usr/src/sys/vm/vm_glue.c:469
  Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
  /usr/src/sys/vm/vm_glue.c:469
  Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
  @ /usr/src/sys/kern/kern_lock.c:239
  Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
  @ /usr/src/sys/kern/kern_lock.c:239

 This is due to the way that lockmgr locks are implemented unfortunately, and
 will be fixed when vm maps switch to sx locks instead of lockmgr locks.
 
   Interesting.  Is there a workaround so it just reboots instead of
 freezing?  Also, I noticed that you committed some changes to the kernel,
 is that supposed to help it any?

There is currently not a workaround.  The changes committed fix other things,
but not this problem.  I haven't actually seen this lock order cause a freeze
before to be honest.

-- 

John Baldwin [EMAIL PROTECTED] -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
Power Users Use the Power to Serve!  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread Vincent Poy

On Tue, 31 Jul 2001, John Baldwin wrote:


 On 31-Jul-01 Vincent Poy wrote:
  On Tue, 31 Jul 2001, John Baldwin wrote:
   root@pele [9:29pm][/usr/temp] 
   Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
   Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
   Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
   /usr/src/sys/vm/vm_glue.c:469
   Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
   /usr/src/sys/vm/vm_glue.c:469
   Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
   @ /usr/src/sys/kern/kern_lock.c:239
   Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
   @ /usr/src/sys/kern/kern_lock.c:239
 
  This is due to the way that lockmgr locks are implemented unfortunately, and
  will be fixed when vm maps switch to sx locks instead of lockmgr locks.
 
Interesting.  Is there a workaround so it just reboots instead of
  freezing?  Also, I noticed that you committed some changes to the kernel,
  is that supposed to help it any?

 There is currently not a workaround.  The changes committed fix other things,
 but not this problem.  I haven't actually seen this lock order cause a freeze
 before to be honest.

Yeah, that's the weird part... I thought adding a DDB_UNATTENDED
as a option would atleast make it reboot or something...


Cheers,
Vince - [EMAIL PROTECTED] - Vice President    __ 
Unix Networking Operations - FreeBSD-Real Unix for Free / / / / |  / |[__  ]
WurldLink Corporation  / / / /  | /  | __] ]
San Francisco - Honolulu - Hong Kong  / / / / / |/ / | __] ]
HongKong Stars/Gravis UltraSound Mailing Lists Admin /_/_/_/_/|___/|_|[]
Almighty1@IRC - oahu.DAL.NET Hawaii's DALnet IRC Network Server Admin



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread John Baldwin


On 31-Jul-01 Vincent Poy wrote:
 On Tue, 31 Jul 2001, John Baldwin wrote:
 

 On 31-Jul-01 Vincent Poy wrote:
  On Tue, 31 Jul 2001, John Baldwin wrote:
   root@pele [9:29pm][/usr/temp] 
   Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
   Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
   Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
   /usr/src/sys/vm/vm_glue.c:469
   Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
   /usr/src/sys/vm/vm_glue.c:469
   Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr
   interlock
   @ /usr/src/sys/kern/kern_lock.c:239
   Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr
   interlock
   @ /usr/src/sys/kern/kern_lock.c:239
 
  This is due to the way that lockmgr locks are implemented unfortunately,
  and
  will be fixed when vm maps switch to sx locks instead of lockmgr locks.
 
Interesting.  Is there a workaround so it just reboots instead of
  freezing?  Also, I noticed that you committed some changes to the kernel,
  is that supposed to help it any?

 There is currently not a workaround.  The changes committed fix other
 things,
 but not this problem.  I haven't actually seen this lock order cause a
 freeze
 before to be honest.
 
   Yeah, that's the weird part... I thought adding a DDB_UNATTENDED
 as a option would atleast make it reboot or something...

Well, since it is a lock order reversal, there is the chance of it resulting in
a deadlock though the chances of that on a UP machine would be very, very rare
indeed.  The reversal in question is triggered when we swap a process out.

-- 

John Baldwin [EMAIL PROTECTED] -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
Power Users Use the Power to Serve!  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread Vincent Poy

On Tue, 31 Jul 2001, John Baldwin wrote:

  On 31-Jul-01 Vincent Poy wrote:
   On Tue, 31 Jul 2001, John Baldwin wrote:
root@pele [9:29pm][/usr/temp] 
Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
/usr/src/sys/vm/vm_glue.c:469
Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
/usr/src/sys/vm/vm_glue.c:469
Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr
interlock
@ /usr/src/sys/kern/kern_lock.c:239
Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr
interlock
@ /usr/src/sys/kern/kern_lock.c:239
  
   This is due to the way that lockmgr locks are implemented unfortunately,
   and
   will be fixed when vm maps switch to sx locks instead of lockmgr locks.
  
 Interesting.  Is there a workaround so it just reboots instead of
   freezing?  Also, I noticed that you committed some changes to the kernel,
   is that supposed to help it any?
 
  There is currently not a workaround.  The changes committed fix other
  things,
  but not this problem.  I haven't actually seen this lock order cause a
  freeze
  before to be honest.
 
Yeah, that's the weird part... I thought adding a DDB_UNATTENDED
  as a option would atleast make it reboot or something...

 Well, since it is a lock order reversal, there is the chance of it
 resulting in a deadlock though the chances of that on a UP machine
 would be very, very rare indeed.  The reversal in question is
 triggered when we swap a process out.

Yep, it's so rare that nothing can trigger it except for passwd
and chpass after they successfully exit and do the following
successfully...

passwd: updating the database...
passwd: done

Even vipw doesn't trigger it which I thought it would as it would
do all the users rather than just one.


Cheers,
Vince - [EMAIL PROTECTED] - Vice President    __ 
Unix Networking Operations - FreeBSD-Real Unix for Free / / / / |  / |[__  ]
WurldLink Corporation  / / / /  | /  | __] ]
San Francisco - Honolulu - Hong Kong  / / / / / |/ / | __] ]
HongKong Stars/Gravis UltraSound Mailing Lists Admin /_/_/_/_/|___/|_|[]
Almighty1@IRC - oahu.DAL.NET Hawaii's DALnet IRC Network Server Admin




To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread Vincent Poy

On Mon, 30 Jul 2001, John Baldwin wrote:

 On 30-Jul-01 Sheldon Hearn wrote:
 
 
  On Mon, 30 Jul 2001 07:38:47 MST, David O'Brien wrote:
 
  However, those boxes were panicing often before I made that statement.
  So I still believe current is now in better shape than it was in June.
 
  I'll be a lot happier when I can enabled DDB_UNATTENDED and do whatever
  it is that causes my panic of the day and actually get a crashdump
  instead of
 
panic: witness_restore: lock (sleep mutex) Giant not locked

 This is a different one.  Is this during the dump itself?  That I can try to
 work on.  (Basically, I need to make witness just stop doing all of its various
 checks if panicstr != NULL).

I'm getting the following lock order reversal for any -current
since July 19, 2001 including today and it just hangs solid after this, no
db prompt or anything...  It only happens after passwd or chpass
successfully rebuilds the database, vipw works fine.

root@pele [9:29pm][/usr/temp] 
Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
/usr/src/sys/vm/vm_glue.c:469
Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
/usr/src/sys/vm/vm_glue.c:469
Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
@ /usr/src/sys/kern/kern_lock.c:239
Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
@ /usr/src/sys/kern/kern_lock.c:239


Cheers,
Vince - [EMAIL PROTECTED] - Vice President    __ 
Unix Networking Operations - FreeBSD-Real Unix for Free / / / / |  / |[__  ]
WurldLink Corporation  / / / /  | /  | __] ]
San Francisco - Honolulu - Hong Kong  / / / / / |/ / | __] ]
HongKong Stars/Gravis UltraSound Mailing Lists Admin /_/_/_/_/|___/|_|[]
Almighty1@IRC - oahu.DAL.NET Hawaii's DALnet IRC Network Server Admin




To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-31 Thread Sheldon Hearn



On Mon, 30 Jul 2001 16:52:27 +0200, Sheldon Hearn wrote:

 I'll be a lot happier when I can enabled DDB_UNATTENDED and do whatever
 it is that causes my panic of the day and actually get a crashdump
 instead of
 
   panic: witness_restore: lock (sleep mutex) Giant not locked

Right!  We have interesting progress!

The following patchset fixes two problems:

1) The witness code may still panic when panicstr is not NULL.  This is
   a problem when the original panic was caused by the witness code
   itself.

   Solution: when panicstr != NULL, witness backs off on the fascism.

2) The ata code calls await/mawait inside a dump.  This results in a
   hard, uninterruptable lock in ad_reinit().  An NMI might interrupt
   it, but not all of us have NMI boards.

   Solution: as a quick fix, when panicstr != NULL, bail out of
   masleep() without spinning on mutexes.

   The real solution here is arguably for the ata code to be aware of
   the fact that sleeping inside a dump (i.e. panicstr != NULL) is bad
   and stop doing so.

With these fixes in place, I can get a crashdump with a populated
ktr_buf from a witness-related panic.  Joy!

Many thanks to jhb for providing the patch for #1 above and to peter for
the patch for #2 above.  None of this is actually my own work.  I'm just
the guy pressing the panic button on his box incessantly. :-)

Ciao,
Sheldon.

Index: kern/kern_mutex.c
===
RCS file: /home/ncvs/src/sys/kern/kern_mutex.c,v
retrieving revision 1.64
diff -u -d -r1.64 kern_mutex.c
--- kern/kern_mutex.c   2001/06/25 18:29:32 1.64
+++ kern/kern_mutex.c   2001/07/30 23:14:32
@@ -562,6 +562,9 @@
 void
 _mtx_assert(struct mtx *m, int what, const char *file, int line)
 {
+
+   if (panicstr != NULL)
+   return;
switch (what) {
case MA_OWNED:
case MA_OWNED | MA_RECURSED:
Index: kern/kern_synch.c
===
RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.148
diff -u -d -r1.148 kern_synch.c
--- kern/kern_synch.c   2001/07/06 01:16:42 1.148
+++ kern/kern_synch.c   2001/07/31 01:07:23
@@ -554,6 +554,18 @@
KASSERT(timo  0 || mtx_owned(Giant) || mtx != NULL,
(sleeping without a mutex));
mtx_lock_spin(sched_lock);
+   if (cold || panicstr) {
+   /*
+* After a panic, or during autoconfiguration,
+* just give interrupts a chance, then just return;
+* don't run any other procs or panic below,
+* in case this is the idle process and already asleep.
+*/
+   if (mtx != NULL  priority  PDROP)
+   mtx_unlock_flags(mtx, MTX_NOSWITCH);
+   mtx_unlock_spin(sched_lock);
+   return (0);
+   }
DROP_GIANT_NOSWITCH();
if (mtx != NULL) {
mtx_assert(mtx, MA_OWNED | MA_NOTRECURSED);
Index: kern/subr_trap.c
===
RCS file: /home/ncvs/src/sys/kern/subr_trap.c,v
retrieving revision 1.196
diff -u -d -r1.196 subr_trap.c
--- kern/subr_trap.c2001/07/04 15:36:30 1.196
+++ kern/subr_trap.c2001/07/30 23:14:42
@@ -72,9 +72,9 @@
while ((sig = CURSIG(p)) != 0)
postsig(sig);
mtx_unlock(Giant);
+   PROC_UNLOCK(p);
 
mtx_lock_spin(sched_lock);
-   PROC_UNLOCK_NOSWITCH(p);
p-p_pri.pri_level = p-p_pri.pri_user;
if (resched_wanted(p)) {
/*
@@ -96,24 +96,22 @@
while ((sig = CURSIG(p)) != 0)
postsig(sig);
mtx_unlock(Giant);
-   mtx_lock_spin(sched_lock);
-   PROC_UNLOCK_NOSWITCH(p);
-   }
+   PROC_UNLOCK(p);
+   } else
+   mtx_unlock_spin(sched_lock);
 
/*
 * Charge system time if profiling.
 */
-   if (p-p_sflag  PS_PROFIL) {
-   mtx_unlock_spin(sched_lock);
+   if (p-p_sflag  PS_PROFIL)
addupc_task(p, TRAPF_PC(frame),
(u_int)(p-p_sticks - oticks) * psratio);
-   } else
-   mtx_unlock_spin(sched_lock);
 }
 
 /*
  * Process an asynchronous software trap.
  * This is relatively easy.
+ * This function will return with interrupts disabled.
  */
 void
 ast(framep)
@@ -121,68 +119,64 @@
 {
struct proc *p = CURPROC;
u_quad_t sticks;
+   critical_t s;
+   int sflag;
 #if defined(DEV_NPX)  !defined(SMP)
int ucode;
 #endif
 
KASSERT(TRAPF_USERMODE(framep), (ast in kernel mode));
-
-   /*
-* We check for a pending AST here rather than in the assembly as
-* acquiring and releasing mutexes in assembly is not fun.
-*/
-   mtx_lock_spin(sched_lock);
-   if (!(astpending(p) || resched_wanted(p))) {
-   

Re: -current lockups

2001-07-31 Thread John Baldwin


On 31-Jul-01 Vincent Poy wrote:
 On Mon, 30 Jul 2001, John Baldwin wrote:
 
 On 30-Jul-01 Sheldon Hearn wrote:
 
 
  On Mon, 30 Jul 2001 07:38:47 MST, David O'Brien wrote:
 
  However, those boxes were panicing often before I made that statement.
  So I still believe current is now in better shape than it was in June.
 
  I'll be a lot happier when I can enabled DDB_UNATTENDED and do whatever
  it is that causes my panic of the day and actually get a crashdump
  instead of
 
panic: witness_restore: lock (sleep mutex) Giant not locked

 This is a different one.  Is this during the dump itself?  That I can try to
 work on.  (Basically, I need to make witness just stop doing all of its
 various
 checks if panicstr != NULL).
 
   I'm getting the following lock order reversal for any -current
 since July 19, 2001 including today and it just hangs solid after this, no
 db prompt or anything...  It only happens after passwd or chpass
 successfully rebuilds the database, vipw works fine.
 
 root@pele [9:29pm][/usr/temp] 
 Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
 Jul 28 21:29:40 pele /boot/kernel/kernel: lock order reversal
 Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
 /usr/src/sys/vm/vm_glue.c:469
 Jul 28 21:29:40 pele /boot/kernel/kernel: 1st 0xd92fea9c process lock @
 /usr/src/sys/vm/vm_glue.c:469
 Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
 @ /usr/src/sys/kern/kern_lock.c:239
 Jul 28 21:29:40 pele /boot/kernel/kernel: 2nd 0xc118dfb0 lockmgr interlock
 @ /usr/src/sys/kern/kern_lock.c:239

This is due to the way that lockmgr locks are implemented unfortunately, and
will be fixed when vm maps switch to sx locks instead of lockmgr locks.

-- 

John Baldwin [EMAIL PROTECTED] -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.baldwin.cx/~john/pgpkey.asc
Power Users Use the Power to Serve!  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-30 Thread David O'Brien

On Sun, Jul 29, 2001 at 07:00:11PM -0700, Kris Kennaway wrote:
 For the past 2 or 3 weeks my -current system has been experiencing
 temporary lockups, usually under disk load.  The entire system will
 hang for around 20-30 seconds, during which time absolutely no
 network/IO/keyboard/mouse activity is accepted.  Usually, after 20-30
 seconds the system will unwedge and activity will resume, but
 sometimes it hangs forever.  There are no console messages logged by
 this event.  I cannot break into DDB until after system activity
 resumes normally.

I am also experiencing total wedging on disk activity (vi foo, was one)
on a SCSI system since I updated late last week.  My May 7th kernel was
rock solid.

-- 
-- David  ([EMAIL PROTECTED])

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-30 Thread Sheldon Hearn



On Mon, 30 Jul 2001 09:28:09 MST, John Baldwin wrote:

panic: witness_restore: lock (sleep mutex) Giant not locked
 
 This is a different one.  Is this during the dump itself?  That I can
 try to work on.  (Basically, I need to make witness just stop doing
 all of its various checks if panicstr != NULL).

Oh cool!  Yes, this is during the dump.  I get the witness panic,
following by something along the lines of dump already in progress,
bailing.

Ciao,
Sheldon.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-30 Thread Sheldon Hearn



On Mon, 30 Jul 2001 07:26:55 MST, David O'Brien wrote:

 I am also experiencing total wedging on disk activity (vi foo, was one)
 on a SCSI system since I updated late last week.  My May 7th kernel was
 rock solid.

Was this before or after you posted publically that -CURRENT seemed
stable and that now is a good time to upgrade if you've been holding
back? :-)

Ciao,
Sheldon.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-30 Thread Sheldon Hearn



On Mon, 30 Jul 2001 07:38:47 MST, David O'Brien wrote:

 However, those boxes were panicing often before I made that statement.
 So I still believe current is now in better shape than it was in June.

I'll be a lot happier when I can enabled DDB_UNATTENDED and do whatever
it is that causes my panic of the day and actually get a crashdump
instead of

panic: witness_restore: lock (sleep mutex) Giant not locked

:-)

Fortunately, jhb has said he'll try take a look at this some time this
week.  However, if I hadn't interacted with the guy directly, I'd be
pretty frustrated with -CURRENT.

Ciao,
Sheldon.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: -current lockups

2001-07-29 Thread Matthew Jacob


This has happened on and off for various platforms since SMPNG.


On Sun, 29 Jul 2001, Kris Kennaway wrote:

 Hi all,
 
 For the past 2 or 3 weeks my -current system has been experiencing
 temporary lockups, usually under disk load.  The entire system will
 hang for around 20-30 seconds, during which time absolutely no
 network/IO/keyboard/mouse activity is accepted.  Usually, after 20-30
 seconds the system will unwedge and activity will resume, but
 sometimes it hangs forever.  There are no console messages logged by
 this event.  I cannot break into DDB until after system activity
 resumes normally.
 
 My system is a PPro 233 using IDE drives.  Has anyone else seen
 something like this?
 
 Kris
 
 


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: current lockups

2000-03-18 Thread Arun Sharma

 Compiling Mozilla with make -j 2 got -current to lock up, twice in
 succession. I'm running a fairly recent snapshot (a week or two old)
 on a Dual celeron box (BP6) with UDMA66 enabled.
 
 The kernel had DDB enabled. I was running X, but I didn't see any
 signs of the kernel attempting to get into the debugger.
 
 Has this been fixed ? Is anyone interested in investigating ?
 I'll post more info if I find anything.

Another data point: I had another lockup today. I left the box to do
a buildworld and went out for dinner. When I'd returned, the machine
had locked up tight, but the orange LED on the disk was on. Now, I don't
know if the problem is my WDC 20 GB disk or something else in the ATA 
driver.

I'm running on the RELENG_4 branch as of yesterday night.

-Arun


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-09 Thread Peter Dufault

 On 2000-Mar-09 10:05:21 +1100, Peter Dufault [EMAIL PROTECTED] wrote:
 There's no difference between rtprio and P1003.1B scheduling other than
 the name.  rtprio is the same as P1003.1B "SCHED_RR".
 
 I wasn't aware of that.
 
 I'd like to remove the rtprio call from ntpd.  I think we ought to do
 it now before 4.0 ships.
 
 Given there is a known a priority inversion bug related to realtime
 (or idle) scheduling, it would seem wise not to use it in any system
 utilities.  The relevant patch would appear to be (untested):
 
 --- /usr/src/usr.sbin/ntp/config.hTue Feb  1 13:56:05 2000
 +++ /tmp/config.h Thu Mar  9 11:46:11 2000

You have to do something in the "./configure" stuff. Hopefully someone
in the know can suggest the "--with-no-foobar" option needed on the
command line so I don't have to wade into it.

Autoconfiguring POSIX realtime is a bad, bad idea because:

1. You don't know if it is available in all environments;
2. You don't know who is allowed to use it;
3. You don't know what the heck it does.

It decidedly does not mean "run as fast as you can".

Peter
--
Peter Dufault ([EMAIL PROTECTED])   Realtime development, Machine control,
HD Associates, Inc.   Fail-Safe systems, Agency approval


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-07 Thread Dave Boers

It is rumoured that Vallo Kallaste had the courage to say:
 I had a lockup yesterday while stress-testing new SMP machine. Tyan
 motherboard with Intel GX chipset, 256MB of memory, one 20GB IBM UDMA66
 disk, but running at UDMA33. All power management disabled completely in
 the BIOS. I was doing massive parallel compiling of GENERIC kernels.
 Let the machine doing this overnight and on the morning the console had
 about 20 'microuptime() went backwards' messages, I was able to switch
 vty's but not login, machine responded to pings, no disk activity. I'm
 using ata driver and only one unusual kernel option HZ=1000.

Your symptoms are not the same as mine. In my case the lockups are
complete. No switching of vt's, no pings, nothing at all. 

I never saw any "microuptime() went backwards" messages either. But then
again, I never had the machine lockup on the console; I was usually logged
in over the network or working in X. 

Regards, 

Dave Boers. 

-- 
  Dave Boers  djb @ relativity . student . utwente . nl 
  Don't let your schooling interfere with your education. (Mark Twain)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-07 Thread Peter Jeremy

On 2000-Mar-07 06:29:17 +1100, Dave Boers [EMAIL PROTECTED] wrote:
It is rumoured that Arun Sharma had the courage to say:
 Compiling Mozilla with make -j 2 got -current to lock up, twice in
 succession. I'm running a fairly recent snapshot (a week or two old)
 on a Dual celeron box (BP6) with UDMA66 enabled.

Finally. I've been complaining about this on several occasions. I'm also
running UDMA66 and Dual Celeron BP6. No overclocking. 

Later postings mention possible problems with UDMA66.  The other
possibility that has been discussed recently is potential priority
inversions for processes using rtptio and idprio.

Note that ntpd will use rtprio if the Posix P1003.1b extensions aren't
enabled in the kernel.  (These were enabled by default in GENERIC on
i386 in mid-January).  If you have the new ntpd (rather than xntpd)
and are running a kernel without options P1003_1B,
_KPOSIX_PRIORITY_SCHEDULING and _KPOSIX_VERSION=199309L, you could
potentially get a lockup due to a priority inversion.  (Though I
think the probability is very small).

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-07 Thread Matthew Dillon


:On 2000-Mar-07 06:29:17 +1100, Dave Boers [EMAIL PROTECTED] wrote:
:It is rumoured that Arun Sharma had the courage to say:
: Compiling Mozilla with make -j 2 got -current to lock up, twice in
: succession. I'm running a fairly recent snapshot (a week or two old)
: on a Dual celeron box (BP6) with UDMA66 enabled.
:
:Finally. I've been complaining about this on several occasions. I'm also
:running UDMA66 and Dual Celeron BP6. No overclocking. 
:
:Later postings mention possible problems with UDMA66.  The other
:possibility that has been discussed recently is potential priority
:inversions for processes using rtptio and idprio.
:
:Note that ntpd will use rtprio if the Posix P1003.1b extensions aren't
:enabled in the kernel.  (These were enabled by default in GENERIC on
:i386 in mid-January).  If you have the new ntpd (rather than xntpd)
:and are running a kernel without options P1003_1B,
:_KPOSIX_PRIORITY_SCHEDULING and _KPOSIX_VERSION=199309L, you could
:potentially get a lockup due to a priority inversion.  (Though I
:think the probability is very small).
:
:Peter

p.s. the first thing anyone having potential IDE problems should do is
try the older 'wd' driver (if it supports their chipset) and see if
that solves the problem.  At least then we can focus on the precise
location of the problem.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Vallo Kallaste

On Mon, Mar 06, 2000 at 08:27:18PM +0100, Dave Boers 
[EMAIL PROTECTED] wrote:

 I'm interested in the fix, of course :-) But where to start looking? I've
 had three lockups so far (none before january 2000) but I didn't find
 anything that reliably triggered it. 

I had a lockup yesterday while stress-testing new SMP machine. Tyan
motherboard with Intel GX chipset, 256MB of memory, one 20GB IBM UDMA66
disk, but running at UDMA33. All power management disabled completely in
the BIOS. I was doing massive parallel compiling of GENERIC kernels.
Let the machine doing this overnight and on the morning the console had
about 20 'microuptime() went backwards' messages, I was able to switch
vty's but not login, machine responded to pings, no disk activity. I'm
using ata driver and only one unusual kernel option HZ=1000.
-- 

Vallo Kallaste
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Matthew Sean Thyer

I'll second this email...

My computer had been stable all winter (with setiathome runnning full
time) but suddenly come the Australian summer it started freezing.

Not panicing, just totally freezing under load.

I could reproduce it by trying to build the whole of KDE and each time
it was a freeze, never a panic.

Windows 98 was freezing too but I didn't think that was abormal ;/

It turned out to be heat related as the machine is now stable after I
installed a case fan (before I only had the power supply fan and CPU
fan).

I see that the internal case temperature still gets up to about 50 or 51
degrees celcius whereas it was getting to 52 degrees before.

Note that I AM overclocking a Celeron 300a to 450 MHz by running with
a 100 MHz FSB instead of 66 MHz so I suppose I shouldn't be surprised at
the need for better cooling.

As I'd prefer better CPU cooling to the case fan on the grounds of noise,
can people recommend good CPU fans (over the standard Intel retail version
Celeron 300a fan) ?  How about these Peltier (sp ?) cooling devices I have
heard about ?

On Sun, 5 Mar 2000, Dan Papasian wrote:

 1. Is your computer overclocked?
 
 2. Is the computer totally frozen?  (i.e. scroll lock doesn't turn the light on)
 
 3. Does similar load crash the box as well?  (try make -j2 world)
 
 4. Does it freeze in the same spot?
 
 5. Is the computer not responding to pings?
 
 If you've answered yes to a good amount of these questions, there is a good
 chance that your processor(s) are overheating.  Try improving the airflow to the
 case (But using a household fan isn't recommended due to EMI)
 
 -Dan Papasian
 [EMAIL PROTECTED]
 
 On Sat, Mar 04, 2000 at 11:50:10PM -0800, Arun Sharma wrote:
  Compiling Mozilla with make -j 2 got -current to lock up, twice in
  succession. I'm running a fairly recent snapshot (a week or two old)
  on a Dual celeron box (BP6) with UDMA66 enabled.
  
  The kernel had DDB enabled. I was running X, but I didn't see any
  signs of the kernel attempting to get into the debugger.
  
  Has this been fixed ? Is anyone interested in investigating ?
  I'll post more info if I find anything.
  
  -Arun
  
  
  To Unsubscribe: send mail to [EMAIL PROTECTED]
  with "unsubscribe freebsd-current" in the body of the message
 
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with "unsubscribe freebsd-current" in the body of the message
 



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Dave Boers

It is rumoured that Arun Sharma had the courage to say:
 Compiling Mozilla with make -j 2 got -current to lock up, twice in
 succession. I'm running a fairly recent snapshot (a week or two old)
 on a Dual celeron box (BP6) with UDMA66 enabled.

Finally. I've been complaining about this on several occasions. I'm also
running UDMA66 and Dual Celeron BP6. No overclocking. 
 
 The kernel had DDB enabled. I was running X, but I didn't see any
 signs of the kernel attempting to get into the debugger.

Ditto here. 
 
 Has this been fixed ? Is anyone interested in investigating ?
 I'll post more info if I find anything.

I'm interested in the fix, of course :-) But where to start looking? I've
had three lockups so far (none before january 2000) but I didn't find
anything that reliably triggered it. 

Regards, 

Dave. 

-- 
  Dave Boers  djb @ relativity . student . utwente . nl 
  Don't let your schooling interfere with your education. (Mark Twain)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Arun Sharma

On Mon, Mar 06, 2000 at 08:27:18PM +0100, Dave Boers wrote:
  Has this been fixed ? Is anyone interested in investigating ?
  I'll post more info if I find anything.
 
 I'm interested in the fix, of course :-) But where to start looking? I've
 had three lockups so far (none before january 2000) but I didn't find
 anything that reliably triggered it. 

The cooling theory sounds the most plausible so far. I'm not over clocking
my CPUs (Celeron 366s) and have appropriate cooling installed. But the
machine is kept in a small room, with a bunch of other machines and gets
a bit warm at times.

There has been no reproducible case of locking up. Each one looks different.
But most were trigerred by heavy compilation and I/O. One was a lockup
overnight with no activity on the system. When it happens, it does not
respond to pings or scroll lock.

If you'd like to do something about it, working on getting a reproducible
hang would be the most beneficial one.

-Arun


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Dave Boers

It is rumoured that Arun Sharma had the courage to say:
 The cooling theory sounds the most plausible so far. I'm not over clocking
 my CPUs (Celeron 366s) and have appropriate cooling installed. But the
 machine is kept in a small room, with a bunch of other machines and gets
 a bit warm at times.

My system has been 50 degrees Celcius for the past half year or so. Yet,
the lockups only started occurring around January 2000. Once again, my
system is not overclocked and the temperature is well within Intel's and
Abit's temperature specifications, so there shouldn't be hardware problems. 

 There has been no reproducible case of locking up. Each one looks different.
 But most were trigerred by heavy compilation and I/O. One was a lockup
 overnight with no activity on the system. When it happens, it does not
 respond to pings or scroll lock.

Most of my lockups occurred when the system was relatively idle. Mostly
they happened only after 9 - 11 days of uptime. As you say, each one looks
different and there doesn't seem to be a pattern to it. When it locks up,
there is no response to the console, the network or the serial terminal.
Only the reset button is obeyed. I have DDB in my kernel, but there's no
getting into it. Also, no log messages of any kind from just before the
lockups.  

 If you'd like to do something about it, working on getting a reproducible
 hang would be the most beneficial one.

That's what I have been trying to do for the past few weeks, but I can't
seem to trigger it. Uptime is now 2 days and I intend to let it run to 12
or so before make installworld again, to see if I can reproduce it.
However, I did recently change from UDMA66 to an U2W SCSI disk for my main
partitions (/, /usr, /var, /tmp and swap). It may have impact on the
situation and it is the reason for the short uptime. If the problem has
gone away now, it might indicate something with the ATA driver. I'll keep
you informed. So far, since the disk change I've been putting my system
under some heavy load from time to time (like building three large ports
and make -j 12 buildworld at the same time). So far, the system is quite
stable. 

Regards, 

Dave Boers. 

-- 
  Dave Boers  djb @ relativity . student . utwente . nl 
  Don't let your schooling interfere with your education. (Mark Twain)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread sthaug

 The cooling theory sounds the most plausible so far. I'm not over clocking
 my CPUs (Celeron 366s) and have appropriate cooling installed. But the
 machine is kept in a small room, with a bunch of other machines and gets
 a bit warm at times.

I have seen a couple of suggestions that this may not be the CPUs - but
that the 82443BX chip (the one with the large green cooling fin) doesn't
always get sufficient cooling on a BP6 board. Some thermal compound
between the 82443BX and the cooling fin may be a good idea.

Steinar Haug, Nethelp consulting, [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Dan Papasian

On Mon, Mar 06, 2000 at 08:27:18PM +0100, Dave Boers wrote:
  on a Dual celeron box (BP6) with UDMA66 enabled.
 
 Finally. I've been complaining about this on several occasions. I'm also
 running UDMA66 and Dual Celeron BP6. No overclocking. 

Can you people reproduce this on a kernel without SMP enabled?
Perhaps there is a locking issue?  However, that'd lead to a panic I'd imagine..
So see if you can reproduce this with one CPU running so we can at least
eliminate one of the variables.

-Dan Papasian
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Marius Strom

I'm willing to bet a nickel (perhaps more) you people are running non-IBM
UDMA66 drives on that BP6.  Seems that most UDMA66 drives are not actually
UDMA66 compliant,  and they only drives that have been reported successful
on the BP6 are IBM.  Try taking your HD's off the UDMA66 controller and
put them on the Standard UDMA33 controllers, and it should clear things
up.

-- 
Marius Strom [EMAIL PROTECTED]
Professional Geek/Unix System Administrator
Alpha1 Internet http://www.alpha1.net
http://www.marius.org/marius.pgp 0x42C74CBA *UPDATED PGP KEY 2/24/2000*

In theory, there is no difference between theory and practice...
...In practice, there is a big difference.

On Mon, 6 Mar 2000, Dan Papasian wrote:

 On Mon, Mar 06, 2000 at 08:27:18PM +0100, Dave Boers wrote:
   on a Dual celeron box (BP6) with UDMA66 enabled.
  
  Finally. I've been complaining about this on several occasions. I'm also
  running UDMA66 and Dual Celeron BP6. No overclocking. 
 
 Can you people reproduce this on a kernel without SMP enabled?
 Perhaps there is a locking issue?  However, that'd lead to a panic I'd imagine..
 So see if you can reproduce this with one CPU running so we can at least
 eliminate one of the variables.
 
 -Dan Papasian
 [EMAIL PROTECTED]
 
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with "unsubscribe freebsd-current" in the body of the message
 



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Dave Boers

It is rumoured that Marius Strom had the courage to say:
 I'm willing to bet a nickel (perhaps more) you people are running non-IBM
 UDMA66 drives on that BP6.  Seems that most UDMA66 drives are not actually
 UDMA66 compliant,  and they only drives that have been reported successful
 on the BP6 are IBM.  Try taking your HD's off the UDMA66 controller and
 put them on the Standard UDMA33 controllers, and it should clear things
 up.

I'm interested in the sources of your statement about IBM drivers vs. non
IBM drives. 

In my case, I have a WD 18.2 Gb 7200 rpm disk which has been reported to be
identical to the IBM 18.2 Gb 7200 rpm disk on more than one occasion. And
by the way, my system has been running quite stable before January 2000
with the same disk on the same controller and the same mainboard. 

Regards, 

Dave Boers. 

-- 
  Dave Boers  djb @ relativity . student . utwente . nl 
  Don't let your schooling interfere with your education. (Mark Twain)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Marius Strom

Dave,
Well, there was a discussion a few weeks back with Soren Schmidt and a few
others.  I believe the conclusion was made that this occurred with most WD
drives (interesting about the WD == IBM part, I did notice he mentioned
that in -current a few weeks ago as well).  I had a WD20 gig that would
just hang, and a number of other people had similar problems. (Theirs
would log "Lost Disk Contact" in the dmesg as their root dev wasn't a
UDMA66 drive)

Unfortunately, the discussions occurred while the mailing list archive was
kaput (WD Drive on UDMA66? =]) so it's not archived where I can find it.

Seems to only happen with the ata driver, IIRC.

-- 
Marius Strom [EMAIL PROTECTED]
Professional Geek/Unix System Administrator
Alpha1 Internet http://www.alpha1.net
http://www.marius.org/marius.pgp 0x42C74CBA *UPDATED PGP KEY 2/24/2000*

In theory, there is no difference between theory and practice...
...In practice, there is a big difference.

On Mon, 6 Mar 2000, Dave Boers wrote:

 It is rumoured that Marius Strom had the courage to say:
  I'm willing to bet a nickel (perhaps more) you people are running non-IBM
  UDMA66 drives on that BP6.  Seems that most UDMA66 drives are not actually
  UDMA66 compliant,  and they only drives that have been reported successful
  on the BP6 are IBM.  Try taking your HD's off the UDMA66 controller and
  put them on the Standard UDMA33 controllers, and it should clear things
  up.
 
 I'm interested in the sources of your statement about IBM drivers vs. non
 IBM drives. 
 
 In my case, I have a WD 18.2 Gb 7200 rpm disk which has been reported to be
 identical to the IBM 18.2 Gb 7200 rpm disk on more than one occasion. And
 by the way, my system has been running quite stable before January 2000
 with the same disk on the same controller and the same mainboard. 
 
 Regards, 
 
 Dave Boers. 
 
 



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Dave Boers

It is rumoured that Marius Strom had the courage to say:
 Well, there was a discussion a few weeks back with Soren Schmidt and a few
 others.  I believe the conclusion was made that this occurred with most WD
 drives (interesting about the WD == IBM part, I did notice he mentioned
 that in -current a few weeks ago as well).  I had a WD20 gig that would
 just hang, and a number of other people had similar problems. (Theirs
 would log "Lost Disk Contact" in the dmesg as their root dev wasn't a
 UDMA66 drive)

Interesting. I'll check my own archives of -current to see if I can find
the discussion. I always thought that the "Lost Disk Contact" messages were
due to the disk recalibrating itself after six days of continued use. After
Soren increased the timeout from 5 to 10 seconds, I never saw the problem
again, IIRC. 

For the record, (see my mail elsewhere in the thread) I have recently added
an U2W SCSI harddisk to the system (because I found that the UDMA
effectively cuts off memory access for the two celeron's for long times and
because the celeron's haven't got nearly enough cache they are effectively
waiting for the IDE disk all the time) and I'm now running my root
filesystem on that drive (as well as most of my other important
filesystems). So I guess that if your assertion is right then my problem
should have gone away now.  I haven't seen any "Lost Disk Contact" messages
recently, however, though the UDMA66 drive is still connected. 

BTW, are there any people out there that have similar hangs and are NOT
using UDMA66 or the ATA driver ? 

 Unfortunately, the discussions occurred while the mailing list archive was
 kaput (WD Drive on UDMA66? =]) so it's not archived where I can find it.

:-)
 
Regards, 

Dave Boers. 

-- 
  Dave Boers  djb @ relativity . student . utwente . nl 
  Don't let your schooling interfere with your education. (Mark Twain)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Marius Strom

 Interesting. I'll check my own archives of -current to see if I can find
 the discussion. I always thought that the "Lost Disk Contact" messages were
 due to the disk recalibrating itself after six days of continued use. After
 Soren increased the timeout from 5 to 10 seconds, I never saw the problem
 again, IIRC. 

Six days?  Nah.. I had the problem occur anywhere from 5 minutes to 12
hours after a system boot.  Moved the 20G WD to UDMA33 channel, works
flawlessly.  Usually, I could reproduce the problem doing heavy disk
I/O. However, I one time was able to make it through a "make buildworld",
so that's not entirely true either.

 
 For the record, (see my mail elsewhere in the thread) I have recently added
 an U2W SCSI harddisk to the system (because I found that the UDMA
 effectively cuts off memory access for the two celeron's for long times and
 because the celeron's haven't got nearly enough cache they are effectively
 waiting for the IDE disk all the time) and I'm now running my root
 filesystem on that drive (as well as most of my other important
 filesystems). So I guess that if your assertion is right then my problem
 should have gone away now.  I haven't seen any "Lost Disk Contact" messages
 recently, however, though the UDMA66 drive is still connected. 
 

For my record, I was unable to get dmesg output because the system was
completely hung.  Other people could get it because they had other drives
to write logging information too when the UDMA drive was locked.

---
Marius Strom [EMAIL PROTECTED]
Professional Geek/Unix System Administrator
Alpha1 Internet http://www.alpha1.net
http://www.marius.org/marius.pgp 0x42C74CBA *UPDATED PGP KEY 2/24/2000*

In theory, there is no difference between theory and practice...
...In practice, there is a big difference.



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Peter Jeremy

On 2000-Mar-06 21:39:11 +1100, Matthew Sean Thyer [EMAIL PROTECTED] wrote:
My computer had been stable all winter (with setiathome runnning full
time) but suddenly come the Australian summer it started freezing.

And it's been the coldest summer for something like 5 years...

 How about these Peltier (sp ?) cooling devices I have heard about ?

A Peltier cell is just a semiconductor heat pump.  It effectively just
reduces the junction-to-heatsink thermal resistance, allowing you (in
theory) to use a less efficient heatsink (or have the CPU run cooler
with the same heatsink.  The downside is they they're relatively
inefficient - your power supply will need to supply an extra 3-4A at
12v and you need to dissipate that extra power.  Unless you
significantly improve the airflow through the case, you'll probably
find that the internal temperature rises significantly - further
stressing everything except the CPU.

Note that the chip that most needs cooling may not be the CPU - the
big support chips can also run very hot.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Peter Jeremy

On 2000-Mar-07 06:29:17 +1100, Dave Boers [EMAIL PROTECTED] wrote:
It is rumoured that Arun Sharma had the courage to say:
 Compiling Mozilla with make -j 2 got -current to lock up, twice in
 succession. I'm running a fairly recent snapshot (a week or two old)
 on a Dual celeron box (BP6) with UDMA66 enabled.

Finally. I've been complaining about this on several occasions. I'm also
running UDMA66 and Dual Celeron BP6. No overclocking. 

Later postings mention possible problems with UDMA66.  The other
possibility that has been discussed recently is potential priority
inversions for processes using rtptio and idprio.

Note that ntpd will use rtprio if the Posix P1003.1b extensions aren't
enabled in the kernel.  (These were enabled by default in GENERIC on
i386 in mid-January).  If you have the new ntpd (rather than xntpd)
and are running a kernel without options P1003_1B,
_KPOSIX_PRIORITY_SCHEDULING and _KPOSIX_VERSION=199309L, you could
potentially get a lockup due to a priority inversion.  (Though I
think the probability is very small).

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Poul-Henning Kamp

In message [EMAIL PROTECTED], Peter Jeremy writes
:

 How about these Peltier (sp ?) cooling devices I have heard about ?

A Peltier cell is just a semiconductor heat pump.  It effectively just
reduces the junction-to-heatsink thermal resistance, allowing you (in
theory) to use a less efficient heatsink (or have the CPU run cooler
with the same heatsink.

This is actually not true, quite the contrary in fact:  You need
a better heat-sink with a Peltier because of the significant
electrical power you pump into it.

As a general rule you can expect to *raise* your CPU temperature if
you put a peltier under anything less than a *very good* heat-sink.


Example:

A Celeron 500 disipates about 25W

An average heatsink is about .8 C/W

delta-T becomes 25W * .8C/W = 20C

At 30C ambient that becomes 50C CPU temperature.

Now, add a peltier.  To remove 25W and keep a 25C
temperature difference we need to feed it about 50W

Now the heatsink has to deal with 25 + 50 W and the
delta-T becomes: (25W + 50W) * .8C/W = 60C

Subtract the 25C difference from the peltier and add
the ambient temperature and we find:

30C + 60C - 25C = 65C

We just raised our CPU temperature about 15 C :-(

--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Dave Boers

It is rumoured that Peter Jeremy had the courage to say:
 Note that ntpd will use rtprio if the Posix P1003.1b extensions aren't
 enabled in the kernel.  (These were enabled by default in GENERIC on
 i386 in mid-January).  If you have the new ntpd (rather than xntpd)
 and are running a kernel without options P1003_1B,
 _KPOSIX_PRIORITY_SCHEDULING and _KPOSIX_VERSION=199309L, you could
 potentially get a lockup due to a priority inversion.  (Though I
 think the probability is very small).

I don't use ntpd (I use ntpdate) and I do have those options enabled in my
kernel (all three of them). IIRC they are needed to get either cdrdao or
cdrecord to work. 

Seems that everything points to UDMA66 so far...

Regards, 

Dave Boers. 

-- 
  Dave Boers  djb @ relativity . student . utwente . nl 
  Don't let your schooling interfere with your education. (Mark Twain)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Chris Piazza

On Mon, Mar 06, 2000 at 11:59:21PM +0100, Dave Boers wrote:
 It is rumoured that Peter Jeremy had the courage to say:
  Note that ntpd will use rtprio if the Posix P1003.1b extensions aren't
  enabled in the kernel.  (These were enabled by default in GENERIC on
  i386 in mid-January).  If you have the new ntpd (rather than xntpd)
  and are running a kernel without options P1003_1B,
  _KPOSIX_PRIORITY_SCHEDULING and _KPOSIX_VERSION=199309L, you could
  potentially get a lockup due to a priority inversion.  (Though I
  think the probability is very small).
 
 I don't use ntpd (I use ntpdate) and I do have those options enabled in my
 kernel (all three of them). IIRC they are needed to get either cdrdao or
 cdrecord to work. 
 
 Seems that everything points to UDMA66 so far...

...maybe in certain combinations.

I have a BP6 with dual celerons (466's @ 504) and have had no problems
whatsoever.

FreeBSD 4.0-CURRENT #4: Sun Mar  5 12:20:41 PST 2000
[EMAIL PROTECTED]:/usr/src/sys/compile/NORN
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium II/Pentium II Xeon/Celeron (503.92-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x665  Stepping = 5
  Features=0x183fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM
OV,PAT,PSE36,MMX,FXSR
real memory  = 268369920 (262080K bytes)
avail memory = 256987136 (250964K bytes)
Programming 24 pins in IOAPIC #0
IOAPIC #0 intpin 2 - irq 0
FreeBSD/SMP: Multiprocessor motherboard
 cpu0 (BSP): apic id:  0, version: 0x00040011, at 0xfee0
 cpu1 (AP):  apic id:  1, version: 0x00040011, at 0xfee0
 io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec0

ad0: 9765MB FUJITSU MPC3102AT E [19841/16/63] at ata0-master using UDMA33
ad4: 12949MB IBM-DJNA-371350 [28064/15/63] at ata2-master using UDMA66
acd0: CDROM DELTA OPC-K101/ST1 F/W by OIPD at ata1-slave using PIO4

ad0 is a DOS drive, ad4 is what I have FreeBSD on.

-Chris
-- 
[EMAIL PROTECTED]   [EMAIL PROTECTED]
Abbotsford, BC, Canada


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-06 Thread Thimble Smith

On Mon, Mar 06, 2000 at 03:46:25PM -0600, Marius Strom wrote:
Unfortunately, the discussions occurred while the mailing list archive was
kaput (WD Drive on UDMA66? =]) so it's not archived where I can find it.

I think this is the thread you're looking for:

http://marc.theaimsgroup.com/?t=9503732951w=2r=1

Tim


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: current lockups

2000-03-05 Thread Dan Papasian

1. Is your computer overclocked?

2. Is the computer totally frozen?  (i.e. scroll lock doesn't turn the light on)

3. Does similar load crash the box as well?  (try make -j2 world)

4. Does it freeze in the same spot?

5. Is the computer not responding to pings?

If you've answered yes to a good amount of these questions, there is a good
chance that your processor(s) are overheating.  Try improving the airflow to the
case (But using a household fan isn't recommended due to EMI)

-Dan Papasian
[EMAIL PROTECTED]

On Sat, Mar 04, 2000 at 11:50:10PM -0800, Arun Sharma wrote:
 Compiling Mozilla with make -j 2 got -current to lock up, twice in
 succession. I'm running a fairly recent snapshot (a week or two old)
 on a Dual celeron box (BP6) with UDMA66 enabled.
 
 The kernel had DDB enabled. I was running X, but I didn't see any
 signs of the kernel attempting to get into the debugger.
 
 Has this been fixed ? Is anyone interested in investigating ?
 I'll post more info if I find anything.
 
   -Arun
 
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with "unsubscribe freebsd-current" in the body of the message


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message