Re: DLTs, disconnection and hangs--solved (was: Recent current hangs frequently for 1 to 2 seconds.)
On Wed, Dec 22, 1999 at 09:37:55AM +1030, Greg Lehey wrote: On Tuesday, 21 December 1999 at 10:07:28 +0100, Wilko Bulte wrote: On Mon, Dec 20, 1999 at 05:08:27PM -0800, Matthew Dillon wrote: It's possible you might be on to something. I've been running iostat at 1 second intervals, and during the last hang I saw: * Not properly terminate the SCSI bus (especially when mixing bus architectures. For example, a tape drive may only half-terminate a wide SCSI bus. Never use a tape drive to terminate a SCSI bus, not even an older SCSI bus. As has been discussed many times already: use external terminators attached to the cable. SCSI termination is not difficult, it is just made that way by some :( Who said this? There is nothing magic about SCSI cabling. There are sound technical reasons why they occasionally require the sacrifice of a young goat. Not me, but it sounds familiar. But really: using good stuff to start with saves you a lot heartburn down the road. A more interesting question would be if the DLT drive has a more or less recent firmware loaded. The discussion's worthwhile, but remember that I have had this when the DLT wasn't running as well. How do I find the firmware release? Is that the supplementary information (CC1E) in the dmesg output? Yep. CC1E breaks down into: 0xCC = drive revision = servo code = 204 0x1E = scsi/r\w code = 30 Obviously the last byte is the more interesting one. sa1 at ahc0 bus 0 target 3 lun 0 sa1: Quantum DLT4000 CC1E Removable Sequential Access SCSI-2 device sa1: 10.000MB/s transfers (10.000MHz, offset 15) Mine is: sa2: DEC TZ88 (C) DEC D473 Removable Sequential Access SCSI-2 device I'm not sure if the DEC f/w revs correlate 1:1 with Quantum revs though. (NB TZ88 == DLT4000) W/ -- Wilko Bulte Arnhem, The Netherlands - The FreeBSD Project WWW : http://www.tcja.nl http://www.freebsd.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
On Mon, Dec 20, 1999 at 05:08:27PM -0800, Matthew Dillon wrote: :It's possible you might be on to something. I've been running iostat :at 1 second intervals, and during the last hang I saw: : : ttyad2 da1 sa1 cpu : tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id : 36 142 7.75 95 0.72 0.00 0.00 0 10.00 27 0.27 29 0 9 1 61 : 21 142 8.00 69 0.54 0.00 0.00 00.00 0 0.00 6 0 1 0 93 : 37 143 8.00 44 0.34 0.00 8.00 30.00 0 0.00 5 0 1 1 94 : 41 142 1.76 106 0.18 16.00 5.25 4 10.00 14 0.13 24 0 18 0 57 : 15 143 1.98 87 0.17 0.00 0.00 0 10.00 16 0.15 30 0 15 2 54 : :Note that the stop in tape activity corresponds with a start in disk :activity. I'll keep an eye on that and see if it looks the same the :next time. Tape drives may: * Not support disconnection (the SCSI bus is locked through the entire write sequence), or only partially support disconnection but run the bus so slowly that other devices are left out in the cold. DLT drives do support disconnect/reconnect. * Implement a crappy SCSI command stack that breaks down when So do may disks :) higher-speed operations are running on the same bus (e.g. the disks with their higher synchronous transfer rates). * Not properly terminate the SCSI bus (especially when mixing bus architectures. For example, a tape drive may only half-terminate a wide SCSI bus. Never use a tape drive to terminate a SCSI bus, not even an older SCSI bus. As has been discussed many times already: use external terminators attached to the cable. SCSI termination is not difficult, it is just made that way by some :( * Introduce too much noise onto the SCSI bus due to bad design. Does not apply to DLT drives that I've seen. A more interesting question would be if the DLT drive has a more or less recent firmware loaded. -- Wilko Bulte Arnhem, The Netherlands - The FreeBSD Project WWW : http://www.tcja.nl http://www.freebsd.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
: :On 1999-Dec-21 12:08:27 +1100, Matthew Dillon [EMAIL PROTECTED] wrote: :Tape drives may: :* Not support disconnection ... :* Implement a crappy SCSI command stack ... :* Not properly terminate the SCSI bus ... :* Introduce too much noise onto the SCSI bus due to bad design. : :Do modern drives still support these various `features'? I know :there used to be problems, but I thought Greg was using a (presumably :fairly recent) DLT drive. Given the normal data rates associated with :DLT drives, I would have thought they would behave fairly well. : :BTW, since Greg mentioned he was thinking of using an AHA-1542B - :I've found that mine won't work in sync mode. Whenever I tried to :read or write, it would just timeout, until I disabled SYNC. (And :if I increased the sync speed beyond 5MHz, the kernel would usually :panic). : :Peter They are supposed to support disconnection, and most do. But tape drives tend to have buggy firmware - for example, the 20G Exabyte drives can crash a SCSI bus if you use too small a block size. It just isn't worth mixing them with your disks. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
DLTs, disconnection and hangs--solved (was: Recent current hangs frequently for 1 to 2 seconds.)
On Tuesday, 21 December 1999 at 10:07:28 +0100, Wilko Bulte wrote: On Mon, Dec 20, 1999 at 05:08:27PM -0800, Matthew Dillon wrote: It's possible you might be on to something. I've been running iostat at 1 second intervals, and during the last hang I saw: ttyad2 da1 sa1 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 36 142 7.75 95 0.72 0.00 0.00 0 10.00 27 0.27 29 0 9 1 61 21 142 8.00 69 0.54 0.00 0.00 00.00 0 0.00 6 0 1 0 93 37 143 8.00 44 0.34 0.00 8.00 30.00 0 0.00 5 0 1 1 94 41 142 1.76 106 0.18 16.00 5.25 4 10.00 14 0.13 24 0 18 0 57 15 143 1.98 87 0.17 0.00 0.00 0 10.00 16 0.15 30 0 15 2 54 Note that the stop in tape activity corresponds with a start in disk activity. I'll keep an eye on that and see if it looks the same the next time. Tape drives may: * Not support disconnection (the SCSI bus is locked through the entire write sequence), or only partially support disconnection but run the bus so slowly that other devices are left out in the cold. DLT drives do support disconnect/reconnect. * Implement a crappy SCSI command stack that breaks down when So do may disks :) higher-speed operations are running on the same bus (e.g. the disks with their higher synchronous transfer rates). * Not properly terminate the SCSI bus (especially when mixing bus architectures. For example, a tape drive may only half-terminate a wide SCSI bus. Never use a tape drive to terminate a SCSI bus, not even an older SCSI bus. As has been discussed many times already: use external terminators attached to the cable. SCSI termination is not difficult, it is just made that way by some :( Who said this? There is nothing magic about SCSI cabling. There are sound technical reasons why they occasionally require the sacrifice of a young goat. * Introduce too much noise onto the SCSI bus due to bad design. Does not apply to DLT drives that I've seen. A more interesting question would be if the DLT drive has a more or less recent firmware loaded. The discussion's worthwhile, but remember that I have had this when the DLT wasn't running as well. How do I find the firmware release? Is that the supplementary information (CC1E) in the dmesg output? sa1 at ahc0 bus 0 target 3 lun 0 sa1: Quantum DLT4000 CC1E Removable Sequential Access SCSI-2 device sa1: 10.000MB/s transfers (10.000MHz, offset 15) Anyway, I think I found the problem. I had been reconfiguring the disks, and in the process I removed a disk with swap on it and added another. The primary swap on the system disk is 50 MB, the swap I took away was 256 MB, the swap I added was 512 MB. But: I forgot to change /etc/fstab. So I was running with swap near full, and it wasn't until I realised it and mounted the other swap space that things started to improve. In the process, it used a further 10 MB of swap without any obvious increase in the number of processes (I was mainly running mail and editors). In the meantime I had noticed that the hangs were, for the most part, in biowr. I'm fascinated that the system was able to run without adequate swap without showing anything worse than poor performance. Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
On Saturday, 18 December 1999 at 20:16:53 -0800, Matthew Dillon wrote: I've just upgraded to -CURRENT as of yesterday, and I'm noticing a number of occasions where all activity ceases for a second or two at a time; it seems to be related to IDE disk activity with the new ATA driver, but I don't have much evidence. I'm running a SiS 5591 chipset. Has anybody else seen something like this? It's possible that the blockages you are seeing are due to the ATA driver, but it's also possible that they are due to a bug in the buffer cache flushing code which the following patch fixes. So try the patch and see if that fixes your problem. If it doesn't then we can at least rule it out as being the cause of the problem you are seeing. Thanks. I've put in the patch, but I'm still seeing the problems. It seems to be related to SCSI activity (I'm currently performing a backup on a DLT drive, and apart from that very little disk I/O). Any other ideas? It seems to me as if the whole system freezes (keystrokes don't echo, for example), so possibly something is going into splhigh for too long. Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
:Thanks. I've put in the patch, but I'm still seeing the problems. It :seems to be related to SCSI activity (I'm currently performing a :backup on a DLT drive, and apart from that very little disk I/O). Any :other ideas? It seems to me as if the whole system freezes :(keystrokes don't echo, for example), so possibly something is going :into splhigh for too long. : :Greg No, this is very odd. Certainly reading from disk should not cause any blockages. But DLT SCSI -- there are lots of possiblities there. Is the DLT device sharing the same SCSI bus as the disks? I've historically had bad luck with a shared arrangement and now always put SCSI tape units on their own SCSI bus. If the SCSI bus is hanging something should show up in the kernel logs or dmesg output. Another possibility is that the SCSI operation is causing a hangup or bringing out a bug in the networking somewhere. A lockup for a second or two could be an indication of packet loss. Haven't there been a couple of mbuf-related commits recently? It would be something to look review, anyway. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
On Monday, 20 December 1999 at 16:19:06 -0800, Matthew Dillon wrote: :Thanks. I've put in the patch, but I'm still seeing the problems. It :seems to be related to SCSI activity (I'm currently performing a :backup on a DLT drive, and apart from that very little disk I/O). Any :other ideas? It seems to me as if the whole system freezes :(keystrokes don't echo, for example), so possibly something is going :into splhigh for too long. : :Greg No, this is very odd. Certainly reading from disk should not cause any blockages. But DLT SCSI -- there are lots of possiblities there. Is the DLT device sharing the same SCSI bus as the disks? Yes. I've historically had bad luck with a shared arrangement and now always put SCSI tape units on their own SCSI bus. For other reasons, I intend to do just this, though I'm not sure it's a good idea: for one thing, I don't have any spare PCI slots, so it will have to be a 1542B. But it will be worth checking. It's possible you might be on to something. I've been running iostat at 1 second intervals, and during the last hang I saw: ttyad2 da1 sa1 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 36 142 7.75 95 0.72 0.00 0.00 0 10.00 27 0.27 29 0 9 1 61 21 142 8.00 69 0.54 0.00 0.00 00.00 0 0.00 6 0 1 0 93 37 143 8.00 44 0.34 0.00 8.00 30.00 0 0.00 5 0 1 1 94 41 142 1.76 106 0.18 16.00 5.25 4 10.00 14 0.13 24 0 18 0 57 15 143 1.98 87 0.17 0.00 0.00 0 10.00 16 0.15 30 0 15 2 54 Note that the stop in tape activity corresponds with a start in disk activity. I'll keep an eye on that and see if it looks the same the next time. If the SCSI bus is hanging something should show up in the kernel logs or dmesg output. Right. But there's nothing there. Another possibility is that the SCSI operation is causing a hangup or bringing out a bug in the networking somewhere. A lockup for a second or two could be an indication of packet loss. Haven't there been a couple of mbuf-related commits recently? It would be something to look review, anyway. I don't see any evidence of network participation. All the activity here is local. Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
:It's possible you might be on to something. I've been running iostat :at 1 second intervals, and during the last hang I saw: : : ttyad2 da1 sa1 cpu : tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id : 36 142 7.75 95 0.72 0.00 0.00 0 10.00 27 0.27 29 0 9 1 61 : 21 142 8.00 69 0.54 0.00 0.00 00.00 0 0.00 6 0 1 0 93 : 37 143 8.00 44 0.34 0.00 8.00 30.00 0 0.00 5 0 1 1 94 : 41 142 1.76 106 0.18 16.00 5.25 4 10.00 14 0.13 24 0 18 0 57 : 15 143 1.98 87 0.17 0.00 0.00 0 10.00 16 0.15 30 0 15 2 54 : :Note that the stop in tape activity corresponds with a start in disk :activity. I'll keep an eye on that and see if it looks the same the :next time. Tape drives may: * Not support disconnection (the SCSI bus is locked through the entire write sequence), or only partially support disconnection but run the bus so slowly that other devices are left out in the cold. * Implement a crappy SCSI command stack that breaks down when higher-speed operations are running on the same bus (e.g. the disks with their higher synchronous transfer rates). * Not properly terminate the SCSI bus (especially when mixing bus architectures. For example, a tape drive may only half-terminate a wide SCSI bus. Never use a tape drive to terminate a SCSI bus, not even an older SCSI bus. * Introduce too much noise onto the SCSI bus due to bad design. At one time or another I've been hit with all of these problems. You may be able to work around some of them by going into the adaptec bios config and intentionally slowing down all the devices on the bus. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
: driver, but it's also possible that they are due to a bug in the : buffer cache flushing code which the following patch fixes. So try : the patch and see if that fixes your problem. If it doesn't then : we can at least rule it out as being the cause of the problem you : are seeing. :[..] : :Just one comment.. You've replaced vfs_bio's call to speedup_syncer() with :a bufdaemon speedup..Granted I don't understand the details here, but :I'm curious why? (or why not use both?) The reason I ask is that I wonder :whether removing of the speedup of the vnode cleanup rate is a good idea or :not.. or if the bufdaemon speedup does this as a side effect via the hooks :softupdates has got in the bio system. This is not a criticism, just a :request for enlightenment. :-) speedup_syncer() was designed for softupdates to allow softupdates to regulate the number of pending transactions. The problem with it, though, is that calling the function results in a 'slow reaction' by the system rather then an 'immediate reaction', and in this particular case we need an immediate reaction. A slow reaction gets us nowhere (gets us multi-second delays, in fact). bd_speedup() wakes up the buf_daemon and as part of the patch the buf_daemon has been redesigned to handle the 'immediate reaction' case without compromising its dynamic stability characteristics (which are what tend to make it efficient under normal operation). -Matt Matthew Dillon [EMAIL PROTECTED] : : @@ -1571,9 +1573,8 @@ : flags = VFS_BIO_NEED_ANY; : } : : -/* XXX */ : +bd_speedup(); /* hlp */ : : -(void) speedup_syncer(); : needsbuffer |= flags; : while (needsbuffer flags) { : if (tsleep(needsbuffer, (PRIBIO + 4) | slpflag, : :Cheers, :-Peter : : : : :To Unsubscribe: send mail to [EMAIL PROTECTED] :with "unsubscribe freebsd-current" in the body of the message : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Recent current hangs frequently for 1 to 2 seconds.
:I've just upgraded to -CURRENT as of yesterday, and I'm noticing a :number of occasions where all activity ceases for a second or two at a :time; it seems to be related to IDE disk activity with the new ATA :driver, but I don't have much evidence. I'm running a SiS 5591 :chipset. Has anybody else seen something like this? : :Greg It's possible that the blockages you are seeing are due to the ATA driver, but it's also possible that they are due to a bug in the buffer cache flushing code which the following patch fixes. So try the patch and see if that fixes your problem. If it doesn't then we can at least rule it out as being the cause of the problem you are seeing. (This patch is currently slated for commit on Sunday). -Matt Matthew Dillon [EMAIL PROTECTED] Index: vfs_bio.c === RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v retrieving revision 1.237 diff -u -r1.237 vfs_bio.c --- vfs_bio.c 1999/12/01 02:09:29 1.237 +++ vfs_bio.c 1999/12/17 18:44:40 @@ -88,7 +88,7 @@ bufmallocspace, maxbufmallocspace, hibufspace; static int maxbdrun; static int needsbuffer; -static int numdirtybuffers, lodirtybuffers, hidirtybuffers; +static int numdirtybuffers, hidirtybuffers; static int numfreebuffers, lofreebuffers, hifreebuffers; static int getnewbufcalls; static int getnewbufrestarts; @@ -96,8 +96,6 @@ SYSCTL_INT(_vfs, OID_AUTO, numdirtybuffers, CTLFLAG_RD, numdirtybuffers, 0, ""); -SYSCTL_INT(_vfs, OID_AUTO, lodirtybuffers, CTLFLAG_RW, - lodirtybuffers, 0, ""); SYSCTL_INT(_vfs, OID_AUTO, hidirtybuffers, CTLFLAG_RW, hidirtybuffers, 0, ""); SYSCTL_INT(_vfs, OID_AUTO, numfreebuffers, CTLFLAG_RD, @@ -275,6 +273,16 @@ } } +/* + * bd_speedup - speedup the buffer cache flushing code + */ + +static __inline__ +void +bd_speedup(void) +{ + bd_wakeup(1); +} /* * Initialize buffer headers and related structures. @@ -353,7 +361,6 @@ * Reduce the chance of a deadlock occuring by limiting the number * of delayed-write dirty buffers we allow to stack up. */ - lodirtybuffers = nbuf / 7 + 10; hidirtybuffers = nbuf / 4 + 20; numdirtybuffers = 0; /* @@ -365,14 +372,9 @@ * the buffer cache. */ while (hidirtybuffers * BKVASIZE 3 * hibufspace / 4) { - lodirtybuffers = 1; hidirtybuffers = 1; buf_maxio = 1; } - if (lodirtybuffers 2) { - lodirtybuffers = 2; - hidirtybuffers = 4; - } /* * Temporary, BKVASIZE may be manipulated soon, make sure we don't @@ -799,9 +801,9 @@ void bwillwrite(void) { - int twenty = (hidirtybuffers - lodirtybuffers) / 5; + int slop = hidirtybuffers / 10; - if (numdirtybuffers hidirtybuffers + twenty) { + if (numdirtybuffers hidirtybuffers + slop) { int s; s = splbio(); @@ -1571,9 +1573,8 @@ flags = VFS_BIO_NEED_ANY; } - /* XXX */ + bd_speedup(); /* hlp */ - (void) speedup_syncer(); needsbuffer |= flags; while (needsbuffer flags) { if (tsleep(needsbuffer, (PRIBIO + 4) | slpflag, @@ -1652,6 +1653,7 @@ static struct proc *bufdaemonproc; static int bd_interval; static int bd_flushto; +static int bd_flushinc; static struct kproc_desc buf_kp = { "bufdaemon", @@ -1672,6 +1674,7 @@ bd_interval = 5 * hz; /* dynamically adjusted */ bd_flushto = hidirtybuffers;/* dynamically adjusted */ + bd_flushinc = 1; while (TRUE) { bd_request = 0; @@ -1694,44 +1697,38 @@ } } - /* -* If nobody is requesting anything we sleep -*/ - if (bd_request == 0) - tsleep(bd_request, PVM, "psleep", bd_interval); + if (bd_request || + tsleep(bd_request, PVM, "psleep", bd_interval) == 0) { + /* +* Another request is pending or we were woken up +* without timing out. Flush more. +*/ + --bd_flushto; + if (bd_flushto = numdirtybuffers - 5) { + bd_flushto = numdirtybuffers - 10; + bd_flushinc = 1; + } + if (bd_flushto 2) + bd_flushto = 2; + } else { + /* +* We slept and timed out, we can slow down. +*/ +
Re: Recent current hangs frequently for 1 to 2 seconds.
On Saturday, 18 December 1999 at 20:16:53 -0800, Matthew Dillon wrote: :I've just upgraded to -CURRENT as of yesterday, and I'm noticing a :number of occasions where all activity ceases for a second or two at a :time; it seems to be related to IDE disk activity with the new ATA :driver, but I don't have much evidence. I'm running a SiS 5591 :chipset. Has anybody else seen something like this? : :Greg It's possible that the blockages you are seeing are due to the ATA driver, but it's also possible that they are due to a bug in the buffer cache flushing code which the following patch fixes. So try the patch and see if that fixes your problem. If it doesn't then we can at least rule it out as being the cause of the problem you are seeing. Thanks. I'll do that Real Soon Now. BTW, it looks as if the disk activity was on SCSI disks, so the ATA driver is probably a red herring. (This patch is currently slated for commit on Sunday). It *is* Sunday :-) Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message