Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems
Christian Weeks c...@weeksfamily.ca writes: Hello I purchased an AEX800 card to replace the ageing cheap channel bank/T1 card solution a few months ago, assuming that it would be a more robust solution for my small scale phone system. However, it appears to be anything but that. Originally implemented as a XEN dom-u virtual machine on a large server class machine, using PCI passthrough to pass the AEX800 and a small older TDM400, then recently migrated to the dom-0, the aex800 has continued to experience interrupt errors: wctdm24xxp :04:08.0: Missed interrupt. Increasing latency to 8 ms in order to compensate. wctdm24xxp :04:08.0: ERROR: Unable to service card within 25 ms and unable to further increase latency. Can you do a at /proc/interrupts? /Benny -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems
On 09/08/2010 10:38 AM, Christian Weeks wrote: So I am asking the list, do you have any advice except perhaps to go back to the broken channel bank? Is it really true that my modern server class machine (quad core xeon) cannot handle the AEX800, whereas my seven year old AMD desktop (previous host to the T1) could handle what seems to have been about 3x the capacity? Isn't this a massive regression? Does the AEX800 work fine in your old AMD desktop? If the wctdm24xxp driver is having problems servicing the interrupt in a timely fashion in your server I would be surprised if other cards in the same system wouldn't also experience high interrupt latencies which would probably manifest itself as pops and noise on the channels. Some server class machines can have problems since they aren't optimized for real-time performance but are instead optimized for overall throughput (typically) and there are timing requirements for telephony. In other words, it doesn't matter if your server can handle a thousand channels...if it can't service any one channel within 25ms consistently, you're going to have issues with audio. I would recommend: a) checking the transfer rate to your hard drive ('hdparam -t /dev/[sda|hda]'). If it's below 4MB/s that's the likely culprit. Sometimes setting the kernel command line parameter to hda=none can help depending on the kernel version you're using. I've also seen slow transfer rates fixed by changing BIOS settings. b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and then stress your system to make sure maximum latencies remain low without DAHDI loaded. System Management Interrupts / Baseboard Management Controllers can cause problems here on some servers. If cyclictest is shows you have some maximum latency above 128ms, I would recommend trying to fix that first, but if for some reason you can't, you could trade some of your system memory for increased tolerance to system conditions by editing the DRING_SIZE in drivers/dahdi/voicebus.h to 256 or 512 depending on what cyclictest reported what your maximum latency is. Keep in mind this isn't a fix since you'll still have problems in your audio for any latency above 25ms. Good luck, -- Shaun Ruffell Digium, Inc. | Linux Kernel Developer 445 Jan Davis Drive NW - Huntsville, AL 35806 - USA Check us out at: www.digium.com www.asterisk.org -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-users
Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems
On Wed, 2010-09-08 at 11:06 -0500, Shaun Ruffell wrote: On 09/08/2010 10:38 AM, Christian Weeks wrote: So I am asking the list, do you have any advice except perhaps to go back to the broken channel bank? Is it really true that my modern server class machine (quad core xeon) cannot handle the AEX800, whereas my seven year old AMD desktop (previous host to the T1) could handle what seems to have been about 3x the capacity? Isn't this a massive regression? Does the AEX800 work fine in your old AMD desktop? If the wctdm24xxp driver is having problems servicing the interrupt in a timely fashion in your server I would be surprised if other cards in the same system wouldn't also experience high interrupt latencies which would probably manifest itself as pops and noise on the channels. OK. The AEX800 can't go in the old server- it's a PCI express card and the AMD doesn't have a PCI express slot (it's that old). wrt to your comment about the latencies on the other channels, there is none that is noticeable. The other card (the older PCI card) has absolutely no problems at all- it's getting clear audio. In fact, so is the new card- there's not a sign of anything wrong with it at all, except it suddenly stops working with these interrupt errors. Which is why I suspect the driver (esp. given some of the fixes in the dahdi 2.4 release) rather than the card or the computer. Some server class machines can have problems since they aren't optimized for real-time performance but are instead optimized for overall throughput (typically) and there are timing requirements for telephony. In other words, it doesn't matter if your server can handle a thousand channels...if it can't service any one channel within 25ms consistently, you're going to have issues with audio. This is not observed in any way. The other card, on the PCI bus, has no issues, despite being slower and older. I would recommend: a) checking the transfer rate to your hard drive ('hdparam -t /dev/[sda|hda]'). If it's below 4MB/s that's the likely culprit. Sometimes setting the kernel command line parameter to hda=none can help depending on the kernel version you're using. I've also seen slow transfer rates fixed by changing BIOS settings. /dev/sdb: Timing buffered disk reads: 190 MB in 3.03 seconds = 62.71 MB/sec Hmm, don't think that's the culprit, somehow. The server has spent two years before being repurposed as a phone server as a disk server for mythtv. I'd have noticed disk latency on it a long time ago. b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and then stress your system to make sure maximum latencies remain low without DAHDI loaded. System Management Interrupts / Baseboard Management Controllers can cause problems here on some servers. OK. I'm not sure which tests I need to run here. Here's a run at idle: :~# cyclictest -t -p 80 -n -l 1 policy: fifo: loadavg: 0.03 0.02 0.00 1/210 16899 T: 0 (16896) P:80 I:1000 C: 1 Min: 8 Act: 16 Avg: 22 Max: 568 T: 1 (16897) P:79 I:1500 C: 6673 Min: 8 Act: 12 Avg: 25 Max: 119 T: 2 (16898) P:78 I:2000 C: 5005 Min: 9 Act: 14 Avg: 24 Max: 150 T: 3 (16899) P:77 I:2500 C: 4004 Min: 8 Act: 13 Avg: 30 Max: 420 And here's one with some cpu load: :~# cyclictest -t -p 80 -n -l 1 policy: fifo: loadavg: 0.82 0.35 0.12 3/217 17212 T: 0 (17209) P:80 I:1000 C: 1 Min: 8 Act: 14 Avg: 26 Max: 8047 T: 1 (17210) P:79 I:1500 C: 6667 Min: 8 Act: 12 Avg: 15 Max: 820 T: 2 (17211) P:78 I:2000 C: 5001 Min: 7 Act: 17 Avg: 34 Max: 8184 T: 3 (17212) P:77 I:2500 C: 4001 Min: 9 Act: 40 Avg: 27 Max: 8786 Max is higher (obviously) but there's not really any evidence of a signficant difference in latency between the two runs, and it looks well below your threshold (I think thats usecs for those numbers, so it's about 3 orders of magnitude slower). If cyclictest is shows you have some maximum latency above 128ms, I would recommend trying to fix that first, but if for some reason you can't, you could trade some of your system memory for increased tolerance to system conditions by editing the DRING_SIZE in drivers/dahdi/voicebus.h to 256 or 512 depending on what cyclictest reported what your maximum latency is. Keep in mind this isn't a fix since you'll still have problems in your audio for any latency above 25ms. I'm not sure where to go from here. Every diagnostic seems to be telling the same story- the computer is fine. Is it possible I have a hardware problem somehow? Maybe there's something wrong with the card? Thanks Christian -- _ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- New to Asterisk? Join us for a live introductory webinar every Thurs: http://www.asterisk.org/hello asterisk-users mailing list To UNSUBSCRIBE
Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems
First off Digium technical support should be able to help you trouble shoot. On 09/08/2010 03:27 PM, Christian Weeks wrote: On Wed, 2010-09-08 at 11:06 -0500, Shaun Ruffell wrote: On 09/08/2010 10:38 AM, Christian Weeks wrote: So I am asking the list, do you have any advice except perhaps to go back to the broken channel bank? Is it really true that my modern server class machine (quad core xeon) cannot handle the AEX800, whereas my seven year old AMD desktop (previous host to the T1) could handle what seems to have been about 3x the capacity? Isn't this a massive regression? Does the AEX800 work fine in your old AMD desktop? If the wctdm24xxp driver is having problems servicing the interrupt in a timely fashion in your server I would be surprised if other cards in the same system wouldn't also experience high interrupt latencies which would probably manifest itself as pops and noise on the channels. OK. The AEX800 can't go in the old server- it's a PCI express card and the AMD doesn't have a PCI express slot (it's that old). wrt to your comment about the latencies on the other channels, there is none that is noticeable. The other card (the older PCI card) has absolutely no problems at all- it's getting clear audio. In fact, so is the new card- there's not a sign of anything wrong with it at all, except it suddenly stops working with these interrupt errors. Which is why I suspect the driver (esp. given some of the fixes in the dahdi 2.4 release) rather than the card or the computer. Is there anything else in dmesg or /var/log/messages when the card suddenly stops working with the interrupt errors? Do you see messages about the latency increasing at some regular interval (i.e. every hour)? I've seen systems that have flash and SATA drives where the flash drives are connected as /dev/hda and periodically flushing them can cause huge latencies even though you don't see this happening at runtime. a) checking the transfer rate to your hard drive ('hdparam -t /dev/[sda|hda]'). If it's below 4MB/s that's the likely culprit. Sometimes setting the kernel command line parameter to hda=none can help depending on the kernel version you're using. I've also seen slow transfer rates fixed by changing BIOS settings. /dev/sdb: Timing buffered disk reads: 190 MB in 3.03 seconds = 62.71 MB/sec Hmm, don't think that's the culprit, somehow. The server has spent two years before being repurposed as a phone server as a disk server for mythtv. I'd have noticed disk latency on it a long time ago. I've also seen cases where the latency increases like you describe because of poorly implemented X video drivers. Are you running without X installed on this server? Do you have a serial console connected. Slow baud rates on a serial console can be correlated to inability to service the interrupts on some systems. b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and then stress your system to make sure maximum latencies remain low without DAHDI loaded. System Management Interrupts / Baseboard Management Controllers can cause problems here on some servers. OK. I'm not sure which tests I need to run here. Here's a run at idle: :~# cyclictest -t -p 80 -n -l 1 policy: fifo: loadavg: 0.03 0.02 0.00 1/210 16899 T: 0 (16896) P:80 I:1000 C: 1 Min: 8 Act: 16 Avg: 22 Max: 568 T: 1 (16897) P:79 I:1500 C: 6673 Min: 8 Act: 12 Avg: 25 Max: 119 T: 2 (16898) P:78 I:2000 C: 5005 Min: 9 Act: 14 Avg: 24 Max: 150 T: 3 (16899) P:77 I:2500 C: 4004 Min: 8 Act: 13 Avg: 30 Max: 420 And here's one with some cpu load: :~# cyclictest -t -p 80 -n -l 1 policy: fifo: loadavg: 0.82 0.35 0.12 3/217 17212 T: 0 (17209) P:80 I:1000 C: 1 Min: 8 Act: 14 Avg: 26 Max: 8047 T: 1 (17210) P:79 I:1500 C: 6667 Min: 8 Act: 12 Avg: 15 Max: 820 T: 2 (17211) P:78 I:2000 C: 5001 Min: 7 Act: 17 Avg: 34 Max: 8184 T: 3 (17212) P:77 I:2500 C: 4001 Min: 9 Act: 40 Avg: 27 Max: 8786 Max is higher (obviously) but there's not really any evidence of a signficant difference in latency between the two runs, and it looks well below your threshold (I think thats usecs for those numbers, so it's about 3 orders of magnitude slower). The cyclictest output looks good. What about when running disk transfer tests on all the SATA / IDE drives installed? You could also start up cyclictest while your system is attempting to operate normally and see if DAHDI and cyclictest agree on on what the latency is. Do you get the same results for cyclictest when you run from the console (serial, X, whatever) as you do when running via ssh? If cyclictest is shows you have some maximum latency above 128ms, I would recommend trying to fix that first, but if for some reason you can't, you could trade some of your system memory for increased tolerance to system conditions by