Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems

2010-09-13 Thread Benny Amorsen
Christian Weeks c...@weeksfamily.ca writes:

 Hello
 I purchased an AEX800 card to replace the ageing cheap channel bank/T1
 card solution a few months ago, assuming that it would be a more robust
 solution for my small scale phone system. However, it appears to be
 anything but that.

 Originally implemented as a XEN dom-u virtual machine on a large server
 class machine, using PCI passthrough to pass the AEX800 and a small
 older TDM400, then recently migrated to the dom-0, the aex800 has
 continued to experience interrupt errors:

 wctdm24xxp :04:08.0: Missed interrupt. Increasing latency to 8 ms in
 order to compensate.
 wctdm24xxp :04:08.0: ERROR: Unable to service card within 25 ms and
 unable to further increase latency.

Can you do a at /proc/interrupts?


/Benny

-- 
_
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
New to Asterisk? Join us for a live introductory webinar every Thurs:
   http://www.asterisk.org/hello

asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems

2010-09-08 Thread Shaun Ruffell
On 09/08/2010 10:38 AM, Christian Weeks wrote:
 So I am asking the list, do you have any advice except perhaps to go
 back to the broken channel bank? Is it really true that my modern server
 class machine (quad core xeon) cannot handle the AEX800, whereas my
 seven year old AMD desktop (previous host to the T1) could handle what
 seems to have been about 3x the capacity? Isn't this a massive
 regression?

Does the AEX800 work fine in your old AMD desktop?  If the wctdm24xxp
driver is having problems servicing the interrupt in a timely fashion in
your server I would be surprised if other cards in the same system
wouldn't also experience high interrupt latencies which would probably
manifest itself as pops and noise on the channels.

Some server class machines can have problems since they aren't optimized
for real-time performance but are instead optimized for overall
throughput (typically) and there are timing requirements for telephony.
 In other words, it doesn't matter if your server can handle a thousand
channels...if it can't service any one channel within 25ms consistently,
you're going to have issues with audio.

I would recommend:

a) checking the transfer rate to your hard drive ('hdparam -t
/dev/[sda|hda]').  If it's below 4MB/s that's the likely culprit.
Sometimes setting the kernel command line parameter to hda=none can
help depending on the kernel version you're using.  I've also seen slow
transfer rates fixed by changing BIOS settings.

b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and
then stress your system to make sure maximum latencies remain low
without DAHDI loaded.  System Management Interrupts / Baseboard
Management Controllers can cause problems here on some servers.

If cyclictest is shows you have some maximum latency above 128ms, I
would recommend trying to fix that first, but if for some reason you
can't, you could trade some of your system memory for increased
tolerance to system conditions by editing the DRING_SIZE in
drivers/dahdi/voicebus.h to 256 or 512 depending on what cyclictest
reported what your maximum latency is.  Keep in mind this isn't a fix
since you'll still have problems in your audio for any latency above 25ms.

Good luck,

-- 
Shaun Ruffell
Digium, Inc. | Linux Kernel Developer
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: www.digium.com  www.asterisk.org

-- 
_
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
New to Asterisk? Join us for a live introductory webinar every Thurs:
   http://www.asterisk.org/hello

asterisk-users mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-users


Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems

2010-09-08 Thread Christian Weeks
On Wed, 2010-09-08 at 11:06 -0500, Shaun Ruffell wrote:
 On 09/08/2010 10:38 AM, Christian Weeks wrote:
  So I am asking the list, do you have any advice except perhaps to go
  back to the broken channel bank? Is it really true that my modern server
  class machine (quad core xeon) cannot handle the AEX800, whereas my
  seven year old AMD desktop (previous host to the T1) could handle what
  seems to have been about 3x the capacity? Isn't this a massive
  regression?
 
 Does the AEX800 work fine in your old AMD desktop?  If the wctdm24xxp
 driver is having problems servicing the interrupt in a timely fashion in
 your server I would be surprised if other cards in the same system
 wouldn't also experience high interrupt latencies which would probably
 manifest itself as pops and noise on the channels.
OK. The AEX800 can't go in the old server- it's a PCI express card and
the AMD doesn't have a PCI express slot (it's that old). wrt to your
comment about the latencies on the other channels, there is none that is
noticeable. The other card (the older PCI card) has absolutely no
problems at all- it's getting clear audio. In fact, so is the new card-
there's not a sign of anything wrong with it at all, except it suddenly
stops working with these interrupt errors. Which is why I suspect the
driver (esp. given some of the fixes in the dahdi 2.4 release) rather
than the card or the computer.

 
 Some server class machines can have problems since they aren't optimized
 for real-time performance but are instead optimized for overall
 throughput (typically) and there are timing requirements for telephony.
  In other words, it doesn't matter if your server can handle a thousand
 channels...if it can't service any one channel within 25ms consistently,
 you're going to have issues with audio.
This is not observed in any way. The other card, on the PCI bus, has no
issues, despite being slower and older.
 
 I would recommend:
 
 a) checking the transfer rate to your hard drive ('hdparam -t
 /dev/[sda|hda]').  If it's below 4MB/s that's the likely culprit.
 Sometimes setting the kernel command line parameter to hda=none can
 help depending on the kernel version you're using.  I've also seen slow
 transfer rates fixed by changing BIOS settings.

/dev/sdb:
 Timing buffered disk reads:  190 MB in  3.03 seconds =  62.71 MB/sec

Hmm, don't think that's the culprit, somehow. The server has spent two
years before being repurposed as a phone server as a disk server for
mythtv. I'd have noticed disk latency on it a long time ago.

 
 b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and
 then stress your system to make sure maximum latencies remain low
 without DAHDI loaded.  System Management Interrupts / Baseboard
 Management Controllers can cause problems here on some servers.

OK. I'm not sure which tests I need to run here.

Here's a run at idle:
:~# cyclictest -t -p 80 -n -l 1
policy: fifo: loadavg: 0.03 0.02 0.00 1/210 16899  

T: 0 (16896) P:80 I:1000 C:  1 Min:  8 Act:   16 Avg:   22 Max:
568
T: 1 (16897) P:79 I:1500 C:   6673 Min:  8 Act:   12 Avg:   25 Max:
119
T: 2 (16898) P:78 I:2000 C:   5005 Min:  9 Act:   14 Avg:   24 Max:
150
T: 3 (16899) P:77 I:2500 C:   4004 Min:  8 Act:   13 Avg:   30 Max:
420

And here's one with some cpu load:

:~# cyclictest -t -p 80 -n -l 1
policy: fifo: loadavg: 0.82 0.35 0.12 3/217 17212  

T: 0 (17209) P:80 I:1000 C:  1 Min:  8 Act:   14 Avg:   26 Max:
8047
T: 1 (17210) P:79 I:1500 C:   6667 Min:  8 Act:   12 Avg:   15 Max:
820
T: 2 (17211) P:78 I:2000 C:   5001 Min:  7 Act:   17 Avg:   34 Max:
8184
T: 3 (17212) P:77 I:2500 C:   4001 Min:  9 Act:   40 Avg:   27 Max:
8786

Max is higher (obviously) but there's not really any evidence of a
signficant difference in latency between the two runs, and it looks well
below your threshold (I think thats usecs for those numbers, so it's
about 3 orders of magnitude slower).

 If cyclictest is shows you have some maximum latency above 128ms, I
 would recommend trying to fix that first, but if for some reason you
 can't, you could trade some of your system memory for increased
 tolerance to system conditions by editing the DRING_SIZE in
 drivers/dahdi/voicebus.h to 256 or 512 depending on what cyclictest
 reported what your maximum latency is.  Keep in mind this isn't a fix
 since you'll still have problems in your audio for any latency above 25ms.

I'm not sure where to go from here. Every diagnostic seems to be telling
the same story- the computer is fine. Is it possible I have a hardware
problem somehow? Maybe there's something wrong with the card?

Thanks
Christian




-- 
_
-- Bandwidth and Colocation Provided by http://www.api-digital.com --
New to Asterisk? Join us for a live introductory webinar every Thurs:
   http://www.asterisk.org/hello

asterisk-users mailing list
To UNSUBSCRIBE 

Re: [asterisk-users] Problem with new AEX800 card dying because of interrupt problems

2010-09-08 Thread Shaun Ruffell
First off Digium technical support should be able to help you trouble shoot.

On 09/08/2010 03:27 PM, Christian Weeks wrote:
 On Wed, 2010-09-08 at 11:06 -0500, Shaun Ruffell wrote:
 On 09/08/2010 10:38 AM, Christian Weeks wrote:
 So I am asking the list, do you have any advice except perhaps to go
 back to the broken channel bank? Is it really true that my modern server
 class machine (quad core xeon) cannot handle the AEX800, whereas my
 seven year old AMD desktop (previous host to the T1) could handle what
 seems to have been about 3x the capacity? Isn't this a massive
 regression?

 Does the AEX800 work fine in your old AMD desktop?  If the wctdm24xxp
 driver is having problems servicing the interrupt in a timely fashion in
 your server I would be surprised if other cards in the same system
 wouldn't also experience high interrupt latencies which would probably
 manifest itself as pops and noise on the channels.
 OK. The AEX800 can't go in the old server- it's a PCI express card and
 the AMD doesn't have a PCI express slot (it's that old). wrt to your
 comment about the latencies on the other channels, there is none that is
 noticeable. The other card (the older PCI card) has absolutely no
 problems at all- it's getting clear audio. In fact, so is the new card-
 there's not a sign of anything wrong with it at all, except it suddenly
 stops working with these interrupt errors. Which is why I suspect the
 driver (esp. given some of the fixes in the dahdi 2.4 release) rather
 than the card or the computer.

Is there anything else in dmesg or /var/log/messages when the card
suddenly stops working with the interrupt errors?  Do you see messages
about the latency increasing at some regular interval (i.e. every hour)?
 I've seen systems that have flash and SATA drives where the flash
drives are connected as /dev/hda and periodically flushing them can
cause huge latencies even though you don't see this happening at runtime.

 a) checking the transfer rate to your hard drive ('hdparam -t
 /dev/[sda|hda]').  If it's below 4MB/s that's the likely culprit.
 Sometimes setting the kernel command line parameter to hda=none can
 help depending on the kernel version you're using.  I've also seen slow
 transfer rates fixed by changing BIOS settings.
 
 /dev/sdb:
  Timing buffered disk reads:  190 MB in  3.03 seconds =  62.71 MB/sec
 
 Hmm, don't think that's the culprit, somehow. The server has spent two
 years before being repurposed as a phone server as a disk server for
 mythtv. I'd have noticed disk latency on it a long time ago.

I've also seen cases where the latency increases like you describe
because of poorly implemented X video drivers.  Are you running without
X installed on this server?  Do you have a serial console connected.
Slow baud rates on a serial console can be correlated to inability to
service the interrupts on some systems.

 

 b) Use cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and
 then stress your system to make sure maximum latencies remain low
 without DAHDI loaded.  System Management Interrupts / Baseboard
 Management Controllers can cause problems here on some servers.
 
 OK. I'm not sure which tests I need to run here.
 
 Here's a run at idle:
 :~# cyclictest -t -p 80 -n -l 1
 policy: fifo: loadavg: 0.03 0.02 0.00 1/210 16899  
 
 T: 0 (16896) P:80 I:1000 C:  1 Min:  8 Act:   16 Avg:   22 Max:
 568
 T: 1 (16897) P:79 I:1500 C:   6673 Min:  8 Act:   12 Avg:   25 Max:
 119
 T: 2 (16898) P:78 I:2000 C:   5005 Min:  9 Act:   14 Avg:   24 Max:
 150
 T: 3 (16899) P:77 I:2500 C:   4004 Min:  8 Act:   13 Avg:   30 Max:
 420
 
 And here's one with some cpu load:
 
 :~# cyclictest -t -p 80 -n -l 1
 policy: fifo: loadavg: 0.82 0.35 0.12 3/217 17212  
 
 T: 0 (17209) P:80 I:1000 C:  1 Min:  8 Act:   14 Avg:   26 Max:
 8047
 T: 1 (17210) P:79 I:1500 C:   6667 Min:  8 Act:   12 Avg:   15 Max:
 820
 T: 2 (17211) P:78 I:2000 C:   5001 Min:  7 Act:   17 Avg:   34 Max:
 8184
 T: 3 (17212) P:77 I:2500 C:   4001 Min:  9 Act:   40 Avg:   27 Max:
 8786
 
 Max is higher (obviously) but there's not really any evidence of a
 signficant difference in latency between the two runs, and it looks well
 below your threshold (I think thats usecs for those numbers, so it's
 about 3 orders of magnitude slower).

The cyclictest output looks good. What about when running disk transfer
tests on all the SATA / IDE drives installed?  You could also start up
cyclictest while  your system is attempting to operate normally and see
if DAHDI and cyclictest agree on on what the latency is.

Do you get the same results for cyclictest when you run from the console
(serial, X, whatever) as you do when running via ssh?

 
 If cyclictest is shows you have some maximum latency above 128ms, I
 would recommend trying to fix that first, but if for some reason you
 can't, you could trade some of your system memory for increased
 tolerance to system conditions by