That is interesting, but I'm pretty sure that we must be experiencing different crashes with different underlying causes. It is pretty clear that ours are related to external HSS/RADIUS.
In fact, there appear to be 2 separate bugs: 1. When a RADIUS request times out (or possibly when a number of RADIUS requests are made at roughly the same time and they all time out...the rate of requests seems to be a factor), the process responsible for external HSS (aaa_iwk) is at risk of segfaulting. This shouldn't happen, but is also compounded by the fact that the EPC does not try to restart the process gracefully, but instead initiates a 5 minute countdown, after which it reboots itself. 2. When a RADIUS request times out, and the EPC marks that RADIUS server as being unavailable/down, the EPC will also decide (apropos of nothing) that the eNB that the request was generated from is also down (even though it's not) and will reset its SCTP/S1 link to it, kicking all UEs off in the process. The running theory we have developed at this point is that often what happens is that #2 occurs first, which snowballs into a huge number of UE detach/reattaches and combines with #1 + a slow-to-respond RADIUS server to ultimately cause the crash and reboot (e.g., the EPC transmits an interim accounting packet to the RADIUS server for a given UE, gets no response, decides to flag the eNB as down, which kicks all UEs attached to that eNB off-line, and then when the eNB re-establishes S1 with the EPC, there is a sudden flood of UE attachment requests that all come in at the same time, which also are not responded to by the RADIUS server fast enough, which ultimately leads to the aaa_iwk process dying). The next few days will hopefully confirm this, but we think we have worked around the issue that was causing the RADIUS server to sometimes be slow to respond, so hopefully things will stabilize for us now. But these two bugs still need to get fixed, especially since one is a crasher (though they are both service-disruptive). -- Nathan From: [email protected] [mailto:[email protected]] On Behalf Of Gabriel Pike Sent: Thursday, January 05, 2017 10:27 AM To: [email protected] Subject: Re: [Telrad] BreezeWAY EPC spontaneous reboots? I have experienced 2 of these crash/reboots and I am using built in iHSS. Once the EPC rebooted and the other time it completely locked up and needed power cycled. I have tickets open about this and thus far have not found a cause. I may have to do the same thing you did and plug in a laptop via serial cable and hope to capture what is happening. Regards, Gabriel Pike Network Engineering MTCNA DMCI Broadband, LLC<http://dmcibb.net/> [email protected]<mailto:[email protected]> 877.936.2422 Ext. 103 From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Matthew Carpenter Sent: Tuesday, January 03, 2017 9:24 AM To: Telrad List Subject: Re: [Telrad] BreezeWAY EPC spontaneous reboots? Yes, we are only using the built in iHSS of the EPC. Matt Carpenter On Tue, Jan 3, 2017 at 6:34 AM, Nathan Anderson <[email protected]<mailto:[email protected]>> wrote: Okay, after seeing from the console that the AAA process is the one dying and that the BreezeWAY waits a whole 5 minutes after its death before it reboots itself, I was able to pinpoint the place in the tlsyslog where the failure is being referenced. It appears that for whatever reason (which I will work on hunting down), the BreezeWAY is sometimes failing to get a response from our RADIUS server. And that sometimes when this happens, the AAA process apparently segfaults. 40087:2016-12-30,22:51:37.352929:ERROR:0:06.06.00729:AAA:1241:accounting request timeout max retransmissions reached [MSID = 0010010000xxxxx] 40088:2016-12-30,22:51:39.352964:ERROR:0:06.06.00729:AAA:1241:accounting request timeout max retransmissions reached [MSID = 0010010000xxxxx] 40089:2016-12-30,22:51:39.353026:ERROR:0:06.06.00729:AAA:1241:No response from AAA server, after interim retransmission, [MSID=0010010000xxxxx], srvc_grp[0], AAA_IP[xx.xx.xx.xx] 40090:2016-12-30,22:51:39.353075:ERROR:0:06.06.00729:AAA:1241:Accounting Interim timeout - send CLR to S6A, UE [0010010000xxxxx] 40100:2016-12-30,22:51:39.361255:NOTICE:0:06.06.00729:PGWC:1289:Counters indication sent to AAA IWK 40102:2016-12-30,22:51:40.215186:ERROR:0:06.06.00729:AAA:1241:accounting request timeout max retransmissions reached [MSID = 0010010000xxxxx] 40109:2016-12-30,22:51:40.294878:ERROR:0:06.06.00729:AAA:1241:UE [0010010000xxxxx] does not exist! 40118:2016-12-30,22:51:41.384408:ERROR:0:06.06.00729:AAA:1241:Auth T2 failure [MSID = 0010010000xxxxx] AAA server address [xx.xx.xx.xx] 40119:2016-12-30,22:51:41.882861:ERROR:0:06.06.00729:AAA:1241:Auth T2 failure [MSID = 0010010000xxxxx] AAA server address [xx.xx.xx.xx] 40120:2016-12-30,22:51:42.215202:ERROR:0:06.06.00729:AAA:1241:accounting request timeout max retransmissions reached [MSID = 0010010000xxxxx] 40121:2016-12-30,22:51:42.215252:ERROR:0:06.06.00729:AAA:1241:No response from AAA server, after interim retransmission, [MSID=0010010000xxxxx], srvc_grp[0], AAA_IP[xx.xx.xx.xx] 40122:2016-12-30,22:51:42.215301:ERROR:0:06.06.00729:AAA:1241:accounting request timeout max retransmissions reached [MSID = 0010010000xxxxx] 40123:2016-12-30,22:51:42.215448:ERROR:0:06.06.00729:AAA:1241:Accounting Interim timeout - send CLR to S6A, UE [0010010000xxxxx] 40135:2016-12-30,22:51:42.240884:NOTICE:0:06.06.00729:SIGNAL:1241:tl_signal.c:signal_handler(65):Traceback data saved: process:aaa_iwk.out, pid:1241, sig:Segmentation fault 2016-12-30,22:51:42:ERROR: COREDUMP Generated: /mnt/bigstore/coredumps/core_1483138302_06.06.00729_aaa_iwk.out.1241.gz 40142:2016-12-30,22:51:44.881842:ERROR:0:06.06.00729:FORKER:907:forker_api.cpp:StateHandler(855):Subsystem AAA SERVICE (id:9) process dead Hopefully the generated core dumps can aid Telrad engineers in debugging the issue. If most everybody is either using the iHSS, or they are using external HSS but not experiencing issues with their RADIUS server (or the network between the EPC and the RADIUS server), then that would explain why we are getting hit with this bug and other are not. -- Nathan From: [email protected]<mailto:[email protected]> [mailto:[email protected]<mailto:[email protected]>] On Behalf Of Nathan Anderson Sent: Tuesday, January 03, 2017 4:10 AM To: Telrad List Subject: Re: [Telrad] BreezeWAY EPC spontaneous reboots? We have had further crashes/reboots. Yesterday it crashed twice within 10 minutes, then stabilized and hasn't crashed since. This time, though, I had a serial cable hooked up to the system, and this got logged to the console both times: + AAA SERVICE................................................ [DEAD] - PGWC SERVICE............................................... [STOP] - SGWC SERVICE............................................... [STOP] - S6A SERVICE................................................ [STOP] - MME SERVICE................................................ [STOP] - UPGRADE INTERFACE.......................................... [STOP] WARNING : recevied Error Message with reason=21 and cause =207 from Application/OAM-CL WARNING : recevied Error Message with reason=21 and cause =157 from Application/OAM-CL WARNING : recevied Error Message with reason=21 and cause =107 from Application/OAM-CL WARNING : recevied Error Message with reason=21 and cause =107 from Application/OAM-CL - CONFIGURATION AGENT........................................ [STOP] - BASE CONFD................................................. [STOP] - CONFD PHASE-0.............................................. [STOP] - FSTS....................................................... [STOP] - HWIF....................................................... [STOP] ################################################################################# System failure condition detected! POWER restart scheduled in 300 second(s) ################################################################################# Forker timeout has expired. Reset the board... Requesting Power On system reset... So I guess the AAA process is failing for some reason. We are using external HSS (RADIUS), so I presume it has something to do with that. Guess I'll dig through our FreeRADIUS logs and then open a ticket... -- Nathan From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Nathan Anderson Sent: Saturday, December 31, 2016 2:09 PM To: Telrad List Subject: Re: [Telrad] BreezeWAY EPC spontaneous reboots? If a full reset is what it takes to fix, I am familiar enough with the procedure that I can do it myself. To me, it sounds like confd is bombing out, and other processes (mme, pgwc) are trying to talk to it, unsuccessfully (since it isn't running). A few minutes later, I presume the 'forker' process sees the problem and issues a system reboot. Yesterday was the first and only time it has done this, and it happened 4 times. We have not had a recurrence in 24 hours. -- Nathan From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Matthew Carpenter Sent: Saturday, December 31, 2016 8:00 AM To: Telrad List Subject: Re: [Telrad] BreezeWAY EPC spontaneous reboots? Have not had any issues like this with our 2 EPCs. I would contact support (Nick) and see about doing a full factory default on the EPC and set it back up from scratch. We had some odd issues with an eNB and that was the solution, at least for an eNB. The error messages sounds more like confd is trying to start a process with parameters and its not working. Matt Carpenter On Sat, Dec 31, 2016 at 12:34 AM, Nathan Anderson <[email protected]<mailto:[email protected]>> wrote: Okay, I take it back. I found a clue in the 'tlsyslog' after all. Before it reboots, I see several of these logged in there, interwoven with otherwise normally-expected messages: MME:1229:MME - Failed to start the session with confd for operational params (...or...) PGWC:1287:Failed to start the session with confd for operational params So it sounds like confd is dying for some reason, and then the watchdog kicks the box a few minutes later. So I guess the question is, why is confd dying. -- Nathan -----Original Message----- From: [email protected]<mailto:[email protected]> [mailto:[email protected]<mailto:[email protected]>] On Behalf Of Nathan Anderson Sent: Friday, December 30, 2016 10:27 PM To: [email protected]<mailto:[email protected]> Subject: [Telrad] BreezeWAY EPC spontaneous reboots? So, we had a new one today. One of our EPCs rebooted itself 4 times within the span of 90 minutes. Yes, latest public code level (6.6 729). Are we the only ones who have seen THIS happen?? I didn't observe this particular detail myself, but others whose eyeballs were trained on the physical BreezeWAY box at the time say that the alarm light went red when it stopped responding, sat like that for a few minutes, and then the box finally rebooted (presumably some sort of watchdog process). Is there anything that I can look for to explain the reboots? 'show notification stream alarms' only show the 'device-is-up-and-running' event with nothing suspicious-looking showing up before that (or at least that managed to get committed to NVRAM before the reboot occurred). Similarly, the tlsyslog file just abruptly ends with "-- SYSTEM STARTED --" and "NOTICE:Last restart type is POWERUP" with nothing suspicious-looking getting logged right before that. I now have a serial cable hooked up to it in case it happens again and in case whatever is causing the crash logs something to the console that isn't getting written to disk for some reason. But of course it hasn't happened again (it's been 6 hours since the last event). We haven't made a change to the config on this thing since the 6.6 upgrade was installed last August or whenever that was. So, whyyyyyyyy after months of stability is this happeningggggggg. ...actually, I take that back. We increased the uplink AMBR value to help troubleshoot the eNB capacity issues we have been seeing. But that's all. Ugh. If it isn't one thing... -- Nathan Anderson First Step Internet, LLC [email protected]<mailto:[email protected]> _______________________________________________ Telrad mailing list [email protected]<mailto:[email protected]> http://lists.wispa.org/mailman/listinfo/telrad _______________________________________________ Telrad mailing list [email protected]<mailto:[email protected]> http://lists.wispa.org/mailman/listinfo/telrad -- Matthew Carpenter 806-316-5071<tel:(806)%20316-5071> office 806-236-9558<tel:(806)%20236-9558> cell [https://docs.google.com/uc?export=download&id=0BxDRq5UV7HPOaEM4LXVaVnk5cWM&revid=0BxDRq5UV7HPOTDdiVjM0TXRIc3ZzMXVUVDdDVjBiaFU0bHJNPQ] _______________________________________________ Telrad mailing list [email protected]<mailto:[email protected]> http://lists.wispa.org/mailman/listinfo/telrad -- Matthew Carpenter 806-316-5071 office 806-236-9558 cell [https://docs.google.com/uc?export=download&id=0BxDRq5UV7HPOaEM4LXVaVnk5cWM&revid=0BxDRq5UV7HPOTDdiVjM0TXRIc3ZzMXVUVDdDVjBiaFU0bHJNPQ]
_______________________________________________ Telrad mailing list [email protected] http://lists.wispa.org/mailman/listinfo/telrad
