Yes, we are only using the built in iHSS of the EPC. Matt Carpenter
On Tue, Jan 3, 2017 at 6:34 AM, Nathan Anderson <[email protected]> wrote: > Okay, after seeing from the console that the AAA process is the one dying > and that the BreezeWAY waits a whole 5 minutes after its death before it > reboots itself, I was able to pinpoint the place in the tlsyslog where the > failure is being referenced. It appears that for whatever reason (which I > will work on hunting down), the BreezeWAY is sometimes failing to get a > response from our RADIUS server. And that sometimes when this happens, the > AAA process apparently segfaults. > > > > 40087:2016-12-30,22:51:37.352929:ERROR:0:06.06.00729:AAA:1241:accounting > request timeout max retransmissions reached [MSID = 0010010000xxxxx] > > 40088:2016-12-30,22:51:39.352964:ERROR:0:06.06.00729:AAA:1241:accounting > request timeout max retransmissions reached [MSID = 0010010000xxxxx] > > 40089:2016-12-30,22:51:39.353026:ERROR:0:06.06.00729:AAA:1241:No response > from AAA server, after interim retransmission, [MSID=0010010000xxxxx], > srvc_grp[0], AAA_IP[xx.xx.xx.xx] > > 40090:2016-12-30,22:51:39.353075:ERROR:0:06.06.00729:AAA:1241:Accounting > Interim timeout - send CLR to S6A, UE [0010010000xxxxx] > > 40100:2016-12-30,22:51:39.361255:NOTICE:0:06.06.00729:PGWC:1289:Counters > indication sent to AAA IWK > > 40102:2016-12-30,22:51:40.215186:ERROR:0:06.06.00729:AAA:1241:accounting > request timeout max retransmissions reached [MSID = 0010010000xxxxx] > > 40109:2016-12-30,22:51:40.294878:ERROR:0:06.06.00729:AAA:1241:UE > [0010010000xxxxx] does not exist! > > 40118:2016-12-30,22:51:41.384408:ERROR:0:06.06.00729:AAA:1241:Auth T2 > failure [MSID = 0010010000xxxxx] AAA server address [xx.xx.xx.xx] > > 40119:2016-12-30,22:51:41.882861:ERROR:0:06.06.00729:AAA:1241:Auth T2 > failure [MSID = 0010010000xxxxx] AAA server address [xx.xx.xx.xx] > > 40120:2016-12-30,22:51:42.215202:ERROR:0:06.06.00729:AAA:1241:accounting > request timeout max retransmissions reached [MSID = 0010010000xxxxx] > > 40121:2016-12-30,22:51:42.215252:ERROR:0:06.06.00729:AAA:1241:No response > from AAA server, after interim retransmission, [MSID=0010010000xxxxx], > srvc_grp[0], AAA_IP[xx.xx.xx.xx] > > 40122:2016-12-30,22:51:42.215301:ERROR:0:06.06.00729:AAA:1241:accounting > request timeout max retransmissions reached [MSID = 0010010000xxxxx] > > 40123:2016-12-30,22:51:42.215448:ERROR:0:06.06.00729:AAA:1241:Accounting > Interim timeout - send CLR to S6A, UE [0010010000xxxxx] > > > > 40135:2016-12-30,22:51:42.240884:NOTICE:0:06.06.00729: > SIGNAL:1241:tl_signal.c:signal_handler(65):Traceback data saved: > process:aaa_iwk.out, pid:1241, sig:Segmentation fault > > > > 2016-12-30,22:51:42:ERROR: COREDUMP Generated: > /mnt/bigstore/coredumps/core_1483138302_06.06.00729_aaa_iwk.out.1241.gz > > > > 40142:2016-12-30,22:51:44.881842:ERROR:0:06.06.00729: > FORKER:907:forker_api.cpp:StateHandler(855):Subsystem AAA SERVICE (id:9) > process dead > > > > Hopefully the generated core dumps can aid Telrad engineers in debugging > the issue. > > > > If most everybody is either using the iHSS, or they are using external HSS > but not experiencing issues with their RADIUS server (or the network > between the EPC and the RADIUS server), then that would explain why we are > getting hit with this bug and other are not. > > > > -- Nathan > > > > *From:* [email protected] [mailto:[email protected]] *On > Behalf Of *Nathan Anderson > *Sent:* Tuesday, January 03, 2017 4:10 AM > > *To:* Telrad List > *Subject:* Re: [Telrad] BreezeWAY EPC spontaneous reboots? > > > > We have had further crashes/reboots. Yesterday it crashed twice within 10 > minutes, then stabilized and hasn't crashed since. > > > > This time, though, I had a serial cable hooked up to the system, and this > got logged to the console both times: > > > > + AAA SERVICE................................................ [DEAD] > > - PGWC SERVICE............................................... [STOP] > > - SGWC SERVICE............................................... [STOP] > > - S6A SERVICE................................................ [STOP] > > - MME SERVICE................................................ [STOP] > > - UPGRADE INTERFACE.......................................... [STOP] > > WARNING : recevied Error Message with reason=21 and cause =207 from > Application/OAM-CL > > WARNING : recevied Error Message with reason=21 and cause =157 from > Application/OAM-CL > > WARNING : recevied Error Message with reason=21 and cause =107 from > Application/OAM-CL > > WARNING : recevied Error Message with reason=21 and cause =107 from > Application/OAM-CL > > - CONFIGURATION AGENT........................................ [STOP] > > - BASE CONFD................................................. [STOP] > > - CONFD PHASE-0.............................................. [STOP] > > - FSTS....................................................... [STOP] > > - HWIF....................................................... [STOP] > > > > ############################################################ > ##################### > > System failure condition detected! > > POWER restart scheduled in 300 second(s) > > ############################################################ > ##################### > > Forker timeout has expired. Reset the board... > > Requesting Power On system reset... > > > > So I guess the AAA process is failing for some reason. We are using > external HSS (RADIUS), so I presume it has something to do with that. > > > > Guess I'll dig through our FreeRADIUS logs and then open a ticket... > > > > -- Nathan > > > > *From:* [email protected] [mailto:[email protected] > <[email protected]>] *On Behalf Of *Nathan Anderson > *Sent:* Saturday, December 31, 2016 2:09 PM > *To:* Telrad List > *Subject:* Re: [Telrad] BreezeWAY EPC spontaneous reboots? > > > > If a full reset is what it takes to fix, I am familiar enough with the > procedure that I can do it myself. > > > > To me, it sounds like confd is bombing out, and other processes (mme, > pgwc) are trying to talk to it, unsuccessfully (since it isn't running). A > few minutes later, I presume the 'forker' process sees the problem and > issues a system reboot. > > > > Yesterday was the first and only time it has done this, and it happened 4 > times. We have not had a recurrence in 24 hours. > > > > -- Nathan > > > > *From:* [email protected] [mailto:[email protected] > <[email protected]>] *On Behalf Of *Matthew Carpenter > *Sent:* Saturday, December 31, 2016 8:00 AM > *To:* Telrad List > *Subject:* Re: [Telrad] BreezeWAY EPC spontaneous reboots? > > > > Have not had any issues like this with our 2 EPCs. > > > > I would contact support (Nick) and see about doing a full factory default > on the EPC and set it back up from scratch. > > We had some odd issues with an eNB and that was the solution, at least for > an eNB. > > > > The error messages sounds more like confd is trying to start a process > with parameters and its not working. > > > > Matt Carpenter > > > > > > > > On Sat, Dec 31, 2016 at 12:34 AM, Nathan Anderson <[email protected]> wrote: > > Okay, I take it back. I found a clue in the 'tlsyslog' after all. Before > it reboots, I see several of these logged in there, interwoven with > otherwise normally-expected messages: > > MME:1229:MME - Failed to start the session with confd for operational > params > > (...or...) > > PGWC:1287:Failed to start the session with confd for operational params > > So it sounds like confd is dying for some reason, and then the watchdog > kicks the box a few minutes later. > > So I guess the question is, why is confd dying. > > -- Nathan > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of Nathan Anderson > Sent: Friday, December 30, 2016 10:27 PM > To: [email protected] > Subject: [Telrad] BreezeWAY EPC spontaneous reboots? > > So, we had a new one today. One of our EPCs rebooted itself 4 times > within the span of 90 minutes. > > Yes, latest public code level (6.6 729). > > Are we the only ones who have seen THIS happen?? > > I didn't observe this particular detail myself, but others whose eyeballs > were trained on the physical BreezeWAY box at the time say that the alarm > light went red when it stopped responding, sat like that for a few minutes, > and then the box finally rebooted (presumably some sort of watchdog > process). > > Is there anything that I can look for to explain the reboots? 'show > notification stream alarms' only show the 'device-is-up-and-running' event > with nothing suspicious-looking showing up before that (or at least that > managed to get committed to NVRAM before the reboot occurred). Similarly, > the tlsyslog file just abruptly ends with "-- SYSTEM STARTED --" and > "NOTICE:Last restart type is POWERUP" with nothing suspicious-looking > getting logged right before that. > > I now have a serial cable hooked up to it in case it happens again and in > case whatever is causing the crash logs something to the console that isn't > getting written to disk for some reason. But of course it hasn't happened > again (it's been 6 hours since the last event). > > We haven't made a change to the config on this thing since the 6.6 upgrade > was installed last August or whenever that was. So, whyyyyyyyy after > months of stability is this happeningggggggg. > > ...actually, I take that back. We increased the uplink AMBR value to help > troubleshoot the eNB capacity issues we have been seeing. But that's all. > > Ugh. If it isn't one thing... > > -- > Nathan Anderson > First Step Internet, LLC > [email protected] > > _______________________________________________ > Telrad mailing list > [email protected] > http://lists.wispa.org/mailman/listinfo/telrad > > _______________________________________________ > Telrad mailing list > [email protected] > http://lists.wispa.org/mailman/listinfo/telrad > > > > > > -- > > *Matthew Carpenter* > > *806-316-5071 <(806)%20316-5071> office* > > *806-236-9558 <(806)%20236-9558> cell* > > > > > _______________________________________________ > Telrad mailing list > [email protected] > http://lists.wispa.org/mailman/listinfo/telrad > > -- *Matthew Carpenter* *806-316-5071 office* *806-236-9558 cell*
_______________________________________________ Telrad mailing list [email protected] http://lists.wispa.org/mailman/listinfo/telrad
