Yes, we are only using the built in iHSS of the EPC.

Matt Carpenter


On Tue, Jan 3, 2017 at 6:34 AM, Nathan Anderson <[email protected]> wrote:

> Okay, after seeing from the console that the AAA process is the one dying
> and that the BreezeWAY waits a whole 5 minutes after its death before it
> reboots itself, I was able to pinpoint the place in the tlsyslog where the
> failure is being referenced.  It appears that for whatever reason (which I
> will work on hunting down), the BreezeWAY is sometimes failing to get a
> response from our RADIUS server.  And that sometimes when this happens, the
> AAA process apparently segfaults.
>
>
>
> 40087:2016-12-30,22:51:37.352929:ERROR:0:06.06.00729:AAA:1241:accounting
> request timeout max retransmissions reached [MSID = 0010010000xxxxx]
>
> 40088:2016-12-30,22:51:39.352964:ERROR:0:06.06.00729:AAA:1241:accounting
> request timeout max retransmissions reached [MSID = 0010010000xxxxx]
>
> 40089:2016-12-30,22:51:39.353026:ERROR:0:06.06.00729:AAA:1241:No response
> from AAA server, after interim retransmission, [MSID=0010010000xxxxx],
> srvc_grp[0], AAA_IP[xx.xx.xx.xx]
>
> 40090:2016-12-30,22:51:39.353075:ERROR:0:06.06.00729:AAA:1241:Accounting
> Interim timeout - send CLR to S6A, UE [0010010000xxxxx]
>
> 40100:2016-12-30,22:51:39.361255:NOTICE:0:06.06.00729:PGWC:1289:Counters
> indication sent to AAA IWK
>
> 40102:2016-12-30,22:51:40.215186:ERROR:0:06.06.00729:AAA:1241:accounting
> request timeout max retransmissions reached [MSID = 0010010000xxxxx]
>
> 40109:2016-12-30,22:51:40.294878:ERROR:0:06.06.00729:AAA:1241:UE
> [0010010000xxxxx] does not exist!
>
> 40118:2016-12-30,22:51:41.384408:ERROR:0:06.06.00729:AAA:1241:Auth T2
> failure [MSID = 0010010000xxxxx] AAA server address [xx.xx.xx.xx]
>
> 40119:2016-12-30,22:51:41.882861:ERROR:0:06.06.00729:AAA:1241:Auth T2
> failure [MSID = 0010010000xxxxx] AAA server address [xx.xx.xx.xx]
>
> 40120:2016-12-30,22:51:42.215202:ERROR:0:06.06.00729:AAA:1241:accounting
> request timeout max retransmissions reached [MSID = 0010010000xxxxx]
>
> 40121:2016-12-30,22:51:42.215252:ERROR:0:06.06.00729:AAA:1241:No response
> from AAA server, after interim retransmission, [MSID=0010010000xxxxx],
> srvc_grp[0], AAA_IP[xx.xx.xx.xx]
>
> 40122:2016-12-30,22:51:42.215301:ERROR:0:06.06.00729:AAA:1241:accounting
> request timeout max retransmissions reached [MSID = 0010010000xxxxx]
>
> 40123:2016-12-30,22:51:42.215448:ERROR:0:06.06.00729:AAA:1241:Accounting
> Interim timeout - send CLR to S6A, UE [0010010000xxxxx]
>
>
>
> 40135:2016-12-30,22:51:42.240884:NOTICE:0:06.06.00729:
> SIGNAL:1241:tl_signal.c:signal_handler(65):Traceback data saved:
> process:aaa_iwk.out, pid:1241, sig:Segmentation fault
>
>
>
> 2016-12-30,22:51:42:ERROR: COREDUMP Generated:
> /mnt/bigstore/coredumps/core_1483138302_06.06.00729_aaa_iwk.out.1241.gz
>
>
>
> 40142:2016-12-30,22:51:44.881842:ERROR:0:06.06.00729:
> FORKER:907:forker_api.cpp:StateHandler(855):Subsystem AAA SERVICE (id:9)
> process dead
>
>
>
> Hopefully the generated core dumps can aid Telrad engineers in debugging
> the issue.
>
>
>
> If most everybody is either using the iHSS, or they are using external HSS
> but not experiencing issues with their RADIUS server (or the network
> between the EPC and the RADIUS server), then that would explain why we are
> getting hit with this bug and other are not.
>
>
>
> -- Nathan
>
>
>
> *From:* [email protected] [mailto:[email protected]] *On
> Behalf Of *Nathan Anderson
> *Sent:* Tuesday, January 03, 2017 4:10 AM
>
> *To:* Telrad List
> *Subject:* Re: [Telrad] BreezeWAY EPC spontaneous reboots?
>
>
>
> We have had further crashes/reboots.  Yesterday it crashed twice within 10
> minutes, then stabilized and hasn't crashed since.
>
>
>
> This time, though, I had a serial cable hooked up to the system, and this
> got logged to the console both times:
>
>
>
>   + AAA SERVICE................................................ [DEAD]
>
>   - PGWC SERVICE............................................... [STOP]
>
>   - SGWC SERVICE............................................... [STOP]
>
>   - S6A SERVICE................................................ [STOP]
>
>   - MME SERVICE................................................ [STOP]
>
>   - UPGRADE INTERFACE.......................................... [STOP]
>
> WARNING : recevied Error Message with reason=21 and cause =207 from
> Application/OAM-CL
>
> WARNING : recevied Error Message with reason=21 and cause =157 from
> Application/OAM-CL
>
> WARNING : recevied Error Message with reason=21 and cause =107 from
> Application/OAM-CL
>
> WARNING : recevied Error Message with reason=21 and cause =107 from
> Application/OAM-CL
>
>   - CONFIGURATION AGENT........................................ [STOP]
>
>   - BASE CONFD................................................. [STOP]
>
>   - CONFD PHASE-0.............................................. [STOP]
>
>   - FSTS....................................................... [STOP]
>
>   - HWIF....................................................... [STOP]
>
>
>
> ############################################################
> #####################
>
>                        System failure condition detected!
>
>                      POWER restart scheduled in 300 second(s)
>
> ############################################################
> #####################
>
> Forker timeout has expired. Reset the board...
>
> Requesting Power On system reset...
>
>
>
> So I guess the AAA process is failing for some reason.  We are using
> external HSS (RADIUS), so I presume it has something to do with that.
>
>
>
> Guess I'll dig through our FreeRADIUS logs and then open a ticket...
>
>
>
> -- Nathan
>
>
>
> *From:* [email protected] [mailto:[email protected]
> <[email protected]>] *On Behalf Of *Nathan Anderson
> *Sent:* Saturday, December 31, 2016 2:09 PM
> *To:* Telrad List
> *Subject:* Re: [Telrad] BreezeWAY EPC spontaneous reboots?
>
>
>
> If a full reset is what it takes to fix, I am familiar enough with the
> procedure that I can do it myself.
>
>
>
> To me, it sounds like confd is bombing out, and other processes (mme,
> pgwc) are trying to talk to it, unsuccessfully (since it isn't running).  A
> few minutes later, I presume the 'forker' process sees the problem and
> issues a system reboot.
>
>
>
> Yesterday was the first and only time it has done this, and it happened 4
> times.  We have not had a recurrence in 24 hours.
>
>
>
> -- Nathan
>
>
>
> *From:* [email protected] [mailto:[email protected]
> <[email protected]>] *On Behalf Of *Matthew Carpenter
> *Sent:* Saturday, December 31, 2016 8:00 AM
> *To:* Telrad List
> *Subject:* Re: [Telrad] BreezeWAY EPC spontaneous reboots?
>
>
>
> ​Have not had any issues like this with our 2 EPCs.
>
>
>
> I would contact support (Nick) and see about doing a full factory default
> on the EPC and set it back up from scratch.
>
> We had some odd issues with an eNB and that was the solution, at least for
> an eNB.
>
>
>
> The error messages sounds more like confd is trying to start a process
> with parameters and its not working.
>
>
>
> Matt Carpenter
>
>
>
> ​
>
>
>
> On Sat, Dec 31, 2016 at 12:34 AM, Nathan Anderson <[email protected]> wrote:
>
> Okay, I take it back.  I found a clue in the 'tlsyslog' after all.  Before
> it reboots, I see several of these logged in there, interwoven with
> otherwise normally-expected messages:
>
> MME:1229:MME - Failed to start the session with confd for operational
> params
>
> (...or...)
>
> PGWC:1287:Failed to start the session with confd for operational params
>
> So it sounds like confd is dying for some reason, and then the watchdog
> kicks the box a few minutes later.
>
> So I guess the question is, why is confd dying.
>
> -- Nathan
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Nathan Anderson
> Sent: Friday, December 30, 2016 10:27 PM
> To: [email protected]
> Subject: [Telrad] BreezeWAY EPC spontaneous reboots?
>
> So, we had a new one today.  One of our EPCs rebooted itself 4 times
> within the span of 90 minutes.
>
> Yes, latest public code level (6.6 729).
>
> Are we the only ones who have seen THIS happen??
>
> I didn't observe this particular detail myself, but others whose eyeballs
> were trained on the physical BreezeWAY box at the time say that the alarm
> light went red when it stopped responding, sat like that for a few minutes,
> and then the box finally rebooted (presumably some sort of watchdog
> process).
>
> Is there anything that I can look for to explain the reboots?  'show
> notification stream alarms' only show the 'device-is-up-and-running' event
> with nothing suspicious-looking showing up before that (or at least that
> managed to get committed to NVRAM before the reboot occurred).  Similarly,
> the tlsyslog file just abruptly ends with "-- SYSTEM STARTED --" and
> "NOTICE:Last restart type is POWERUP" with nothing suspicious-looking
> getting logged right before that.
>
> I now have a serial cable hooked up to it in case it happens again and in
> case whatever is causing the crash logs something to the console that isn't
> getting written to disk for some reason.  But of course it hasn't happened
> again (it's been 6 hours since the last event).
>
> We haven't made a change to the config on this thing since the 6.6 upgrade
> was installed last August or whenever that was.  So, whyyyyyyyy after
> months of stability is this happeningggggggg.
>
> ...actually, I take that back.  We increased the uplink AMBR value to help
> troubleshoot the eNB capacity issues we have been seeing.  But that's all.
>
> Ugh.  If it isn't one thing...
>
> --
> Nathan Anderson
> First Step Internet, LLC
> [email protected]
>
> _______________________________________________
> Telrad mailing list
> [email protected]
> http://lists.wispa.org/mailman/listinfo/telrad
>
> _______________________________________________
> Telrad mailing list
> [email protected]
> http://lists.wispa.org/mailman/listinfo/telrad
>
>
>
>
>
> --
>
> *Matthew Carpenter*
>
> *806-316-5071 <(806)%20316-5071> office*
>
> *806-236-9558 <(806)%20236-9558> cell*
>
>
>
>
> _______________________________________________
> Telrad mailing list
> [email protected]
> http://lists.wispa.org/mailman/listinfo/telrad
>
>


-- 
*Matthew Carpenter*
*806-316-5071 office*
*806-236-9558 cell*
_______________________________________________
Telrad mailing list
[email protected]
http://lists.wispa.org/mailman/listinfo/telrad

Reply via email to