[uknof] NETCONF&ANG and device health stats

James Bensley Tue, 22 Sep 2015 13:51:08 -0700

Hi All,

Those of you that use NETCONF and YANG (or something similar [1]), how
are you (or are you not?) "managing" that process? Just pushing out
config is the tip of the iceberg for most networks looking to automate
operations...


A classic example: I need to bring up a new eBGP session on a router
that carries high risk services (let say 999 calls) to a new voice
carrier. If I was doing this by hand the following would occur:

Pre change checks: For example, if the CPU is running at 99% due to a
bug I'd call off the maintenance work, if the FIB is full, this change
won’t succeed :)

Apply the proposed config: Here I lock the config database, apply my
config to bring up a new peer, then check the router has accepted the
config (it's syntactically correct and not rejected)

Check the change has worked: Next I would check the config has done as
I had hoped (the new eBGP session is up and exchanging routes,
assuming the peer is already configured in this example), check that
any “policy” regarding this new configuration is working as expected
(we are only sending and receiving the relevant routes due to prefix
filters for example)

Post change checks: Next I could check the router isn’t now running at
%100 CPU usage when it wasn’t before I fiddled with it, check memory
usage etc (some conditions would be defined that it met, would require
the change to be rolled back as not to affect live traffic).

With all that in mind, only the “Apply the proposed config” part
actually falls under the remit of NETCONF and YANG (as per the
functionality defined in the existing RFCs). Does anyone know if there
is any scope to extend NETCONF/YANG to include reading stats such as
number of routes received from a specified BGP peer, CPU usage, memory
usage, FIB usage, and so on, everything we are probably getting via
SNMP? [2]

Are any of you doing all the above alongside your NETCONF & YANG
deployments, if so, are you just getting it from your exist SNMP/NMS
systems and "gluing" it together by hand?

I’m seeing scope for lots of problems here having to write an
automation system that scrapes data from the NMS to check the current
state of the device, use NETCONF & YANG to reconfigure it, kick the
NMS to poll immediately, scrape the new data and check it against some
pre-supplied criteria to ensure the change was a success, then have
NETCONF release the lock on the config database or rollback if
required.

What are others doing here to keep the whole process sane?

Cheers,
James.


[1] We (the networking community) have been able to do ALL of this for
a while using expect scripts, all of the above can be done of the CLI
so expect can do it. It's not elegant at scaling and provider top
levels of reliability though IMO.

[2] I don't want to reinvent the wheel, I don't want loads of
different systems storing disparate data set either.

[uknof] NETCONF&ANG and device health stats

Reply via email to