Hi All, Those of you that use NETCONF and YANG (or something similar [1]), how are you (or are you not?) "managing" that process? Just pushing out config is the tip of the iceberg for most networks looking to automate operations...
A classic example: I need to bring up a new eBGP session on a router that carries high risk services (let say 999 calls) to a new voice carrier. If I was doing this by hand the following would occur: Pre change checks: For example, if the CPU is running at 99% due to a bug I'd call off the maintenance work, if the FIB is full, this change won’t succeed :) Apply the proposed config: Here I lock the config database, apply my config to bring up a new peer, then check the router has accepted the config (it's syntactically correct and not rejected) Check the change has worked: Next I would check the config has done as I had hoped (the new eBGP session is up and exchanging routes, assuming the peer is already configured in this example), check that any “policy” regarding this new configuration is working as expected (we are only sending and receiving the relevant routes due to prefix filters for example) Post change checks: Next I could check the router isn’t now running at %100 CPU usage when it wasn’t before I fiddled with it, check memory usage etc (some conditions would be defined that it met, would require the change to be rolled back as not to affect live traffic). With all that in mind, only the “Apply the proposed config” part actually falls under the remit of NETCONF and YANG (as per the functionality defined in the existing RFCs). Does anyone know if there is any scope to extend NETCONF/YANG to include reading stats such as number of routes received from a specified BGP peer, CPU usage, memory usage, FIB usage, and so on, everything we are probably getting via SNMP? [2] Are any of you doing all the above alongside your NETCONF & YANG deployments, if so, are you just getting it from your exist SNMP/NMS systems and "gluing" it together by hand? I’m seeing scope for lots of problems here having to write an automation system that scrapes data from the NMS to check the current state of the device, use NETCONF & YANG to reconfigure it, kick the NMS to poll immediately, scrape the new data and check it against some pre-supplied criteria to ensure the change was a success, then have NETCONF release the lock on the config database or rollback if required. What are others doing here to keep the whole process sane? Cheers, James. [1] We (the networking community) have been able to do ALL of this for a while using expect scripts, all of the above can be done of the CLI so expect can do it. It's not elegant at scaling and provider top levels of reliability though IMO. [2] I don't want to reinvent the wheel, I don't want loads of different systems storing disparate data set either.
