Best Effort

BE: Automate All The Network Things

Erwan James, November 19, 2020

Network Automation…a very large topic…so large in fact that we’re going to do a series within a series on this topic. In this episode we cover the areas of networking where automation can and probably should be applied – we don’t talk specific tools or frameworks quite yet, but rather network automation philosophy.

Apple
Google
Spotify
Castbox

Hosts: Bruce Wallis and Erwan James

2 thoughts on “BE: Automate All The Network Things”

Justin Pietsch

November 22, 2020

Hello fellas,

I haven’t listened to your other episodes, so you might have covered my comments and questions. And I have a lot of comments and questions. I’m trying to understand the state of network automation, it seems like a mess. You present a world that is more functional than I think is actually true. One of my chief concerns is that I don’t think the vendors are trustworthy enough. I listened to your podcast before I realized you work for Nokia; I have no experience with Nokia. In general, vendors are pretty consistent in showing that management and monitoring are not priorities, and that they aren’t trustworthy there.

You make config generation sound trivial and that it just works well. Is that true? Is that true for most people? As far as I can see, most networks (that even do automation) use something like ansible and start from scratch. Is there anything better that people in general are using? That is available to a team without writing a bunch of their own software?

You talk about wanting to use modeled interfaces for changing any device state, but I don’t trust the vendors. I haven’t actually tried for a decade, but I can’t believe any vendor has actually put the priority correct. For me to trust those interfaces I’d have to have good evidence they were using a database, everything is modeled internally, everything uses the same database, etc. I’m in the camp of push full configs and reboot the box because I don’t trust vendor software. Also, it’s hard to really test all the different states here; the most tested device state is turning it on. But as I said, I haven’t really checked in a decade.

simulating with a NOS is really important. I’m more interested in figuring out designs with simulation, but even as part of a deployment pipeline is really important. Do you know that there are Network OSes that instead of container use VM? Why do you think container is better than VM? We’ve been using vagrant and have some examples of different NOSes with eVPN topologies at https://github.com/netenglabs/automatic-for-the-people.

You talk about not CLI scraping, but the vendors are exceedingly frustrating and there is too much info in CLIs that aren’t avaiable in a programable CLI. Sometimes it works to use the CLI, get json, but even that isn’t always trustworthy, and sometimes you have to fall back to textfsm. Which is sadness. This is still true, at least with the biggest vendors; again I have no Nokia experience.

you mentioned writing one-off scripts, and the example was to get data. Why can’t people just get data from their monitoring systems? Is it that the data you are thinking of isn’t already getting gathered, or is it that it’s too hard to get data from the monitoring systems? As a marketing plug, I’ve been working on https://github.com/netenglabs/suzieq whoose goal is to be always gathering the data that you need to troubleshoot and understand your network. However, right now though it’s gathering interface counters, we aren’t doing things useful with it. Suzieq currently is more focused on table state, like BGP, OSPF, eVPN, etc. Whether or not you use Suzieq, it seems like most of this data should be captured all the time. (If you look at the Suzieq code, you’ll see that we mostly use CLI with json because the data is not reliably found any other way.)

Remediation demonstrates a really important concept which actually is important for all network automation: it’s much easier to automate networks that are automatable. In the remediation case, it’s much safer to reboot a device if you have more than 8 spine nodes than 2 spine nodes,. When you are doing remediation, stick to really simple things like reboot, interface down/up, etc. I heard about a mail service 10+ years ago that if it had a machine that was reporting errors, it would reboot the device, if that didn’t work, it would reconfigure the machine, if that didn’t work it would turn the machine off. Even really simple remediation can be really powerful, but it has to be simple and it does take work to make it safe. While I have seen remediation applied to a network, I don’t think there is a common platform or software that anybody is using, I think it’s only custom, which is too bad. As you mentioned, there are two pieces needed, workflow and event corelation/detection.

Are there any controlers/fabric managers that are actually useful? (Don’t talk about SDWAN, just anything else?) SDN controlers based on openflow is a waste, I have no idea if somebody has actually implemented something useful.

As far as AI/ML, yeah, but we aren’t really there yet. I think we don’t have the foundational pieces to benefit from ML. It’s not the lowest hanging fruit at all.

that’s enough from me, hope you both are doing well

Erwan James
December 4, 2020
Hi Justin,

Sorry for the very late reply and thank you for commenting!

There’s a lot to unpack there, I think we’ll try and do a better response in one of our future podcasts but here are some thoughts of mine about some of the stuff you’ve brought up.

Firstly I can’t really comment on the trustworthyness of vendors or how you feel about them – indeed both Bruce and I work for a vendor but this project – ntwr.kn – is not affiliated with our employer at all. We do our upmost to remove any biases we may have, this is truly a project we started as friends and networking nerds. We’re doing out best to keep this space vendor neutral and remove any references to our employer, so far I think we’ve done a pretty good job but happy to be called out on potential biases in our podcast which we may have missed!

Alright with that out of the way…

Another note is that indeed we often speak about a utopian world but we do try and make our comments applicable to the real world, I think you’ll see that more in the upcoming podcasts where we start to dissect some of the tooling people are using today. Hopefully we can strike a balance between where we think the networking world should be and where it really is – its about understanding how people get to that utopian world and what that world indeed looks likes!

As far as config generation goes I think many have this figured out, mostly today I would say the use of templates is probably the most prominent – ansible really to me nothing more than a well formatted (yaml) potentially vendor agnostic (although I think you’ll find that’s not realistic today) way of templating or modeling a device configuration. Then of course there is the other aspect of ansible in itself is the transport and management of the config on the device itself. I think people are starting to move into using a proper coding language such as python to generate their configs, starting is the key operative here…take your preferred OS’s yang models (or openconfig), generate python (or other) bindings for the data model, use python to then retrieve and populate the data model and output in a format that is understood my your preferred management protocol (gNMI with json for instance). This could/should be a step in your switch config and turn up pipeline. Is everyone there yet..no certainly not. And in many cases there is no need for all that if you are using other management platforms, maybe the data model you are dealing with is already an abstraction and switch configs themselves get auto generated by that management platform.

CLI scraping exists exactly for the reason you described, vendors have been terrible at exposing their data models via other northbound interfaces, but to be honest most NOS these days are pretty good at it…there certainly was a time when CLI was your source of truth, it had the entire data model available and many northbound interfaces did not…so you have to CLI scrape to get some state, or config information. I think today you’ll find that’s less and less the case, not 100% equal but getting there….

I’ll check out suzieq! Seems like an interesting project. I think you’ll find that more and more updated/modern network operating systems are making that stateful information available via streaming telemetry – mostly via gNMI these days. The more streaming telemetry we can get out of the devices the better! Push your vendors to stream all the things! 🙂

Fabric managers / controllers I’m not sure, I can speak obviously for what I know well (my employer’s products) but outside of that I’m not super familiar with what’s around. Apstra looks like a good product for multivendor support in fabric management, as far as opensource software I’m not sure there are many. As far as openflow…agreed…I think its run its course and not widely deployed – I think Google published a paper about their uses for it, they were able to do something with it that was useful but the controller si not available publicly, I assume too tailed to their environment to make it worthwhile publishing it.

Alright…I wrote way more than I thought I would, thanks for the discussion – happy to talk more about anytime!

Cheers
Erwan
Log in to Reply

BE: Automate All The Network Things

2 thoughts on “BE: Automate All The Network Things”

Justin Pietsch

Erwan James

Leave a Reply
Cancel reply

Leave a Reply

BE: Automate All The Network Things

2 thoughts on “BE: Automate All The Network Things”

Justin Pietsch

Erwan James

Leave a Reply Cancel reply

Leave a Reply

Leave a Reply
Cancel reply