Friday, April 6, 2018

Automating Mellanox ONYX and MLNX-OS using Ansible: quick guide

For those who have already evolved to NetOps, manual command-by-command execution and config template copy-pasting are things of the past. Netconf, REST-based APIs are becoming a must; if you don't have it - there are still options to automate. Among other orchestrating solutions, Ansible stands out - it's simple, SSH-based, and already has modules to rule the ocean of your network devices.
 
MLNX-OS CLI

By creating Cisco-like CLI, Mellanox inherited classic IOS problems: lack of rollback, configuration management, filtering options and other things (100 things why we love Juniper). Later, XML API was introduced - but it still looks like a wrap of old CLI commands (why not do Netconf?). To this day, the underlying interfaces in Onyx and MLNX-OS are still the same - yes, means we still have to feed old commands to these systems. Although, in ONYX new JSON-API has become available (for x86 platforms) - but, again, there was just a little markup magic required to create it. We can only wish for features present in NX-OS and Junos like shell access, wildcards, Linux-based filters and commands, and, additionally, an ability to integrate into a proper SDN one day.

Ansible 2.5

Before Ansible 2.5, the automation of Mellanox was pretty sad. Thankfully, it was possible to send multiple commands by ssh: https://community.mellanox.com/docs/DOC-2092.
With the release of 2.5, new network modules were introduced:

As you can see, the majority of those manage Ethernet-specific protocols, but a couple can be used to manage both Infiniband (MLNX-OS VPI) and Ethernet (MLNX-OS/ONYX) devices.

Editing Ansible configuration
We need to change /etc/ansible/ansible.cfg

There are a couple of things to do to prevent a headache while using Ansible, uncomment lines and change options as needed. These parameters, however, are not final and you can freely adjust them to your liking.

gathering = explicit
Gathering doesn't work properly with network equipment, therefore we say that we don't do it by default

host_key_checking = False
Crucial for devices management, SSH key checking sometimes causes timeouts and failed plays. Either you live with security drawback while disabling it, or maintain known_hosts list on your master host.
 
timeout = 30
Increasing SSH timeout from 10 to 30 for remote connections.

look_for_keys = False
Out of scope, because we do not use paramiko to connect to Mellanox devices, but still important for other devices.

host_key_auto_add = True
Same here, paramiko parameter to add new ssh host keys automatically.

connect_timeout = 60
Increasing persistent connection timeout from 30 to 60 seconds.

connect_retry_timeout = 45
Increasing retry timeout from 15 to 45 seconds

command_timeout = 30
Increasing the amount of time to wait for a command before timing out from 10 to 30 seconds (hello to slow and old PowerPC board inside)

Setting up inventory
Let's use standard Ansible hosts file  /etc/ansible/hosts

[test_cluster]
192.168.0.[10:25]
[test_cluster:vars]
ansible_network_os=onyx

So, our Mellanox switches are have IP addresses starting from 192.168.0.10 to 192.168.0.25, they have default user admin with password admin configured, ssh access is enabled, and no enable_password configured. We also specify Network OS for Ansible to handle the CLI properly.

Creating playbooks
Create a .yml file with the following content:

-  hosts: test_cluster
   gather_facts: false
   connection: network_cli
   tasks:
   - name: run command on MLNX-OS/Onyx device
     onyx_command:
      commands:
       - enable
       - show version
       - show ntp
       - show usernames


Here we specify that we're connecting to our test_cluster group using network_cli connection to execure a series of commands.
However, if you add some configuration commands,
       - conf t
       - show version
       - xml-gw enable


module will fail:

"msg": "onyx_command does not support running config mode commands.  Please use onyx_config instead"

Okay, let's run our playbook:

ansible-playbook <filename>.yml -u admin --ask-pass -vvv

We will be prompted to enter SSH password (by default, admin) and pretty soon will see output lines:

    "stdout_lines": [
        [
            ""
        ],
        [
            "Product name:      MLNX-OS",
            "Product release:   3.6.6000",
            "Build ID:          #1-dev",
            "Build date:        2018-03-04 16:48:04",
            "Target arch:       ppc",
            "Target hw:         m460ex",
            "Built by:          jenkins@2811f8c7d517",
            "Version summary:   PPC_M460EX 3.6.6000 2018-03-04 16:48:04 ppc",
            "",
            "Product model:     ppc",
            "Host ID:           EC0D9ACED572",
            "",
            "Uptime:            14d 17h 28m 13.056s",
            "CPU load averages: 1.37 / 1.18 / 1.09",
            "Number of CPUs:    1",
            "System memory:     268 MB used / 1759 MB free / 2027 MB total",
            "Swap:              0 MB used / 0 MB free / 0 MB total"
        ],
        [
            "NTP is administratively            : enabled",
            "NTP Authentication administratively: disabled",
            "",
            "Clock is unsynchronized.",
            "",
            "Active servers and peers:",
            "  No NTP associations present."
        ],
        [
            "USERNAME    FULL NAME               CAPABILITY  ACCOUNT STATUS",
            "admin       System Administrator    admin       Password set (SHA512)",
            "monitor     System Monitor          monitor     Password set (SHA512)",
            "xmladmin    XML Admin User          admin       Password set (SHA512)",
            "xmluser     XML Monitor User        monitor     Password set (SHA512)"
        ]
    ]
}


Of course, you can run Infiniband-specific commands just as easily: 
- show ib ha
- show ib smnodes

 
Using onyx_config module


 Here's an example of using onyx_config to back up running configuration:

-  hosts: test_cluster
   gather_facts: false
   connection: network_cli
   become: yes
   become_method: enable
   tasks:
   - name: change config on MLNX-OS device
     onyx_config:
       backup: yes


The only change is adding "become" and "become_method" parameters which are required for enable_mode on Mellanox switches - without it, we can not read the running configuration.

Run it with:
ansible-playbook <new_filename>.yml -u admin --ask-pass. 
Configuration files will be saved to ./backup directory.

Troubleshooting

First of all, ensure that you have correct Ansible and Python installed.

user@somewhere:$ ansible --version

ansible 2.5.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.14+ (default, Feb  6 2018, 19:12:18) [GCC 7.3.0]


Currently, ONYX modules work best with python2.7.

If something doesn't work, first try to connect to target host by SSH and execute commands manually, see if that works. Then, increase verbosity of Ansible commands by adding -vvvv

Then, use this great troubleshooting guide:

http://docs.ansible.com/ansible/latest/network/user_guide/network_debug_troubleshooting.html

See best practices here (use ssh-keys for authentication instead of default admin user, for example):
http://docs.ansible.com/ansible/latest/network/user_guide/network_best_practices_2.5.html

If that helped, please endorse: https://goo.gl/RfjbnG

1 comment:

Fixing OpenSM service not running

So, we've set up OpenSM as a service in Windows, with Start Type = Automatic. https://wchukov.blogspot.com/2019/06/fixing-this-configura...