Dark side of Extreme ELRP Protocols and some best practices

This is my first blog here. I hope to continue this blog eventually. I decided to write blogs here only. Previously, I tried several places, like on my website, WordPress, etc. I came back to Blogspot after a long time.

Without further delay, let me jump to the subject. Recently, I received an interesting topic from my colleagues. He explained to me that he can't find the loop on the extreme switches. All switches show loop disturbances. Even two or three switches are not accessible.

After solving this problem, I have decided to carry out a LAB on this topic. My main focus will be some questions 1. Why are all the switches showing loops, but the loop was only available on one switch? 2. What action should be taken to avoid such complete downtime if the loop is detected again?


Designed a LAB with the following information:



Let's change some configurations on the switches. I am using the default VLAN for this lab purpose. As MST is enabled by default on the default VLAN. I will disable MST and enable ELRP protocol on default VLAN and all switch ports.

Run the below commands for all switches. Currently, I am not configuring to disable any ports if a loop is detected. All switches are in default settings.

disable stpd s0

enable elrp-client

configure elrp-client periodic Default ports all


Start Ping from PCs to each other (IP addresses are already assigned on the PCs and Switches under VLAN 1)


Before going in the loop, let's capture traffic on the Sw3, Port 4. 


ELRP (Extreme Loop Recovery Protocol) is a simple loop protection mechanism/protocol that sends a multicast packet out of each configured port(s) and checks to see if it receives it back on any port. If the packet is received back by the same switch, the port can be disabled, or the switch can notify via logs, SNMP traps, or both.

ELRP should be configured not to disable uplink ports so that ELRP will not disable the uplinks from a switch causing the switch to lose network connectivity.

ELRP sends out Multicast messages every 1 second (in periodic mode only). But there is no IEEE multicast or Extreme Specific multicast mac address in the destination field. From my little understanding, this is the main cause of showing a loop on all Switches in the network and high CPU. Since the neighbor switch cannot process and drops multicast ELRP packets due to random destination MAC, the neighbor switch will also forward this packet to all ports (except the received port).

If you have 50 switches in your network, you also create unnecessary traffic for each client. 50 packets per second to be checked and dropped by all clients on the network. As you can see, the above captures show all three packets on the client port from all three switches.

How to identify the ELRP source Switch in the above capture. 

I strongly suggest you double-check the LAB diagram. I have also mentioned the MAC address of the system. What are the source and destination of ELRP packets? The Extreme guide mentioned as:

Starting from 22.5.
The Extreme Loop Recovery Protocol (ELRP) source MAC address has changed from "00:e0:2b:00:00:01" to “0e:Switch-MAC” starting with ExtremeXOS 22.5.

The destination MAC address of an ELRP packet will be the MAC address of the switch is originated from, with the first bit in the highest-order byte replaced with a 1.

IE, if the MAC address of the switch sending the ELRP packet is "00:11:22:33:44:55", the destination address of the ELRP packet will be "01:11:22:33:44:55". 

So, you can find that packet no 257 is sourced from SW1, 258 is from SW3, and 259 is from SW2. All packets are Multicast (Check IG bit on), and the source Address is not used as a built-in MAC address (Hardware) (LG bit is also on).

Why is Extreme forced to choose a feature or destination mac address for ELRP? ELRP can't identify an Uplink interface like STP did. This is a standalone protocol.

Let's create a loop on SW2 (I am connecting a HUB on this Switch) and examine the network state. 

* SW2.5 # show log

09/14/2022 16:08:39.10 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2669 transmitted, 238 received, ingress slot:port (3) egress slot:port (2)

09/14/2022 16:08:38.63 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2669 transmitted, 237 received, ingress slot:port (2) egress slot: port (3)

!

!

* SW1.1 # sho log

09/14/2022 16:09:38.48 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2659 transmitted, 17631 received, ingress slot:port (1) egress slot:port (1)

09/14/2022 16:08:38.48 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2599 transmitted, 11407 received, ingress slot:port (1) egress slot:port (1)

09/14/2022 16:07:38.38 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2539 transmitted, 7050 received, ingress slot:port (1) egress slot:port (1)

09/14/2022 16:06:38.38 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2479 transmitted, 1 received, ingress slot:port (1) egress slot: port (1)

!

!


* SW3.1 # show log

09/14/2022 16:10:37.96 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2711 transmitted, 24438 received, ingress slot:port (2) egress slot:port (2)

09/14/2022 16:09:37.96 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2651 transmitted, 17956 received, ingress slot:port (2) egress slot:port (2)

09/14/2022 16:08:37.80 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2591 transmitted, 11737 received, ingress slot:port (2) egress slot:port (2)

09/14/2022 16:07:37.80 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2531 transmitted, 7334 received, ingress slot:port (2) egress slot:port (2)

09/14/2022 16:06:37.55 <Warn:ELRP.Report.Message> [CLI:Default:1] LOOP DETECTED : 2471 transmitted, 1 received, ingress slot:port (2) egress slot: port (2)


Now let's capture the traffic on Sw1 Port 1. 



In the total 13 Seconds capture file, I noticed almost 1477 ELRP packets. 

Now let's capture the traffic on Sw3, Port 2


Check the 5-second capture file conversations:



Now change the configuration on the Sw2. as Loop detected ports must be disabled for a few seconds. 

unconfigure elrp-client "Default"

configure elrp-client periodic Default ports all log disable-port duration 15


The default duration is 30 Seconds. I changed it to 15 seconds. 


Now the Switch has disabled the port in the Ingress direction. 

* SW2.5 # sho elrp disabled-ports 

  Exclude EAPS ring ports : No

  Exclude VXLAN RTEPs     : No

  Exclude inter-VLAN loop ports :No

  Excluded Ports

  -------------------------------------------------------------------

   ------------------------------------------------------------------

  Disabled                   Detected             Duration  Time                     Disable  

  Port/Virtual Port          Vlan                 (sec)     Disabled                 Direction

  -------------------------------------------------------------------

  3                          Default              15        Wed Sep 14 16:31:25 2022 Ingress  

  -------------------------------------------------------------------

(!) Port disabled due to Inter-VLAN loop detected

* SW2.6 # 


Summary:

1. If you are using ELRP, use STP on the uplink interfaces. If your network design allows it, create multiple MST instants or instants per VLAN to make it a PVST.

2. If you don't want to use MST, exclude uplink interfaces for automatic deactivation. It could be the wrong switch and the wrong port will be offline.

3.  If you won't disable ELRP on the uplink ports, your client machines will process unnecessary packets per second.

In the next blog post, we'll dive into the technical stuff.

Comments

Popular posts from this blog

Does NAT66 or NPTv6 need it?

BGP Slow Peer Detection