Network Availability technology is used to allow one or more failures in a network without losing connectivity, or at least for a short time.
Plenty of different technologies can be used for this purpose. We have tools in Software Design Network (SDN), or Software Design Wide Area Network (SDWAN) that is able to monitor several links simultaneously. It can direct the traffic to the best links in live time and of course remove from the path the link that lost its connectivity. This kind of redundancy is managed by software, L7 of OSI (Open Systems Interconnection).
In today’s blog we will focus on redundancy based on network protocols, L2 and L3 of OSI.
With L2 we have two main actors STP (Spanning Tree Protocol) and LAG (Link Aggregation). These protocols work with Media Access Control address (MAC), not IP.
- STP: The purpose of STP is to remove the loops that can happen in the L2 network, in other words in a bunch of switches To summarize, we are able to disable one or more ports in order to stop the traffic on one or several links. So, where is the redundancy here? When a link in use is going down, STP will re-compute it’s data to enable a link that was down previously and let the traffic through again. STP hello timers is 2 sec by default, that means, STP is able to see a failure in 2 seconds. (The convergence will take longer, but that’s another blog in itself.)
- LAG: A LAG is an aggregation of several links that is named and seen as one link. In RFC we speak of Link Aggregation Control Protocol (LACP); and with Cisco we speak of Ether channel or Port Aggregation Protocol (PAgP). LACP, Ether channel and PAgP all essentially do the same job with different parameters. Let’s take a scenario where we have 4 links in a LACP, the switch will only see one link but the LACP protocol will load balance the traffic between the 4 links. If one link goes down, the other 3 links will stay active and LACP will load balance the traffic within the 3-remaining links. There is however one small downside to this; all sessions open in the link that has gone down will be lost until the source start a new connexion, TCP, ICMP, etc.
With L3, we have more options, all are very different to each other. L3 will play with IPs and mainly routing.
- ICMP: The first one is Internet Control Message Protocol (ICMP). ‘What?! I hear you ask’ Don’t worry this is the reaction of most of the people who are reading this. When a source tries to reach an IP outside its network, this source will send a request to its GateWay (GW). When the GW lost the route or is not able to route the traffic properly, the GW will send an ICMP Redirect message to its network, the message Type 5, Code 0. This message gives an alternative GW in the network, even if a GW is hardcoded on the source. The ICMP Redirect message is furtive, most of the time it is sent only once, hence the difficulty to see it on the network. In the most VPN configurations, ICMP Redirect needs to be disable. So while this is very useful, it can certainly give some white hairs to support when they troubleshoot network.
- First Hop Redundancy Protocol (FHRP): Other GW redundancy protocols collectively known as FHRP include: Hot Standby Router Protocol (HSRP), Virtual Router Redundancy Protocol (VRRP) and Gateway Load Balancing Protocol (GLBP). These 3 protocols do almost the same job and so its often simply a case of personal preference. I have preference for GLBP because of its ability to load balance the traffic on different links. HSRP and GLBP are Cisco proprietary where VRRP is standard. Without deep diving into the details, these protocols work with a Virtual IP (VIP). This VIP is the GW used by the sources. Behind this VIP, one or often several routers are link to this VIP. The routers have plenty of options to track, interfaces, links, protocols. When a condition is down the VIP will be redirected to a router that has the proper routing table. This redirection is done by changing the MAC address of the VIP, so is totally transparent for the sources. The network convergence can be done very quickly, it depends of the type of tracking we set on the routers. Often, FHRP are the first technologies we think when we speak redundancy. See links below for more info on the:
- HSRP: https://en.wikipedia.org/wiki/Hot_Standby_Router_Protocol
- VRRP: https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol
- GLBP: https://en.wikipedia.org/wiki/Gateway_Load_Balancing_Protocol
- Routing Protocols: Finally, we will speak of routing protocols. I won’t be exhaustive on all routing protocols but only the most used protocols today; Enhanced Interior Gateway Routing Protocol (EIGRP), Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP). I could write one line or two on Routing Information Protocol (RIP) or Intermediate System to Intermediate System (ISIS) but my Grandpa might be better placed to speak about those technologies. Those 3 protocols can load balance the traffic following the configuration given. It is the first level of redundancy we can have with routing protocols.
The king of network convergence is EIGRP, it has a very fast compute. EIGRP is an Advanced Distance Vector Routing Protocol with complex calculation metric. When a route is lost, EIGRP can converge in a second. Of course it depends on different parameters but it is fast. In my opinion the only default of EIGRP is that we can use it only with Cisco, the debate is open.
OSPF is a Link State Routing Protocol. OSPF give a cost for each link then compute the routing table. OSPF work with areas, kind of mini OSPF in OSPF. When a link going down, the information is broadcasted by the DR : Designed Router’s port in the network. Hello times is 10 sec by default, after that, an alternative route can be recomputed quickly. Of course we can tune the timers, but it will depend of the network requirements.
Where EIGRP and OSPF are more focus on LAN, BGP will be more appropriate for backbone network. When we speak backbone, our first thought goes to internet, so yes BGP is the main routing protocol for internet. BGP is an Advanced Distance Vector Routing Protocol. So the redundancy is everywhere with BGP. BGP works by Autonomous System (AS). AS is kind of area where BGP knows its other friend routers to exchange routes between them. Earlier I said L3 protocols, to be really honest, BGP speaks on the port TCP 179, so it is not OSI L3 but L4 here. While BGP is the most sophisticated routing protocol that can tune in very different ways to work, BGP has one main issue. It is the slowest routing protocol convergence. So it is an issue for LAN but not for internet, we don’t want the internet to change it’s mind too often, do we? BGP brings stability and keep the redundancy when we need.
The redundancy can be implemented everywhere on the network. On the LAN with STP and LAG, on the GW with FHRP and on external networks with the routing protocols. The only way to justify the network is down today would be the end of internet and I am persuaded to think that would be the end the world as well.
Thanks for reading.
Olivier Farhat | Ergo Solutions Architect