Kubernetes Network Model - Weekly Sharing

I. Underlay Network Model

1. What is Underlay Network？

Underlay networks refer to the physical network infrastructure, such as switch and router. DWDM uses network media to connect these devices to a network topology.

Underlay network topology

The underlay network can be Layer 2 or Layer 3 network; A typical example of a Layer 2 underlay network is Ethernet, and the Internet is an example of a Layer 3 underlay network.

The technology working on layer 2 is VLAN, and the technology working on layer 3 is composed of OSPF, BGP, and other protocols.

2. Underlay Network in Kubernetes

In Kubernetes, a typical example in the underlay network is that the host is used as a router device. The Pod network performs cross-node communication by learning the routing entries.

The underlay network topology in Kubernetes

The flannel host-gw mode and the calico BGP mode are typical under this model.

Flannel host-gw

In the flannel host-gw mode, each Node must be in the same Layer 2 network, and the Node is used as a router. Cross-node communication will be carried out through the routing table method, which simulates the network as an underlay network.

Layer2 ethernet topology

Notes: Because it is through the routing way, the cidr of the cluster must be configured with at least 16 because this can ensure that the Node across nodes acts as a layered network, and the Pod of the same Node acts as a network. If this is not the case, the routing tables are in the same network, and there will be a situation the network is unreachable.

Calico BGP

BGP (Border Gateway Protocol) is a decentralized autonomous routing protocol. It implements the accessibility between AS (Automotive System) by maintaining the IP routing table or 'prefix' table. It belongs to the vector routing protocol.

BGP network topology

Unlike flannel, Calico provides a BGP network solution. In the network model, Calico and Flannel host-gw are similar, but they are different in the implementation of the software architecture. Flannel uses the flannelled process to maintain routing information, while Calico contains multiple daemons, of which the Brid process is a BGP client side and Router Reflector. The BGP client side is responsible for taking routes from Felix and distributing them to other BGP peers, while the reflector plays an optimized role in BGP. In the same IBGP, the BGP client side only needs to be connected to an RR, which reduces the large number of BGP connections maintained within the AS. Typically, the RR is the actual routing device, while Bird works as the BGP client side.

Calico Network Architecture

3. IPVLAN & MACVLAN

IPVLAN and MACVLAN are network card virtualization technologies. The difference between them is that IPVLAN allows a physical network card to have multiple IP addresses, and all virtual interfaces use the same MAC address; MACVLAN, on the other hand, allows the same network card to have multiple MAC addresses, while the virtual network card can have no IP address.

Because this is a network card virtualization technology, not a network virtualization technology, it essentially belongs to the Overlay network. Compared with the Overlay network in a virtualized environment, the biggest feature of this method is that it can level the Pod's network to the same level as the Node network, thus providing a network interface with higher performance and lower latency. Essentially, its network model belongs to the second one in the figure below.

Virtual bridge: we create a virtual network card veth pair, one in the container and the other in the root namespaces of the host computer. This way, the packets sent from the container can directly enter the host network stack through the bridge, and the packets sent to the container can also enter the container through the bridge.
Multiplexing: We use an intermediate network device to expose multiple virtual network card interfaces. The container and network card can intervene in the intermediate device and distinguish which container device the packet should be sent to through the MAC/IP address.
Hardware exchange: We assign a virtual network card to each Pod. This way, the connection between pods becomes very clear because it is close to the communication foundation between physical machines. Nowadays, most network cards support the SR-IOV function. This function virtualizes a single physical network card into multiple VF interfaces. Each VF interface has a separate virtual PCIe channel. These virtual PCIe channels share the PCIe channel of the physical network card.

Virtual networking modes: bridging, multiplexing, and SR-IOV

In Kubernetes, the typical CNIs under the network model of IPVLAN are multus and danm.

4. Multus

Multi is an intel open-source CNI solution composed of traditional CNI and multus and provides an SR-IOV CNI plugin to enable the K8s Pod to connect to SR-IOV VF. This is the function of using IPVLAN/MACVLAN.

When a new Pod is created, the SR-IOV plugin starts working. The configuration VF will be moved to the new CNI namespace. The plugin sets the interface name according to the "name" option in the CNI configuration file. Finally, we set the VF status to UP.

The following figure shows a network environment for Multus and SR-IOV CNI plugins with three interface pods.

eth0 is the flannel network plug-in and also the default network for Pod
VF is the instantiation of the physical port ens2f0 of the host. This is a port on Intel X710-DA4. The name of the VF interface on the Pod side is south0.
This VF uses the DPDK driver, which is instantiated from the host's physical port ens2f1. This is another port on the Intel® X710-DA4. The VF interface in the Pod is named north0. This interface is bound to the DPDK driver vfio-pci.

Mutus networking Architecture overlay and SR-IOV

Notes：terminology

NIC: network interface card.
SR-IOV: single root I/O virtualization, a hardware-implemented feature that allows PCIe devices to be shared among virtual machines.
VF: Virtual Function, based on PF, shares a physical resource with PF or other VFs.
PF: PCIe Physical Function with full control over PCIe resources.
DPDK: Data Plane Development Kit.

At the same time, we can also move the host interface directly to the Pod's network namespace. Of course, this interface must exist and cannot use the same interface as the default network. In this case, in a common network card environment, the Pod network and the Node network are directly in the same plane.

Mutus networking Architecture overlay and ipvlan

5. Danm

DANM is Nokia's open-source CNI project, which aims to introduce telecom-grade networks into Kubernetes. Like multus, it also provides SR-IOV/DPDK hardware technology and supports IPVLAN.

II. Overlay Network Model

1. What is Overlay?

An overlay network is a virtual logical network built on an underlay network using network virtualization technology without changing the physical network architecture. Essentially, an overlay network uses one or more tunneling protocols to achieve transmission from one network to another by encapsulating data packets. Specifically, the tunneling protocol focuses on data packets (frames).

Overlay Network Topology

Common network tunneling techniques：

Generic Routing Encapsulation is used to encapsulate data packets from IPv4/IPv6 into data packets of another protocol, usually working in the Layer 3 network.
VxLAN (Virtual Extensible LAN) is a simple tunneling protocol that essentially encapsulates L2 Ethernet frames as UDP data packets in L4, using 4789 as the default port. VxLAN is also an extension of VLAN for 4096 (Bit VLAN ID) extended to 16 million (Bit VNID) logical network.

This work is typically in the overlay model, such as VxLAN and IPIP modes in flannel and calico.

2. IPIP

IP in IP is also a tunneling protocol. Similar to VxLAN, the implementation of IPIP is also encapsulated through the Linux kernel functions. IPIP requires the kernel module ipip.ko to use the command to check whether the kernel loads the IPIP module lsmod | grep ipip; Using the command modprobe ipip to load.

A simple IPIP network workflow

In Kubernetes, IPIP is similar to VxLAN and implemented through network tunneling technology. The difference between IPIP and VxLAN is that VxLAN is essentially a UDP packet, while IPIP encapsulates the packet on its message packet.

IPIP in Kubernetes

IPIP packet with Wireshark unpack

Notes: Public cloud may not allow IP traffic, such as Azure.

3. VxLAN

In Kubernetes, both the implementation of the flannel and calico VxLAN are encapsulated using the Linux kernel functions. Linux has not been supporting the V×LAN protocol for long. In 2012, Stephen Hemminger incorporated relevant work into the kernel, finally appearing in version 3.7.0. For stability and many functions, you can see that some software recommends using VxLAN on kernel versions after 3.9.0 or 3.10.0.

A simple VxLAN network topology

In Kubernetes, the VxLAN network, such as flannel, the daemon will maintain VxLAN according to the Node of Kubernetes, named flannel.1. The VNID maintains the routing of this network. When cross-node traffic occurs, the local will maintain the MAC address of the peer VxLAN device. Through this address, you can know the destination of the transmission so that the packet can be sent to the peer. After unpacking, the peer VxLAN device flannel.1 received the packet gets the real destination address.

Checking the Forwarding database list

1 $ bridge fdb 
2 26:5e:87:90:91:fc dev flannel.1 dst 10.0.0.3 self permanent

VxLAN in Kubernetes

VxLAN packet with Wireshark unpack

Notes: For the 4789 port used by VxLAN, Wireshark should analyze the protocol according to the port, while flannel's default port in Linux is 8472. At this time, the Wireshark can only be seen as a UDP packet.

The above architecture shows tunneling is an abstract concept, not a real tunnel established at both ends, but a network overlay achieved by encapsulating a packet into another packet and then unpacking it via the same device (network tunnel) after transmission through the physical device.

4. Weave vxlan

The weave also encapsulates packets completed using VxLAN technology. This technology is called fastdp (fast data path) in the weave. Unlike the technology used in calico and flannel, the openvswitch datapath module in the Linux kernel is used here, and weave encrypts network traffic.

Weave fastdp network topology

Notes: fastdp works in Linux kernel version 3.12 and higher. If it is lower than this version, such as CentOS7, the weave will work in the user space, called sleeve mode in the weave.