Advance buffer mechanism on EXA32100

A significant function of Network Packet Brokers is aggregation. Often people think that aggregation is a simple task in packet brokering but actually, it is the most challenging part because everything must be done in real time and if you drop packets there is no retransmission possible.
If a switch drops a packet because of bursting it is not good, but there is always the chance of retransmitting the packet. This is not the case with network packet brokers because if the packets are dropped, then they cannot be retransmitted and therefore they are missed in the monitoring. Especially for signal monitoring, this is an issue because even if one packet is missing, it can make a full session not readable for monitoring systems. This is a big difference compared to switches.
Aggregation is challenging in the case of strong bursting traffic. If such traffic is aggregated, it usually comes to oversubscription of the output ports and as consequence packets are dropped.

This is a typical situation for bursting, and if traffic is above 1000 Mbit, some of the incoming traffic will not be available at the combined output. Such a burst is only microseconds long but still causes packet loss.
The usage of a packet buffer is the only way to overcome this issue. Along with this, smart management of the buffer is required. Cubro offers 24 MB SRAM which is the most significant memory on a 32 x 100 Gbit platform. All the competitors use standard switches with much less memory. Another significant advantage is that Cubro offers active buffer management* so that the user can assign more memory to specific ports with bursting issues.
The switches which offer GB RAM and Deep Buffer Switches are not network packet brokers. To learn more about these switches, please refer to the Deep Buffer Switches section of this document.
*On the EXA32100 it is possible to allocate every port a maximum space it can use out of the 24 MB buffer which is shared over all ports. This feature can be used to guarantee a certain buffer size per port.
The configuration for the size of the buffer is preferably simple because it is done directly in the standard port settings. The maximum size in the buffer which can be used from the port is defined by value. This value can be a number between 10 and 90000. 90000 means that this amount of packets (all have a length of 256 bytes) can be buffered in case of a burst.

Here is an example of buffering on our EXA32100. Scenario: Port 1 and Port 15-1 get aggregated to Port 2. Port 1(100G) is fully loaded. Therefore, Port 2(100G) is also fully loaded. But now we insert another traffic on Port 15-1(10G | 90000 packets) – this simulates the burst. But because the output has no capacities left, the aggregator has to buffer these packets until Port 2 has space for them. These 90000 packets have to wait until Port 1 stops bursting. As you see on the top right of the picture all packets have been successfully transmitted.
Deep Buffer Switches:
Deep Buffer Switches are switches and not network packet brokers. Gigantic buffers can hold packets for a very long time and make network troubleshooting difficult. Issues related to big buffers such as this and buffer bloat can be found here “https://en.wikipedia.org/wiki/Bufferbloat“.
Switches with ultra-deep buffers were initially designed to address the modular switch market. These chipsets with external buffers are not relevant for storage, big data and other high-performance applications inside the data centre.
The challenges of implementing ultra-deep buffers
Let us take a 1.0Tbps switch on the chip as an example. Ideally, the ultra-deep buffer should be able to support 1.0Tbps of packet writes, and 1.0Tbps of packet reads. The buffer is built using commodity DRAMs that are meant for the server market. Following are the challenges in using these DRAMs that were initially targeting the server market:

DRAMs are SO SLOW (low bandwidth)

In order to build a buffer that is functional with reasonable bandwidth, multiple (e.g. 8) DRAM banks must be used (See Figure 1). With multiple banks, the external memory can absorb several 100Gbps of traffic. However, there is some oversubscription as even multiple DRAM banks typically cannot support full line rate of 1.0Tbps traffic.

DRAMs are SO BIG (high capacity)

The minimum size of the DIMM is around 512B. If you use 8 instances, you get a 4GB buffer whether you want the capacity or not.

Figure 1: Eight instances of slow DRAMs needed to provide 512Gbps bandwidth

There are some vendors that have this switch with 4GB buffer, and they sell it as “ultra-deep” buffer switches for storage and big data applications. But these platforms are not relevant for these high-performance applications running inside the data centre due to these reasons:
Ultra-deep buffers do not translate to better burst absorption
Platforms with ultra-deep buffers have multiple bottlenecks and have an oversubscribed buffer architecture (See Figure 1). As a result, packets can be dropped indiscriminately even before classification and forwarding lookup. This implies that these switches will be blocking in nature and will have port interference issues. The performance will degrade further once the packet touches the slower external memory. What is the point of having “ultra-deep” buffers if one cannot guarantee non-blocking traffic switching and better burst absorption?

Figure 2: Multiple bottlenecks in external packet buffer-based switches

Ultra-deep buffer switches exhibit higher latency and jitter

Today’s ultra-deep buffer platforms do not support cut-through switching. High-performance applications such as storage, typically have a higher proportion of large packets. These large packets will incur extra latency due to the store and forward function. Also, switching packet data accesses between on-chip SRAM and off-chip DRAM will introduce more jitter.

Ultra-deep buffer switches lack density and are less reliable

External packet memory is slow, putting a limit on the throughput that can be provided by a single switching element. Vendors will have to use multiple switching elements (e.g.6) to get to a density of 32x100GbE. Also, these switch chips have several dedicated interface pins that are needed to connect to the external DRAMs and use expensive silicon real estate to implement the thousands of queues that are only relevant to WAN applications. These chips are sub-optimal for high-performance applications inside the data centre. More components translate into reduced reliability and more power.
Therefore, the Ultra-deep buffered switches on the market today are not a good fit for packet broker applications at all.