Buffer Usage in Homa

One of the most controversial aspects of Homa is its usage of buffer space in network switches. A recent paper on Aeolus claimed that Homa’s buffer usage exceeds the capacity of switches, resulting in packet drops, timeouts, and poor performance. The paper is fundamentally flawed and its conclusions are invalid, but its claims have spread widely in the networking community and are taken as fact by many. This article attempts to provide a comprehensive discussion of Homa’s buffer usage and whether it is problematic. The overall points of the article are:

All existing claims of problems with Homa buffer usage, such as Aeolus, are based on unnecessary and artificial restrictions in switch buffer configurations.
No problems with buffer exhaustion have occurred to date with the Linux kernel implementation of Homa; we know that at least some switches provide plenty of buffer space for Homa.
Buffer space is getting tighter (newer switching chips appear to be scaling bandwidth faster than buffer space). There is not yet any experimental data of Homa running with those chips, and it’s hard to tell a priori whether their buffer space will be adequate.
If buffer space does turn out to be a problem, there are multiple options for reducing Homa’s buffer usage.

Thus there is currently uncertainty about Homa’s buffer usage. More work is needed to gain a better understanding of the factors that influence Homa’s buffer usage and to explore options for reducing it.

In the meantime, it is unfair to assume that Homa’s buffer requirements are unacceptable.

What we know about buffer usage

We measured the Linux kernel implementation running benchmarks on a 40-node cluster at 25 Gbps. The Mellanox switch for this cluster provides 13 MB of shared buffer space; Homa’s worst-case buffer occupancy across all nodes varied from 246 KB to 8.5 MB, depending on the workload. No packets were ever dropped because of lack of buffer space. The shared buffer pool could be reduced to 7 MB without significant performance degradation.
On these same workloads, TCP experienced noticeable performance degradation when the buffer pool was restricted to 10 MB. This suggests that any switch that works well with TCP will also work well with Homa.
On these same workloads, DCTCP ran efficiently with a shared buffer pool as small as 2 MB, so it is clearly superior to Homa in buffer usage (though not in performance).
Aeolus simulated Homa running at 100 Gbps with statically partitioned buffers of 200 KB per egress port. This resulted in significant packet loss, leading to timeouts, retransmissions, and poor performance. These results are flawed because (a) there is no need to partition buffer space in modern switches and (b) the 200 KB size of the static buffers is unnecessarily low (e.g., Tomahawk-3 chips provide 64 MB of buffer space for 128 ports). However, the Aeolus results do indicate that static buffer partitioning is not a good match for Homa.
Other papers cite the Aeolus results and in some cases (such as dcPIM) appear to have made their own measurements, but there is no concrete data available from these measurements and they appear to have made unnecessary restrictions also.
Simulations of Homa in the SIGCOMM paper showed maximum buffer usage of 146 KB for a single egress port, but we did not measure aggregate usage across all ports. Furthermore, the measurements were made with 10 Gbps network links, and buffer usage is likely to scale with network link speed.

Factors that affect Homa’s buffer usage

There are many factors that can affect the amount of buffer space used by Homa:

Unscheduled packets. Homa allows senders to transmit a certain amount of data for each message unilaterally, without receiving permission from the receiver. These are called unscheduled packets; once the unscheduled packets have been sent, senders must wait for grants before sending the remaining data. If many senders transmit unscheduled packets to the same receiver at the same time, buffers will accumulate in the network switch at the receiver’s downlink. Although it is unlikely that a large number of senders will transmit simultaneously, there is no upper limit on how much of buffer space could be occupied by unscheduled packets. Homa is normally configured so that the unscheduled packets contain a bandwidth-delay-product (BDP) worth of data because this optimizes performance on an unloaded network (the receiver can return the first grant before the sender has transmitted all the unscheduled packets).
Overcommitment. When unscheduled packets result in buffer accumulation, the receiver will detect this and delay sending grants until buffer occupancy has dropped. However, Homa intentionally tries to maintain a certain amount of buffer occupancy at its downlink (up to 8*BDP in practice) in order to maintain high link utilization; this is called overcommitment. Overcommitment also allows Homa to perform an efficient bipartite matching between senders and receivers.
Link speed. Both unscheduled packets and overcommitment are normally defined in terms of the BDP; this means that buffer usage will scale with link speed.
Network utilization. Higher aggregate loads on the network are likely to produce more intense bursts, which will result in higher buffer occupancy.
Round-trip time (RTT). Higher RTTs also increase the BDP and thus result in higher buffer usage. Higher RTTs can come about either because of networking hardware (e.g. links in the network core may have higher latency than those within a rack) or because of overheads in the software stack. A hardware implementation of Homa in the NIC would reduce RTTs significantly compared to an implementation in kernel software, so it would reduce buffer usage as well.
Workload. We found significant variations in buffer usage as the workload changes. Across the 5 workloads we simulated in the SIGCOMM paper, maximum buffer usage for a single downlink varied by more than a factor of 2x. Across the 4 workloads measured with the Linux kernel implementation, aggregate buffer usage varied by more than 3x (but some of this variance is because some workloads couldn’t drive the network at high utilization).
Degree of pool sharing. If the buffer pool is statically partitioned among egress ports as in the Aeolus experiments (i.e. each buffer pool can be shared by only 1 port), significantly more buffer space is required to prevent packet drops than if buffers are shared. It seems likely that pools with more sharing are likely to use buffer space more efficiently and hence require less total space. Thus, for example, a switch with buffers shared across 80 ports will probably need less than 2x as much buffer space as a switch with buffers shared across 40 ports.

Most of these factors have not been analyzed comprehensively. For example, we don’t yet know the precise relationship between degree of overcommitment and buffer occupancy, or between the degree of pool sharing and buffer occupancy. Gaining a better understanding of these relationships is an important area for future work.

Quantifying buffer usage

A convenient way to measure buffer usage is to take the total amount of buffer space available (or required) across a collection of egress ports in a switch and divide it by the aggregate network bandwidth of those ports. The resulting number will have units of time (microseconds).

For example, the Mellanox 2410 switches used to evaluate the Linux kernel implementation of Homa have a total of 13 MB of buffer space available for downlink egress ports, and in our experiments 40 nodes were attached to the switch at 25 Gbps. Thus, available buffer space for downlinks was 13 MB / (40 * 25 Gbps), or 104 usecs. Homa’s worst-case buffer usage was 8.5 MB, which is 68 usecs, and it was able to run efficiently at 7 MB, or 56 usecs.

One way of thinking about this time is that if data arrives at full bandwidth but no data is actually removed from buffers, it will take this long before all buffer space is exhausted (assuming a shared pool).

In principle, this usage metric should be fairly constant across changes in link speed or number of hosts, but as described in the previous section, there are many factors that can affect buffer usage. Thus, the usage metric is unlikely to stay constant across all environments. For example, an environment with a larger RTT will almost certainly require more usecs of buffer space than an environment with a smaller RTT.

Extrapolating to 100 Gbps switches

Given what we know about Homa’s performance with the Mellanox 2410 switches, can we predict its behavior in 100 Gbps environments? Consider the Tomahawk-3 switching chip. It has 12.8 Tbps aggregate throughput (128 ports at 100 Gbps) and 64 MB of buffer space, which equates to 40 usec of buffer capacity. This would appear to be too little for Homa’s minimum requirement of 56 usec. However, not all of the Tomahawk ports will be used for downlinks. If the switch is configured with 2:1 oversubscription, then 1/3 of its bandwidth will be for uplinks, so the total downlink throughput will be 8.53 Tbps. We don’t know how much buffer space will be needed for the uplinks; if it turns out that very little buffering is needed for them (because Homa can load-balance effectively with packet spraying) there could be nearly 60 usec available for the downlinks, which would meet Homa’s needs.

In addition, the other factors discussed above could either increase or decrease Homa’s buffering needs in a 100 Gbps environment. For example, if it turns out that hosts are not able to fully utilize 100 Gbps uplinks, buffer needs will drop. On the other hand, our Linux kernel measurements were made entirely within a single rack; if 100 Gbps switches are used in a multi-level fabric, RTTs could go up, which would increase buffer requirements.

Thus, we don’t currently have enough information to determine whether Homa will work well with Tomahawk-3 chips: there could be enough buffer space (barely), or there could be a shortfall.

In general, it appears that newer switching chips are scaling their bandwidth faster than their buffer space; if this trend continues then it will put pressure on Homa (and many other protocols, including TCP).

What can be done if switches don’t meet Homa’s buffer requirements?

If it turns out that Homa’s use of buffer space exceeds the capacity of switches, that will not necessarily make Homa impractical. Here are some thoughts on how to handle this situation, if it arises:

Reduce Homa’s buffer needs. So far we have made no attempt to reduce Homa’s buffer usage: we’ve optimized Homa for performance under the assumption that switches will have enough buffer space to meet its needs. We have several ideas for ways to reduce buffer usage (see the Homa projects page). Some of these approaches would not impact performance at all, while others might entail some loss of performance. For example, reducing the degree of overcommitment would reduce buffer usage, but it would also limit maximum network throughput. We don’t yet have enough data to quantify the tradeoff (e.g. how much buffer space would be saved vs. how much throughput would be sacrificed).
Idle some switch ports. If some ports on a switch are not used (i.e. the switch’s throughput is reduced) then the usecs of buffer space available to the remaining ports will increase. This introduces a cost-performance tradeoff: more switching chips would be needed in the network hardware, thereby increasing its cost, but Homa’s performance would improve.
Aeolus isn’t the answer. Although Aeolus reduced Homa’s buffer requirements significantly, it did so by severely damaging the protocol, resulting in much worse performance than “real” Homa. It seems likely that there are better ways to reduce Homa’s buffer usage, which will allow Homa to work with future switches at much higher performance than Aeolus can provide.
TCP will also have problems. Our measurements indicate that Homa uses less buffer space than TCP. Thus, if it is hard for Homa to run efficiently with a given switch, it will also be hard for TCP to run efficiently. This will create pressure for switch designers to provide more buffer space.