In discussions about Homa, there are several concerns that occur repeatedly. Most of these have already been addressed in the Homa papers, but this page summarizes the concerns and the counter-arguments.
Homa doesn’t deal with congestion in the network core
This is true. Homa’s congestion control mechanisms are focused on the network edge (downlinks between top-of-rack switches and hosts); Homa doesn’t take any steps to mitigate congestion in the core. However, with Homa there should be no core congestion in the first place. This is because Homa allows “packet spraying”, where each packet of a message can be independently routed through the datacenter switching fabric in order to balance load. The reason current datacenters suffer core congestion is because TCP requires packets to be delivered in order; this requires flow consistent routing, where all of the packets of a given connection follow the same trajectory through the fabric. Unfortunately, with flow consistent routing it’s highly likely that multiple large flows will sometimes hash to the same link, even if the fabric is running at low overall utilization. Once this happens, that link will be congested for the lifetime of the flows and affect not just those flows, but any other flows that hash to the same link. This is the cause of virtually all core congestion today. If Homa becomes widely deployed there will no longer be significant core congestion for anyone to deal with.
All modern datacenters oversubscribe the core; doesn’t this guarantee that there will always be core congestion?
No. It’s important to distinguish between oversubscription and overload. Oversubscription means that the aggregate bandwidth of all of the host downlinks exceeds the aggregate bandwidth of the network core. In an oversubscribed system, if every host were to transmit at full bandwidth to destinations across the datacenter, the core network could become overloaded. However, overload virtually never happens in practice, because hosts typically use only a small fraction of their uplink bandwidth and some of the traffic targets neighbors attached to the same top-of-rack switch. As a result, even with oversubscription, core networks tend to run at relatively low utilization. It would not be cost-effective to underprovision core networks so that they can’t keep up with the actual loads, because this would result in under-utilization of the more expensive host machines.
Does Homa have problems with excessive buffer usage?
This is an open question. This claim has been made by a few recent papers, most notably Aeolus. However, these papers based their claims on inaccurate assumptions about switch buffer management. For example, the Aeolus paper assumes that switch buffer space is statically divided among egress ports, whereas in fact switches provide shared pools of buffers, so they can handle brief spikes at a particular port. See the Aeolus rebuttal for more discussion of the Aeolus claims.
There has been no problem with buffer overflows in the existing implementations of Homa. For example, the worst-case buffer consumption in benchmarks of the Linux kernel implementation was about 8.5 MB, for a switch with 13 MB capacity (these benchmarks used a 25 Gbps network). Our implementation of Homa in RAMCloud, which used Infiniband networking, also had no problems with buffer overflows, though we did not measure Homa’s actual buffer usage.
Extrapolations to newer 100 Gbps switching chips, such as Broadcom’s Tomahawk-3, suggest there may be challenges for Homa. To see this, take the ratio of total required buffer space to total host downlink bandwidth; this has units of time. It seems plausible that this ratio will remain constant as network speeds scale. In the Linux kernel implementation, Homa used 8.5 MB of buffer space to drive 40 nodes at 25 Gbps: the ratio is 68 microseconds of buffer space. Tomahawk-3 switches offer 128 ports at 100 Gbps, for 12.8 Gbps total bandwidth, and they have 64 MB of buffer space, which is 40 usecs worth. This would appear to be insufficient for Homa. However, with 2:1 oversubscription, only ⅔ of the switch bandwidth will be for downlinks. Assuming that there will be little or no buffering on the uplinks (since Homa can use packet spraying), this would result in 60 usecs of buffering on the downlinks, which is very close to what Homa needs.
Another consideration is that our measurements indicate that TCP needs at least as much buffer space as Homa. Thus, any switch that works for TCP is likely to work for Homa.
It appears that newer switching chips are increasing their bandwidth faster than their buffer space. Suppose there comes a time where switches no longer have enough buffer space for Homa: will that make Homa useless?
No. To date we have not made any attempt to reduce Homa’s buffer usage, but it seems likely that buffer usage could be reduced significantly. For example, most buffer usage comes from either unscheduled packets or overcommittment. In the worst case, these could be scaled back to reduce buffer consumption (but this would come at some cost in performance). We also have ideas for a optimizations that might reduce buffer usage without any performance impact. See the projects page for details.
Bottom line: it is premature to declare that Homa is impractical because of its buffer usage when (a) we have actual implementation experience that shows this is not a problem, and (b) we have ideas how to reduce buffer usage in the future if that should be needed.
Is Homa resilient against dropped packets?
Homa is resilient in that it will detect dropped packets and retransmit them, but Homa assumes that drops are extremely rare. If packets are dropped frequently, Homa will perform poorly. Packet drops from corruption are extremely rare, so the only thing to worry about is buffer overflow.
How does Homa handle incast?
Incast refers to a situation where many hosts simultaneously transmit to a single receiver. In the worst case this could result in rttBytes of buffered data at the receiver’s downlink for each transmitter; in theory, this is unbounded, so it could result in buffer overflow and dropped packets. It’s worth considering two different scenarios for incast:
Self-inflicted: a server issues a large number of requests to different peers; the requests all take about the same amount of time to process, so the responses all arrive at about the same time. This appears to be the most common cause of incast. Because Homa is RPC-oriented, it can easily predict and mitigate this form of incast. To do so, Homa tracks the number of outstanding RPCs at any given point in time. If this number exceeds a small threshold, Homa sets a flag in new RPCs, which indicates to the servers for those RPCs that they should reduce the amount of unscheduled data they may send in their responses (perhaps only a few hundred bytes?). This reduces buffering to the point where Homa can handle incasts of degree 1000 or more without buffer overflow (see measurements in Section 5.1 of the Homa SIGCOMM paper).
Unpredictable: many machines independently decide to transmit to the same recipient at the same time. It is highly unlikely that this form of incast could reach a high degree unless there is a common underlying cause; if there is a common cause, perhaps it could also provide a signal similar to what Homa does for self-inflicted incast. Even so, Homa limits the amount of buffer buildup to rttBytes per source, and Homa receivers act very aggressively to drain away accumulated buffers. Homa can handle degrees of at least several hundred for unpredictable incasts.
Since Homa prioritizes shorter messages, can’t the longest messages starve?
In principle yes, but in practice no. In order for longer messages to starve, the receiver’s TOR downlink would need to be completely saturated for a long period of time; this is uncommon in practice. We ran adversarial experiments where we attempted to generate starvation, and found this difficult to do even when running at 80% average network load. The Linux kernel implementation of Homa also contains an extra safeguard against starvation. It reserves a small fraction of bandwidth for the oldest message instead of the shortest one; this eliminates starvation. It’s also important to note that Homa’s SRPT policy has a built-in advantage over the fair sharing approach used by TCP, in that it uses run to completion: Homa receivers tend to pick one message and grant it until it completes. When there are many competing messages, at least one of them will finish relatively quickly. With fair sharing, if there are many long messages, they all complete slowly. As a result, even though Homa favors short messages, it also speeds up longer messages in comparison to fair sharing.
Homa depends on using the priority queues in modern switches, but some datacenter operators already use those queues to implement their own quality-of-service mechanisms. Won’t this make it difficult for Homa to achieve its performance potential?
Getting access to the switch priority queues is a potential challenge for Homa deployment, but it’s worth considering the following additional factors:
Even if all Homa packets must flow through a single priority queue, Homa still outperforms TCP by a considerable margin.
Homa doesn’t need very many priority queues: 4 is plenty, and even 2 queues provide considerable benefit.
The number of queues per port appears to be increasing in switches.
Managing priority queues statically “by hand” is difficult to do effectively. A simple experiment in the ATC Homa paper (see Section 5.4) suggests that it may be possible to get better overall performance by eliminating existing static allocations and letting Homa manage the priority levels dynamically. Even though all applications and uses would get the same quality of service, the applications that previously had higher priority would see better performance under almost all conditions!
Add Comment