Comments on "PowerTCP: Pushing the Performance Limits of Datacenter Networks"
The paper “PowerTCP: Pushing the Performance Limits of Datacenter Networks“ (NSDI 2022) describes a new congestion control algorithm for datacenters and compares it against several existing protocols, including Homa. Among other things, the paper claims better performance than Homa, including a 99% reduction in tail latency, and it claims that Homa’s overcommitment mechanism is harmful to performance. Unfortunately, the third-party simulator used to evaluate Homa suffers from numerous problems, including bugs, missing features, and misconfiguration. After discussions with the authors, we have agreed that because of the simulator issues most of the Homa measurements in the paper are incorrect. In addition, there remain anomalies that have not been resolved. Until all of the issues are resolved and measurements are retaken, the paper’s conclusions relating to Homa should be considered unreliable.
The remainder of this article discusses in more detail the problems with the Homa evaluation in the paper. It then argues that there is good reason to believe Homa will outperform PowerTCP, due to its inherent advantages of knowing message lengths and using network priority queues.
I have not scrutinized the paper’s analysis of PowerTCP or its comparison with other systems such as TIMELY and HPCC, and I have no reason to question those measurements. Based on my reading of the paper, its conclusion that PowerTCP provides significant benefits in today’s TCP-based datacenters seems plausible.
The PowerTCP authors have issued their own statement, agreeing with the problems above.
About the simulator
The Homa simulator used for the PowerTCP paper was not developed by either the PowerTCP authors or the Homa developers. It is in a relatively early stage of development, but unfortunately advertised itself in a way that suggested it was endorsed by the Homa creators as a reference implementation. This is not the case, and the wording has subsequently been changed, but the simulator’s original description could easily mislead people into placing undue trust in it.
Known problems with the simulator
As a result of discussions with the PowerTCP authors about anomalies in the Homa measurements, we have uncovered the following bugs and missing features in the simulator:
Incorrect implementation of priority queues. The simulator reads packet priorities incorrectly.
No shared buffer pools. The simulator implements only statically allocated pools. For best performance, Homa needs shared buffer pools (which are common in modern switches); statically allocated pools can result in unnecessary packet drops.
No support for non-uniform RTTs. Homa senders are allowed to transmit one BDP of data unilaterally for each message, without waiting for grants. This amount depends on the round-trip time between sender and receiver. In most datacenter fabrics, RTTs are non-uniform (e.g., in the PowerTCP simulations intra-rack RTTs were 4 usec, while inter-rack RTTs were 28 usec); it has been our expectation that in environments like this Homa should use different BDP values for different sender-receiver pairs, depending on their RTTs (which should be measured dynamically). However, this expected behavior is not specified in any of the Homa papers, and the simulator supports only a single RTT value for all client-server pairs.
Sender-driven retry. The simulator had an option to use sender-driven retry, which was enabled for the PowerTCP measurements. This is not part of the Homa design (Homa uses receiver-driven retry) and it causes poor performance under some conditions.
All of these issues have now been registered in the simulator’s GitHub repository.
Consequences of the simulator problems
Figures 9 and 10 are incorrect. These figures show step-wise increases in buffer occupancy for Homa. However, this is an artifact of the simulator’s use of sender-driven retry (senders were timing out and resending all their unscheduled data). When sender-driven retry was disabled, the steps went away in both figures.
Figure 11 is incorrect. This figure indicates that bandwidth sharing is occurring between flows, which should not happen in Homa due to its SRPT mechanism. This result was a consequence of the bug in the implementation of priority queues. After fixing the bug all of Figures 11(b)-(f) look like Figure 11(a) (overcommitment has no impact on incast situations).
The paper’s claim that Homa works best with a degree of overcommitment of 1 is incorrect: this conclusion was based on the incorrect data in Figures 9-11. This calls into question all of the other Homa measurements in the paper, since they were made with an overcommitment degree of 1.
Figure 4 is an artifact of non-uniform RTT handling. Figure 4 seems to show that Homa suffers higher peak buffer usage under incast than PowerTCP. However, this is entirely an artifact of how non-uniform RTTs were handled. Because the simulator supports only a single RTT value for computing unscheduled data, the PowerTCP authors decided to use the largest value (28 usec). This means that 7x too much unilateral data is sent when communicating within a rack, which results in excessive buffer buildup. The PowerTCP authors agree that if Homa were implemented in the way we assumed, its peak buffer usage under incast would be identical to PowerTCP’s.
Figure 6 contradicts prior work. The behavior shown for Homa in this figure is dramatically worse than, and qualitatively different from, measurements of 3 other Homa implementations running the same workload. It seems likely that the known simulator problems affected this figure, so the figure should be considered unreliable until it has undergone additional analysis to explain exactly why it differs from previous Homa measurements.
Additional anomalies. Even after fixing the bugs above, there are still anomalies in the simulation results (as one example, Homa’s buffer usage appears too high in the revised Figures 10; see this GitHub issue for details). These anomalies have not yet been resolved.
Before any of the paper’s Homa measurements or conclusions can be considered valid, the known bugs need to be fixed, missing simulator features need to be implemented, and the remaining anomalies need to be analyzed to see if there are additional simulator problems. Once all of these issues have been resolved, all of the Homa measurements need to be retaken.
Homa vs. PowerTCP: which is better?
Contrary to the paper’s conclusions, there are good reasons to believe that Homa will outperform PowerTCP once all of the simulator problems are fixed. Further measurements are needed to confirm these arguments, but the arguments provide additional reasons to be skeptical of the paper’s claims about Homa until proper measurements have been made.
Homa’s response to incast is strictly faster than PowerTCP. Knowledge of message sizes gives Homa a fundamental advantage: it can predict incasts before they happen. During incast, a Homa receiver becomes aware of a problem as soon as it has received packets from at least 2 messages. At this point it knows exactly how much data is coming, so it can instantly stop issuing grants and all senders will stop transmitting after their initial BDP of data (modulo an extra packet or two for the first message). This is the fastest possible response, absent additional in-network mechanisms for incast detection. PowerTCP depends on queue buildup to provide a congestion signal, which must inevitably take at least a bit more time. In addition, senders scale back gradually as they receive ever-more-dire feedback; this also takes additional time. The only way to improve upon Homa’s response is to reduce the amount of data sent unilaterally. Either Homa or PowerTCP could do this, but it would result in loss of throughput on lightly loaded networks.
Homa also responds faster when flows complete. Because it has message length information, Homa knows when each message will complete and it can proactively grant to other messages to avoid “bubbles” on its downlink. PowerTCP has no such information, so it cannot know that a flow has completed until buffer occupancy begins to drop; this results in delays in ramping up other senders, which risks buffer underflow and underutilization of the downlink.
Priorities give Homa an additional advantage. With its use of priorities, Homa can provide low latency for short messages even when there is significant buffer buildup. Buffering is necessary in order to maintain high link utilization in the face of fluctuations in traffic. With protocols like TCP and PowerTCP, which do not use priority queues, it is difficult to optimize both throughput and latency. Buffering improves throughput but impacts the latency of short messages; if buffering is reduced to improve the performance of short messages, overall throughput will be affected. Homa can achieve the “best of both worlds”: buffers accumulate in the low-priority queues to maintain throughput, but short messages have higher priority, so they can bypass those queues.
It is worth noting that PowerTCP addresses congestion in the network core fabric, while Homa does not. PowerTCP appears to offer significant benefits in today’s datacenters, which are based on TCP and suffer core congestion. At the same time, it’s also important to note that core congestion is not an issue for Homa. Core congestion only exists in current datacenters because they use flow-consistent routing, which is required by protocols such as TCP. With flow-consistent routing, core congestion can happen even at very low overall network utilization, when two long flows happen to randomly hash to the same link. In contrast, Homa uses packet spraying; this should eliminate core congestion as a significant factor. Thus there is no need for Homa to address core congestion.
It is also worth noting that Homa’s overcommitment mechanism consumes additional buffer space. Overall, PowerTCP is likely to use less buffer space in the steady-state than Homa. Homa assumes that modern switches have adequate buffer space for its overcommitment mechanism. If this assumption turns out to be false, then the Homa protocol will need to be modified, and this will likely affect its performance. See this article for a more thorough discussion of buffer utilization in Homa.
Overall, it seems likely that PowerTCP provides significant value in TCP environments, but that Homa will outperform PowerTCP as long as it runs in an environment that supports packet spraying and has adequate buffer space.