A Critique of "dcPIM: Near-Optimal Proactive Datacenter Transport"
The paper "dcPIM: Near-Optimal Proactive Datacenter Transport" (SIGCOMM 2022) describes a new network transport protocol for datacenters, which applies a bipartite matching technique from switches to schedule network flows across a datacenter. The paper claims that dcPIM produces good results both in terms of tail latency and in terms of network utilization, and compares dcPIM to other protocols, including NDP and HPCC. The paper claims that dcPIM is superior to Homa, but it doesn’t actually compare to Homa; instead it compares dcPIM to a hobbled modification of Homa called “Homa Aelous”. This document makes the following points:
The paper should have compared against “real” Homa.
Homa’s bipartite matching mechanism is better than dcPIM’s.
Homa outperforms dcPIM in both latency and network throughput. The only situation in which dcPIM matches Homa’s performance is a special case for short messages in which dcPIM bypasses its bipartite matching mechanism.
Note: I raised these concerns with the dcPIM authors in early July of 2022. We had a Zoom meeting in early August of 2022 in which we discussed some of the issues. At the end of that call there were many unresolved issues and we agreed to continue discussing these. However, since then the authors have declined to engage in any further discussion about the paper; I invited them to offer counter-arguments to the points below, but they declined.
The paper benchmarks against Homa Aeolus, not Homa
Instead of comparing dcPIM to Homa in its evaluation, the paper compares dcPIM with “Homa Aeolus”, a modified version of the Homa protocol developed by the authors of the Aeolus paper. The paper justifies this by claiming that Homa is impractical because of excessive buffer usage. It cites the Aeolus paper as evidence of this. However, the Aeolus work is based on incorrect assumptions about switch buffers and its results are invalid; see this critique for more details.
The dcPIM paper also states the following in Section 4.1:
Our own evaluation confirms the observations made in Homa Aeolus: when realistic buffer sizes are used, Homa suffers from significant packet drops and low network utilization, especially for 100Gbps links.
The paper provides no specifics on exactly how Homa’s buffer usage was evaluated (e.g. what “realistic buffer sizes” are) or what results were produced. I asked the authors for more details, but they have declined to provide any. Based on our Zoom call, I think they may have partitioned the buffer space, but I have not been able to confirm this.
To the contrary, we have demonstrated that the Homa Linux implementation works fine with Mellanox 2410 switches (25 Gbps). Buffer space will be tighter on 100 Gbps switching chips, but there has not yet been a convincing demonstration of problems for Homa, and there are several unexplored avenues for reducing Homa’s buffer usage if that should become necessary. See this discussion of Homa’s buffer usage for more information.
Thus, it is premature to dismiss Homa because of its buffer usage.
The paper make claims about Homa based on Homa Aeolus, but Homa Aeolus has inferior performance compared to Homa
The paper uses Homa Aeolus as a stand-in for Homa, claiming that it “makes a better tradeoff” than Homa. Homa Aeolus is a modification to Homa designed by the Aeolus authors to reduce buffer utilization. Although Homa Aeolus does reduce buffer utilization compared to Homa, it severely damages the Homa protocol (for example, it drops packets in a way that conflicts with Homa’s SRPT policy). The performance measurements in the Aeolus paper camouflage these deficiencies, but they are clearly visible in the dcPIM measurements. For example, the short-message tail latencies measured for Homa Aeolus in Figure 3 are far worse than those that have been published for Homa. Homa Aeolus also appears not to implement Homa’s incast optimization, which damages its performance in incast experiments.
Homa Aeolus does not fairly represent Homa’s performance. If it should become necessary to reduce Homa’s buffer usage, there are better ways to do it than Homa Aeolus.
Homa performs bipartite matching better than dcPIM
The primary contribution of the dcPIM paper is its new approach to bipartite matching. However, Homa also performs bipartite matching, and its approach is superior to dcPIM’s approach.
In Section 3.1 the dcPIM paper claims that Homa “essentially do[es] one round of matching”, and that therefore dcPIM is superior because it performs multiple rounds. This characterization is incorrect. Homa’s approach, which uses priority queues and overcommitment, is equivalent to multiple rounds of matching (typically 8), which is more than dcPIM’s 4 rounds. For a full discussion of how bipartite matching works in Homa, see this page.
Homa’s approach to bipartite matching is more efficient than dcPIM’s:
Homa is asynchronous while dcPIM is synchronous. In dcPIM, time is divided into fixed-length intervals (roughly 5 RTTs each), and both matching and transmission are slotted into those intervals. This creates inefficiences:
A new message must wait until the beginning of the next matching interval before it can begin matching; once matched, it must then wait until the next interval to begin data transmission. Thus, the total delay for matching is about 7.5 RTTs on average.
If the data for a message does not exactly fit the length of a transmission interval, bandwidth is wasted. The channel mechanism reduces but does not eliminate this overhead.
Slack time must be added to maintain synchrony (e.g. 1.3 RTTs are allocated for each round of matching, instead of 1 RTT, to handle variations in speed among machines).
In contrast, Homa is asynchronous: every machine makes independent decisions. Messages can begin matching (and transmitting) immediately, and no slack is needed.
In Homa, senders transmit continuously through the matching process (data packets are part of the matching process); there is no delay between when a request appears and when data transmission can begin, compared to 7.5 RTTs in dcPIM. This helps Homa both in terms of latency and network utilization.
Homa implements a closer approximation to run-to-completion than dcPIM. For example, dcPIM’s channel mechanism, which was introduced to reduce wasted bandwidth because of fixed-length intervals, means that a single message can only use ¼ of the incoming bandwidth of a receiver. In addition, the randomized elements of dcPIM’s matching mechanism tend to produce fair sharing rather than run to completion. Fair sharing results in higher latency than run to completion.
In dcPIM, messages < 1 BDP bypass the matching mechanism and are transmitted directly. This improves short message latency but results in lost network bandwidth. For example, suppose that sender A has matched with receiver B and is transmitting to B when a short message for C appears on A. A will stop transmitting to B in order to send the message to C; no-one else will transmit to B while this is happening, so B’s downlink bandwidth will be wasted. Homa’s overcommitment mechanism allows buffered data from other senders to utilize downlink bandwidth if the primary sender stops transmitting. This improves Homa’s network utilization.
Homa has lower latency than dcPIM
Homa’s workload W5 is the same as the WebSearch workload for dcPIM, so it is possible to compare P99 slowdown for dcPIM at 60% load (Figure 3(d) in the dcPIM paper) vs. Homa at 80% load (Figure 12 in the Homa SIGCOMM paper). Even though Homa was measured under a higher load, its P99 slowdowns are better than dcPIMs for all but the shortest messages:
For short messages (< 1 BDP), dcPIM is a bit faster (P99 slowdown about 1.1 vs. about 1.4 for Homa). The two protocols handle these messages virtually identically, so it seems likely that this difference is due to differences in the simulation environments (dcPIM measured at 100 Gbps, Homa at 10 Gbps). For Homa, we found that slowdown for small messages is almost entirely due to link-level preemption lag, where a high-priority packet arrives at an egress port while a lower-priority packet is being transmitted; the high-priority packet must wait for the lower-priority one to finish transmission. This takes longer in the Homa simulations because Homa was simulated at 10 Gbps, vs. 100 Gbps for dcPIM.
For mid-size messages, Homa is significantly faster (P99 slowdown 1.4-1.5 vs. 4-8 for dcPIM), even though it is running at a higher load.
For large messages (> 16 BDP), Homa P99 slowdown increases with message size, from about 2.1x up to about 16x for the very largest messages, with an average of about 4.7x. dcPIM lumps all large message sizes together, with an average P99 slowdown of about 9x.
Homa supports higher network utilization than dcPIM
dcPIM claims a maximum load of 0.84 for the WebSearch workload, with a range of 0.7 - 0.84 across all workloads considered. Homa claims a maximum load of 0.87 for the WebSearch workload, with a range of 0.87 - 0.92 across all workloads considered (see Figure 15).
Advantages of dcPIM
I am aware of two ways in which dcPIM is better than Homa:
dcPIM uses less buffer space.
If N senders begin transmitting to one receiver simultaneously, then N-1 BDP’s of buffer space will accumulate at the receiver’s downlink. The receiver will have to “burn this off” by receiving the accumulated data before it can focus its link bandwidth on a single sender. This will delay the highest-priority sender (effectively, Homa will use fair sharing for a while, as it receives the unscheduled data). dcPIM will not have this buffer buildup, so it can focus more quickly on a single preferred message (run to completion).
However, dcPIM’s channel mechanism limits any given sender to ¼ of the receiver’s bandwidth, so dcPIM effectively uses at least 4-way fair sharing throughout message transmission. For long messages, Homa will still be closer to run-to-completion than dcPIM.