Comments on "SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol"
The paper “SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol” (NSDI 2025) describes a new receiver-driven transport protocol that improves on Homa by using buffer space more efficiently. The key idea is for senders to provide information to receivers when they are suffering congestion on their uplinks (i.e., they have multiple outgoing messages that have received grants, so they cannot immediately transmit all of the messages at once). The congestion information allows receivers to redirect grants to other senders that are not congested. The result is that receivers can use their grants more efficiently, which allows them to reduce the degree of overcommitment, which then results in lower buffer usage in top-of-rack switches (TORs).
The current design of Homa solves the congested-sender problem with overcommitment: a receiver issues grants to multiple incoming messages concurrently. If the highest-priority sender does not transmit its message immediately due to output congestion, messages from lower-priority senders can be received in order to keep the TOR downlink fully utilized. In practice, a degree of overcommitment of 4-8 is needed to ensure full downlink utilization under mixed workloads.
However, Homa’s overcommitment results in increased buffer utilization in the TOR: if all of the granted senders do in fact transmit their messages, only the highest priority message will actually get through to the receiver; the others will be buffered in the TOR. The higher the degree of overcommitment, the greater the worst-case buffering. As network speeds increase, buffer space is becoming an increasingly precious resource, so transport protocols must use buffers as efficiently as possible in order to avoid buffer overflow and dropped packets.
SIRD's feedback from senders allows receivers to reduce the degree of overcommitment while still maintaining high downlink utilization. Simulation-based measurements reported in the paper indicate that SIRD can reduce worst-case buffer utilization by a factor of 10x or more in comparison to Homa.
These results are very promising. It should be possible to implement SIRD-like sender feedback in Homa and thereby reduce Homa’s buffer utilization significantly. This task is now on my “to do” list.
The rest of this article discusses a few areas of concern or disagreement with the SIRD paper. These are mostly second-order issues and should not be taken as criticism of the overall idea. Overall, I think SIRD makes a very nice contribution that could result in significant improvements to Homa. I’m excited about implementing the SIRD ideas in Homa!
Simulations don’t include software overheads
The greatest risk with the SIRD measurements is that they were made with simulations, and simulations don’t include the software overheads that occur in real systems. My experience with Homa is that software overheads make a big difference. For example, our simulations of Homa produced round-trip times around 5 usec, but the best achievable RTT with the Linux kernel implementation of Homa is around 15 usec. The difference is due to software overheads in the Linux networking stack. Furthermore, software delays can be highly variable: it’s not unusual for packet processing to be delayed by hundreds of usec in Linux. Simulators don’t capture this variability.
Software delays interfere with the control loop between senders and receivers. SIRD achieves its efficiency by having only a very small level of overcommitment (about 1.5x, vs 4-8x in Homa). For this to work, SIRD receivers must receive feedback about sender congestion quickly so that they can reallocate grants; otherwise there will be insufficient outstanding grants to keep the receiver downlinks fully utilized and throughput will suffer.
It seems likely that in a real implementation of SIRD the additional delays from software overheads will require a higher level of over-commitment to keep links fully utilized; the longer it takes to react to changes in load, the more grants need to be outstanding in order to keep the downlink utilized. This in turn will result in higher buffer utilization. The interesting question is: how much higher? It seems likely that sender feedback will still provide benefits for Homa, but the benefits probably won’t be as high as reported in the paper.
In general, I am skeptical of the value of simulations in analyzing transport protocols. Given the dramatic increases that are occurring in network speed, without corresponding increases in CPU speed, software overheads now dominate performance both for latency and throughput. Without simulating those overheads it’s unclear that simulation results have much value.
Alternate feedback mechanisms
SIRD uses an ECN-like mechanism for senders to indicate that they are congested. The information they send to receivers consists of a single bit saying “My uplink is congested” without indicating how much it is congested or how long the congestion is likely to last. This requires receivers to use an AIMD (additive increase multiplicative decrease) approach for managing receive windows, which requires multiple RTTs to stabilize after a change in load. In addition, this approach interferes with SRPT priorities (a significant amount of traffic must be sent on a fair-sharing basis in order to support the congestion feedback).
I think it should be possible for senders to provide more precise information to receivers, such as “I have too many higher priority messages right now, so I can’t transmit my message to you. I’m returning all of my grants for that message. I’ll get back in touch when I’m ready to transmit bytes to you, so you can send new grants.” This would allow receivers to make precise decisions about grants immediately, and it would eliminate any impact on the SRPT priority scheme.
One of the reasons for the SIRD approach was so that it could handle congestion in the network fabric (where only ECN bits are available) as well as at sender uplinks. However, as discussed elsewhere, I don’t expect congestion in the network fabric to be an issue for Homa, so there’s no need to tie the mechanism for sender congestion feedback to that used for fabric congestion.
Incast measurements not meaningful for Homa
Figure 6 of the SIRD paper measures buffer utilization under incast. However the simulator used for Homa did not implement Homa’s incast optimization, so the Homa incast results are not meaningful. If the Homa incast optimizations were implemented, I would expect Figures 6(g-i) to look the same as Figures 6(a-c) for Homa (i.e., no noticeable impact from incast).
Are priorities still needed?
The paper argues that SIRD can provide good performance with fewer priorities than Homa. The last paragraph of Section 6.2.4 seems to suggest that even a single priority level might be adequate for SIRD. I disagree with this conclusion: long messages may not need priorities for high performance, but short ones do. Figure 11 shows a 2-3x reduction in tail latency for short messages when a second priority level is used.
That said, it does appear that SIRD can get by with fewer priority levels than Homa (Homa sees continuing performance improvements up to about 4 priority levels). This could make a big difference in terms of ease of deployment, because priority levels appear to be a scarce resource in datacenters. It seems likely that a datacenter will have a priority level that is dedicated to high-priority traffic, and that the traffic on that level uses only a small fraction of available bandwidth (so there is no need to worry about contention within that priority level). If Homa only needed 2 priority levels, it could most likely share its higher priority level (which has low throughput) with the existing high priority level in the datacenter and send the rest of its traffic over the “normal” priority level. This would eliminate priority levels as a challenge in deploying Homa.
P99 measurements for mixed distributions are misleading
Figures 10 and 11 in the paper contain an “all” column with data aggregated from the entire workload. Unfortunately, the 99th-percentile (P99) measurements in these columns are not particularly meaningful and can easily be misinterpreted. For example, the “all” column in the right graph of Figure 11 seems to suggest that priorities do not impact tail latency overall (P99 numbers are about the same with and without priorities). Yet the B column shows that more than half of the messages experience a 2x reduction in P99 with priorities, so priorities really do matter (assuming that you care about short messages).
99th-percentile measurements only make sense if all of the data points come from the same distribution. If data from different distributions are aggregated, the P99 really only represents data from the worst of the constituent distributions. For example, suppose a workload consists of two message types: 90% of the messages are short ones that can be delivered in a few usec and 10% are longer ones that take 100 usec on average. The two message types have very different time distributions. Now consider the P99 latency for the combined workload. It will be determined entirely by the longer messages: even an average long message is likely to take longer than the slowest short message. The P99 for the entire workload will probably be the same as the P90 for the long messages. This is misleading because it completely ignores 90% of the traffic: if the times for short messages increased by a factor of 2x, there would likely be no change in the P99 for the combined workload.
Because of this, I recommend against reporting tail numbers such as P99 unless all of the measurements come from about the same distribution. It’s better to report separate P99’s for each of the constituent distributions. Fortunately the SIRD paper does this (e.g., the A, B, C, and D columns in Figures 10 and 11). I recommend focusing on those columns and ignoring the P99 for the “all” columns.
That said, average measurements for mixed distributions such as the “all” columns do still have some validity, since they incorporate information from all of the data points.