Response to Ivan Pepelnjak's Blog Post

In a recent blog post, Ivan Pepelnjak raised numerous objections to my position paper It’s Time to Replace TCP in the Datacenter. This article responds to those objections.

Let me begin with two overall responses:

  • In spite of all its criticisms of the position paper, Ivan’s posting doesn’t present any arguments that TCP is actually the right protocol for datacenters. The closest it gets is to argue that TCP avoids ordering issues that arise with message-based approaches, but it then suggests that TCP should be used in ways that introduce the same problems. See below for details. The blog post feels in many places like it’s just trying to find things to argue about, as opposed to addressing the issue of whether TCP is the right transport protocol for datacenters.

  • We need more data. The blog post makes numerous claims with no supporting data. I tried to base the position paper as much as possible on data, but there are several important areas where I don’t know of any data, such as the prevalence and causes of contention in datacenter core fabrics, or what datacenter workloads look like today. I hope this discussion will stimulate people at places like Google, Microsoft, and Amazon to measure their datacenters and make the data available publicly.

The rest of this article addresses many of Ivan’s specific concerns in detail. I have not attempted to respond to every one of his gibes because that would make this article so long that no-one would want to read it. But, if you think there’s an important issue raised by Ivan that I haven’t responded to, let me know and I’ll try to respond.

Message reordering

Homa treats each message independently; it does not guarantee any ordering among concurrent messages. Ivan points out that this could be problematic if an application issues multiple concurrent requests and expects them to be processed in the same order they were issued. I agree that this is a potential problem. However:

  • It’s unclear to me how often this occurs in practice. In fact, Ivan himself seems to be skeptical; in a different argument he says (without supporting data) “Environments in which a single application client would send multiple independent competing messages to a single server tend to be rare.” This is an area where more data is needed.

  • As Ivan pointed out, there are many situations where the ordering doesn’t matter (such as any non-mutating operation); a message-based approach allows higher performance for these cases.

  • In situations where order matters, there are relatively simple solutions. For example, it’s pretty easy to add serial numbers to requests that need to be ordered and use them at application (or RPC framework) level to ensure they are executed in order (This code is simpler than the code required in TCP applications to reassemble messages that are split across multiple receive buffers). Or, an application can send multiple requests in a single message, and the server can then process them in order.

  • Later on, Ivan argues that concurrent TCP requests should be distributed across parallel streams to avoid head-of-line blocking; this will create the same ordering problem for TCP.

Software overheads

Ivan does not seem to believe that efficient load-balancing across server cores is an important issue, suggesting that my assertion on this “might indicate unfamiliarity with the state-of-the-art high-speed packet processing.” My claim is based on detailed measurements of the Linux kernel, which are discussed in the paper on the Linux implementation of Homa. If Ivan has contradictory data, I’d be delighted to see it.

Ivan then goes on to suggest that if a machine has a large number of open connections, “statistical distribution” will ensure good load balancing across cores. I do not believe this is the case. The problem is that even if a large number of connections exists, only a very small number of them are typically active at once (supporting data is available in papers such as A Microscopic View of Bursts, Buffer Contention, and Loss in Data Centers, by Ghabashneh et al.), and the active set varies, though a static allocation of connections to cores will not. Again, if Ivan has contradictory data, it would be great if he could present it.

How many Gbps can one core sustain?

Ivan challenges my claim that “Driving a 100 Gbps network at 80% utilization in both directions consumes 10–20 cores just in the networking stack”, and asserts that “the Snap paper quoted as a reference definitely does not support that claim.” I stand by my claims about the Snap paper; let’s look at that paper in detail.

The Snap paper contains two different measurements of throughput per core. The first is in Table 1, discussed in the second paragraph of Section 5.1. In this experiment, Snap was able to achieve one-way throughput of up to 80 Gbps with a single core. However, this is a highly artificial experiment because it uses only a single core and thus is not scalable.

The second measurement of throughput per core is in Figure 6(b). This experiment is still overoptimistic because it uses only very large (1 MB) transfers, but it does use load balancing to scale across multiple cores, which is what would be required in any production implementation. At 80 Gbps of total offered load the “spreading” approach to load-balancing requires about 7 cores, while the “compacting” approach requires about 4.5 cores. But this is still unidirectional throughput (small requests and large responses); I assume that scaling to 80 Gpbs in both directions would double the core utilization (to 14 or 9 cores, depending on the load-balancing policy). And things will get even worse for workloads with smaller messages, where the software overhead/byte is higher.

Thus the Snap paper supports my claim of 10-20 cores to drive a 100 Gbps network at 80% utilization in both directions.

One of the discoveries I found most interesting in my measurements of the Linux kernel is the discrepancy between performance with a single core and performance with load balancing across multiple cores. For example, I measured Homa in both the single-core and load-balanced cases, and found a 3x increase in software overhead once transport-level processing is distributed across multiple cores. For Snap, the discrepancy is even higher (4.5x or 7x, depending on the load balancing policy); the Snap paper did not explicitly discuss this discrepancy. Scaling is really expensive! Unfortunately, the single-core measurement approach has been widely used in research papers (including, alas, some of mine); conclusions based on such measurements are questionable.

Popularity implies quality?

Ivan seems to suggest that because TCP is widely used in the datacenter, it must be the best approach: “Why is everyone still using TCP?”, he asks, if it is as problematic as I have suggested? This argument bewilders me, because I’m sure that Ivan understands the network effect and the power of an entrenched standard. Everyone is using TCP, because …. well, because everyone else is using it. It’s by far the best-supported option, and in many situations it’s the only viable option. Over the years, TCP has done many things right (especially in the wide area), so it has earned the right to be a widely used standard. But that doesn’t mean it is automatically the best way to do everything in every environment. If there are reasons other than its entrenchment to justify TCP's use in the datacenter, I’d be interested in hearing them.

Workloads

Ivan objects to the workloads we used to evaluate Homa, such as a Hadoop cluster at Facebook. I agree that workload selection is very important for benchmarking. We did our best to find representative workloads and also to measure a variety of workloads with different behaviors (for example the Homa measurements include 5 different message size distributions, all measured at production sites such as Google or Facebook). We couldn’t find any data on production traffic patterns, so we used a uniform random distribution; I agree that this probably isn’t representative, and it could conceivably bias our results. However, Ivan’s suggestion that “this might exacerbate TCP shortcomings due to heavy incast” is completely speculative (and probably wrong: I think it’s more likely that our workloads erred on the side of spreading requests too evenly across machines rather than generating incast). Before dismissing our arguments based on this speculation, I would hope that he would put forward data on the right workloads for datacenter benchmarks (we’ll be happy to measure with them!) or at least join me in calling for more data.

Workloads, part 2: shared connections?

One of Ivan’s objections to the workloads we used for evaluation is that they use a single TCP stream per sender-receiver pair. Concurrent messages can end up sharing that connection, resulting in head-of-line-blocking if a short message ends up behind a long one. Ivan asserted (without supporting evidence):

“… this is not how TCP is commonly used. Environments in which a single application client would send multiple independent competing messages to a single server tend to be rare.”

This is not my personal experience. For example, in the RAMCloud project it was fairly common for a server to have multiple outstanding requests to the same peer; RAMCloud used a single connection in these situations and experienced significant head-of-line blocking. As more and more applications and services become multithreaded in order to take advantage of machines with large numbers of cores, it seems likely that concurrent requests from one application or service to another will become more common. I would be happy to see additional data on this issue.

Ivan also suggested using parallel sessions to eliminate head-of-line blocking. While this is possible, it could explode the number of connections and their associated state; my understanding is that connection state is already a significant issue for many datacenter applications even with a single connection per peer. And, it’s hard to manage such connections cleverly (e.g. by sharing a connection for short messages while using separate connections for long messages): for example, when issuing a request, it isn’t always possible to know how large the response will be.

Ivan asserted (without evidence) that using a single connection per host pair “skews the results so far that they become meaningless.” However, we have also made TCP measurements in which we used multiple parallel connections to eliminate head-of-line blocking (see Figure 8 in the Homa SIGCOMM paper). This does improve TCP’s performance, but Homa is still superior.

Finally, as one data point, I checked the gRPC framework and determined that it does indeed share connections. Details: I created a client that issued three concurrent asynchronous requests to the same server; using gdb to set breakpoints on the socket library calls, I was able to determine that gRPC used the same TCP socket for all three requests.

All in all, the potential for head-of-line blocking (and/or the cost and complexity of trying to prevent it) represent a significant problem for TCP.

Bandwidth sharing

Ivan (and several of the blog commenters) objected to my use of the terms “bandwidth sharing” and “fair scheduling” to describe TCP’s congestion control. For example, Ivan seemed to think that TCP can’t really control this and that it doesn’t have a particular policy:

“It’s really hard to influence incoming traffic.”
”Bandwidth sharing is just a side effect of running many independent TCP sessions across a network using FIFO queuing.”
”While a TCP/IP stack might try to provide fair scheduling of outbound traffic (but often does not), all bets are off once the traffic enters the network.”

I agree that both the sender and the network can influence the arrival rate of packets for a stream, but the receiving transport layer also has significant control. TCP has chosen bandwidth sharing (by doing nothing to skew bandwidth usage): if it wanted to allocate its link bandwidth unevenly across active connections, it could do so by delaying ACKs or reducing window sizes.

Others argued in comments that bandwidth sharing has nothing to do with congestion control. I disagree. In order to manage congestion, the rate of incoming bytes must be reduced. There are many ways to do this. One way is to allow each connection to use L/N bandwidth, where L is the link speed and N is the number of active connections. Another way is to let a single connection use all the available bandwidth, then when that connection finishes, switch to a different connection, and so on. This choice is necessarily a part of congestion control: if there is no congestion, then there’s no need to decide how to share the link.

In a discussion comment, Minh argued “Why should TCP or any transport protocol care about scheduling? Queuing/scheduling is a network function.” Queuing and scheduling can be done at any level in a system; I don’t see why it must be done only in the network. Performing this at a lower level can be advantageous, because then higher levels don’t have to worry about it. But higher levels typically have more information to guide the scheduling decision. For example, the network layer doesn’t know about message lengths, so it can’t implement SRPT as effectively as the transport layer, which (at least in Homa’s case) does know about message lengths.

Application stacks generate higher latency than TCP

Ivan asserts (again, without data) “Application stacks often include components that generate much higher latency than what TCP could cause” and then says that it thus isn’t worth solving the TCP problem. I agree that application stacks are also a problem. For example, in measuring gRPC I was shocked to discover that the software overheads for a round-trip with small messages are about 60 microseconds in the unloaded case, whereas the time spent in the TCP stack and network is “only” 30 microseconds. It’s hard to imagine where all this time is going.

But I reach a different conclusion than Ivan. I’d argue that we need to fix all of these problems. Not only do we need a new transport protocol, but we also need a new super-efficient RPC framework. I’d also argue that it probably makes sense to work from the bottom up: first introduce a new transport, then a new RPC framework. If we go in the opposite direction and create a new RPC framework that gets the best possible performance out of today’s TCP, it may still turn out to be too slow when layered on a faster transport such as Homa.

Also, gRPC may add more overhead than TCP in the unloaded case, but TCP’s tail latency can easily reach 5-10 ms, which is far higher than the gRPC overheads.

Out-of-order packets

The position paper states that TCP expects packets to arrive in the same order transmitted, and that out-of-order arrivals are taken as an indication of packet drops and result in retransmissions; this creates problems for load-balancing in both network hardware and system software. Ivan asserted that the need for in-order arrivals “might have been valid in certain TCP implementations decades ago but is no longer true.” After doing some more reading and consulting with colleagues who know more about the details of TCP than I do, I’ve learned that Ivan is right; mea culpa. TCP can tolerate moderate amounts of reordering pretty well. I have updated the position paper to correct this error.

However, my TCP expert sources have told me that asymmetries in datacenter networks (caused by device hererogeneity or link/switch failures) can cause significant reordering that exceeds TCP’s tolerance. In addition, there are performance optimizations in NICs and the Linux stack, such as LRO and GRO, that become ineffective in the presence of even modest packet reordering, resulting in significant performance degradation. Thus, both networking hardware and Linux kernel software attempt to deliver packets in order, resulting in the problems discussed in the position paper.

I’m not sure that the problems leading to in-order packet delivery are fundamental to TCP. It could probably be argued that, with sufficient reimplementation effort, TCP could achieve high performance with packet spraying. Of course, this would be a daunting task, given the entrenchment of the in-order assumption.

Flow-consistent routing

Ivan objected to my hypothesis that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of datacenter networks (congestion occurs if enough flows hash to the same link to overload the link). He pointed out that core links are faster than edge links, so a core link can handle traffic from at least a few nodes before it saturates. I agree with this point (though his claim of an order-of-magnitude difference in speed is higher than the numbers I’ve heard).

But even so, the core fabric will perform strictly better with packet spraying than with flow-consistent routing. With packet spraying, it’s unlikely that any one link will be congested for a significant period of time unless the entire core is overloaded (and I’ve been told that the big datacenter operators configure their systems with enough bandwidth headroom so this virtually never happens). Thus, if a network doesn’t use flow-consistent routing, I would expect there to be essentially no core congestion; congestion would occur only on the TOR-host links. For example, failure of a link won’t cause congestion; it just reduces the headroom in overall core fabric bandwidth.

But this is just a hypothesis; maybe I’m missing something. I would love to see someone at Google, Amazon, Microsoft, Meta, etc. publish data about the amount of core congestion that occurs in their core network fabrics and what are the root causes.

If TCP is not the answer, what is?

Ivan objected to the fact that the position paper focuses only on Homa as a replacement for TCP and does not address other alternatives. Discussing every possible transport protocol was not the goal of the position paper. The goals were (a) to show that TCP is problematic and (b) to demonstrate that all of TCP’s problems can be solved with a new protocol. I used Homa as an illustration of such a new protocol. Once people agree that TCP needs to be replaced, I’d be happy to debate the merits of various alternatives.

Even so, I do have opinions about the some of the alternatives that Ivan mentioned. These opinions are only semi-educated and may not withstand scrutiny, so take them with a grain of salt:

  • Infiniband: assuming you want to do RPCs, you’ll have to use either its reliable connections, which are stream-based and thus have the same problems as TCP streams, or its unreliable datagrams, which have all the awkwardness of UDP. Infiniband congestion control is based on priority flow control (PFC), which I believe has serious problems that have been widely documented. Infiniband’s one-sided RDMA operations can be very fast, but they only work well for tasks with relatively static data layouts (in general I’d argue that distributed shared memory is not a good abstraction for distributed systems, but that’s another discussion). I will say that the NICs developed in the Infiniband world (especially those from Mellanox) are spectacularly fast; fortunately this NIC technology can also be used for protocols like TCP and Homa.

  • RocE: it has all of Infiniband’s problems and contaminates our Ethernet-based networks with PFC, which seems like a bad idea.

  • SCTP: I don’t know enough about this protocol to make meaningful comments, but it looks like it’s designed for WANs. Any particular reason to believe it will solve TCP’s problems in the datacenter?

  • QUIC: my knowledge is limited, but QUIC appears to be designed primarily for long-haul communication between browsers and Web servers. Thus, it addresses a variety of problems that made HTTP-over-TCP slow. I haven’t seen any indication that its designers considered issues related to low-latency communication in datacenters, so it would be fortuitous if it turned out to solve that problem. I have heard QUIC characterized as a protocol for RPCs, which would seem to make it similar to Homa, but its abstractions are reliable streams and unreliable datagrams, which sound more like TCP and UDP. Its congestion control mechanism is sender-based and has a lot of similarity with TCPs. Thus there is reason to believe that QUIC would suffer from many of TCP’s problems if used for datacenter applications.

Just a niche?

In his conclusion Ivan claims that the position paper (and presumably Homa) focus on “niche” applications while ignoring “long-lived high-volume connections that represent the majority of traffic in most data centers.” That was definitely not our intent when we designed Homa. I think that the kinds of applications Homa is designed for, which use RPCs in a variety of sizes to communicate between servers, account for a significant fraction of datacenter network bandwidth, and that the future is likely to see even more applications of this sort (e.g., many machine learning applications fit this model). I will agree that TCP works pretty well for applications sending very large sequential streams of data where nothing matters but throughput, such as video streaming . But for most everything else I think TCP is problematic.

Of course, if anyone has data that characterizes datacenter applications, I’d be happy to see it.

Conclusion

With the exception of out-of-order packet handling, I believe that all of my criticisms of TCP have held up through Ivan’s critique, and he agreed with several of them. And, I didn’t see any arguments that Homa’s approaches are actually wrong (other than the discussion about message ordering, which potentially affects TCP too). Although we disagree on several issues, these disagreements are mostly unrelated to the core tenets of the position paper and there doesn’t appear to be enough data available today to resolve them. Hopefully that data will become available in the future.