Bipartite Matching and Homa

TL;DR: Homa implements a novel and effective form of bipartite matching that allows it to achieve high network utilization while also respecting message priorities.

Why bipartite matching is good

For a transport protocol to make the best use of a datacenter network under heavy load, it needs to create a bipartite matching. A bipartite matching occurs when a group of hosts are paired up such that at any given time each host is sending to exactly one other host and receiving from exactly one other host (i.e., if a graph is created with edges from sending ports to receiving ports, each port participates in exactly one edge). A bipartite matching results in optimal network utilization: it allows each host link to run at full capacity in both directions.

Without a bipartite matching, some links will be underutilized and buffers will accumulate in top-of-rack switches (TORs). Consider two examples, under the assumption that all hosts in the network have the same uplink and downlink speeds. If two hosts send to the same target, the target’s downlink will be overloaded, resulting in buffer accumulation in the target’s TOR. In addition, this must mean that some other host is receiving no data, so its link is underutilized. Or, if two destinations receive from the same source, then their downlink bandwidth will only be half utilized; at the same time, the other N-2 receivers will not be able to keep up with the remaining N-1 senders, resulting in buffer accumulation.

(It’s possible to imagine schemes where hosts send or receive to/from multiple hosts simultaneously while keeping every uplink and downlink fully utilized, but from a latency standpoint it’s better for each host to dedicate its uplink or downlink to a single message, so that message finishes as quickly as possible. The bandwidth-sharing approach causes all messages to finish more slowly).

Achieving good matchings is hard

Achieving perfect bipartite matchings (where every host is both transmitting and receiving at full speed) is very difficult in practice, for two reasons. First, the available messages may not permit a perfect bipartite matching; one example is a situation where every host is trying to transmit to the same target and has no messages for other destinations. When this happens, there is nothing the transport protocol can do to achieve a perfect bipartite matching. Second, even if there are enough active messages to allow a perfect matching, it is difficult for the transport protocol to discover this and implement the matching. The most obvious way to implement a bipartite matching is to collect information about all messages in a central place and schedule the messages from there, such as in Fastpass [1]. However, it isn’t possible to build a central scheduler that can scale to manage the traffic of thousands of hosts in a modern datacenter. And, even if this were possible, the centralized approach has other problems. A central scheduler introduces a delay of at least one round-trip time (RTT) between when a host reports its status and when it gets instructions. Bandwidth will be wasted while the host waits for instructions, and by the time the instructions arrive, new messages could arise, changing the optimal configuration. Furthermore, many messages in modern datacenters take less than one RTT to transmit, so central scheduling will waste significant amounts of bandwidth for them. As datacenter networks get faster, more and more messages take less than one RTT to transmit. Thus, a central scheduler is undesirable as well as infeasible.

Thus, even the best systems implement bipartite matchings that are imperfect: at times there will be hosts that could potentially be communicating with each other, but the system does not discover them. As a result, network bandwidth is wasted. No practical system can fully utilize 100% of a network’s bandwidth under typical operating conditions. The effectiveness of a protocol’s bipartite matching mechanism can be measured in terms of the network utilization it can support under given workloads.

Homa generates bipartite matchings

One of the interesting features of Homa is that it creates good bipartite matchings in an efficient and fully distributed fashion. Furthermore, Homa’s bipartite matching also respects priorities, where senders establish a priority order among their outgoing messages and receivers establish a priority order among incoming messages. In Homa’s case, priority is determined by the number of bytes remaining to transmit in a message. Where there are choices in the bipartite matching, Homa will favor the transmission of messages with fewer remaining bytes.

Homa implements priority-based bipartite matching using two resources in TORs (priority queues and buffer space) and two concepts: receiver-driven flow control and overcommitment. Before explaining how Homa implements bipartite matching, here is a quick review of Homa’s key features:

  • Homa is message based. A sender is allowed to transmit the first rtt_bytes of each message unilaterally, without advance permission from the receiver. These are called unscheduled bytes. For messages longer than rtt_bytes, the sender must receive explicit grants from the receiver. Each grant permits the transmission of one or a few packets. Rtt_bytes is chosen to cover the round trip time (including software time on the receiver), so that in an unloaded network the first grant will arrive before the unscheduled bytes have all been transmitted.

  • If a sender has multiple messages that are transmissible, it transmits them in decreasing order of bytes remaining to send.

  • If a receiver has a single incoming message, it will issue grants for that message so as to maintain rtt_bytes of granted but not yet received data at all times. This allows messages to be transmitted at the full network bandwidth.

  • If a receiver has multiple incoming messages, it uses overcommitment: it grants to multiple messages simultaneously, maintaining rtt_bytes of granted but not yet received data for each message. This means that the receiver’s downlink will be overcommitted and buffers will accumulate in the TOR. If the degree of overcommitment is N, then up to (N-1)*rtt_bytes of buffers may be occupied in the TOR. The maximum degree of overcommitment is a system parameter; 8 works well in practice (it allows network utilizations around 90%), but 4 is not much worse.

  • If the number of incoming messages for a receiver exceeds its maximum overcommitment, then it grants only to the highest priority messages.

  • Receivers use the priority queues in TORs to prioritize incoming messages. Each grant indicates the priority level that the sender should use for the granted packets; packets with different priorities use different queues in the switch egress ports, and packets from higher priority queues are transmitted preferentially. Thus, when multiple senders transmit to the same receiver, packets from the highest priority message (fewest remaining bytes) get to the receiver first.

How does this mechanism implement priority-based bipartite matching? Imagine that a collection of outgoing messages appears simultaneously on all hosts in a cluster. Each host will transmit the first rtt_bytes of its highest priority message. By the time the unscheduled bytes have been transmitted, senders will begin receiving grants from receivers. In the unlikely event that each sender has chosen a different receiver, a perfect bipartite matching will have been achieved, honoring the priorities of the senders.

Of course, it is almost certain that there will be collisions, where multiple senders transmit to the same receiver. When this happens, the receiver will send grants with different priority levels to each sender. If the highest-priority sender responds to incoming grants, its packets will get through to the receiver, while packets from lower priority messages will be queued in the TOR. This means that lower priority senders will receive no grants, so after sending rtt_bytes they will have to stop transmitting packets from their messages. When this happens, they will begin transmitting packets from their next lower priority messages. These additional messages will create more options for receivers, and the receivers will adjust their priority allocations to reflect the new messages. This process will continue, with senders trying new messages every RTT, until eventually the system settles in a state where receivers grant to their favorite senders, and senders transmit their favorite messages (among those that have received grants). Packets from lower priority messages will be queued either at senders or in switch queues.

A particular matching will be stable until messages complete or new messages arrive. At this point the matchings will automatically adjust. For example, suppose a new message appears on a sender, with higher priority than any current message. The sender will stop transmitting its current message and begin transmitting the new one. As a result, packets for the old message will stop arriving at the destination TOR; this will allow packets in lower priority switch queues to be transmitted to that destination. As those packets arrive, the receiver will issue grants for that message, and if that message’s sender responds to the grants, the receiver will seamlessly switch over to this new matching. This can cause a ripple effect, where the sender for the new message might now be neglecting some other message, which causes its receiver to switch to a new sender, and so on. Each “ripple” takes 1 RTT.

Homa generates matchings efficiently

The Homa matching mechanism is efficient. There is no delay while waiting for a scheduler to make a decision (there is no “scheduler” per se): each host immediately begins transmitting as soon as outgoing messages appear. If a sender receives no grants for its highest priority message, it immediately begins transmitting the next lower priority message, so its uplink will continue to be fully utilized unless it gets to a point where no receiver will grant to any of its outgoing messages. Packets buffered in TOR queues ensure high utilization of downlinks: if packets stop arriving for the current highest priority message, the TOR will immediately begin transmitting from the next highest priority message. Each receiver’s downlink will be fully utilized unless there is no sender willing to transmit to this receiver.

The Homa matching mechanism does not require synchronized decision making: each sender and receiver can make independent decisions, which combine to produce a good matching. Perturbations to the system, such as message arrivals and completions, cause the global matching to adjust automatically and incrementally.

Homa’s mechanism does not necessarily find the best possible bipartite matching (i.e. highest throughput). In particular, it gives first preference to priorities, and only secondary preference to keeping all senders and receivers busy. Thus, there may be situations where more links could be kept busy if some senders or receivers did not follow the priorities exactly (of course, discovering such situations would be quite challenging). In addition, the limit on overcommitment means that receivers may miss some opportunities to grant to willing senders.

Nonetheless, Homa’s matching mechanism performs well in practice. In simulations described in the SIGCOMM paper, Homa achieved overall network utilizations of 87-92% for different benchmarks. The Linux kernel implementation achieves throughputs above 80% (for workloads with enough large messages that software overheads don’t limit throughput).

Short messages

Messages shorter than rtt_bytes can be transmitted entirely with unscheduled packets, so they bypass the matching mechanism. However, Homa still prioritizes them to favor shorter messages. For full details on how this works, see the Homa papers.

Buffer usage

Homa’s matching mechanism does come at a price: it consumes TOR buffer space. For a full discussion of Homa’s usage of buffers (and whether it is problematic), see this page.

[1] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal, “Fastpass: A Centralized “Zero-Queue” Datacenter Network”, SIGCOMM 2014, August 17-22, 2014, Chicago, IL, pp. 307-318.