For workloads executing on distributed compute fabrics, performance is often limited by data movement rather than computation. As a result, efficient execution depends critically on how the workloads are partitioned and placed across hardware resources.

This page describes an algorithm I developed (code) for distributed systems that maps a task graph onto a hardware connectivity graph. The task graph captures computation and communication dependencies, while the hardware graph represents compute elements and their physical interconnects. The algorithm automatically explores a large search space through an iterative process and converges to communication-efficient task assignments while respecting dependency constraints.

Based on Ant Colony Optimization, the algorithm sends out a group of ants to explore possible paths through the hardware graph. Initially, paths are selected randomly, but over iterations selection becomes biased towards more efficient paths. To guide path selection, top-performing ants leave trails where they travel. And to avoid local minima, the trails that aren’t reinforced by top runners gradually fade away. Over time, the system converges on good solutions without exhaustively exploring every possibility.

mapping

The animation → features a linear graph being mapped onto a mesh, showing how the algorithm finds one of the optimal solutions. Linear graphs are well-suited for developing mapping intuition because the optimal solutions are known. In fact, the mapping for such graphs can be entirely rule-based, following the obvious up/down/left/right/stair/snake patterns.

However, today’s most prominent workloads are rarely linear. Be it hyperscale/orchestration level, multi-agent, neural-network, or micro-op level like decomposition of conv2d into im2col and gemm, the workload is often a non-linear graph like the one below. Mapping patterns for such graphs aren’t obvious, especially when the mapping space is constrained.

mapping

Consider mapping this ← non-linear task graph onto the smallest mesh it can fit into, a 4x4. Having a tight space means that every mapping decision directly impacts subsequent ones. This interdependence makes it challenging to construct a direct solution and apply it generally.

————————————

More guided than brute force, yet not nearly as stringent as a direct solution is the iterative method. It integrates a heuristic with stochastic elements to navigate the search space and find satisfactory solutions, including optimal. And sometimes very quickly with proper biasing. But even without nuanced biasing, it’s only a matter of time until a good solution emerges.

Give Good Randomness a Chance

The iterative mapping employs Ants to search for an optimal path. Each Ant starts at a specific location in the hardware graph (e.g. top-left corner in the animation above) and is given a number of choices on where to go next. Before hopping to the next location, the Ant assigns a task to the current location and removes it from the task graph marking it as complete. Each Ant keeps hopping until all tasks from the task graph are gone.

This entire process represents an algorithm’s iteration - a single Ant Colony run. During the very first run, the Ants are unbiased. This means one randomly happens to find a more efficient path than the others. Just like in any competition, the winner gets to brag about how it did it: “I started in the top left, moved right, then down…” You get the idea. Having listened to the winner, the next Ant Colony run becomes slightly biased towards the randomly chosen path. This is good randomness though, as it produced the best result so far.

To keep producing improvements, the heuristic ought to maintain just the right conditions for randomness to work its magic: providing guidance but also allowing deviaton. Here’s how this is done in practice:

# Ant's path is constructed by randomly choosing hops.
# Initially, the probabilities are uniformly distributed.
choices =       [c1, c2, ..., cN]
probabilities = [p1, p2, ..., pN]
# After normalization:
[p1, p2, ..., pN] = 1/N

# After an iteration, hops in the top performer's
# path (e.g., c1, c2) receive a probability boost.
choices =       [c1,     c2,     ..., cN]
probabilities = [p1 + x, p2 + x, ..., pN]
# After normalization:
[p1, p2, ..., pN] = [p1 + x, p2 + x, ..., pN]/sum(P)

# Since the first path was chosen randomly,
# the boost shouldn't be excessive.
# A simple model with pheromone deposit and 
# evaporation will do:
deposit = AntHeuristicParams.PHERONOME_DEPOSIT
rate = AntHeuristicParams.PHERONOME_EVAPORATION_RATE
x = deposit + (1 - rate)*current_pheromone 

# Using deposit = 1.0 and evaporation rate = 0.247:
# The boost for hops in the top performer's path:
x = 1.0 + (1 - 0.247)*1.0 = 1.753
# The boost for other hops:
x = (1 - 0.247)*1.0 = 0.753

# These numbers will translate into probabilities below.

# In addition to keeping history of previous choices, 
# path selection is also guided by the energy and time 
# it takes to travel a path. As part of the heuristic, 
# both components depend on the Manhattan distance
# between nodes in the hardware connectivity graph:
e_route = AntHeuristicParams.ENERGY_ROUTE_ONE_BYTE
e_link = AntHeuristicParams.ENERGY_LINK_ONE_BYTE
energy_per_byte = (dist + 1)*e_route + dist*e_link
n_bytes = dag.edges[edge]['weight']
energy_per_transfer = n_bytes*energy_per_byte
tau = 1/energy_per_transfer

t_compute = dag.nodes[node]['weight']
t_communicate = dag.edges[edge]['weight']*dist
time_total = t_compute + t_communicate
phi = 1/time_total 

# Weights alpha and beta adjust the importance of 
# energy and time components respectively:
p = (tau**alpha)*(phi**beta)

# Example calculation with dist=3, n_bytes=1024:
# (plugging in 1 femtojoule per byte for communication
# and a 128-bit bus @ 200MHz for communication):
n_bytes = 1024
dist = 3
e_route = 1
e_link = 1
energy_per_byte = (3 + 1)*1 + 3*1 = 7
energy_per_transfer = 1024*7 = 7168
tau = 1/7168 = 0.000139509
t_compute = 290
t_communicate = 320*3 = 960
time_total = 290 + 960 = 1250
phi = 1/1250 = 0.0008
p = (0.000139509**0.2)*(0.0008**0.2) = 0.0407

# After adding pheromone bias:
# Top performer edges:
p = 0.0407 + 1.753 ≈ 1.7937
# Others:
p = 0.0407 + 0.753 ≈ 0.7937

# The final probabilities after the first iteration 
# (example for a 4x4 mesh / 240 edges and
# a 19-edge graph from the above):
sum_all = 1.7937*19 + 0.7937*221 ≈ 209.49
prob_top = 1.7937 / 209.49 ≈ 0.856%
prob_other = 0.7937 / 209.49 ≈ 0.379%

# In other words, the edges in the top performer's
# path are only about half a percent likelier to be 
# chosen in the next iteration (0.856-0.379=0.477%).

# Sanity check:
19*0.856% + 221*0.379% ≈ 100%

# The algorithm simply iterates from here.

mapping

———————————

The result found by the algorithm for the task graph above is 25 hops. Given the constraints (fixed location for the first and last tasks [0 & 15], and the smallest possible mesh for this graph [4x4]), the algorithm quickly finds a good solution that is only 6 hops away from the optimal (unconstrained) one.

———————————

Guide The Algorithm

The random component is what helps the algorithm explore new possibilities. Tweaking probabilities and seeing how the search evolves can be really engaging. To get the most out of this, I recommend using a live plot to watch the convergence in action.

The number of individual edges and probabilities is typically quite large. Tuning them one by one is rather challenging and essentially not much different than rule-based mapping. Much more manageable is working with edges when they are grouped. Below is an example where the edges are grouped by Manhattan distance (bin 1=shortest, bin 16=longest). Grouping makes it easier to see what the algorithm values as it iterates.

Once a pattern emerges, further bias can be applied to whole groups of edges instead of individual ones. For example, a uniform distribution means the algorithm values all edge lengths equally. But if the graph is linear, it makes sense to favor short edges much more than long ones. A function like radioactive decay can help achieve that: \(y = e^{-\text{edge_length}}\) can be mixed into the heuristic. It makes longer paths exponentially less likely to be chosen, essentially steering the algorithm towards more efficient paths. The animation below shows this first-order steering process.

biasing

For complex graphs, such first-order steering may not seem nuanced enough. But as with many things tech, it’s more of a tradeoff between computer time and human time, and a bit of guidance goes a long way.