Introducing flexible node selection for XMTP's decentralized network

The XMTP network is evolving. As we move toward full decentralization with D14N, Gateway nodes need smarter ways to decide which node handles a message. Until now, we only had one approach: stable hashing. It works well for production load balancing—but what about when you're debugging a tricky issue at 2am and need all traffic to hit a specific node? Or when you're rolling out a new node and want to gradually shift traffic to it?

That's why we built a configurable node selection system with five distinct strategies. Here's what we built, why it matters, and how you can use it.

The problem: one size doesn't fit all

Our original stable hashing selector does exactly what it sounds like—it deterministically maps topics to nodes using SHA-256 hashing. Messages for the same topic always route to the same originator node, which is great for message ordering consistency.

But operators kept running into walls:

Testing and debugging: "I need to see exactly what Node 100 is doing with this traffic"
Staged rollouts: "Route to our new node first, but fall back to others if it's down"
Latency optimization: "Pick the closest node—I care about speed"
Load experiments: "Let's see what pure random distribution looks like"

Without alternative strategies, operators had no lever to pull. So we built four new ones.

The five strategies

1. Stable (default)

This is the original behavior, now formalized. Topics hash to consistent nodes, giving you "topic stickiness"—all messages for a topic route to the same originator, reducing ordering anomalies. If that node goes down, we seamlessly iterate to the next one.

Use this for: Production environments where message ordering matters.

2. Manual

Need surgical control? Manual mode lets you specify exactly which node(s) handle traffic. Configure a list of node IDs, and we use the first available one. Period.

Use this for: Debugging, testing specific node behavior, or isolating traffic during incident response.

3. Ordered

Like Manual, but with a safety net. You specify preferred nodes, and we try them in order—but if none of your preferred nodes are available, we automatically fall back to any healthy node in the registry.

Use this for: Staged rollouts where you want to prioritize new infrastructure but maintain resilience.

4. Random

Pure statistical load distribution. We pick a node uniformly at random using cryptographically secure randomness (crypto/rand). No topic affinity, no determinism—just even distribution.

Use this for: Load balancing experiments, or when you explicitly don't want topic stickiness.

5. Closest

Selects the node with the lowest measured TCP latency. We probe each candidate node's connection time, cache the results (default: 5 minutes), and route to the fastest one.

Use this for: Latency-sensitive deployments, especially with geographically distributed nodes.

How it works under the hood

All strategies implement a common interface:

type NodeSelectorAlgorithm interface {
    GetNode(topic topic.Topic, banlist ...[]uint32) (uint32, error)
}

The optional banlist parameter is key to our fault tolerance. When a publish fails, the failed node gets added to a request-scoped unavailable list, and we retry with a different node—up to 5 attempts. This means transient failures don't break message delivery.

Configuration

Switching strategies works via flags or environment variables:

Option	Environment variable	Description
`--payer.node-selector-strategy`	`XMTPD_PAYER_NODE_SELECTOR_STRATEGY`	Strategy: `stable`, `manual`, `ordered`, `random`, `closest`
`--payer.node-selector-preferred-nodes`	`XMTPD_PAYER_NODE_SELECTOR_PREFERRED_NODES`	Comma-separated node IDs
`--payer.node-selector-cache-expiry`	`XMTPD_PAYER_NODE_SELECTOR_CACHE_EXPIRY`	Latency cache TTL (default: 5m)
`--payer.node-selector-connect-timeout`	`XMTPD_PAYER_NODE_SELECTOR_CONNECT_TIMEOUT`	TCP probe timeout (default: 2s)

Example: Route to nodes 100 and 200 with automatic fallback:

export XMTPD_PAYER_NODE_SELECTOR_STRATEGY=ordered
export XMTPD_PAYER_NODE_SELECTOR_PREFERRED_NODES=100,200

Security considerations

Be aware of these security implications:

Closest strategy probing: TCP probes reveal connection patterns to network observers. Deploy this strategy only in trusted network environments.
Manual misconfiguration: If configured node IDs don't exist in the registry, message delivery fails. Verify node IDs against the active registry before deployment.
Configuration protection: An attacker with configuration access can redirect traffic via the manual strategy. Protect configuration with standard infrastructure security practices.

What's next

This change is fully backward-compatible—existing deployments continue working exactly as before with stable hashing as the default. But now operators have the flexibility to optimize for their specific needs.

Check out the reference implementation in PR #1483 and the full XIP specification for all the details.

Isaac Hartford is a software engineer at XMTP Labs working on decentralized network infrastructure.