Clustering
Embedded etcd metadata, gRPC routing, session takeover, and cluster message flow
Clustering
Last Updated: 2026-02-12
FluxMQ clustering is designed around one idea: keep coordination data consistent across nodes, and keep payload data fast and local wherever possible.
This page explains how etcd, the inter-node transport, and the protocol brokers work together to route publishes, take over sessions, and (optionally) replicate durable queue logs.
Two Planes: Metadata vs Data
In clustered mode there are two planes:
- Metadata plane (etcd): “who owns what”, “who is subscribed”, “who is consuming”.
- Data plane (gRPC transport): the actual routed messages and takeover payloads.
The system stays understandable if you keep this split in mind: etcd does not stream user traffic; it tells nodes where to send it.
etcd Keyspace: What Lives Where
etcd is the source of truth for cluster metadata. FluxMQ maintains local in-memory caches (subscription and session-owner caches) to reduce etcd round trips; caches are kept up-to-date via etcd watches.
Key prefixes:
| Prefix | Meaning |
|---|---|
/mqtt/sessions/<client>/owner | Session ownership (written with a lease so it expires on node death) |
/mqtt/subscriptions/<client>/<filter> | Subscription registry for cross-node pub/sub routing |
/mqtt/queue-consumers/<queue>/<group>/<consumer> | Queue consumer registry for cross-node queue delivery |
/mqtt/retained-data/* and /mqtt/retained-index/* | Hybrid retained store (small payloads replicated; large payloads indexed) |
/mqtt/will-data/* and /mqtt/will-index/* | Hybrid will store (same strategy as retained) |
/mqtt/leader | Cluster leader election (used for coordination and visibility) |
Session Ownership: Why It Exists
MQTT sessions are stateful (inflight QoS 1/2, offline queue, subscriptions, will). In a cluster, you need a single node to be “the owner” at any time so publishes, acks, and retained/will management don’t split-brain.
Ownership is stored in etcd and written with a lease:
- If a node crashes, its lease expires and ownership keys disappear automatically.
- Nodes cache ownership locally for fast routing, but can fall back to etcd when needed.
Session Takeover (MQTT): End-to-End Flow
Takeover happens when a client reconnects to a node that is not the current owner.
The goal: move session state from the old owner to the new owner, and guarantee that only one node continues the session.
Important details:
- The takeover request uses the gRPC transport, not etcd.
- Ownership is updated after the state transfer completes (so the new owner can safely overwrite).
- The old owner closes the session as part of preparing the state.
Pub/Sub Routing Across Nodes
For “normal” pub/sub topics, the originating node does two things:
- Deliver locally to matching subscriptions.
- Forward to remote nodes that own sessions with matching subscriptions.
The routing decision is based on the subscription registry and the session-owner map.
Remote RoutePublish deliveries are dispatched to the local protocol broker by client ID namespace:
- MQTT clients: no protocol prefix
- AMQP 1.0 clients:
amqp: - AMQP 0.9.1 clients:
amqp091-
Hybrid Retained and Will Storage
Retained messages and wills need to be available cluster-wide, but replicating large payloads through etcd is expensive.
FluxMQ uses a hybrid strategy (threshold configurable):
- Small payloads: store metadata + payload in etcd (replicated to all nodes).
- Large payloads: store payload in the owner’s local store; store metadata in etcd; fetch payload on-demand via gRPC.
Durable Queues Across Nodes
Queue consumers are registered cluster-wide so a node receiving a queue publish can find where consumption is happening.
Two distribution styles exist (configured via cluster.raft.distribution_mode):
forward: the node that appends to the queue also delivers (or routes) messages to remote consumers.replicate: the queue log is replicated (Raft), so each node with consumers can deliver from its local log.
Queue Flow: Storage vs Real-Time Delivery
For replicated queues, FluxMQ intentionally separates:
- Storage path (durable): publish -> append to queue log (Raft for replicated queues)
- Delivery path (real-time): claim from log -> deliver to connected consumers (local or remote)
This keeps durability decisions in one path and transport fanout in another.
Scenario: Producer on node A, consumer connected on node B (B is not queue leader)
Assume queue replication is enabled and cluster.raft.write_policy=forward.
What Gets Stored vs What Gets Routed
- The queue log and consumer-group progress are stored (and replicated when enabled).
- Deliveries to live consumers are routed (they are not persisted as “delivery events”).
Acks, Cursors, and Consistency
For replicated queues:
- Acks, cursor updates, and PEL changes are leader-owned and replicated.
- If a client is connected to a non-leader node, the mutation is forwarded to the leader.
- If the leader is unreachable, the mutation fails rather than being applied locally.
This keeps failover behavior predictable: on leadership change, the new leader’s state matches what was committed.
Performance Summary
write_policy=forwardadds an extra hop on non-leader publishes.sync_mode=trueincreases publish latency (leader waits for commit up toack_timeout).distribution_mode=forwardincreases cross-node delivery traffic.distribution_mode=replicateincreases cross-node replication traffic (but can make delivery more local).
Choosing distribution_mode: Forward vs Replicate
Both modes preserve the same queue semantics (consumer groups, ack/nack/reject, retry). The difference is primarily where delivery happens and which network traffic you pay for.
| Mode | forward | replicate |
|---|---|---|
| Where delivery happens | The node that appends also delivers, routing to remote consumers as needed | Nodes deliver from their local replicated log |
| What crosses the network | Per-message delivery traffic (leader -> consumer’s node) | Replication traffic (Raft log + snapshots) |
| Typical benefits | Simple operationally; good when most consumers are local to the leader; avoids replicating logs just for delivery | More local delivery; scales better when consumers are spread across nodes; reduces per-message routing load |
| Typical costs | More cross-node traffic when consumers are spread out; delivery depends on transport health to consumer nodes | More background overhead (replication, snapshots); delivery can lag behind replication during spikes |
| Best when | Consumers are concentrated on one node/zone; you want minimal replication overhead | Consumers are distributed; you want to minimize cross-node delivery fan-gout |
Practical guidance:
- Start with
forwardunless you have a clear need forreplicate. - Prefer
replicatefor high fan-out queue consumption across many nodes. - Measure: the right choice depends on your workload shape (publish rate, consumer placement, and payload sizes).
For multi-queue workloads, use replication groups to shard hot queues and reduce contention. See Queue Replication Groups (Raft).
Configuration Entry Points
- Cluster basics: Cluster configuration
- Replication and tuning: Configuration reference