FluxMQ
Architecture

Clustering

Embedded etcd metadata, gRPC routing, session takeover, and cluster message flow

Clustering

Last Updated: 2026-02-12

FluxMQ clustering is designed around one idea: keep coordination data consistent across nodes, and keep payload data fast and local wherever possible.

This page explains how etcd, the inter-node transport, and the protocol brokers work together to route publishes, take over sessions, and (optionally) replicate durable queue logs.

Two Planes: Metadata vs Data

In clustered mode there are two planes:

  • Metadata plane (etcd): “who owns what”, “who is subscribed”, “who is consuming”.
  • Data plane (gRPC transport): the actual routed messages and takeover payloads.

The system stays understandable if you keep this split in mind: etcd does not stream user traffic; it tells nodes where to send it.

etcd Keyspace: What Lives Where

etcd is the source of truth for cluster metadata. FluxMQ maintains local in-memory caches (subscription and session-owner caches) to reduce etcd round trips; caches are kept up-to-date via etcd watches.

Key prefixes:

PrefixMeaning
/mqtt/sessions/<client>/ownerSession ownership (written with a lease so it expires on node death)
/mqtt/subscriptions/<client>/<filter>Subscription registry for cross-node pub/sub routing
/mqtt/queue-consumers/<queue>/<group>/<consumer>Queue consumer registry for cross-node queue delivery
/mqtt/retained-data/* and /mqtt/retained-index/*Hybrid retained store (small payloads replicated; large payloads indexed)
/mqtt/will-data/* and /mqtt/will-index/*Hybrid will store (same strategy as retained)
/mqtt/leaderCluster leader election (used for coordination and visibility)

Session Ownership: Why It Exists

MQTT sessions are stateful (inflight QoS 1/2, offline queue, subscriptions, will). In a cluster, you need a single node to be “the owner” at any time so publishes, acks, and retained/will management don’t split-brain.

Ownership is stored in etcd and written with a lease:

  • If a node crashes, its lease expires and ownership keys disappear automatically.
  • Nodes cache ownership locally for fast routing, but can fall back to etcd when needed.

Session Takeover (MQTT): End-to-End Flow

Takeover happens when a client reconnects to a node that is not the current owner.

The goal: move session state from the old owner to the new owner, and guarantee that only one node continues the session.

Important details:

  • The takeover request uses the gRPC transport, not etcd.
  • Ownership is updated after the state transfer completes (so the new owner can safely overwrite).
  • The old owner closes the session as part of preparing the state.

Pub/Sub Routing Across Nodes

For “normal” pub/sub topics, the originating node does two things:

  1. Deliver locally to matching subscriptions.
  2. Forward to remote nodes that own sessions with matching subscriptions.

The routing decision is based on the subscription registry and the session-owner map.

Remote RoutePublish deliveries are dispatched to the local protocol broker by client ID namespace:

  • MQTT clients: no protocol prefix
  • AMQP 1.0 clients: amqp:
  • AMQP 0.9.1 clients: amqp091-

Hybrid Retained and Will Storage

Retained messages and wills need to be available cluster-wide, but replicating large payloads through etcd is expensive.

FluxMQ uses a hybrid strategy (threshold configurable):

  • Small payloads: store metadata + payload in etcd (replicated to all nodes).
  • Large payloads: store payload in the owner’s local store; store metadata in etcd; fetch payload on-demand via gRPC.

Durable Queues Across Nodes

Queue consumers are registered cluster-wide so a node receiving a queue publish can find where consumption is happening.

Two distribution styles exist (configured via cluster.raft.distribution_mode):

  • forward: the node that appends to the queue also delivers (or routes) messages to remote consumers.
  • replicate: the queue log is replicated (Raft), so each node with consumers can deliver from its local log.

Queue Flow: Storage vs Real-Time Delivery

For replicated queues, FluxMQ intentionally separates:

  • Storage path (durable): publish -> append to queue log (Raft for replicated queues)
  • Delivery path (real-time): claim from log -> deliver to connected consumers (local or remote)

This keeps durability decisions in one path and transport fanout in another.

Scenario: Producer on node A, consumer connected on node B (B is not queue leader)

Assume queue replication is enabled and cluster.raft.write_policy=forward.

What Gets Stored vs What Gets Routed

  • The queue log and consumer-group progress are stored (and replicated when enabled).
  • Deliveries to live consumers are routed (they are not persisted as “delivery events”).

Acks, Cursors, and Consistency

For replicated queues:

  • Acks, cursor updates, and PEL changes are leader-owned and replicated.
  • If a client is connected to a non-leader node, the mutation is forwarded to the leader.
  • If the leader is unreachable, the mutation fails rather than being applied locally.

This keeps failover behavior predictable: on leadership change, the new leader’s state matches what was committed.

Performance Summary

  • write_policy=forward adds an extra hop on non-leader publishes.
  • sync_mode=true increases publish latency (leader waits for commit up to ack_timeout).
  • distribution_mode=forward increases cross-node delivery traffic.
  • distribution_mode=replicate increases cross-node replication traffic (but can make delivery more local).

Choosing distribution_mode: Forward vs Replicate

Both modes preserve the same queue semantics (consumer groups, ack/nack/reject, retry). The difference is primarily where delivery happens and which network traffic you pay for.

Modeforwardreplicate
Where delivery happensThe node that appends also delivers, routing to remote consumers as neededNodes deliver from their local replicated log
What crosses the networkPer-message delivery traffic (leader -> consumer’s node)Replication traffic (Raft log + snapshots)
Typical benefitsSimple operationally; good when most consumers are local to the leader; avoids replicating logs just for deliveryMore local delivery; scales better when consumers are spread across nodes; reduces per-message routing load
Typical costsMore cross-node traffic when consumers are spread out; delivery depends on transport health to consumer nodesMore background overhead (replication, snapshots); delivery can lag behind replication during spikes
Best whenConsumers are concentrated on one node/zone; you want minimal replication overheadConsumers are distributed; you want to minimize cross-node delivery fan-gout

Practical guidance:

  • Start with forward unless you have a clear need for replicate.
  • Prefer replicate for high fan-out queue consumption across many nodes.
  • Measure: the right choice depends on your workload shape (publish rate, consumer placement, and payload sizes).

For multi-queue workloads, use replication groups to shard hot queues and reduce contention. See Queue Replication Groups (Raft).

Configuration Entry Points

On this page