Broker Replication

Brokers can run in replication mode where multiple JVM processes could act as broker.

Only a JVM is the leader broker, the others will be defined followers.

Leadership is obtained using ZooKeeper with a simple master election algorithm.

Architecture

Replication use BookKeeper in order to replicate status between brokers.

Leader Broker writes status changes on a set of BK ledgers and followers reads continuously from the same ledgers, replicating every change to their own in memory copy of the status.

Majordodo leverages the fencing features of BK in order to ensure that only on Broker can change the shared "view" of the status of the system.

Checkpoints

Brokers do not share any resource other than BK ledgers and ZK filesystem. In order to be able to delete old ledgers a local checkpoint strategy is needed.

When it is time to create a checkpoint a dump of the broker status is saved to the local file system (directory "snapshot"), each snapshot file will contain:

a dump of the status of every task
a dump of the status of every worker

The sequence number (ledgerid + seqnumber) of the last entry of the log stream applied to the in memory status at the time of the snapshot.
The time between checkpoints is configurable.

Followers

The brokers start in follower mode and recovers from snapshots directory and BK ledgers. When the recovery is over the leader election will start. While in follower mode the broker "tails" the log and applies changes to its own copy of the view of the system. When a broker is elected as the leader then it opens the actual ledger (this operation will "fence" any other zombie broker...) and it begins to accept incoming requests from clients and from workers.

When a new broker joins the cluster it looks on ZK for ledgers, and if the whole set of ledgers is still on BK than it replays the whole logs, if not it connects to the actual leader broker, download a snapshot and recovers from it.