Each worker connects to the actual leader broker with a single TCP connection over which are multiplexed deliveries of messages.
If the TCP connections goes down for any reason the node must look for a new leader broker and retry the connection.
When a node boots it created an unique id ("process id").
From the broker point of view if a node does not reconnect within a configurable amount of time it is considered dead, and the actual worker process id cannot be used any more.
A Worker could be in the following states:
- CONNECTED: a valid TCP connection is active
- DISCONNECTED: no TCP connection is active, but there are tasks assigned to the node
- DEAD: no connection is present for a long time
When a Node transitions from the DISCONNECTED to the DEAD status then recovery is scheduled for each task present on the node (based on the retry policy of the task).
Updated less than a minute ago