To migrate to new machine images in Hashicorp’s Nomad, we can’t do a git-ops approach easily. It requires manual cluster management.
From a high level, we need to bring up a parallel set of nodes, get them configured correctly, and then turn off the original set. This is made more difficult because both consul and nomad use raft consensus, so we have to manage that state and who the leader is carefully, else we break the cluster which causes more manual intervention.
First, bring up a new set of nodes. In terraform, there are two instances of application platform
, blue and green. One is active and has nodes (server_count
parameter controls this), the other doesn’t. Add nodes to the one that doesn’t have any, along with the updated machine image. Once that completes, you should have new instances up using the new machine image.
Validate things are running correctly and they have joined the peer list:
consul maint -enable -reason=""
will remove a node from consul’s raft peer list and remove it from DNS queries. Do this on a node before you delete it. Note that this is a stateful flag. The server won’t go out of maintenance mode until you disable maintenance mode. That’s fine if you’re deleting the node.
From there, we need to delete the old servers.
After that’s done, tell the other peers the old nomad servers should be removed from the peer list. You can get the list of servers with nomad server members
and then remove any that shouldn’t be there with nomad server force-leave green-test-server-0
. If that server is still up, it’ll rejoin.
For servers, we need to ensure our servers are in the raft consensus and remove the old ones.
If the servers keep coming back, you’ll need to turn off nomad on those services by doing systemctl stop nomad
.
Drain the “client” server. This may take a bit.
46f1
is the start to the node id.
Verify the node is ineligible for scheduling.
You can mark a node for ineligibilty for scheduling before draining, so old workloads don’t get put onto a server which is about to be turned off.