Recover from an Initial Node Failure

The first computer you install Tableau on, the "initial node", has some unique characteristics. Three processes run only on the initial node and cannot be moved to any other node except in a failure situation, the Licence Service (Licence Manager), Activation Service and TSM Controller (Administration Controller). Tableau Server includes a script that automates moving these processes to one of your other existing nodes so you can get complete access back to TSM and keep Tableau Server running.

Two other processes are initially included on the initial node but can be added or moved to additional nodes, the CFS (Client File Service) and the Coordination Service. Depending on how your installation was configured with CFS and Coordination Service, you may also need to take steps to redeploy these.

If an initial node fails

If there is a problem with the initial node and you have redundant processes on your other nodes, there is no guarantee that Tableau Server will continue to run.

  • Tableau Server can continue to run for up to 72 hours after an initial node failure, before the lack of the licensing service impacts other processes. If so, your users may be able to continue to sign in and see and use their content after the initial node fails, but you will not be able to reconfigure Tableau Server because you won't have access to the Administration Controller.
  • If you are running a version of Tableau Server 2021.4.2 (or older) that is configured for ATR, then problems with the initial node will render all server functionality unavailable. This is true whether the node has a problem or if you intentionally stop it (for instance, to do a system-level patch).

Even when configured with redundant processes, it is possible that Tableau Server may not continue to function after the initial node fails. This is true even when an installation is configured for high availability. This means you should make a point of moving the two unique processes to another of your running nodes as soon as possible. If your initial node fails for reasons that are recoverable in a relatively short amount of time (for example, a hardware failure you can correct), you should first attempt to bring the node back up without using the procedure below.

Note: The steps in this article require server downtime and can be disruptive, and should only be used in the event of a catastrophic failure of the initial node. If you are unable to get your initial node running again, use the following steps to move key TSM processes to another node in your cluster.

General requirements

The 2021.1 version of Tableau Server has been updated with improved recovery functionality. The procedure in this topic has been written for Tableau Server 2021.1.

If you are attempting to recover a failed node from an earlier version of Tableau Server, you must follow the procedure for that version. To view archived versions of Tableau help, see Tableau Help(Link opens in a new window).

  • As part of the process for setting up a multi-node Tableau Server installation you should have deployed a Coordination Service ensemble. The process below assumes there was a Coordination Ensemble deployed before there was a problem with the initial node. For more information about deploying a Coordination Service ensemble, see Deploy a Coordination Service Ensemble.
  • This process assumes that you have configured instances of Client File Service (CFS) on every node that is running the Coordination Service. If you did not add additional instances of CFS, your only instance was on the initial node, and you will need to add at least one instance of CFS to another node. You will also need to repopulate CFS. Tableau Server requires at least one instance of the CFS. For more information, see Configure Client File Service and Tableau Server Client File Service.

Note: This operation includes steps that you may need to perform using the TSM command line.

Move the TSM Controller, Licence Service and Activation Service to another node

If there is a problem with the initial node, the TSM Controller, the Licensing Service, and Activation Service need to be started on another node. Follow these steps to use the provided move-tsm-controller script and get the TSM Controller, Licensing Service and Activation Service working on another node.

  1. On a node that is still working, run the Controller recovery script. At a terminal prompt on a working node, type the following command:

    sudo /opt/tableau/tableau_server/packages/scripts.<version_code>/move-tsm-controller -n <nodeID>

    where "nodeID" is the ID for the node you want the TSM Controller to run on. For example:

    sudo /opt/tableau/tableau_server/packages/scripts.10400.17.0802.1319/move-tsm-controller -n node2

  2. Verify the Administration Controller is running on the node:

    tsm status -v

  3. Stop Tableau Server.

    The remainder of this procedure includes some commands with the --ignore-node-status option. When a command is run with the --ignore-node-status option, the command will run without consideration of the status of the specified node. To use --ignore-node-status , specify the failed node:

    tsm stop --ignore-node-status <nodeID>

    For example, if node1 has failed, run the command as follows:

    tsm stop --ignore-node-status node1

  4. Add the Licence Service to the node:

    tsm topology set-process -pr licenseservice -n <nodeID> -c 1

  5. Remove the old Licence Service from the original node, where "nodeID" is the initial node that has failed:

    tsm topology set-process -pr licenseservice -n <nodeID> -c 0

  6. If you are running one of the following versions

    • 2023.3.0 or later

    • 2023.1.3 or later

    • 2022.3.7 or later

    • 2022.1.15 or later

    or you are running an earlier version and using ATR, add the Activation Service to the new node:

    tsm topology set-process -pr activationservice -n <nodeID> -c 1

  7. If you are running one of these versions or later

    • 2023.3.0 or later

    • 2023.1.3 or later

    • 2022.3.7 or later

    • 2022.1.15 or later

    or you are running an earlier version and using ATR, remove the old Activation Service from the original node, where "nodeID" is the initial node that has failed:

    tsm topology set-process -pr activationservice -n <nodeID> -c 0

    Important: In a cluster, if a node that is running your only instance of CFS fails, any files being managed by CFS will be lost, and you will need to repopulate CFS those files by reimporting certs and custom images, and making any related configuration changes. For a list of files managed by CFS, see Tableau Server Client File Service.

  8. If the initial node had been running the Messaging Service, add the Messaging Service to this node:

    tsm topology set-process -pr activemqserver -n node2 -c 1

  9. (Optional) You can also add other processes that had been running on the initial node but are not running on this node. For example, to add an cache server:

    tsm topology set-process -pr cacheserver -n node2 -c 1

  10. Apply the changes:

    tsm pending-changes apply --ignore-node-status <nodeID>

    If the pending changes require a server restart, the pending-changes apply command will display a prompt to let you know a restart will occur. This prompt displays even if the server is stopped, but in that case, there is no restart. You can suppress the prompt using the --ignore-prompt option, but this does not change the restart behaviour. If the changes do not require a restart, the changes are applied without a prompt. For more information, see tsm pending-changes apply.

  11. Restart the TSM Administration Controller (as tableau system account):

    sudo su -l tableau -c "systemctl --user restart tabadmincontroller_0.service"

    Note: It may take a few minutes for tabadmincontroller to restart. If you attempt to apply pending changes in the next step before the controller has fully restarted, TSM will not be able to connect to the controller. You can verify that the controller is running by using the tsm status -v command. Tableau Server Administration Controller should be listed as "is running".

  12. Apply pending changes (there may not appear to be any, but this step is required):

    tsm pending-changes apply --ignore-node-status <nodeID>

  13. Activate the Tableau Server licence on the new Controller node:

    tsm licenses activate -k <product-key>

  14. Verify the licence is properly activated:

    tsm licenses list

  15. If the initial node was running the Coordination Service, you need to deploy a new Coordination Service ensemble that does not include that node. If you have a three node cluster and the initial node was running the Coordination Service, you must deploy a new, single-instance Coordination Service ensemble on a different node and clean up the old ensemble. In this example, a single instance of the Coordination Service is being deployed to the second node:

    tsm topology deploy-coordination-service -n node2 --ignore-node-status node1

  16. If the initial node was running a File Store instance, you need to remove that instance:

    tsm topology filestore decommission -n <nodeID> --delete-filestore

    Where nodeID is the initial node that has failed.

  17. Apply pending changes, using the --ignore-warnings flag if the new Coordination Service ensemble you deployed above is a single node ensemble:

    tsm pending-changes apply --ignore-node-status node1 --ignore-warnings

  18. Remove the initial node, where nodeID is the initial node that has failed:

    tsm topology remove-nodes -n <nodeID>

  19. Apply pending changes, using the --ignore-warnings flag if the new Coordination Service ensemble you deployed above is a single node ensemble:

    tsm pending-changes apply --ignore-warnings

  20. Start Tableau Server:

    tsm start

    At this point your server should start, and you will be able to use TSM to configure it. The next step is to replace your initial node so your cluster has the original number of nodes. How you do this depends on whether or not you want to reuse the node that failed. We recommend that you only reuse that node if you are able to identify the reason it failed, and take steps to keep the failure from recurring.

  21. If you plan to reuse the original node, you first need to completely remove Tableau from it. Do this by running the tableau-server-obliterate script. For details on doing this, see Remove Tableau Server from Your Computer.

  22. On a fresh computer, or on your original computer after completely removing Tableau, install Tableau using your original Setup program and a bootstrap file generated from the node that is now running the Administration Controller and Licensing Service. This creates an additional node you can configure as part of your cluster. For details on how to add the node, see Install and Configure Additional Nodes.

    A best practice is to configure any processes you lost when the original node failed, to make sure your cluster is fully redundant. You may want to move processes from your new initial node to the newly added additional node to duplicate your original configuration. For example, if your initial node was only running gateway and File Store, you may want to configure the new initial node the same way.

  23. You should also redeploy a new Coordination Service ensemble, once you have your nodes up and running the way you want. For details, see Deploy a Coordination Service Ensemble.

  24. Finally, if you have not already done this, add an instance of CFS to every node that is running the Coordination Service. For more information, see Configure Client File Service

    In a cluster, if a node that is running your only instance of CFS fails, any files being managed by CFS will be lost, and you will need to repopulate CFS those files by reimporting certs and custom images, and making any related configuration changes.For a list of files managed by CFS, see Tableau Server Client File Service.

 

Thanks for your feedback!Your feedback has been successfully submitted. Thank you!