Recover from a Node Failure

If there is a problem with one of your server nodes, and you have redundant processes on your other nodes, Tableau Server can continue to run. Your users can continue to sign in and see and use their content after the node fails, but they may experience performance degradation as a result of the failed node. In addition, your server will be at greater risk of catastrophic failure if the bad node was running processes that are no longer redundant. This means you should make a point of removing the bad node and replacing it as soon as you can. If your node fails for reasons that are recoverable in a relatively short amount of time (for example, a hardware failure you can correct), you should first attempt to bring the node back up without using the procedure below.

Note: If the failed node is your initial node, there are larger implications for your Tableau Server installations. For details on how to recover from the failure of an initial node, see Recover from an Initial Node Failure.

General requirements

The 2020.1 version of Tableau Server has been updated with improved recovery functionality. The procedure in this topic has been written for Tableau Server 2020.1.

If you are attempting to recover a failed node from an earlier version of Tableau Server, you must follow the procedure for that version. To view archived versions of Tableau help, see Tableau Help(Link opens in a new window).

  • There is at least one functioning node with an instance of the File Store on it.
  • There is at least one functioning node with a Repository on it.
  • There is at least one functioning node with the Client File Service (CFS) on it.

Note: This operation includes steps that you may need to perform using the TSM command line. To use the TSM CLI you need administrator access to the command line on one of the nodes in your installation and TSM administrator credentials to run TSM commands.

Removing a Failed Node

To remove a failed node from a Tableau Server cluster:

  1. Identify the failed node:

    tsm status -v

    The failed node will have a status of "ERROR" and processes will show as unavailable. The node ID is listed as "node<n>" with the machine name following it. For example, node3:

    node3: WIN-OO915SFASVH
    						Status: ERROR
    					'Tableau Server Gateway 0' status is unavailable.
  2. Stop Tableau Server.

    The remainder of this procedure includes some commands with the --ignore-node-status option. When a command is run with the --ignore-node-status option, the command will run without consideration of the status of the specified node. To use --ignore-node-status , specify the failed node:

    tsm stop --ignore-node-status <nodeID>

    For example, if node3 has failed, run the command as follows:

    tsm stop --ignore-node-status node3
  3. Determine any key processes that were running on the node:

    • If the failed node was running the Messaging Service, you need to remove the service from the failed node and add it to a working node.

      Remove it from the failed node:

      tsm topology set-process -pr activemqserver -n <nodeID> -c 0
      

      Add it to a working node:

      tsm topology set-process -pr activemqserver -n <nodeID> -c 1
      
    • If the failed node was running the Coordination Service, you need to deploy a new ensemble before you can remove the node:

      tsm topology deploy-coordination-service -n <good_nodeID> --ignore-node-status <failed_nodeID>
      
    • If the failed node was running the only instance of Client File Service (CFS), you need to configure a new instance of CFS on a working node. We recommend that you configure CFS on every node that is running the Coordination Service. For detail steps, see Configure Client File Service .

    • If the failed node was running File Store, you need to force-decommission File Store and remove it before you can remove the node.

      tsm topology filestore decommission -n <nodeID> --delete-filestore

      Apply pending changes (use --ignore-warnings option if you had a three node cluster and a single Coordination Service instance):

      tsm pending-changes apply --ignore-warnings --ignore-node-status <nodeID>
  4. If the cluster was a three-node cluster and there are repositories on the remaining working nodes, you need to either remove one repository, or add a new node. This is because you are limited to a single instance of the repository when you have fewer than three nodes.

    To remove one repository:

    tsm topology set-process -n <nodeID> -pr pgsql -c 0
  5. Run the command to remove the failed node. This adds the change to the pending changes list:

    tsm topology remove-nodes -n <nodeID>
  6. Verify the node removal is pending:

    tsm pending-changes list
  7. Apply pending changes to remove the node:

    tsm pending-changes apply 
  8. Start Tableau Server:

    tsm start
  9. Install Tableau Server on a new node and configure the node with the processes that the old, failed node had been running.

  10. On a fresh computer, or on your original computer after completely removing Tableau, install Tableau using your original Setup program and a bootstrap file generated from the initial node. For details on how to do this, see Install and Configure Additional Nodes.

    A best practice is to configure any processes you lost when the original node failed, to make sure your cluster is fully redundant.

  11. You should also redeploy a new Coordination Service ensemble, once you have your nodes up and running the way you want. For details, see Deploy a Coordination Service Ensemble .

  12. Finally, if you have not already done this, add an instance of CFS to every node that is running the Coordination Service. For more information, see Configure Client File Service

 

Thanks for your feedback!Your feedback has been successfully submitted. Thank you!