Part 7 - Validation, Tools, and Troubleshooting

This part includes post-installation validation steps and troubleshooting guidance.

Failover system validation

After you have configured your deployment, we recommend running simple failover tests to validate system redundancy.

We recommend running the following steps to validate failover functionality:

  1. Shut down the first instance of Independent Gateway (TSIG1). All inbound traffic should route through the second instance of Independent Gateway (TSIG2).
  2. Resart TSIG1 and then shut down TSIG2. All inbound traffic should route through TSIG1.
  3. Restart TSIG2.
  4. Shut down Tableau Server Node 1. All Vizportal/Application service traffic will fail over to Node 2.

    Note As of September 2022, Node 1 high availability was compromised on certain versions of Tableau Server 2021.4 and later. Client connections will fail if Node 1 is down. This issue has been fixed in these maintenance releases:

    - 2021.4.15 and later
    - 2022.1.11 and later
    - 2023.1.3 and later

    To ensure your Tableau Server installation using ATR activations will have a 72 hour grace period after initial node failure, install or upgrade to one of these versions. For more details, see Tableau Server HA using ATR Does Not Have a Grace Period After the Initial Node Failure(Link opens in a new window) in the Tableau Knowledge Base.

  5. Restart Node1 and shut down Node 2. All Vizportal/Application service traffic will fail over to Node 1.
  6. Restart Node 2.

In this context "shutting down" or "restarting" is done by turning off the operating system or virtual machine without attempting a graceful shut down of the application before hand. The goal is to simulate a hardware or virtual machine failure.

The minimum validation step for each failover test is to authenticate with a user and perform basic view operations.

You may get a "Bad Request" browser error when you attempt to sign-in after a simulated failure. You may see this error even if you clear the cache in the browser. Often this issue occurs when the browser is caching data from previous IdP session. If this error persists even after you clear the local browser cache, validate the Tableau scenario by connecting with a different browser.

Initial node automated recovery

Tableau Server version 2021.2.4 and later include an automated initial node recovery script, auto-node-recovery, in the scripts directory (/app/tableau_server/packages/scripts.<version>).

If there is a problem with the initial node and you have redundant processes on Node 2, there is no guarantee that Tableau Server will continue to run. Tableau Server may continue to run for up to 72 hours after an initial node failure, before the lack of the licensing service impacts other processes. If so, your users may be able to continue to sign in and see and use their content after the initial node fails, but you will not be able to reconfigure Tableau Server because you won't have access to the Administration Controller.

Even when configured with redundant processes, it is possible that Tableau Server may not continue to function after the initial node fails.

To recover initial node (Node 1) failure:

  1. Sign in to Tableau Server Node 2.

  2. Change to the scripts directory:

    cd /app/tableau_server/packages/scripts.<version>
  3. Run the following command to launch the script:

    sudo ./auto-node-recovery -p node1 -n node2 -k <license keys>

    Where <license keys> is a comma-separated (no spaces) list of the license keys for your deployment. If you do not have access to your license keys, visit the Tableau Customer Portal(Link opens in a new window) to retrieve them. For example:

    sudo ./auto-node-recovery -p node1 -n node2 -k TSB4-8675-309F-TW50-9RUS,TSNM-559N-ULL6-22VE-SIEN

The auto-node-recovery script will execute about 20 steps to recover services to Node 2. Each step is displayed in the terminal as the script progresses. More detailed status is logged to /data/tableau_data/logs/app-controller-move.log. In most environments, the script takes between 35 and 45 minutes to complete.

Troubleshooting initial node recovery

If node recovery fails, you may find running the script interactively to allow or disallow discrete steps in the process useful. For example, if the script fails part way through the process, you can review log file, make changes to the configuration, and then run the script again. By running in interactive mode, you can then skip all the steps until you get to the step that failed.

To run in interactive mode, add the -i switch to the script argument.

Rebuilding the failed node

After you have run the script, Node 2 will be running all of the services that were formerly on the failed Node 1 host. To add in the 4 node, you need to deploy a fresh Tableau Server host with the bootstrap file and configure it as you did for the original Node 2, as specified in Part 4. See Configure Node 2.

switchto

Switchto is a script from Tim that makes switching between windows easy.

  1. Copy the following code into a file called switchto in the home directory on your bastion host.
  2. #!/bin/bash
    #-------------------------------------------------------------------
    # switchto
    #
    # Helper function to simplify SSH into the various AWS hosts when
    # following the Tableau Server Enterprise Deployment Guide (EDG).
    #
    # Place this file on your bastion host and provide your AWS hosts' 
    # internal ip addresses or machine names here.
    # Example: readonly NODE1="10.0.3.187"
    #
    readonly NODE1=""
    readonly NODE2=""
    readonly NODE3=""
    readonly NODE4=""
    readonly PGSQL=""
    readonly PROXY1=""
    readonly PROXY2=""
    				
    usage() {
    echo "Usage: switchto.sh [ node1 | node2 | node3 | node4 | pgsql | proxy1 | proxy2 ]"
    }
    
    
    ip=""
    
    case $1 in
    	node1)
    		ip="$NODE1"
    		;;
    	node2)
    		ip="$NODE2"
    		;;
    	node3)
    		ip="$NODE3"
    		;;
    	node4)
    		ip="$NODE4"
    		;;
    	pgsql)
    		ip="$PGSQL"
    		;;
    	proxy1)
    		ip="$PROXY1"
    		;;
    	proxy2)
    		ip="$PROXY2"
    		;;
    	?)
    		usage
    		exit 0
    		;;
    	*)
    		echo "Unkown option $1."
    		usage
    		exit 1
    		;;
    esac
    
    if [[ -z $ip ]]; then
    echo "You must first edit this file to provide the ip addresses of your AWS hosts."
    exit 1
    fi
    
    ssh -A ec2-user@$ip
  3. Update the IP addresses in the script to map to your EC2 instances and then save the file.
  4. Apply permissions to the script file:
  5. sudo chmod +x switchto

Usage:

To switch to a host, run the following command:

./switchto <target>

For example, to switch to Node 1, run the following command:

./switchto node1

Troubleshooting Tableau Server Independent Gateway

Configuring Independent Gateway, Okta, Mellon, and SAML on Tableau Server can be an error prone process. The most common root cause of failures is a string error. For example, a trailing slash (/) on the Okta URLs specified during configuration may cause a SAML assertion-related mismatch error. This is just one example. There are many opportunities during configuration to input an incorrect string across any of the applications.

Restart tableau-tsig service

Always start (and finish) troubleshooting by restarting the tableau-tsig service on the Independent Gateway computers. Restarting this service is quick and often triggers the updated config to load from the Tableau Server.

Run the following commands on the Independent Gateway computer:

sudo su - tableau-tsig
systemctl --user restart tsig-httpd
exit

Find incorrect strings

If you have made a string error (copy/paste mistake, string truncated, etc), take time to walk through each of the settings that you configured:

  • Okta pre-authentication configuration. Carefully review the URLs that you have set. Look for trailing slashes. Verify HTTP vs HTTPS.
  • Shell history for SAML configuration on Node 1. Review the tsm authentication saml configure command that you ran. Verify that all of the URLs match those that you have configured in Okta. While you are reviewing shell history from Node 1, verify that the tsm configuration set commands that specify the Mellon configuration file paths map exactly to the file paths where you copied the files on Independent Gateway.
  • Mellon configuration on Independent Gateway. Review the shell history to verify that you created the metadata with the same URL string that you have configured in Okta and Tableau SAML. Verify that all the paths that are specified in/etc/mellon/conf.d/global.conf are correct and that the MellonCookieDomain is set to your root domain, not your Tableau subdomain.

Search relevant logs

If all strings appear to be set correctly, then you should inspect logs for errors.

Tableau Server logs errors and events to dozens of different log files. Independent Gateway logs to a set of local files as well. We recommend inspecting these logs in the following order.

Independent Gateway log files

The default location of the Independent Gateway log files are at /var/opt/tableau/tableau_tsig/logs.

  • access.log: This log is useful to the extent that it has entries that show connections from the Tableau Server nodes. If you are getting gateway errors (won't start) when you attempt to start TSM, and there are no entries in the access.log file, then there is a core connectivity issue. Always verify AWS security group configuration as a first step. Another common issue is a typo in tsig.json. If you make an update to tsig.json, run tsm stop before running tsm topology external-services gateway update -c tsig.json. After tsig.json is updated, run tsm start.
  • error.log: Among other entries, this log includes SAML and Mellon errors.

Tableau Server tabadminagent log file

The tabadminagent (not tabadmincontroller) set of files are the only relevant log files for troubleshooting Independent Gateway-related errors.

You must find where Independent Gateway errors have been logged to tabdminagent. These errors can be on any node, but they are only on one node. Perform the following steps on each node in the Tableau Server cluster until you find the “independent” string:

  1. Locate the tabadminagent log file location on Tableau Server nodes 1-4 in EDG setup:

    cd /data/tableau_data/data/tabsvc/logs/tabadminagent
  2. Open latest log to read:

    less tabadminagent_nodeN.log

    (replace N with node number)

  3. Search for all instances of “Independent” and “independent” - by using the following search string:

    /ndependent

    If there are no matches, then go to next node and repeat steps 1-3.

  4. When you get a match: Shift + G to move to bottom to get last error messages.

Reload httpd stub file

Independent Gateway manages configuration of httpd for Apache. A generic operation that will often fix transient issues is to reload the httpd stub file that seeds the underlying Apache configuration. Run the following commands on both instances of Independent Gateway.

  1. Copy the stub file over to httpd.conf:

    cp /var/opt/tableau/tableau_tsig/config/httpd.conf.stub /var/opt/tableau/tableau_tsig/config/httpd.conf
  2. Restart the Independent Gateway service:

    sudo su - tableau-tsig
    systemctl --user restart tsig-httpd
    exit

Delete or move log files

Independent Gateway logs all access events. You will need to manage log file storage to avoid filling up disk space. If your disk fills up Independent Gateway will be unable to write access events and the service will fail. The following message will be logged to error.log on Independent Gateway:

(28)No space left on device: [client 10.0.2.209:54332] AH00646: Error writing to /var/opt/tableau/tableau_tsig/logs/access.%Y_%m_%d_%H_%M_%S.log

This failure will result in a status of DEGRADED for the external node when you run tsm status -v on Tableau Node 1. The external node in the status output refers to Independent Gateway.

To resolve this issue, delete or move the access.log files off the disk. Access.log files are stored at /var/opt/tableau/tableau_tsig/logs. After you have cleared the disk, restart tableau-tsig service.

Browser errors

Bad Request: A common error for this scenario is a "Bad Request" error from Okta. Often this issue occurs when the browser is caching data from previous Okta session. For example, if you manage the Okta applications as an Okta administrator and then attempt to access Tableau using a different Okta-enabled account, session data from the administrator data may cause the "Bad Request" error. If this error persists even after you clear the local browser cache, try validating the Tableau scenario by connecting with a different browser.

Another cause of the "Bad Request" error is a typo in one of the many URLs that you enter during the Okta, Mellon, and SAML configuration processes. Check that you entered all of these without error.

Often the error.log file on the Independent Gateway server will specify which URL is causing the error.

Not Found - The requested URL was not found on this server: This error indicates one of many configuration errors.

If the user is authenticated with Okta, and then receives this error, then it's likely that you have uploaded the Okta pre-auth application to Tableau Server when you configured SAML. Verify that you have the Okta Tableau Server application metadata configured on Tableau Server, and not the Okta pre-auth application metadata

Other troubleshooting steps:

  • Review the Okta pre-auth application settings. Be sure HTTP vs HTTPS protocols are set as specified in this topic.
  • Restart tsig-httpd on both Independent Gateway servers.
  • Verify that sudo apachectl configtest returns “Syntax OK” on both Independent Gateways.
  • Verify that the test user is assigned to both applications in Okta.
  • Verify that stickiess is set on the load balancer and associated target groups.

Verify TLS connection from Tableau Server to Independent Gateway

Use the wget command to verify connectivity and access from Tableau Server to Independent Gateway. Variations of this command can help you understand if certificate issues are causing connection problems.

For example run this wget command to verify the housekeeping (HK) protocol from Tableau Server:

wget https://ip-10-0-1-38.us-west-1.compute.internal:21319

Construct the URL with the same host address that you included for the host option of the tsig.json file. Specify the https protocol, and append the URL with the HK port 21319.

To check connectivity and ignore certificate verification:

wget https://ip-10-0-1-38.us-west-1.compute.internal:21319 --no-check-certificate

To verify root CA cert for TSIG is valid:

wget https://ip-10-0-1-38.us-west-1.compute.internal:21319 --ca-certificate=tsigRootCA.pem

If Tableau is able to communicate, then you may still get content-related errors, but you will not get connection-related errors. If Tableau is unable to connect at all, then start by verifying protocol configuration in the firewall/security groups. For example, the inbound rules for the security group where Independent Gateway resides must allow TCP 21319.

Thanks for your feedback!Your feedback has been successfully submitted. Thank you!