Saturday, April 20, 2019

More tips and a check list for that elusive Percona Xtra DB (pxc) (galera) cluster SST or IST

While it's clear that SST/IST are the magic sauce of Galera and their implementation in MySQL by Percona there are dragons lurking that can prevent this from working.  It seems that lots of people have this issue so I created this checklist you can use to try and resolve your IST/SST issues.  At one time or another every one of these issues has caused me to have an SST/IST fail.

 In general, when debugging SST/IST issues it is best to use a specific donor for your joiner node so that you can isolate your troubleshooting and testing to a specific set of nodes.

Checklist


  • Are the nodes in the same cluster?  Check wsrep_cluster_name
  • Is the server_id the same for all nodes in the cluster?
  • Is wsrep_node_address correct?  A typo here causes very interesting errors.
  • Is wsrep_node_name correct?  Again, a simple error here causes SST to fail.
  • Is gcom the same for all nodes?  Triple check this.  Use nslookup on every address to check for correctness.   Do a reverse nslookup to make sure fqdn resolves to the expected ip address.  
  • Is wsrep_sst_donor a fqdn or node name?  Do not use an ip address in this field.  Refer to the documentation for methods to code this. 
  • Are ports 3306, 4444, 4567, and 4568 open?  This is a biggie so see below for tips on how to test.
  • Are you using xtrabackup_v2 for your wsrep_sst_method?  If not you should.  My experience with rsync, mysqldump, and the deprecated xtrabackup (no _v2) have been poor.  
  • Are xtrabackup versions the same on all nodes?  
  • Is /etc/hosts fqdn and ip address correct on all nodes?  Test using dns for every name and ip address you see.
  • Make sure you examine every ERROR in the MySQL error log and sometimes the WARNINGS are also worth a look.  Infrequently another error can cause SST/IST to fail for issues actually unrelated to either SST or IST.  
  • is socat working?  (see below on how to use socat to test ports._)
  • will the MySQL server start at all?  If you have a valid install you can try something simple like mysqld --wsrep-recover to see if MySQL will even start.  If your datadir is invalid then blow it away and reinitialize MySQL using mysqld initialize_insecure to bootstrap mysql using your my.cnf and then try to start it using mysqld --wsrep-recover only to make sure it will start.  It may uncover other issues you have with MySQL that are not specific to SST/IST.  This helped me once when an error existed in my.cnf which was not logged to the error log until I ran this standalone test.  

Port testing

Now about those pesky ports.  The best way to test this is using socat which is the same utility that sst/ist uses to move data between nodes.  The socat utility established a bi-directional tunnel between hosts and should be tested on both hosts.   Substitute your ip address below for donor and joiner.  You must test this in both directions to be certain SST/IST will work.

run this pair of tests for each port 4444, 4567, and 4568.  

from the joiner run this
    socat TCP4-LISTEN:4444 STDOUT    # this starts a listener on the port

from the donor run this.  You should see hello on your joiner
    echo "hello" | socat -t 30 tcp:
:4444 -

Now reverse this and do the same where the donor becomes the listener.  Repeat for all three pots.  If your socat works then you have eliminated ports as an issue.  So go back and check the items in the checklist once again.  

For port 3306 use the MySQL client from each host to contact the other host.  Since your joiner is down you may need to start it without using the cluster in order to ensure that port 3306 is open.  (See the checklist on how to get a bare bones mysqld setup to run.) 




I hope this is useful.

See also this article (somewhat dated and redundant) from Percona.
https://www.percona.com/blog/2014/12/30/diagnosing-sst-errors-with-percona-xtradb-cluster-for-mysql/