Lots of head scratching and work with the network team finally figured this one out. Percona support also played an important role. We first started digging into the normal port issue. PXC needs 3306, 4444, 4567, and 4568 open. For SST the critical port is 4567. We were on hosts that were located in two data centers and all hosts were in the DMZ. Because of this, the networking was more complex than normal with two sets of firewalls and NATting going on as well. I am no network person but thankfully I had good support.
Our first dig into the problem showed we had ports closed and it was unclear on which end it was breaking. We were able to use tcp dump to finally convince ourselves that traffic could flow across the network on the critical ports. We did this by running
tcpdump -n -o eth0
We ran this on the joiner with the ip address of the donor and we ran it on the donor with the address of the joiner. This gives you a view of traffic flow and can confirm that you are making the connection.
After much work by the network team to open firewalls we finally were able to see the traffic flow between the two hosts. During our investigation it was decided that we wanted to pin down the joiner and donor so that we would always know what hosts to investigate. Here comes our first major error. We added the
wsrep_sst_donor=
on our joiner. We used the ip address of the donor! Don't do this. It turns out that Galera has specifically recommended against using the ip address and either wants the wsrep_node_name or the hostname of the donor! Once we fixed this we saw success of our SST. But our woes were not over.
We then turned our attention to another node. It too failed SST and we started head scratching. Yes, the ports were open. Yes, we had the hostname in the donor attribute. After digging in Mr. Google we found a post that mentioned a failed SST due to a version difference. Viola! Our puppet manifest had accidentally put the wrong version of MySQL on one of the hosts and that was the reason for the SST failure. Once the node was upgraded correctly the SST worked.
Hopefully this will help someone someday.
You can read the full Galera post on why not to use ip address as the donor here:
http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sst-donor3