Some time ago I got involved on a support call because one network switch went down, which regularly speaking it shouldn’t be a big deal but it had one of the Citrix SDX NetScaler connected to it. I immediately thought “what is the issue?, the HA pair should work and failover to the other one”. After getting a more details It was clear the access to our XenApp and XenDesktop was compromised because the VPX NetScaler instance did not failover. In this article we will describe how to ensure the failover between NetScaler instances.
I would like to mention that I didn’t design or build this environment; I don’t want you to think this is an excuse for the issue, I’m just justifying why I did all the testing you are about to read since I was learning the configuration as I was troubleshooting it. That is always fun..
There are three NetScaler families, SDX, MPX and VPX, a more detailed description can be found here, but essentially VPX are virtual appliances that need a hypervisor (XenServer, ESX or Hyper-V) to run; MPX are the middle of the road family that to my point of view can cover the needs for about 60% of the enterprise our there and SDX is the higher end family of NetScalers. SDX NetScaler run XenServer as its core OS and then multiple instances (from 5 to 80) of VPXs run on top of it. The image below is a quick reference design to illustrate my previous description.
The issue and troubleshooting
During the call I was told the network switch was still down and the network team was still looking into the issue but the failover didn’t happen automatically so I started troubleshooting the issue.
As my first troubleshooting step I decided to connect to the SDX web management interface to verify the status of the difference instances. This should not be a problem because the management NIC is connected to a different set of switches on a different VLAN. The image below shows the status found when I checked the SDX, the NetScaler01 is showing the VMs up but instances is not connected (red dots) or unreachable.
Netscaler instance did not failover
For this article we are going to focus on two VPX instances: Internal01(.59) and Internal02(.60), which are configured as a high availability (HA) pair. Up to this point I didn’t see any reason why failover couldn’t work, so I decided to ping both instances; as expected I wasn’t able to contact Internal01 but I was able to ping Internal02.
The second troubleshooting step was to connect to VPX internal02 via the web management tool to check on the HA status. The big surprise here was to find that internal02 was reporting internal01 as up. How is that possible? Not even the SDX that is hosting the internal01 instance was able to reach to it. So we had to go a little deeper and decided to use Putty to ssh into the VPX.
Once connected to Internal02 via Putty (ssh), the first test I ran was to check the HA node status using “sh ha node”, the results confirmed what the web interface was telling us, the devices were talking to each other.
Then I pinged Instance01 from Instace02 and I got a positive responses, I also successfully ssh’ed into Instance01 from within Instace02, so there was definitely communication between the two but how is this happening if the network was down?
The next logical step was to get with the network team and try to find an explanation for this behavior. After checking everything, we couldn’t find anything there either so following a suggestion from my network engineer we started looking for routing at the NetScaler. I checked Channels, Traffic Domains, IP tunnels, etc and everything was clear, like nothing has been created on those configuration sections. We finally checked the ARP table and there it was, and entry on Internal02 ARP Table to route Internal01 via port 0/1.
So what are ports 0/1 and 0/2 and how is the VPX able to route the traffic? If you take a closer look to the image above, the highlighted line is the IP address .59 for Internal01; that IP is part of the GeneralTraffic-VLAN but was dynamically learned by port 0/2 which is actually on Management-VLAN, the management VLAN for SDX. At this point we decided to contact Citrix support to validate our findings and our thoughts.
VLAN90 doesn’t know or have a route for GeneralTraffic-VLAN so basically Internal02 was using the SDX appliances management interface to send the traffic between Internal01 and Internal02 and preventing the automatic failover from happening. Citrix Support confirmed that was the case.
When I checked the Internal02 interfaces, I saw that SDX management interfaces 0/1 and 0/2 were added to the VPX instance and that is a mistake in the configuration. When you create VPX interfaces you should only select interfaces 1/x or 10/x, don’t add 0/1 or 0/2
There are only two ways to fix the issue. The easiest and fastest way is disabling the interfaces, simply right click on 0/1 and 0/2 and select disable. While this is the easiest way, it also leave two interfaces that will be showing down and could potentially confuse other support personnel in the future. Something I didn’t know and it was confirmed by Citrix Support is that there is no way to delete or add interfaces once the VPX is created, therefore in order to delete the interfaces from the VPX, we are going to need to delete the VPX and create a new one from scratch.With either solution you decide to implement to the failover is going to work as designed.
Special thanks go out to the @citrixsupport team for working quickly with us on this, I believe we all learned something with this one.
Thank you reading and please leave us a message with your feedback.