ESXi host disconnects randomly from vCenter Server
I had a problem with my vCenter server randomly intermittently disconnecting from the ESXi host for about 3 minutes a couple of times per day. In this article I will share how I finally managed to fix it.
I read many articles addressing this issue when heartbeats are not received, but none of them really applied to the scenario I was seeing. Many articles, including article 1005757 from VMWare states the importance of:
- Stable network connection between the vCenter server and the ESXi host (port 902/udp).
- Time being the same on vCenter server and the ESXi host.
My setup relevant to this case:
- The vCenter server appliance is a virtual machine on ESXi host itself.
- There are no firewalls rules to block port 902/udp between vCenter and the host.
- I use VLANs in my network setup.
- The time on both the host and vCenter server appliance are set by the same NTP server.
- A Synology NAS is connected to the ESXi host via NFS protocol.
So I addressed this issue on step at a time based on the information I could find.
First check that the heartbeat between the ESXi host and the vCenter server can come through. I added the heartbeat timeout setting under advanced configuration config.vpxd.heartbeat.notRespondingTimeout and set it to ”120” and then to ”240”. A guide on how to do this is found in the article I linked to above. This workaround did not help for me though.
I actually had the vCenter server and the ESXi host on different VLANs, so the traffic goes from the ESXi host to the physical router and back to the vCenter server on the host. Maybe my Edgerouter was having problems with congestion at times and dropped packets, so to eliminate this possibility I changed IP on the vCenter server appliance and put it on the same VLAN as the ESXi host. Now the traffic between them would only go through the virtual switch (vSwitch0) and not leave the physical host.
This did not help though.
I noticed in the logs that there were some errors related to a datastore not being found, usually this was the NFS mounted datastore. I couldn’t tell in what order it happened or if it was related, but maybe my Synology NAS didn’t answer quickly enough to requests so I decided to check its power settings and disable HDD hybernation in ”Hardware & Power” settings. This did not help.
I had two DNS servers configured on my servers and devices, one for the local domain and one on the router that got its DNS from my internet provider. I thought that maybe sometimes my local DNS do not answer queries fast enough so the vCenter server and ESXi host would ask the router instead, which have no knowledge about the internal network or the FQDN of the vCenter server. So I removed the router as DNS, only having the local domain DNS on the network with a public DNS as an upstream server. This way I would make sure all requests coming to the DNS about internal hosts would be resolved correctly. I’m not sure if this help much, I did get an error after this change, but I kept this since it makes more sense.
Time (problem solved)
I read in many articles the importance of the time being exactly the same on the ESXi host and the vCenter server and that they should have the same setting. I had both of them set to NTP from ”se.pool.ntp.org”. This then seemed to be an unlikely problem, but I did change it so that the ESXi host got its time from NTP and the vCenter server get it from the host instead.
After this final change, the problem seems to be solved. No more errors about the host randomly being disconnected from vCenter!