The ESXi host stops responding and displays a purple diagnostic screen. This problem is typically caused by one of several errors:
• CPU exception
• Driver or module panic message
• Machine check exception
• Hardware fault
• Software defect
The available information for many problems might prove inconclusive. Server hangs, purple screens without disk dumps, or disk failures might leave the server with little information logged about a problem. Although the root cause of this outage might be elusive, we can better prepare for the next time the problem occurs.
We review logs for diagnostic messages that were generated both leading up to the problem and during the occurrence of the problem.
For hardware faults, we run hardware diagnostics.
Faulty CPUs can manifest as unusual behavior, such as abrupt reboots, hangs, or purple screens. Most often, the CPU generates an exception that is trapped by the VMkernel and handled with a purple screen.
Verifying an ESXi Host Failure
We view the ESXi local console at the DCUI to verify the purple diagnostic screen.
The ESXi host fails when the VMkernel enters a condition where it cannot or should not proceed. A VMkernel fault is manifested by a purple screen on the ESXi console.
When recovering from a host failure in a production environment, the main goal is to get our VM running as soon as possible. A vSphere HA cluster can help us recover quickly when one of the hosts in the cluster fails.
Recovering from a Purple Diagnostic Screen Failure
To recover from a purple diagnostic screen failure on an ESXi host:
1. Record the state of the system:
• Take a screenshot or photograph of the purple diagnostic screen.
• Note any relevant environmental issues or conditions.
2. Restart the host:
• If vSphere HA does not restart the VMs, get the VMs up and running on a different host.
• Collect a vm-support log bundle from the affected host.
3. Contact VMware Technical Support:
• If VMware Technical Support determines that the issue is a hardware problem, we must contact our hardware vendor.
When a host stops responding, the entire server becomes unresponsive. We might not be able to determine whether the issue is related to the hardware or software without collecting further data.
If the host is unresponsive and we cannot boot the system properly, the problem might be a corrupt configuration or a hardware fault. In this case, we try to boot from diagnostics or the installer CD, if possible.
If the ESXi host is using encryption, the core dumps must be decrypted before they can be investigated.
ESXi Host Is Unresponsive
An ESXi host stops responding because these events occur:
• The VMkernel is too busy or is deadlocked.
• A hardware lockup occurs.
Verifying ESXi Host Unresponsiveness
To confirm that an ESXi host is not responding, we determine whether we can perform the following tasks on the host:
• Ping the VMkernel network interface.
• Determine whether the vSphere Client responds to queries.
• Monitor network traffic from the ESXi host and its VMs.
If any of these tasks are successful, our ESXi host should be at least minimally operational.
To verify that the host is not responding, we use the ESXi host’s DCUI to display VMkernel messages on the screen (press ALT+F12).
The main goal is to get the VMs back up and running as soon as possible. After they are up and running, we do research and try to determine why the ESXi host locked up. We check the VMkernel log file (/var/log/vmkernel.log) for error messages.
We run the esxtop command to gather performance statistics. The esxtop command shows current performance statistics of the entire ESXi host.
A best practice is to capture the logs, independent of disk or network connectivity, when we are troubleshooting issues. Enabling serial logging sends all VMkernel logs to the serial port in addition to their normal destination.
For more information about backing up an ESXi host, see VMware knowledge base article 2042141
Recovering from an ESXi Host Failure
To recover from an ESXi host failure:
1. Reboot the host.
2. Determine why the host locked up:
• Review system logs from before the outage.
• Set up serial-line logging in case the issue occurs again.
• Review the performance charts.
3. After hardware problems are corrected, reinstall and configure the ESXi host using your most recent backup to ensure that the faulty hardware did not corrupt the disk.
4. Install the latest patches and updates for the ESXi host.
I hope it has been useful to you. See you next!