This article is Part of Becoming a VMware Storage Expert series
APD & PDL is the most discussed topic when it comes to vSphere Storage but this topic never old. VMware APD & PDL conditions are introduced in 5.x and 6.x
This article is part of How to Handle APD & PDL Series
But what happens exactly when disaster strikes and an APD has occurred? When an APD condition is detected a timer is started. After 140 seconds the APD condition is officially declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting, the default HA timeout is 3 minutes.
The timeout parameter controls how many seconds the ESXi host will retry nonvirtual machine I/O commands to a storage device in an all paths down (APD) state. If needed, you can change the default timeout value.
The timer starts immediately after the device enters the APD state. When the timeout expires, the host marks the APD device as unreachable and fails any pending or new nonvirtual machine I/O. Virtual machine I/O will continue to be retried.
The default timeout parameter on your host is 140 seconds. You can increase the value of the timeout if, for example, storage devices connected to your ESXi host take longer than 140 seconds to recover from a connection loss.\
Under Advanced System Settings, select the Misc.APDTimeout parameter and change the default value.
You can enter a value between 20 and 99999 seconds
When the 3 minutes have passed HA VMCP can restart the impacted virtual machines, depending on how you configured it
An unplanned permanent device loss (PDL) condition occurs when a storage device becomes permanently unavailable without being properly detached from the ESXi host.
The following items in the vSphere Web Client indicate that the device is in the PDL state:
-
The datastore deployed on the device is unavailable.
-
The operational state of the device changes to Lost Communication.
-
All paths are shown as Dead.
-
A warning about the device being permanently inaccessible appears in the VMkernel log file
This table outlines possible SCSI sense codes that determine if a device is in a PDL state:
SCSI sense code | Description |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x__ /0x25/0x0 |
*LOGICAL UNIT NOT SUPPORTED |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x__/0x68/0x0 |
*LOGICAL UNIT NOT CONFIGURED |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4/0x4c/0x0 |
HARDWARE ERROR/LOGICAL UNIT FAILED SELF-CONFIGURATION |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4/0x3e/0x3 |
HARDWARE ERROR/LOGICAL UNIT FAILED SELF-TEST |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4/0x3e/0x1 | HARDWARE ERROR/LOGICAL UNIT FAILURE |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x4c/0x0 | NOT READY/LOGICAL UNIT FAILED SELF-CONFIGURATION |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x3e/0x3 | NOT READY/LOGICAL UNIT FAILED SELF-TEST |
H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x3e/0x1 | NOT READY/LOGICAL UNIT FAILURE |
When the device enters this PDL state, the vSphere host can take action to prevent directing any further, unnecessary I/O to this device. This alleviates other conditions that might arise on the host as a result of this unnecessary I/O. With vSphere 5.5, a new feature called PDL AutoRemove is introduced. This feature automatically removes a device from a host when it enters a PDL state. Because vSphere ESXi hosts have a limit of 255 disk devices per host, a device that is in a PDL state can no longer accept I/O, but can still occupy one of the available disk device spaces.
PDL AutoRemove occurs only if there are no open handles left on the device. The auto-remove takes place when the last handle on the device closes. If the device recovers, or if it is re-added after having been inadvertently removed, it is treated as a new device
- Run this command to disable AutoRemove:esxcli system settings advanced set -o “/Disk/AutoremoveOnPDL” -i 0
Notes:
- There is no guarantee of data integrity can be given if a device returns from PDL.
- To re-enable the AutoRemove feature, run this command from a shell session to the ESXi host:esxcli system settings advanced set -o “/Disk/AutoremoveOnPDL” -i 1