vSekar Blog

Getting Hands Dirty With Technology

  • About vSekar
  • Get in Touch
  • Facebook
  • Twitter
  • LinkedIn
Posts and Views are my Own
You are here: Home / Troubleshooting / Another blog post on APD & PDL

Another blog post on APD & PDL

March 6, 2018 By v

This article is Part of Becoming a VMware Storage Expert series

APD & PDL is the most discussed topic when it comes to vSphere Storage but this topic never old. VMware APD & PDL conditions are introduced in 5.x and 6.x

This article is part of How to Handle APD & PDL Series

But what happens exactly when disaster strikes and an APD has occurred? When an APD condition is detected a timer is started. After 140 seconds the APD condition is officially declared and the device is marked as APD time out. When the 140 seconds has passed HA will start counting, the default HA timeout is 3 minutes.

The timeout parameter controls how many seconds the ESXi host will retry nonvirtual machine I/O commands to a storage device in an all paths down (APD) state. If needed, you can change the default timeout value.

The timer starts immediately after the device enters the APD state. When the timeout expires, the host marks the APD device as unreachable and fails any pending or new nonvirtual machine I/O. Virtual machine I/O will continue to be retried.

The default timeout parameter on your host is 140 seconds. You can increase the value of the timeout if, for example, storage devices connected to your ESXi host take longer than 140 seconds to recover from a connection loss.\

Under Advanced System Settings, select the Misc.APDTimeout parameter and change the default value.

You can enter a value between 20 and 99999 seconds

When the 3 minutes have passed HA VMCP can restart the impacted virtual machines, depending on how you configured it

An unplanned permanent device loss (PDL) condition occurs when a storage device becomes permanently unavailable without being properly detached from the ESXi host.

The following items in the vSphere Web Client indicate that the device is in the PDL state:

  • The datastore deployed on the device is unavailable.

  • The operational state of the device changes to Lost Communication.

  • All paths are shown as Dead.

  • A warning about the device being permanently inaccessible appears in the VMkernel log file

This table outlines possible SCSI sense codes that determine if a device is in a PDL state:

SCSI sense code Description
H:0x0 D:0x2 P:0x0 Valid sense data: 0x__ /0x25/0x0 *LOGICAL UNIT NOT SUPPORTED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x__/0x68/0x0 *LOGICAL UNIT NOT CONFIGURED
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4/0x4c/0x0 HARDWARE ERROR/LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4/0x3e/0x3 HARDWARE ERROR/LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x4/0x3e/0x1 HARDWARE ERROR/LOGICAL UNIT FAILURE
H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x4c/0x0 NOT READY/LOGICAL UNIT FAILED SELF-CONFIGURATION
H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x3e/0x3 NOT READY/LOGICAL UNIT FAILED SELF-TEST
H:0x0 D:0x2 P:0x0 Valid sense data: 0x2/0x3e/0x1 NOT READY/LOGICAL UNIT FAILURE

When the device enters this PDL state, the vSphere host can take action to prevent directing any further, unnecessary I/O to this device. This alleviates other conditions that might arise on the host as a result of this unnecessary I/O. With vSphere 5.5, a new feature called PDL AutoRemove is introduced. This feature automatically removes a device from a host when it enters a PDL state. Because vSphere ESXi hosts have a limit of 255 disk devices per host, a device that is in a PDL state can no longer accept I/O, but can still occupy one of the available disk device spaces.

PDL AutoRemove occurs only if there are no open handles left on the device. The auto-remove takes place when the last handle on the device closes. If the device recovers, or if it is re-added after having been inadvertently removed, it is treated as a new device

  1. Run this command to disable AutoRemove:esxcli system settings advanced set -o “/Disk/AutoremoveOnPDL” -i 0

Notes:

  • There is no guarantee of data integrity can be given if a device returns from PDL.
  • To re-enable the AutoRemove feature, run this command from a shell session to the ESXi host:esxcli system settings advanced set -o “/Disk/AutoremoveOnPDL” -i 1
In vSphere 6.x, the expectation is that a device in a PDL state will not return. Therefore, the device needs to be removed from the ESXi host, before it can be recovered. If the AutoRemoveOnPDL feature is disabled, a manual rescan is required to remove the device while in a PDL state.
Note: For vSphere Metro Storage Cluster (vMSC) environments, VMware recommends to have AutoremoveOnPDL set to 1 explicitly for vSphere 6.x.
A planned PDL occurs when there is an intent to remove a device presented to the ESXi host. The datastore must first be unmounted, then the device detached before the storage device can be unpresented at the storage array. For more information on how to correctly unpresent a LUN
This KB explains how to unmount LUN safely
Sometime we may run into issues when remounting LUN after PDL, follow this article if you face the issue

Share this:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)

Related

Filed Under: Availability, storage, Troubleshooting