Since we implemented 5.5, we have seen more APD errors than 5.0. In 5.1 VMware changed how APD errors were logged, so it was a bit more sensitive but also very telling of issues with the storage.
I had a Linux file system go read-only a month ago and was trying to track down the issue. We have Netapp storage in cluster mode and have multiple tools including vROPS, Log Insight and OnCommand Insight. The multiple tools were showing some latency and a lot of APDs , though they weren’t exactly aligned. There was also a vMotion that occurred around the same time as file system went read-only.
I opened a case with VMware who pointed to the APDs. You can’t immediately say an APD caused a read-only file system but is possible and it was the only lead that we had. My storage team opened a case with Netapp who asked for various performance captures. They noticed in the configuration that the Netmask of various LIFs (the virtual ip used to present storage) was not the same across the board.
The Netmask that should have been in use was a 22-bit (255.255.252.0) but for a quarter of the LIFs a 24-bit (255.255.255.0). From the ESXi host there was not problem mounting the Datastore via the LIFs with the bad Netmask, but you could not ping the LIF.
I worked with my storage admin to change the Netmask on the LIFs. It was a fairly easy change going from 24-bit to 22-bit. We tested going the other way and we saw a brief disconnect.
In this image from Log Insight you can see that we getting at least a handful of APDs per hour on the datastore that supported some Linux systems. Once we changes the netmask on the LIF to the 22- bit, the APDs disappeared. The image two-days prior and two-days after the change.
APDs can be a very tough issue, especially in a NFS environment, VMware has very little visibility into the issue, so they defer to the storage vendor. The storage vendor may ask for performance captures, configs and logs. Sometimes they say they see something, sometimes they don’t. Sometimes they don’t see anything until they re-run their performance captures multiple times or get different eyes to look at the problems. The root issue could be controller performance or some weird condition during controller failover, regardless they usually take really long to get to the bottom to if at all.
Hopefully if you stumble across this page, you can check your netmasks and rule that out right away. If that solved your problem, then great! If not, then I wish you all the patience in the world.
Thx, for resolution