Recovering from a VSAN Array Controller Failure

By | December 8, 2015

I was taking a look at the new VSAN that I build (3-node of HP DL380 Gen 9), and I noticed an error message on the host/cluster. This was after I resolved the driver issue, so I knew it was something new.

vsan-array-5

From this screenshot, you will see that at the cluster level you see VIrtual SAN Health errors.

vsan-array-6

The same errors/alarms are visible in Monitor -> Issues -> Triggered Alarms

vsan-array-7

Going to the Monitor -> Virtual SAN -> Physical Disks, you will see a red error icon on the host and there are NO disks shown

vsan-array-8

At the Virtual Disk Level, the compliance for all of the disks is Noncompliant and there are missing components (data or witness). Since I only have three nodes, the failure in the one node means that there are not enough nodes to rebuild the components.

vsan-array-9

The Health section shows failed Data Health/Object Health

The host was still up, but all of the storage was gone. At this point I went into the ILO

vsan-array-1

 

The ILO Overview page had a critical alert

vsan-array-2

 

 

 

Summary showed a critical alert in storage

vsan-array-3

 

Lastly the HP Smart Array P440ar Controller itself showed as failed.

I was interested to see what this looked like in Log Insight as well:

vsan-array-4

 

I used a pre-canned query from the VSAN Content Pack for Log Insight and then went to the event trends.

There wasn’t anything specific to the hardware, but you could definitely see all of the VSAN errors.

At this point I opened a ticket with HP and collected the AHS logs. When they reviewed the logs, they told me that I didn’t have a failed controller, I hit a known issue. Apparently there was an updated array controller firmware that was released but was not part of the SPP disk (released after the SPP) but it WAS part of the HP Recipe. There is a footnote in the recipe that the firmware had to be downloaded separately.

I started reviewing the recipe and was set to download the firmware that was extra, but got distracted with this VSAN setup and just used the SPP as is.

If your controller has this same issue, you will see this error on boot-up

vsan-array-10

 

After using the VMware online components to apply the firmware, the array controller was back in action, all green in the Health Checks.

This reminded me of a conversation I had on twitter the other day:

VSAN Health Check worked great to notify me that I was using the wrong array driver (as determined from the HCL for VSAN), but there wasn’t a lot of checks for other firmware or drivers. I was trying to follow the November 2015 HP recipe guide, but I was missing the correct array firmware. What  would be really awesome is if Health Check could be loaded with HP’s recipes and tell me that I was using the wrong firmware/driver and if there were any related advisories to the firmware/driver that I was using. I haven’t used HP OneView for vCenter, but from what I can tell it displays software/firmware but it doesn’t match it against a preferred configuration. Maybe I’m looking for some sort of mash-up of VSAN Health Check and HP OneView.

3 thoughts on “Recovering from a VSAN Array Controller Failure

  1. tyler anderson

    This happened to us on the 5th although in our case HP did replace the controller, once it was back in we reconfigured for HBA and everything came back up.

    Reply
  2. Pingback: Replacing HP P440 controller in VSAN Setup | Virtual Chris

Leave a Reply to Chris Cancel reply

Your email address will not be published. Required fields are marked *