3-node VSAN On-Disk Upgrade Experience

By | March 13, 2018

This should be a fairly straightforward post.

I have a 3-node Hybrid VSAN 6.2 in an isolated environment and the on-disk format needed to be upgraded to version 3. I had been putting it off but I wanted to get it out of the way. I haven’t done a disk update in sometime, so I went to the web client and kicked it off.

A general system error occurred: Failed to evacuate data for disk uuid <XXXX> with error: Out of resources to complete the operation 

A quick search brought me to this blog that reminded me that VSAN tries to maintain full redundancy (rebuild missing parts on other nodes) as its default. a 3-node cluster wouldn’t be able to do this because the two copies of data (for FTT=1) would need one other node to be an observer. I started to think this sounded weird no since 2-node clusters exist with the off-site virtual VSAN witness node, but the witness is really only for the 2-node case. Also, if you start with 3-nodes, your data will be dispersed among the 3-node cluster, so if you take any one down there will be at least some VMs that have components on that cluster and will not be able to rebuild on the other since that cluster doesn’t have an extra node for witness.

I digress.

The solution in my case was to go to the vcenter, run rvc and run and command to run the on-disk format but specifying to allow reduced availability

Note: Windows vCenter (for now)

“C:\program files\vmware\vcenter server\rvc\rvc.bat”

login with the administrator@vsphere.local pasword

cd down to the VSAN cluster

vsan.ondisk_upgrade –allow-reduced-redundancy .

Now this started to work all fine and dandy just like the blog post referenced. It evacuated the first node, recreated the disk group with new version, and then moved to the second node. When it got to the second node, it seemed to wait. Checking the Web Client, you could see that the cluster resyncing data back to node 1.

Now the part where I started to freak out was after I left my RDP session open and came back an hour later. My RDP session was closed, which isn’t usually a big deal. But when I logged in, the session was totally new, no command prompt. I wasn’t exactly sure if I had to keep my command prompt running for the RVC version of the disk upgrade to complete. It took me some time, but I found the right command in RVC to check if the vsan disk upgrade was still occurring:

vsan.upgrade_status .

This command confirmed that the upgrade was still occurring. I could also see from checking the web client resync (and also the rvc command vsan.resync_dashboard .) that the cluster had moved onto resyncing data to the second node instead of the first (meaning the first finished and the second’s disk group was updated).

I checked the status the next morning and the vsan.upgrade_status stated there was no upgrade occurring. The Web Client showed that all disks were upgraded, but there was still a lot of resyncs occurring (data being copied back to the 3rd node). I checked later that day and all of the rsyncs were complete!

So in the end, VSAN is pretty smart to do the right thing. The only problem is sometimes you have to get things kicked off from RVC instead of the Web Client.

Note, here is a good VMware doc for disk upgrades, it just didn’t have the upgrade_status command.

Leave a Reply

Your email address will not be published. Required fields are marked *