I was technically “on vacation” for half of December, but I couldn’t help but check my email to see what was going on (and to not have a backlog of email when I got back). I noticed an email from another department who I advised in regards to VMware. The email stated that they ran into a bug with the hp-ams driver where they couldn’t power on VMs and they couldn’t logon to the ESXi console. He pointed me to the following VMware KB:
I haven’t had any issues with my servers, so I kind of brushed it off.
The following weekend there were some some issues with a few VMs restarting so I started to grab the logs. Most of the hosts were fine but one didn’t work. Strange, I went to enable SSH on the host and I started to get these odd errors. Then I started to noticed a lot of vMotion tasks that were failing. After looking at the host a little longer, I realized that I hit the same bug that I was told about earlier in the week.
I was kind of panicking at this point. We have quite a few hosts that run this driver, any number of t hem could be affected. VMware’s KB article either indicated to upgrade to 10.0.1 or to uninstall the service. Since we were in the middle of a Change Freeze, we went with this approach:
- Disable hp-ams service on all Gen8 hosts (this would prevent a good host from turning into a bad host, this is only good until host is rebooted)
- I used a powercli script to enable SSH and then created a batch file of plink commands to run /etc/init.d/hp-ams.sh stop
- EDIT 4-16-2015 if you also run “chkconfig hp-ams.sh off” it will stop the service from starting on reboot
- vMotion FROM a bad host still worked (just not to a bad host), so on a host that is affected, vMotion all of the VMs off, put it into maintenance mode, reboot, disable hp-ams service, put host back into service
There are a few hosts that are vMotion sensitive or have other issues with vMotion, so we could not touch those hosts, but I did notice that once you start vMotioning VMs off a host, SSH and some other management functions start to work. I’m thinking that as the VMs leave, enough memory is freed up to allow for new processes to be launched.
My plan of attack for the hosts that are still having issues but I couldn’t remediate using the above strategy was to move some VMs off (that weren’t vMotion sensitive) so that SSH can be enabled and then stop the hp-ams service. Note that you may have to vMotion a few more VMs MORE than what is needed for ssh (to enable you to launch a process to kill the hp-ams).
With all of that done, I still have a choice to make with hp-ams.
- Uninstall hp-ams vib
- Upgrade hp-ams
The reasons for doing #1 is that a VMware TSE mentioned that he’s seen issues with the service and usually recommends for people to remove if it it’s not needed. I’m not sure if we need it, we do use HP SIM and HP IRS (needed for our contract). I don’t like recommending removing the vib though because we use the HP custom image, and if someone asks us about what image we use it’s easier to say HP custom image with a few patches than hp custom image and then remove patches, or VMware image plus very specific HP agents and drivers.
That being said, I’m leaning towards #2 at this point.