This post covers VMware error 21009 (unable to install VMware Tools) or “Unable to install VMware Tools: An error occurred while trying to access image file “/usr/lib/vmware/isoimages/windows.iso” needed to install VMware tools: No such file or directory.”, Updater Error 15 (cannot remediate the host) and VMware hosts reverting back to previous configurations on reboot.
I learned more about VMware’s bootbank concept than I expected a couple weeks ago. It started out as a basic ticket to VMware tech support about not being able to install VMware Tools on any VM in a host (the host could not find the ISO files needed to run the install) and not having enough room in the host’s root partition to manually copy the tools’ ISO files to it. This ticket was closed pretty quickly and I escalated a new one after I figured out what was going on.
The physical VMware hosts are all UCS B200-M3 blades. All the hosts boot from a remote 10 GB LUN over fiber. They default to creating a 4GB VMFS root partition on the assigned boot LUN, which is plenty of room for everything it needs, including VM Tools’ ISO files. I am expecting the host to boot from the VMFS partition every time since it is set as LUN-0 in the storage array mapping configuration.
The hosts also create a root partition in local volatile RAM every time it boots. This is called the Ramdisk. This disk stores configuration data, drivers and other imperative information the hosts will need should the VMFS volume be unavailable.
This isn’t to say that the host can boot from the Ramdisk in this scenario, it can’t. The system files are still in the VMFS volume.
The Ramdisk as shown through a Putty session using the “vdf -h” command:
The bootbank and altbootbank holds two sets of configuration files, the one that is being used and the one that is being changed. If, after updating or making another change to the host that you didn’t want committed to the running (current) config, the boot process can be interrupted to point the host to the prior config set and to boot from it. This is similar to Cisco’s running config versus startup-config concept.
The bootbanks are supposed to be in the VMFS boot volume (Putty session using “ls -l” command):
It’s not that easy to see the VMFS path there, but the point is the bootbank shortcuts in root point to the VMFS partition on the storage array, which is what they’re supposed to do. In this scenario all updates and changes are written to the altbootbank and will be committed on the next reboot.
Changes will not be committed if the bootbank shortcuts are not pointed to the VMFS volume, but rather the Ramdisk (in volatile RAM on the VMware host):
The bootbank shortcut is pointed to the /tmp folder on the Ramdisk. In this case all the updates and changes made to the host will vanish when the host reboots because it is stored in volatile RAM.
The Bootbanks and other folders get their shortcuts changed by ESXi’s kernel because of FLOGI (Fiber Login) delay. The host’s FNICs (Fiber NICs) have to login to the SAN to gain connectivity and the VMware host kernel allows only three seconds for this to happen. If that threshold is passed, the kernel chooses the Ramdisk as the safer place to store the bootbanks.
Alright, so that explains why hosts have their configuration reverted upon reboot. What about VM Tools installation failure and the inability to stage patches to a host?
After the FLOGI delay, the root folder shortcuts to the product locker and other vmtools locations won’t work because the target is on the VMFS volume which the Ramdisk cannot get to. When a VM is commanded to update its tool set the host will not find the folder or the ISO files and in turn show the error 21009 (cannot access the needed ISO files). Same goes for Update Manager. It won’t be able to stage patches to the host because the folders it is copying to are on the Ramdisk now which has much less room than the VMFS partition. A WinSCP connection will show several broken shortcuts.
There are two ways to figure out if the VMware hosts have their bootbanks in the wrong place. Well, maybe three ways-
1. VM Tools install will fail and the ISOs will not be accessible
2. If a host gets rebooted and it has an old configuration
3. Open SSH and run ls -l to verify the bootbank shortcut
There shouldn’t be any errors or broken thresholds in the SAN fabric. This is strictly a VMware choice based on SAN assumptions.
The Fix:
1.- Change the bootbank shortcut back to the VMFS volume. Open a SSH session to the host and run the following command: “localcli –plugin-dir=/usr/lib/vmware/esxcli/int/ boot system restore –bootbanks” – No quotes and those are double dashes before plugin and bootbanks. If this fails then try reboot the host and try step 3 with varying number of seconds.
2. Install the VMware ESXi 6.0, Patch ESXi-6.0.0-20161104001-standard patch bundle. This can be downloaded through VMware after some searching. The bundle has the following patches:
The file can be uploaded into the Update Manager and assigned to a baseline for easier host scanning and remediation.
3. After the patch bundle is installed, reboot the host. During the ESXi boot process press shift+0 and append the parameter “devListStabilityCount=number-of-seconds”. VMware recommended 10, but I needed 30-60 to get the kernel to keep the bootbanks on the VMFS volumes. This parameter delays the kernel ten (or more) extra seconds before verifying FLOGI success and making a decision on bootbank shortcuts.
4. After a successful patch and boot change the /bootbank/boot.cfg file. Append the same parameter as during the boot process- “devListStabilityCount=20” (or whatever number of seconds you want to use- also, no quotes in the config file.)
5. Verify the changes by rebooting the host again without intervention. Check the bootbank shortcut targets.
This may just be coincidence, but of all the VMware host vDC’s I have (3), clusters (6) and hosts (36) the only ones with this problem were using the old school RDM disk connections over fiber. The problem hosts also suffer from the 3-5 minute per RDM mapping boot delay caused by the satp_alua service. I can’t help to speculate that these are connected somehow but VMware won’t validate me or my ideas (and I’m a little more insecure because of that- kidding.)
I spent a few days struggling with a similar problem re: FC login on a Nimble storage array. VMware advised me to make the same devListStabilityCount=10 change as you did, however it wasn’t working for me. I wanted to give everyone a heads up that this fix did *NOT* make it in to 6.5 GA (which came out a few days AFTER 6.0 Update 2 Patch 4), hence why it works under this 6.0 build but NOT 6.5. There is no current fix for 6.5. I’m currently in the process of reinstalling a dozen newly built hosts (ugh). Good luck!!!
Thanks for the info Nathaniel.