So far, this vFRC feature is working great. I’ve got vmx-10 VMs in a VDI Pool performing at above expected speeds. With the vFRC providing another layer of flash storage (the VDI Pool already lives on an all-flash storage array from Pure Storage), and a 10 Gb networking backbone, we have setup an environment that gives us the best chance to succeed as we roll out our VDI initiative across the country. The VDI machines are persistent full clones and CBRC (Content based Read Cache)is enabled on all the hosts in the cluster.
As mentioned in a previous post, we added local SSD storage to the Cisco UCS blades that form the HA cluster housing the VDI VMs. We don’t have any spinning disks to accompany them, so we did not use VMware’s new VSAN technology. The best alternative was to use the local storage for vFRC and host caching.
While the VMs were referencing the cache and saving millions of I/O hits to the Pure Storage SAN, vCenter was unable to keep up with only 3 VMs assigned to the vFRC feature. vCenter was crippled by high CPU and RAM usage as mentioned in the KB. The web GUI didn’t see any updates made from the fat vSphere client and, overall, was useless. I made a call into VMware and opened a ticket. After a six hour phone call and no resolution, I ended up going with my Plan B and creating a pool of vmx-8 VMs for the VDI mini-roll out to a test group due in the next morning.
Most of the troubleshooting done was database related. According to VMware, any stale record in the DB could cause the web client to freak out and see no changes, or just some of the inventory. Awesome. The vSphere fat client was working fine, so this particular theory doesn’t apply to it. So, why is VMware moving away so rapidly from the fat client if the web based administration is prone to this problem?
After the database changes, which consisted of data in a couple of select tables, the web GUI was still busted.
Here’s the thing, though, that really threw me, VMware was unable to turn vFRC off. It was as if they were battling the WOPR from War Games. vFRC is enabled by a couple checkboxes, one for the host and one for each VM. Every time the technicians would uncheck the caching feature for the host and reboot it, vFRC would be enabled again. The same thing would happen when they disabled vFRC on the individual VMs.
After watching this for a few minutes, it made some sort of sense:
1. The web GUI is not working properly
2. VMware has designed a system in which these new features can only be controlled through the web GUI
3. Any change that was made to vFRC in the web GUI didn’t seem to work
4. See #1
So, yeah. That’s not great. My virtual vCenter is also vmx-10, so I guess I can’t control that either through the great new web interface. I CAN control it through VMware’s Workstation 10, with the exception of the vFRC features. There is a vmx file entry that references vFRC (look for vFlash in the vmx) which can be edited to forcibly turn it off. I checked one of the VMs that, during troubleshooting, had vFRC turned off. It was on again today and reading from vFRC at the same rate.
I’m starting to wonder if VMware is rushing these new features into production to compete instantly with all the other vendors that offer flash caching on virtualization hosts. I really hope not, although after this, I’ll need to take more burn-in time on these so-called “features”.
I may update this post after the VMware ticket closes.
This ticket got escalated up the chain after a phone call with my sales representative. During the troubleshooting it was mentioned that waiting for an update may be the only answer IF we couldn’t schedule downtime to move all the ESXi hosts out and back into the cluster. The engineer also gave me some SQL queries to run against the inventory DB and mentioned a reset of the database may need to be done. It’s becoming clear that the VMware support team is aware of the bug and is experimenting with workaround options. It was confirmed during the call that vFRC is the cause, so there’s that.
Finally a breakthrough. The bullet-points:
1. Database queries to change some tables ultimately fixed the web interface problem. I’m not sure if it would fix the CPU-RAM issue as I haven’t enabled any other VMs to use it. The queries may be specific to the problem so they aren’t included here. They were done by VMware Support.
2. The new Host Cache is configured on 2 of the 4 hosts. I haven’t made any changes there either. I’m going to ride this out for a while as we are on-boarding more VDI clients.
3. There are occasional Error 1009 pop-ups on the web interface. This is yet another known issue with vCenter 5.5:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2061667 and the way vApp links are built in vCOPS and Workspace.