VMware’s Virtual Flash Read Cache May Crush vCenter

References:
VMware vCenter Server consumes most of the CPU, memory, or disk I/O after enabling Virtual Flash (2072392)
Disabling vSphere Flash Read Cache caching in a virtual machine (2057840)

sarcasm

vFRC has made the VDI environment so much better.

So far, this vFRC feature is working great. I’ve got vmx-10 VMs in a VDI Pool performing at above expected speeds. With the vFRC providing another layer of flash storage (the VDI Pool already lives on an all-flash storage array from Pure Storage), and a 10 Gb networking backbone, we have setup an environment that gives us the best chance to succeed as we roll out our VDI initiative across the country. The VDI machines are persistent full clones and CBRC (Content based Read Cache)is enabled on all the hosts in the cluster.

As mentioned in a previous post, we added local SSD storage to the Cisco UCS blades that form the HA cluster housing the VDI VMs. We don’t have any spinning disks to accompany them, so we did not use VMware’s new VSAN technology. The best alternative was to use the local storage for vFRC and host caching.

While the VMs were referencing the cache and saving millions of I/O hits to the Pure Storage SAN, vCenter was unable to keep up with only 3 VMs assigned to the vFRC feature. vCenter was crippled by high CPU and RAM usage as mentioned in the KB. The web GUI didn’t see any updates made from the fat vSphere client and, overall, was useless. I made a call into VMware and opened a ticket. After a six hour phone call and no resolution, I ended up going with my Plan B and creating a pool of vmx-8 VMs for the VDI mini-roll out to a test group due in the next morning.

Most of the troubleshooting done was database related. According to VMware, any stale record in the DB could cause the web client to freak out and see no changes, or just some of the inventory. Awesome. The vSphere fat client was working fine, so this particular theory doesn’t apply to it. So, why is VMware moving away so rapidly from the fat client if the web based administration is prone to this problem?

After the database changes, which consisted of data in a couple of select tables, the web GUI was still busted.

Here’s the thing, though, that really threw me, VMware was unable to turn vFRC off. It was as if they were battling the WOPR from War Games. vFRC is enabled by a couple checkboxes, one for the host and one for each VM. Every time the technicians would uncheck the caching feature for the host and reboot it, vFRC would be enabled again. The same thing would happen when they disabled vFRC on the individual VMs.

After watching this for a few minutes, it made some sort of sense:
1. The web GUI is not working properly
2. VMware has designed a system in which these new features can only be controlled through the web GUI
3. Any change that was made to vFRC in the web GUI didn’t seem to work
4. See #1

So, yeah. That’s not great. My virtual vCenter is also vmx-10, so I guess I can’t control that either through the great new web interface. I CAN control it through VMware’s Workstation 10, with the exception of the vFRC features. There is a vmx file entry that references vFRC (look for vFlash in the vmx) which can be edited to forcibly turn it off. I checked one of the VMs that, during troubleshooting, had vFRC turned off. It was on again today and reading from vFRC at the same rate.

I’m starting to wonder if VMware is rushing these new features into production to compete instantly with all the other vendors that offer flash caching on virtualization hosts. I really hope not, although after this, I’ll need to take more burn-in time on these so-called “features”.

I may update this post after the VMware ticket closes.

UPDATE: 3-31-2014

This ticket got escalated up the chain after a phone call with my sales representative. During the troubleshooting it was mentioned that waiting for an update may be the only answer IF we couldn’t schedule downtime to move all the ESXi hosts out and back into the cluster. The engineer also gave me some SQL queries to run against the inventory DB and mentioned a reset of the database may need to be done. It’s becoming clear that the VMware support team is aware of the bug and is experimenting with workaround options. It was confirmed during the call that vFRC is the cause, so there’s that.

UPDATE: 4-2-2014

Finally a breakthrough. The bullet-points:

1. Database queries to change some tables ultimately fixed the web interface problem. I’m not sure if it would fix the CPU-RAM issue as I haven’t enabled any other VMs to use it. The queries may be specific to the problem so they aren’t included here. They were done by VMware Support.

2. The new Host Cache is configured on 2 of the 4 hosts. I haven’t made any changes there either. I’m going to ride this out for a while as we are on-boarding more VDI clients.

3. There are occasional Error 1009 pop-ups on the web interface. This is yet another known issue with vCenter 5.5:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2061667 and the way vApp links are built in vCOPS and Workspace.

 

 

 

Advertisements


Categories: VMware

Tags: , , , , , ,

16 replies

  1. How did your support call end up? Did they have a solution for you?

  2. Any updates?

    • Yes. I contacted VMware support recently about this problem because I want to upgrade our other production cluster from 5.1. I was told that the problem still exists and has not been fixed. To avoid the inventory database corruption that could occur, I was told to upgrade as usual, but NOT TO ENABLE VFRC. So, yeah, don’t check the check box and it should be ok.

      I will be upgrading in the next 2 weeks and will update this entry.

  3. I experienced the exact same problems with 5.5 U1a and when I disabled vFRC – they stopped.

    Its very disappointing to know that a feature that could be very beneficial like vFRC is plagued with issues that make it unusable.

    My recommendation to everyone – wait. Disabling vFRC is a very painful operation.

  4. Another disappointing thing with vFRC is when you run a certain workload on a VM, it will PSOD the ESXi host every time. Using the iometer bench mark tests that were used in the “unofficial storage performance thread” on the VMware forums, the test RealLife-60%Rand-65%Read will just kill the host. I have a support call with VMware on it, they identified the issue and are working on a patch.

    • Thanks for the post. I hope I get a chance to ask the experts at VMworld in SF this year about vFRC. After looking through the agenda, there’s a spotlight on VSAN, no mention of vFRC. I’m betting that vFRC will be discontinued as a “feature” relatively quickly as VSAN matures. I’m not planning on using either for the foreseeable future.

    • Same problem here 😦 Installed HP Image 5.5u1 on a HP BL460c G7 on local SD-Card and enabled vFRC on a HP 400GB Enterprise Performance SSD. ESX was updated through Update Manager. Was working well…until I started to test with iometer (full test over night). PSOD when I looked back on the iLO console on the next morning. 😦 Do you use iSCSI or FC?

  5. Hello:

    So, ESXI 5.5 Update 2, and vCenter Server Update 2 were released on September 9, and I still see no fix for the high CPU issue. Does anyone have any further information or updates from VMware support? Its kind of nuts this isn’t fixed yet.

    • Never mind. This is indeed supposedly fixed in Update 2!

    • ESXi 5.5 update 2 did not fix the issue. VMware just sent me the patch to test. They have been working on a fix for a while now. I will update in a few weeks when I get a chance to test.

      • Hi – Any idea why the VMware KB article 2072392 specifically notes “This issue is resolved in vCenter Server 5.5 Update 2” still? Is there a different issue now?

      • My bad. Vcenter CPU issue was fixed. The PSOD memory leak in VFRC was not fixed in update 2. That’s what I got a patch for. It’s the issue I reference above.

        I got a one track mind.

  6. is it fixed in vsphere v6?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Brad Hedlund

stuff and nonsense

MyVirtuaLife.Net

Every cloud has a silver lining.

Live Virtually

A Storage and Virtualization Blog

Virtualization Team

VMware ESX/ESXi - ESX server - Virtualization - vCloud Director, tutorials, how-to, video

www.hypervizor.com/

Just another WordPress.com site

VirtualKenneth's Blog - hqVirtual | hire quality

Virtualization Blog focused on VMware Environments

Virtu-Al.Net

Virtually everything is POSHable

Gabes Virtual World

Your P.I. on virtualization

Yellow Bricks

by Duncan Epping

Wahl Network

Technical Solutions for Technical People

Joking's Blog

Phoenix-based IT guy with too much to say...

%d bloggers like this: