We recently upgraded and patched one of our VMWare ESXi 4.1 implementations. After doing so, we found that the server performance was terrible. Not only were virtual machines running poorly on it, we found gaps in the performance graphs in the vSphere Client and long waits trying to get any information about the server from the vSphere Client in general. Even after shutting down both virtual guests, the performance was still poor.

After some investigation, it started to become clear that ESXi was busy reading like crazy from both our iSCSI- and NFS-based datastores. Using the ESXi SSH console, we looked at the logs and saw the following activity in 2 places:

/var/log/messages

The same basic 4 lines kept showing up over and over and over again:

  • vmkernel: 0:03:09:49.164 cpu1:14895)FS3: 3001: [Req mode: 1] Checking liveness of [type 10c00001 offset 21653504 v 6375, hb offset 4141056
  • Jul 30 18:39:03 vmkernel: gen 37, mode 1, owner 4fd22fec-49735477-9cc8-001143d281b5 mtime 2286586] on vol ‘VMStore’
  • Jul 30 18:39:07 vmkernel: 0:03:09:53.165 cpu3:14895)FS3: 3099: [Req mode: 1] Lock [type 10c00001 offset 21653504 v 6375, hb offset 4141056
  • Jul 30 18:39:07 vmkernel: gen 37, mode 1, owner 4fd22fec-49735477-9cc8-001143d281b5 mtime 2286586] is not free on vol ‘VMStore’

In our case, the VMStore is an iSCSI mount on a NexentaStor system.

 /var/log/vmware/hostd.log

In this log, we found lines like these 2 repeated over and over again for every single VMDK file that appeared on both our iSCSI and NFS datastores:

  • [2014-07-30 19:39:43.746 FFC10F40 info ‘DiskLib’] DISKLIB-VMFS  : “/vmfs/volumes/4fda6859-b283e8a1-9c83-001d09203a0c/OldVMs/Wordpress Sandbox Server/Wordpress Sandbox Server-000001-delta.vmdk” : open successful (21) size = 34359738368, hd = 0. Type 8
  • [2014-07-30 19:39:43.746 FFC10F40 info ‘DiskLib’] DISKLIB-VMFS  : “/vmfs/volumes/4fda6859-b283e8a1-9c83-001d09203a0c/OldVMs/Wordpress Sandbox Server/Wordpress Sandbox Server-000001-delta.vmdk” : closed.

This was appearing for both our iSCSI and NFS mounts on NexentaStor.

 Other Symptoms

During the time while the hostd.log was spewing all this stuff by (at least 75-100 log lines per second), we also found that the system would not allow us to unmount the NFS datastore, even though nothing was using it among the defined VM’s on the system. Not even with the system in Maintenance Mode!

We also looked at MRTG graphs from our network and saw that since the upgrade, there was 10 – 12 Mbps of solid network traffic going to our NFS-based backup datastore. This was happening even though there was nothing setup to actually use it during that time.

The Fix

After hours of digging, searching, rebooting, putting the server in maintenance mode and beating our head against several walls, we came across a post online talking tangentially about how a Veeam system caused some similar problems. Well, we run Veeam Monitor v.5 to watch all of our servers, so I decided I would turn off the Veeam collection server temporarily to see if it had any effect.

The result? Within 1 minute of shutting off the Veeam collector, all of the log activity had stopped and the ESXi server/vSphere Client became incredibly responsive. I/O wait times on the iSCSI-based VM went down dramatically.

To confirm that this wasn’t some crazy coincidence, I restarted the Veeam Monitor. Within a minute, the log entries started  rushing by again at top speed.

The only way to stop this from happening but keep Veeam as part of the equation was to REMOVE the ESXi host from the Veeam Monitor, losing all of the historical data as part of the process. I then added the ESXi host back in to Veeam.

The logs started rushing by again and I though the problem was still there. However, it only went through and investigated each VMDK file once. Once it did that (took about 1 minute in our environment), it stopped. After that, everything was happy and the log volume was normal with Veeam monitoring turned ON.

We hope that this may serve as a help to others who encounter the same situation, as well as a reminder to ourselves of the cause of this problem since we have other servers to update in the future.