vSAN hosts not contributing stats reports – vSAN Health check fails

In this blog post we will talk about one of the most common vSAN Health check of performance service – All hosts contributing stats.

First lets discuss what is the purpose of this health check. This check verifies that all hosts which are currently part of the same network partition are contributing statistics to the collection. Any host that is not in the same network partition will not be checked. Use the general network health checks to assess this aspect of overall cluster health. This check is used to verify that the vSAN performance service can communicate with the host and retrieve statistics. If this check is Yellow, it means some hosts are not contributing performance statistics.

Symptoms: You see a Warning in Cluster Monitor Virtual SAN Performance service All hosts contributing stats.

 

As shown in the above screenshots, this test will show the hosts which are not contributing performance stats. I have blurred that part but you will be shown the affected hosts in details tab.

In this post I will show you all the aspects of resolving this issue. These steps may resolve or may not resolve your issue because this tests fail because of many reasons.

I have combined all the troubleshooting steps and mentioned below what all we need to check as a workflow.

  • Identify the cmmds master and stats master in the vSAN cluster

You can identify the cmmds and stats master by running command ” vsan.perf.cluster_info ” in RVC. Another ways of finding the stats master by running the script  “python /usr/lib/vmware/vsan/perfsvc/vsan-perfsvc-status.pyc svc_info” on all the hosts and grep for “isStatsMaster = true”

--------Perf Service Node Information--------
(vim.cluster.VsanPerfNodeInformation) {
dynamicType = <unset>,
dynamicProperty = (vmodl.DynamicProperty) [],
version = '6.5.0',
hostname = <unset>,
error = <unset>,
isCmmdsMaster = true,
isStatsMaster = true,
  • Restart vsan health service on vCenter by running “/usr/lib/vmware-vmon/vmon-cli -r vsan-health” & vsanmgmt on all the hosts by running “/etc/init.d/vsanmgmtd restart”. Now, retest health & check.
root@vcappliance [ ~ ]# /usr/lib/vmware-vmon/vmon-cli -r vsan-health
Completed Restart service request.


[root@blr1:~] /etc/init.d/vsanmgmtd restart
watchdog-vsanperfsvc: Terminating watchdog process with PID 2098519
vsanperfsvc started

Note: Restarting of these services will not impact production environment.

  • Select vSAN Cluster ⇒ Configure ⇒ Health & Performance ⇒ Edit ⇒  Uncheck performance service. It will delete the .vsan.stats/ object from vsandatastore then restart vsanmgmt service on all the hosts. Follow the same procedure and turn on performance service again, it will again create .vsan.stats/ object in vsandatastore. Retest vsan health and check. ( This shall delete all your historical performance stats)

  • If above steps does not fix the issue, lets identify one affected host and try troubleshooting on that host only. This host which we identified obviously is the agent host and we know about the stats master host. Lets check what is the cause of communication issue between them because statsmaster collects all the stats from agent hosts and provides it to VC. That is why you will not see statsmaster host in affected host list.
  • Enable SSH & take putty session of statsmaster and agent hosts ⇒ cd /etc/vmware/vsan/ ⇒ vi vsanperf.conf ⇒ Modify the following entries “loglevel = debug” & “logrotate = 10” and now /etc/init.d/vsanmgmtd restart on both of them
  • Check for vsanmgmt logs on statsmaster host by grepping “RetrieveRemoteStats”
2018-09-12T01:44:38Z VSANMGMTSVC: DEBUG vsanperfsvc[Collector-2] [statscollector::RetrieveRemoteStats] Traceback (most recent call last): File "/build/mts/release/bora-9183449/bora/build/vsan/release/vsanhealth/usr/lib/vmware/vsan/perfsvc/statscollector.py", line 676, in RetrieveRemoteStats File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/VmomiSupport.py", line 557, in <lambda> File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/VmomiSupport.py", line 363, in _InvokeMethod File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys/lib64/python3.5/site-packages/pyVmomi/SoapAdapter.py", line 1338, in InvokeMethod pyVmomi.VmomiSupport.vim.fault.NotAuthenticated: (vim.fault.NotAuthenticated) { dynamicType = <unset>, dynamicProperty = (vmodl.DynamicProperty) [],
2018-09-12T01:44:38Z VSANMGMTSVC: WARNING vsanperfsvc[Collector-0] [statscollector::RetrieveRemoteStats] Error happened during RetrieveRemoteStats of host 10.10.10.10, type: <class 'pyVmomi.VmomiSupport.vim.fault.NotAuthenticated'>, message: (vim.fault.NotAuthenticated) { dynamicType = <unset>, dynamicProperty = (vmodl.DynamicProperty) [], msg = '', faultCause = <unset>, faultMessage = (vmodl.LocalizableMessage) [], object = 'vim.cluster.VsanInternalStatsProvider:vsan-internal-statsprovider', privilegeId = 'none' }
  • These logs will tell you the reason why statsmaster host is not able to retrieve stats from its agents. Here, the reason mentioned is “vim.fault.NotAuthenticated”. This changes troubleshooting focus to authentication part where certificates comes into play. There can be a certificates related issue with the hosts. Let’s check the certificate files rui.crt & castore.pem of affected agent host. Make sure that castore.pem file should be same across all the hosts in the cluster & issued by same CA. If there is any difference, renew the host certificates. In vSAN 6.7 storage providers are internally managed. Hence, you need to castore file for issuer.
[root@ESXI:/etc/vmware/ssl] openssl x509 -in /etc/vmware/ssl/castore.pem -inform pem -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
xx:xx:xx:xx:xx:xx:xx:xx
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=CA, DC=vsphere, DC=local, C=US, ST=California, O=vm-psc.vhabit.com, OU=VMware Engineering
Validity
Not Before: Aug 20 03:32:56 2018 GMT
Not After : Aug 17 03:32:56 2028 GMT
Subject: CN=CA, DC=vsphere, DC=local, C=US, ST=California, O=vm-psc.vhabit.com, OU=VMware Engineering
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)

 

  • Restart vsanhealth & vsanmgmt service again by following earlier steps mentioned.
  • If this still does not work, put the agent host into MM ensure accessibility ⇒ remove the host from the cluster ⇒ reboot ⇒ add it back to the cluster. Make sure host is back and part of vsan cluster by running “localcli vsan cluster get”. Now, retest it vsan health & check.
  • Look at this KB also, you may be facing SSL issues https://kb.vmware.com/s/article/2150570
  • If still, this health check fails, contact VMware.

 

I hope this post has been informative for you.

Happy learning!!

 

 

Be the first to comment

Leave a Reply

Your email address will not be published.


*