The second part of my journey through the TKG extensions installation is dedicated to the monitoring components. As I explained in the first part, I had to add ClusterRoleBindings to successfully deploy the workload with the kubernetes admission controller enabled.
Prometheus collects (“scrapes”) metrics from various sources and allows to execute queries and calculations on them. You’ll find the k8s manifests in the folder “monitoring”.
I prepare the ClusterRoleBindings for all service accounts that are already defined for prometheus components:
kubectl create clusterrolebinding prometheus --clusterrole=psp:vmware-system-privileged --serviceaccount=tanzu-system-monitoring:prometheus-server --serviceaccount=tanzu-system-monitoring:prometheus-alertmanager --serviceaccount=tanzu-system-monitoring:prometheus-kube-state-metrics --serviceaccount=tanzu-system-monitoring:prometheus-node-exporter --serviceaccount=tanzu-system-monitoring:prometheus-pushgateway --serviceaccount=tanzu-system-monitoring:prometheus-cadvisor
The following adaptions are only necessary if you have changed the “serviceDomain” for your cluster (see first part). In the file monitoring/prometheus/base-files/04-server-configmap.yaml the default cluster domain is contained hard coded “svc.cluster.local” for some endpoints. Hence I had to change it to the domain I have set the correct values in monitoring/prometheus/values.yaml as well. I would be possible to only put “servicename” or “servicename.namespace” to circumvent this issue.
Another issue, possibly only within my installation, was that Antrea (the CNI networking library) was not able to work with “hostPorts”. I have found an already closed issue in the project on github, maybe the version I’m using is not already patched. I could help myself by switching to ClusterIP only by modifying the file as follows:
Later on, testing Grafana, I found out, that the scrape interval set to 1 minute was to long to deliver reasonable results for the CPU load percentage. This might depend on the compute power you have at your disposal.
With these changes I was able to successfully deploy the services:
ytt --ignore-unknown-comments -f common/ -f monitoring/prometheus/ -v infrastructure_provider="vsphere" -v monitoring.ingress.enabled=true -v monitoring.ingress.virtual_host_fqdn="prometheus.system.tanzu.ne.local" | kubectl apply -f-
Remark: the expression “ingress” is here and for Grafana used rather inconsistently. In other services there is a distinction between a standard Kubernetes “Ingress” and the Contour “HTTPProxy”. For Prometheus and Grafana you need to have Contour installed in order to enable the “ingress”.
I will cover the configuration for the Alertmanager in a later write up. The manifest has some stubs to send alerts via email and Slack.
Grafana is a monitoring frontend that with attractive dashboards ready to install. In addition to the 2 preconfigured dashboards I recommend to install the Prometheus Node Exporter Full in order to leverage the component node exporter already contained in the TGK extensions manifests.
But let’s first do the necessary adaptions and deployments. As with prometheus, I had to change the service domain once more:
Prepare the ClusterRoleBindings:
kubectl create clusterrolebinding grafana --clusterrole=psp:vmware-system-privileged --serviceaccount=tanzu-system-monitoring:grafana
And deploy the components:
ytt --ignore-unknown-comments -f common/ -f monitoring/grafana/ -v monitoring.grafana.secret.admin_password="YWRtaW4=" -v infrastructure_provider="vsphere" -v monitoring.grafana.ingress.enabled=true -v monitoring.grafana.ingress.virtual_host_fqdn="grafana.system.tanzu.ne.local" | kubectl apply -f-
Remark: you have to give the password base64 encoded, here I use “admin”, this is the default password set for Grafana. You’ll have to change it on the first login.
OK, now let’s test the monitoring components:
This is a preconfigured dashboard based on the official kube-state-metrics project.
I assume this dashboard is sponsored by VMware TKG and it offers a similar view on the cluster processes. For this one I had to adapt the scrape intervals in Prometheus from 1 minute to 30 seconds in order to have a reliable view on the CPU load. The file system never worked for me. After checking the logs I found out, that some errors for the kube-state-metrics component were logged. Digging a little deeper, I checked the compatiblity matrix and saw that the version referenced in the manifests (v1.9.7) is compatible with Kubernetes 1.16 but not with 1.18 that I am currently using for my workload cluster. This explains the errors (v1beta APIs were expected). But even after building the latest version I never got a view on the file system. I’ll check it out later and will take this use case to show on how to build and publish custom images to the docker registry Harbor. Stay tuned.