Adding monitoring with Prometheus & Grafana
One of the thing I wanted to do before expanding the homelab setup was to add some monitoring. I found the kube-prometheus-stack to be a good way to get a quick start.
I’m integrating this into my GitOps workflow, and it’s commited into my homelab repo.
The config
I have a values.yaml for the helm chart:
grafana:
admin:
existingSecret: grafana-admin
userKey: admin-user
passwordKey: admin-password
persistence:
enabled: true
size: 5Gi
ingress:
enabled: true
ingressClassName: traefik
hosts:
- grafana.local
tls:
- hosts:
- grafana.local
secretName: grafana-tls
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
alertmanager:
enabled: false
# k3s doesn't expose these by default
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeProxy:
enabled: false
kubeEtcd:
enabled: false
One key thing here is that I have the ...SelectorNilUsesHelmValues settings to pick up monitors across all namespaces.
Trying the ArgoCD application multi source pattern
For this one, I’m using a multi source pattern I haven’t used before. Instead of creating a helm chart and adding another chart as a dependency, this specifies a connection between the chart and my value file. Additionally, I have a ServerSideApply=true because CRDs annotations exceed client-side limit, sync failed without it.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: kube-prometheus-stack
namespace: infra
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: monitoring
sources:
- repoURL: https://prometheus-community.github.io/helm-charts
chart: kube-prometheus-stack
targetRevision: 84.0.0
helm:
valueFiles:
- $values/monitoring/values.yaml
- repoURL: https://forgejo.local/user/repo.git
targetRevision: main
ref: values
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
Debugging log
Problem 1: Multiple Prometheus operators
When using this chart, I got a pod that was recreated every few seconds. Operator log was showing sync prometheus every 100-500ms.
When searching for potential conflicting operators, I found a leftover prometheus-operator in the default namespace, from a previous install experiment that wasn’t cleaned up correctly. By scaling is down then deleting, the issue was fixed.
Lesson: cluster-scoped operators watch all namespaces by default. Running 2 of them will cause issues, regardless of namespaces.
Problem 2: ServiceMonitor discovered but 0 active targets
To get started using my monitoring setup, I added a ServiceMonitor to my Forgejo install. The root cause was that the service had no metadata labels, only the pods.
The fix:
apiVersion: v1
kind: Service
metadata:
name: forgejo
labels:
app: forgejo # this is what was missing
spec:
selector:
app: forgejo # this looked like it was good to go, but it's a selector not a label
ports:
- name: http
port: 3000
targetPort: 3000
Lesson: making sure everything is labelled properly is important in Kubernetes. Using a helper function in my charts will be a good step to help avoid this kind of issues.
Problem 3: Target UP but returns 404
After fixing the label issue, I still had 404 Not Found on /metrics. This one simply required to enable the endpoint in Forgejo configuration:
metadata:
name: forgejo-config
data:
...
GITEA__metrics__ENABLED: "true"
Grafana dashboard for Forgejo
Now that the metrics collection was working, I needed to see those in Grafana. I used a community dashboard (id 17802) to get started, with only a couple of tweaks to work for me. One issue that took a bit of time to figure out was that the uptime display looked frozen despite value actually incrementing when looking into the explore tab. The root cause was actually simple: the resolution was set too high (1h). Moving it to a more sensible value gave me a correct incrementing display.
Next steps
Now that the monitoring POC is complete, the next step will be to add ServiceMonitors to other apps, and explore Grafana dashboards further. Two big thing that are missing in the observability picture are logs and alerts, which will also need to be added.