Adding monitoring with Prometheus & Grafana

One of the thing I wanted to do before expanding the homelab setup was to add some monitoring. I found the kube-prometheus-stack to be a good way to get a quick start.

I’m integrating this into my GitOps workflow, and it’s commited into my homelab repo.

The config

I have a values.yaml for the helm chart:

grafana:
  admin:
    existingSecret: grafana-admin
    userKey: admin-user
    passwordKey: admin-password
  persistence:
    enabled: true
    size: 5Gi
  ingress:
    enabled: true
    ingressClassName: traefik
    hosts:
      - grafana.local
    tls:
      - hosts:
          - grafana.local
        secretName: grafana-tls

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 20Gi
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

alertmanager:
  enabled: false

# k3s doesn't expose these by default
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false
kubeProxy:
  enabled: false
kubeEtcd:
  enabled: false

One key thing here is that I have the ...SelectorNilUsesHelmValues settings to pick up monitors across all namespaces.

Trying the ArgoCD application multi source pattern

For this one, I’m using a multi source pattern I haven’t used before. Instead of creating a helm chart and adding another chart as a dependency, this specifies a connection between the chart and my value file. Additionally, I have a ServerSideApply=true because CRDs annotations exceed client-side limit, sync failed without it.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kube-prometheus-stack
  namespace: infra
spec:
  project: default
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  sources:
    - repoURL: https://prometheus-community.github.io/helm-charts
      chart: kube-prometheus-stack
      targetRevision: 84.0.0
      helm:
        valueFiles:
          - $values/monitoring/values.yaml
    - repoURL: https://forgejo.local/user/repo.git
      targetRevision: main
      ref: values
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

Debugging log

Problem 1: Multiple Prometheus operators

When using this chart, I got a pod that was recreated every few seconds. Operator log was showing sync prometheus every 100-500ms.

When searching for potential conflicting operators, I found a leftover prometheus-operator in the default namespace, from a previous install experiment that wasn’t cleaned up correctly. By scaling is down then deleting, the issue was fixed.

Lesson: cluster-scoped operators watch all namespaces by default. Running 2 of them will cause issues, regardless of namespaces.

Problem 2: ServiceMonitor discovered but 0 active targets

To get started using my monitoring setup, I added a ServiceMonitor to my Forgejo install. The root cause was that the service had no metadata labels, only the pods.

The fix:

apiVersion: v1
kind: Service
metadata:
  name: forgejo
  labels:
    app: forgejo       # this is what was missing
spec:
  selector:
    app: forgejo       # this looked like it was good to go, but it's a selector not a label
  ports:
    - name: http
      port: 3000
      targetPort: 3000

Lesson: making sure everything is labelled properly is important in Kubernetes. Using a helper function in my charts will be a good step to help avoid this kind of issues.

Problem 3: Target UP but returns 404

After fixing the label issue, I still had 404 Not Found on /metrics. This one simply required to enable the endpoint in Forgejo configuration:

metadata:
  name: forgejo-config
data:
  ...
  GITEA__metrics__ENABLED: "true"

Grafana dashboard for Forgejo

Now that the metrics collection was working, I needed to see those in Grafana. I used a community dashboard (id 17802) to get started, with only a couple of tweaks to work for me. One issue that took a bit of time to figure out was that the uptime display looked frozen despite value actually incrementing when looking into the explore tab. The root cause was actually simple: the resolution was set too high (1h). Moving it to a more sensible value gave me a correct incrementing display.

Next steps

Now that the monitoring POC is complete, the next step will be to add ServiceMonitors to other apps, and explore Grafana dashboards further. Two big thing that are missing in the observability picture are logs and alerts, which will also need to be added.

Tags: