Dynamic config: DCD and RRD file cleanup able to handle a load with high turnover of objects
Hi all,
based on our recent problem, it is currently quite easy for owners and users of k8s to quickly fill up nearly any space on checkmk /omd filesystem by just quickly deploying and killing again objects in k8s.
Autoscaling, autopilot GKE cluster can easily start and kill nodes, pods and other objects in massive numbers. DCD thus constantly creates new hosts and removes inactive ones from config.
The catch is that RRDs that are created for each remain in place after the DCD removes the host. This leads to /omd filling rapidly on the site with no longer active hosts. Rapidly meaning a constant growth 80GB space/day or more.
This way, k8s or anything else of the kind, like podman, docker, VMware etc could basically DoS the site just by running such workload.
Because of that, the ordinary disk cleanup will not do - it works with days at least and it is not possible to tell it "remove only RRD's of hosts based on this label/pattern that had not been written to for more than X hours or Y days".
Proposal is threefold:
1 - DCD could have an option for removing RRD's (and perhaps the inventory) of the host as well on host removal or after configurable delay. Perhaps even optionally based on the regex of the hostname, that would really help.
2 - It would be great to have in DCD the possibility to hint a renaming scheme of object like pod to provide constant naming. Like, deployment having always pod_deployment_a_1 to pod_deployment_a_X pods in it, regardles off true object names. Either by plugin, or by calling user-provided code to do a "standardisation".
3 - diskspace cleanup could have again an option to configure patterns for hosts with different "retention periods", not just a single period for everything on the site.
So far, I was forced to whip up a bit dirty script to cleanup specific ballast that tends to accumulate and cause problems, but it is still working from the crontab and cannot help with filesystem filling in less than two days.