High-availability: cross-datacenter and always-on
Currently a Checkmk HA cluster is designed to compensate for hardware failures within a single datacenter or local network, not for geographic redundancy. We see the requirements of our internal customer and the regulators growing and we expect that in a few years we need a solution to offer the monitoring always-on and high available cross-datacenters
Shortcomings we see in the current offering
Checkmk is not designed to run an active passive cluster stretched across two or three datacenters. When the connection between two nodes goes down, the passive node assumes control and becomes active. Suppose these nodes are in different datacenters and there is a problem with the connectivity between those datacenters, there is no way for deciding which datacenter is the one that is still online and thus which node should assume the active role. This creates a risk of both nodes attempting to become active at the same time.
The current design also requires both Checkmk servers to be located in the same network segment because the cluster depends on a shared cluster address. While we could technically stretch a network segment across datacenters, this is not always an option. The reason is that sharing the same segment creates a single dependency: a misconfiguration or corruption of that network segment can disrupt both servers.
Another important limitation is that upgrades of a Checkmk cluster require a complete outage. High availability includes the expectation that planned maintenance does not cause downtime aka “always-on” With the current cluster model this is not achievable.
Given the growing importance of uninterrupted monitoring, we would like to request that real high availability across datacenters, not dependent on appliances or stretched network segments, be considered for inclusion in the Checkmk roadmap. We fully recognise the complexity of this request, but we also believe
Comments: 1
Oldest
•
Newest
•
Most likes
•
Fewest likes
-
15 Apr
Kurt SchraeyenHighlighted comment
... but we also believe that such a capability will become increasingly relevant to many customers.
:)