HA vs. DR and “extra” HA for your DB
Principles to prevent cascading failures
In the vein of earlier blogs, this is intended to share observations built from a couple of decades of helping enterprises build resilient systems and filtered through a lot of listening to Kubernetes and StackStorm users over the last 4–5 years in particular. As such — mileage may vary; I’m learning here and offer this as a way for us all to learn together so feedback is not just welcome, it is loved and needed.
Here is what I have seen too often, and it is the basis of all the principles I share below. For all of us, as we build these systems of systems with many more dependencies and more change and dynamism than any one human could possible fully understand — we want to make sure that whatever we do we don’t spawn opaque cascading failures.
In short, Don’t Injure Yourself. DIY.
For example, don’t have your automation so intelligent that it knows to kill nodes that are not responding without also looking at why that node might be moving slowly. Maybe — true story — your load has peaked the day after Thanksgiving and by pulling the slow nodes out of the queue you are simply shortening the time before all the other nodes get overwhelmed. These are the things brownouts are made of.
So how can you avoid being thrown off the end of your own automation treadmill?
A few hard-learned principles that I draw upon below:
- Shift down — tackle failure as close as possible to the failure to limit the risk of injuring yourself with cascading failures
- Build every layer — every system — so that it is built to fail
- Build every layer — don’t think you have DR when you have HA; don’t think you have HA just because you have one workload that spans clusters
- Related to 3 — infrastructure as code. Always. No black boxes. The desired state is in the repo. Yes, the control loop at the center of Kubernetes will make…
Continue Reading the article in MayaData’s Blog