High availability in Kubernetes looks straightforward in architecture diagrams—but in real production systems, it’s often the hardest problem to get right.
In one of our long-term client engagements, we were responsible for maintaining a legacy infrastructure where critical applications and backend services were hosted on GKE clusters. For customers that required high availability (HA), the setup involved two GKE clusters, typically deployed in different regions.
While this design appeared resilient, the reality was very different.
The Legacy Problem: Manual HA and Operational Risk
In the legacy setup:
- Applications and services were deployed on GKE
- HA customers had two separate GKE clusters
- Failover required manual intervention
When a cluster or service became unhealthy, an SRE had to:
- Detect the failure
- Manually redirect traffic to the healthy cluster
- Validate service health and stability
This approach introduced several issues:
❌
Slower recovery during incidents
❌
Increased operational burden
❌
Higher risk of human error
❌
Limited visibility into cross-cluster health
As the platform matured and customer expectations increased, this model no longer met our reliability goals.
Discovering Cloud Service Mesh (Formerly Anthos Service Mesh)
During the migration planning phase, we evaluated native GCP solutions and identified Anthos Service Mesh, now known as Cloud Service Mesh (CSM).
What made Cloud Service Mesh stand out was that it wasn’t just a service mesh—it provided a production-ready solution for multi-cluster traffic management, failover, and observability, all deeply integrated with GKE.
At its core, Cloud Service Mesh is built on top of Istio, using Envoy proxies for traffic control, security, and telemetry.
How We Designed Multi-Cluster HA Using Cloud Service Mesh
Fleet-Based Architecture
Cloud Service Mesh introduces the concept of a Fleet.
- Each customer’s GKE clusters were added to a fleet
- Clusters could live in different regions
A dedicated configuration cluster managed:
- Multi-Cluster Ingress (MCI)
- Multi-Cluster Services (MCS)
This configuration automatically created:
- Google Global Load Balancer
- Backend services pointing to multiple GKE clusters
- Health checks across regions
Automatic Failover Across Regions
This was the real game changer.
If a service inside a GKE cluster goes down or an entire cluster becomes unhealthy, traffic is automatically routed to the next available healthy cluster in another region.
→ No NGINX changes.
→ No manual intervention.
→ No midnight incident calls.
This gave us true active-active high availability.
North-South and East-West Traffic Handling
Cloud Service Mesh handles both:
North-South traffic
(External traffic entering the system via global load balancers)
East-West traffic
(Service-to-service communication inside and across clusters)
This ensured:
- Consistent routing policies
- Secure communication
- Reliable cross-cluster service discovery
All managed natively by GCP.
Built-In Observability and SLOs
Another major win during migration was visibility.
Cloud Service Mesh provides:
- Per-service SLOs
- Request latency metrics
- QPS insights
- Traffic flow visualization
- Health check status
- Error rate breakdowns
All of this is available directly in the Google Cloud Console, without stitching together multiple third-party tools.
For SREs, this meant:
- Faster root cause analysis
- Better alerting
- Clear understanding of traffic patterns
Migration Outcome: A Big Win for SRE Life
After completing the migration:
- Manual failover was completely eliminated
- HA became automatic and reliable
- Cross-region traffic routing worked seamlessly
- Observability improved dramatically
Most importantly:
The system has been running smoothly for more than 3 years without major HA-related incidents.
As an SRE, this is the kind of success that lets you:
- Sleep peacefully at night
- Stop worrying about manual switches
- Focus on improvements instead of firefighting
Final Thoughts: Why Cloud Service Mesh Was the Right Choice
Cloud Service Mesh solved multiple problems for us in a single, integrated solution:
- Multi-cluster high availability
- Global load balancing
- Automatic failover
- Service-level observability
- Simplified operations during and after migration
If you’re running stateful or critical workloads on GKE and still relying on manual or semi-automated failover mechanisms, Cloud Service Mesh is absolutely worth evaluating. For us, it wasn’t just a migration—it was a long-term reliability upgrade.
References:
https://docs.cloud.google.com/service-mesh/docs/overview
