Introduction to Cloud Service Mesh (Multi-Cluster Ingress) in GCP: A Real-World HA Migration Story

High availability in Kubernetes looks straightforward in architecture diagrams—but in real production systems, it’s often the hardest problem to get right.

In one of our long-term client engagements, we were responsible for maintaining a legacy infrastructure where critical applications and backend services were hosted on GKE clusters. For customers that required high availability (HA), the setup involved two GKE clusters, typically deployed in different regions.

While this design appeared resilient, the reality was very different.

The Legacy Problem: Manual HA and Operational Risk

In the legacy setup:

Applications and services were deployed on GKE
HA customers had two separate GKE clusters
Failover required manual intervention

When a cluster or service became unhealthy, an SRE had to:

Detect the failure
Manually redirect traffic to the healthy cluster
Validate service health and stability

This approach introduced several issues:

❌

Slower recovery during incidents

❌

Increased operational burden

❌

Higher risk of human error

❌

Limited visibility into cross-cluster health

As the platform matured and customer expectations increased, this model no longer met our reliability goals.

Discovering Cloud Service Mesh (Formerly Anthos Service Mesh)

During the migration planning phase, we evaluated native GCP solutions and identified Anthos Service Mesh, now known as Cloud Service Mesh (CSM).

What made Cloud Service Mesh stand out was that it wasn’t just a service mesh—it provided a production-ready solution for multi-cluster traffic management, failover, and observability, all deeply integrated with GKE.

At its core, Cloud Service Mesh is built on top of Istio, using Envoy proxies for traffic control, security, and telemetry.

How We Designed Multi-Cluster HA Using Cloud Service Mesh

Fleet-Based Architecture

Cloud Service Mesh introduces the concept of a Fleet.

Each customer’s GKE clusters were added to a fleet
Clusters could live in different regions

A dedicated configuration cluster managed:

Multi-Cluster Ingress (MCI)
Multi-Cluster Services (MCS)

This configuration automatically created:

Google Global Load Balancer
Backend services pointing to multiple GKE clusters
Health checks across regions

Automatic Failover Across Regions

This was the real game changer.
If a service inside a GKE cluster goes down or an entire cluster becomes unhealthy, traffic is automatically routed to the next available healthy cluster in another region.

→ No NGINX changes.
→ No manual intervention.
→ No midnight incident calls.

This gave us true active-active high availability.

North-South and East-West Traffic Handling

Cloud Service Mesh handles both:

North-South traffic
(External traffic entering the system via global load balancers)

East-West traffic
(Service-to-service communication inside and across clusters)

This ensured:

Consistent routing policies
Secure communication
Reliable cross-cluster service discovery

All managed natively by GCP.

Built-In Observability and SLOs

Another major win during migration was visibility.

Cloud Service Mesh provides:

Per-service SLOs
Request latency metrics
QPS insights
Traffic flow visualization
Health check status
Error rate breakdowns

All of this is available directly in the Google Cloud Console, without stitching together multiple third-party tools.

For SREs, this meant:

Faster root cause analysis
Better alerting
Clear understanding of traffic patterns

Migration Outcome: A Big Win for SRE Life

After completing the migration:

Manual failover was completely eliminated
HA became automatic and reliable
Cross-region traffic routing worked seamlessly
Observability improved dramatically

Most importantly:
The system has been running smoothly for more than 3 years without major HA-related incidents.

As an SRE, this is the kind of success that lets you:

Sleep peacefully at night
Stop worrying about manual switches
Focus on improvements instead of firefighting

Final Thoughts: Why Cloud Service Mesh Was the Right Choice

Cloud Service Mesh solved multiple problems for us in a single, integrated solution:

Multi-cluster high availability
Global load balancing
Automatic failover
Service-level observability
Simplified operations during and after migration

If you’re running stateful or critical workloads on GKE and still relying on manual or semi-automated failover mechanisms, Cloud Service Mesh is absolutely worth evaluating. For us, it wasn’t just a migration—it was a long-term reliability upgrade.

References:

https://docs.cloud.google.com/service-mesh/docs/overview

Single Blog

Introduction to Cloud Service Mesh (Multi-Cluster Ingress) in GCP: A Real-World HA Migration Story

Table of Contents

The Legacy Problem: Manual HA and Operational Risk

Discovering Cloud Service Mesh (Formerly Anthos Service Mesh)

How We Designed Multi-Cluster HA Using Cloud Service Mesh

Fleet-Based Architecture

Automatic Failover Across Regions

North-South and East-West Traffic Handling

Cloud Service Mesh handles both:

This ensured:

Built-In Observability and SLOs

Migration Outcome: A Big Win for SRE Life

Final Thoughts: Why Cloud Service Mesh Was the Right Choice

References:

Loved❤️Reading? Share this blog

Let's Evolve Your Business!

Empowering Growth
with Digital Excellence

Our Services

Quick Links

Company Profile:

Single Blog

Introduction to Cloud Service Mesh (Multi-Cluster Ingress) in GCP: A Real-World HA Migration Story

Table of Contents

The Legacy Problem: Manual HA and Operational Risk

Discovering Cloud Service Mesh (Formerly Anthos Service Mesh)

How We Designed Multi-Cluster HA Using Cloud Service Mesh

Fleet-Based Architecture

Automatic Failover Across Regions

North-South and East-West Traffic Handling

Cloud Service Mesh handles both:

This ensured:

Built-In Observability and SLOs

Migration Outcome: A Big Win for SRE Life

Final Thoughts: Why Cloud Service Mesh Was the Right Choice

References:

Loved❤️Reading? Share this blog

Let's Evolve Your Business!

Empowering Growthwith Digital Excellence

Our Services

Quick Links

Company Profile:

Empowering Growth
with Digital Excellence