Company Profile
Follow Us:

Introduction to Cloud Service Mesh (Multi-Cluster Ingress) in GCP: A Real-World HA Migration Story

Table of Contents

High availability in Kubernetes looks straightforward in architecture diagrams—but in real production systems, it’s often the hardest problem to get right.

In one of our long-term client engagements, we were responsible for maintaining a legacy infrastructure where critical applications and backend services were hosted on GKE clusters. For customers that required high availability (HA), the setup involved two GKE clusters, typically deployed in different regions.

While this design appeared resilient, the reality was very different.

The Legacy Problem: Manual HA and Operational Risk

In the legacy setup:

When a cluster or service became unhealthy, an SRE had to:

This approach introduced several issues:

As the platform matured and customer expectations increased, this model no longer met our reliability goals.


Discovering Cloud Service Mesh (Formerly Anthos Service Mesh)

During the migration planning phase, we evaluated native GCP solutions and identified Anthos Service Mesh, now known as Cloud Service Mesh (CSM).

What made Cloud Service Mesh stand out was that it wasn’t just a service mesh—it provided a production-ready solution for multi-cluster traffic management, failover, and observability, all deeply integrated with GKE.

At its core, Cloud Service Mesh is built on top of Istio, using Envoy proxies for traffic control, security, and telemetry.


How We Designed Multi-Cluster HA Using Cloud Service Mesh

Cloud Service Mesh introduces the concept of a Fleet.

A dedicated configuration cluster managed:

This configuration automatically created:


Automatic Failover Across Regions

This was the real game changer.
If a service inside a GKE cluster goes down or an entire cluster becomes unhealthy, traffic is automatically routed to the next available healthy cluster in another region.

No NGINX changes.
No manual intervention.
No midnight incident calls.

This gave us true active-active high availability.


North-South and East-West Traffic Handling

North-South traffic
(External traffic entering the system via global load balancers)

East-West traffic
(Service-to-service communication inside and across clusters)

All managed natively by GCP.


Built-In Observability and SLOs

Another major win during migration was visibility.

Cloud Service Mesh provides:

All of this is available directly in the Google Cloud Console, without stitching together multiple third-party tools.

For SREs, this meant:


Migration Outcome: A Big Win for SRE Life

After completing the migration:

As an SRE, this is the kind of success that lets you:


Final Thoughts: Why Cloud Service Mesh Was the Right Choice

Cloud Service Mesh solved multiple problems for us in a single, integrated solution:

If you’re running stateful or critical workloads on GKE and still relying on manual or semi-automated failover mechanisms, Cloud Service Mesh is absolutely worth evaluating. For us, it wasn’t just a migration—it was a long-term reliability upgrade.

https://docs.cloud.google.com/service-mesh/docs/overview

Introduction to Cloud Service Mesh (Multi-Cluster Ingress) in GCP- A Real-World HA Migration Story - blog - cta - eurus technologies
Loved❤️Reading? Share this blog
// We Carry more Than Just Good Coding Skills

Let's Evolve Your Business!