High Availability in Azure: Services, Strategies, and Design Decisions

Updated – April 2024

Ensuring high availability is critical for any application or service running in the cloud. High availability (HA) as part of the Resiliency subject ensures that your applications and services remain accessible and performant, even in the face of failures. Azure provides a robust set of tools and best practices to achieve high availability.

High availability refers to the ability of a system or component to remain operational and accessible during a specified period, ensuring minimal downtime and uninterrupted service. Achieving high availability involves redundancy, fault tolerance, and the ability to quickly recover from failures.

Some of the Services that you can achieve High Availability in Azure

Availability Zones
Azure Load Balancer
Azure Traffic Manager
Azure Application Gateway
Azure SQL Database
Azure Cosmos DB
Azure Virtual Machine Scale Sets
Azure Kubernetes Service (AKS)

1. Availability Zones

Availability Zones are physically separate locations within an Azure region. Each zone is made up of one or more datacentres equipped with independent power, cooling, and networking. Distributing your resources across multiple Availability Zones helps protect your applications and data from datacentre failures.

For example, you can deploy a multi-tier application with web, application, and database layers across three Availability Zones within a region. This ensures that if one zone goes down, the application remains operational in the other zones.

2. Azure Load Balancer

Azure Load Balancer distributes incoming network traffic across multiple virtual machines (VMs) within a region. It ensures that no single VM becomes a point of failure and provides high availability and reliability by spreading the load.

Load balancers can be Public or Internal Load Balancers which manage traffic to internet-facing services or internal applications. You need to set up Health Probes to Monitor the health of VMs and route traffic only to healthy instances. and Automatic Reconfiguration feature Automatically reconfigures itself in response to changes in traffic.

3. Azure Traffic Manager

Azure Traffic Manager is a DNS-based traffic load balancer that distributes traffic across multiple regions, enhancing the availability and responsiveness of your applications. It can route traffic based on

Geographic Routing: Route traffic to the closest region for better performance.
Priority Routing: Define primary and backup endpoints to manage failover.
Weighted Routing: Distribute traffic based on predefined weights to different endpoints.
…

An example of a global e-commerce website uses Azure Traffic Manager to route users to the nearest datacentre, improving load times and ensuring high availability by failing over to secondary regions if the primary region experiences an outage.

4. Azure Application Gateway

Azure Application Gateway is a web traffic load balancer that enables you to manage traffic to your web applications. It provides advanced load balancing capabilities, such as SSL termination, Web Application Firewall (WAF), and URL-based routing.

Key Features:

SSL Offloading: Terminate SSL at the gateway to reduce load on backend servers.
Web Application Firewall (WAF): Protect your applications from common web vulnerabilities.
Path-Based Routing: Route requests based on URL paths to specific backend pools.

An Example would be an organisation deploying a multi-site web application with an Azure Application Gateway to ensure efficient routing, SSL offloading, and protection from security threats using WAF.

5. Azure SQL Database

Azure SQL Database offers built-in high-availability features that ensure your database remains available during planned and unplanned events. some of the features include:

Active Geo-Replication: Create readable replicas in different regions.
Auto-Failover Groups: Automatically failover databases to a secondary region in case of an outage.
Zone-Redundant Configuration: Deploy databases across Availability Zones for higher resilience.

An example would be A financial services firm that uses Azure SQL Database with active geo-replication to create secondary replicas in a different region, ensuring that critical data remains accessible even if the primary region fails.

6. Azure Cosmos DB

Azure Cosmos DB is a globally distributed, multi-model database that provides high availability and low latency access to data across the globe. here are some of the key features of the Cosmos DB:

Multi-Region Writes: Enable read and write operations in multiple regions for high availability.
Automatic Failover: Automatically failover to another region in case of an outage.
Consistency Levels: Choose from multiple consistency models to balance availability and performance.

An example use case would be a global retail company that uses Azure Cosmos DB to ensure that product catalogue data is available with low latency to customers worldwide, even during regional outages.

7. Azure Virtual Machine Scale Sets

Azure Virtual Machine Scale Sets allow you to deploy and manage a set of identical, auto-scaling VMs. This ensures that your applications can handle varying loads and remain available. here are some of the notable features of the scale sets:

Automatic Scaling: Scale in or out based on demand to ensure optimal performance.
Fault Domains: Distribute VMs across multiple fault domains to reduce the impact of failures.
Instance Health Monitoring: Automatically replace unhealthy instances.

8. Azure Kubernetes Service (AKS)

Azure Kubernetes Service (AKS) provides managed Kubernetes clusters for running containerized applications. It ensures high availability through features like node auto-repair, multi-node pools, and automatic scaling. to name some of the features:

Node Auto-Repair: Automatically detect and repair unhealthy nodes.
Cluster Auto-Scaling: Automatically adjust the number of nodes in your cluster based on application demand.
Multi-Node Pools: Run different workloads on different types of nodes for better resource management.

An example of a use case would be a SaaS provider that uses AKS to deploy microservices across a highly available Kubernetes cluster. The cluster auto-scales based on traffic, and unhealthy nodes are automatically repaired, ensuring uninterrupted service delivery.

Some of the Strategies and Design Decisions to achieve High Availability

Redundancy: Ensure that critical components are duplicated so that if one fails, the other can take over.
Fault Isolation: Isolate failures to prevent them from affecting the entire system. Use features like fault domains and Availability Zones.
Health Monitoring and Alerts: Implement robust monitoring to detect issues early and set up alerts for immediate response.
Automated Recovery: Automate failover and recovery processes to minimise downtime.
Regular Testing: Regularly test your HA setup, including failover mechanisms and disaster recovery plans, to ensure they work as expected.

Achieving high availability in Azure involves leveraging a combination of services, strategies, and design decisions to ensure that your applications and services remain operational and accessible, even in the face of failures. By using Azure’s built-in tools like Availability Zones, Load Balancer, Traffic Manager, Application Gateway, and more, you can build robust, highly available solutions that meet your business needs.

Sources: