How to Autoscale Kubernetes

How to Autoscale Kubernetes Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In today’s cloud-native environments, where traffic patterns are unpredictable and user expectations for performance are high, manually managing pod replicas or cluster nodes is neither scalable nor sustainable. Auto

Nov 10, 2025 - 11:55
Nov 10, 2025 - 11:55
 1

How to Autoscale Kubernetes

Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In todays cloud-native environments, where traffic patterns are unpredictable and user expectations for performance are high, manually managing pod replicas or cluster nodes is neither scalable nor sustainable. Autoscaling ensures that your applications remain responsive during traffic spikes while minimizing infrastructure costs during periods of low usage. This tutorial provides a comprehensive, step-by-step guide to implementing autoscaling in Kubernetes, covering the core components, best practices, real-world examples, and essential tools to help you build resilient, cost-efficient systems.

By the end of this guide, you will understand how to configure and optimize three key autoscaling mechanisms: the Horizontal Pod Autoscaler (HPA), the Vertical Pod Autoscaler (VPA), and the Cluster Autoscaler (CA). You will learn how to integrate them effectively, avoid common pitfalls, and monitor their performance using industry-standard tools. Whether youre managing a small microservice deployment or a large-scale enterprise platform, mastering Kubernetes autoscaling is critical to achieving operational excellence.

Step-by-Step Guide

Understanding Kubernetes Autoscaling Components

Before diving into configuration, its essential to understand the three primary autoscaling mechanisms in Kubernetes:

  • Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas up or down based on observed CPU utilization or custom metrics.
  • Vertical Pod Autoscaler (VPA): Adjusts the CPU and memory requests and limits of individual pods to better match their actual usage.
  • Cluster Autoscaler (CA): Automatically adds or removes worker nodes from the cluster based on resource demand and scheduling constraints.

These components work together to provide end-to-end scalability: VPA ensures pods are sized correctly, HPA ensures enough replicas exist to handle load, and CA ensures the cluster has sufficient capacity to run those pods. They are not mutually exclusive in fact, using them in combination yields the most efficient and resilient infrastructure.

Prerequisites

Before configuring autoscaling, ensure your environment meets the following requirements:

  • A running Kubernetes cluster (version 1.19 or higher recommended).
  • Metrics Server installed and operational. This is required for HPA to collect resource usage data.
  • Appropriate RBAC permissions to create HPA, VPA, and CA resources.
  • Cloud provider or on-premises infrastructure that supports dynamic node provisioning (e.g., AWS, GCP, Azure, or a supported on-prem solution like KubeVirt or vSphere).

To verify Metrics Server is running, execute:

kubectl get pods -n kube-system | grep metrics-server

If no output appears or the pod is in a CrashLoopBackOff state, install Metrics Server using:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Step 1: Configure Horizontal Pod Autoscaler (HPA)

HPA is the most commonly used autoscaling mechanism. It monitors resource usage (CPU and memory) or custom metrics (e.g., requests per second, queue length) and adjusts the number of pod replicas accordingly.

Lets walk through deploying a sample application and configuring HPA for it.

First, deploy a simple nginx deployment:

cat apiVersion: apps/v1

kind: Deployment

metadata:

name: nginx-deployment

spec:

replicas: 2

selector:

matchLabels:

app: nginx

template:

metadata:

labels:

app: nginx

spec:

containers:

- name: nginx

image: nginx:1.21

ports:

- containerPort: 80

resources:

requests:

cpu: 200m

memory: 256Mi

limits:

cpu: 500m

memory: 512Mi

EOF

Expose the deployment as a service:

kubectl expose deployment nginx-deployment --type=ClusterIP --port=80

Now create an HPA that scales between 2 and 10 replicas, targeting 70% CPU utilization:

kubectl autoscale deployment nginx-deployment --cpu-percent=70 --min=2 --max=10

Alternatively, define the HPA using a YAML manifest for greater control:

cat apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: nginx-hpa

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: nginx-deployment

minReplicas: 2

maxReplicas: 10

metrics:

- type: Resource

resource:

name: cpu

target:

type: Utilization

averageUtilization: 70

EOF

Verify the HPA status:

kubectl get hpa

Output will show current replicas, target CPU usage, and actual usage. To simulate load and trigger scaling, use a tool like ab (Apache Bench) or hey:

hey -z 5m -c 20 http://<service-ip>

Monitor scaling behavior in real time:

kubectl get hpa nginx-hpa --watch

Within seconds, you should observe the replica count increase as CPU usage exceeds the 70% threshold.

Step 2: Configure Vertical Pod Autoscaler (VPA)

VPA analyzes historical resource usage and recommends or automatically applies changes to pod resource requests and limits. Unlike HPA, it does not scale the number of pods it scales the size of each pod.

Install VPA using the official manifests:

kubectl apply -f https://github.com/kubernetes/autoscaler/raw/master/vertical-pod-autoscaler/deploy/vpa-release.yaml

Wait for the VPA pods to become ready:

kubectl get pods -n kube-system | grep vpa

Once installed, create a VPA object for your nginx deployment:

cat apiVersion: autoscaling.k8s.io/v1

kind: VerticalPodAutoscaler

metadata:

name: nginx-vpa

spec:

targetRef:

apiVersion: "apps/v1"

kind: Deployment

name: nginx-deployment

updatePolicy:

updateMode: "Auto"

resourcePolicy:

containerPolicies:

- containerName: nginx

minAllowed:

cpu: 100m

memory: 128Mi

maxAllowed:

cpu: 1000m

memory: 1Gi

EOF

Key settings:

  • updateMode: "Auto" VPA will automatically restart pods with updated resource requests.
  • minAllowed and maxAllowed Define boundaries to prevent over- or under-provisioning.

Important: VPA does not modify running pods immediately. It waits for the next pod restart (e.g., during deployment rollout or node maintenance). To force an update, delete the pods:

kubectl delete pods -l app=nginx

After restart, check the new resource requests:

kubectl get pods -o yaml | grep -A 5 -B 5 "resources"

VPA will adjust requests based on historical usage. For example, if nginx was using 150m CPU on average, VPA might reduce the request from 200m to 180m, freeing up cluster capacity.

Step 3: Configure Cluster Autoscaler (CA)

Cluster Autoscaler responds to unschedulable pods by adding nodes to the cluster, and removes idle nodes to reduce cost. Configuration varies by cloud provider.

For AWS EKS:

Install CA using the official Helm chart:

helm repo add aws-charts https://aws.github.io/eks-charts

helm install cluster-autoscaler aws-charts/cluster-autoscaler \

--namespace kube-system \

--set autoDiscovery.clusterName=your-eks-cluster-name \

--set awsRegion=us-west-2 \

--set rbac.create=true \

--set image.repository=602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/cluster-autoscaler:v1.28.0

Alternatively, deploy using YAML:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Ensure the CA service account has the necessary IAM permissions to manage EC2 Auto Scaling Groups.

For GCP GKE:

Enable Cluster Autoscaler via the gcloud CLI:

gcloud container clusters update your-cluster-name --enable-autoscaling --min-nodes=1 --max-nodes=10 --zone=us-central1-a

For Azure AKS:

az aks update --resource-group your-resource-group --name your-aks-cluster --enable-cluster-autoscaler --min-count 1 --max-count 10

For on-premises clusters, use the Cluster API Provider or configure CA with a custom cloud provider.

Test CA by creating a deployment that requests more resources than available:

cat apiVersion: apps/v1

kind: Deployment

metadata:

name: heavy-app

spec:

replicas: 1

selector:

matchLabels:

app: heavy-app

template:

metadata:

labels:

app: heavy-app

spec:

containers:

- name: heavy-app

image: busybox

command: ["sleep", "3600"]

resources:

requests:

cpu: "4"

memory: "8Gi"

EOF

If your cluster has no nodes with sufficient capacity, CA will provision a new node within 15 minutes. Monitor node creation:

kubectl get nodes --watch

Once the pod is scheduled, you can simulate reduced load and verify node removal by deleting the deployment and waiting for idle node eviction.

Step 4: Integrate HPA with Custom Metrics

While CPU and memory are useful, many applications require scaling based on business metrics such as HTTP requests per second, message queue depth, or database query latency.

To enable custom metrics, install Prometheus and the Prometheus Adapter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install prometheus prometheus-community/kube-prometheus-stack

helm install prometheus-adapter prometheus-community/prometheus-adapter

Deploy a sample application that exposes custom metrics. For example, a Go service exposing http_requests_total via Prometheus:

cat apiVersion: apps/v1

kind: Deployment

metadata:

name: custom-metric-app

spec:

replicas: 1

selector:

matchLabels:

app: custom-metric-app

template:

metadata:

labels:

app: custom-metric-app

spec:

containers:

- name: app

image: quay.io/prometheus/busybox:latest

command: ["/bin/sh", "-c", "while true; do echo 'http_requests_total{job=\"app\"} 100' | nc -l -p 9090; sleep 10; done"]

ports:

- containerPort: 9090

resources:

requests:

cpu: 100m

memory: 128Mi

---

apiVersion: v1

kind: Service

metadata:

name: custom-metric-app

annotations:

prometheus.io/scrape: "true"

prometheus.io/port: "9090"

spec:

selector:

app: custom-metric-app

ports:

- protocol: TCP

port: 9090

targetPort: 9090

EOF

Now create an HPA that scales based on http_requests_total:

cat apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: custom-metric-hpa

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: custom-metric-app

minReplicas: 1

maxReplicas: 10

metrics:

- type: Pods

pods:

metric:

name: http_requests_total

target:

type: AverageValue

averageValue: "100"

EOF

Verify custom metrics are available:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_total | jq

Once confirmed, HPA will scale based on real business traffic rather than just infrastructure metrics.

Best Practices

Set Realistic Resource Requests and Limits

Always define explicit requests and limits for CPU and memory in your deployments. Without them, HPA and VPA cannot function effectively, and the scheduler may place pods on overcommitted nodes.

Use historical data or load testing to determine baseline values. Avoid setting limits too high this wastes resources. Avoid setting them too low this causes throttling and degraded performance.

Use HPA with Multiple Metrics

Instead of relying on CPU alone, combine multiple metrics for more intelligent scaling. For example:

  • Scale based on CPU + memory usage.
  • Scale based on HTTP request rate + error rate.
  • Scale based on queue depth + consumer latency.

Example:

metrics:

- type: Resource

resource:

name: cpu

target:

type: Utilization

averageUtilization: 70

- type: Pods

pods:

metric:

name: http_requests_total

target:

type: AverageValue

averageValue: "100"

HPA will scale only if all conditions are met. Use type: Object or type: External for metrics not tied to pods (e.g., cloud queue depth).

Enable VPA in Recommendation Mode First

Before enabling updateMode: Auto, set updateMode: Off and monitor VPA recommendations for several days:

kubectl get vpa nginx-vpa -o yaml

Check the status.recommendation field to see suggested CPU/memory values. Only enable auto-updates once youre confident the recommendations are accurate and safe.

Configure Pod Disruption Budgets (PDBs)

When VPA or CA evicts pods, ensure your critical services remain available by defining PDBs:

cat apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: nginx-pdb

spec:

minAvailable: 1

selector:

matchLabels:

app: nginx

EOF

This ensures at least one nginx pod remains running during maintenance or scaling events.

Set Appropriate Scaling Cooldown Periods

By default, HPA waits 5 minutes after a scale-up and 15 minutes after a scale-down before making further changes. Adjust these based on your applications behavior:

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: nginx-hpa

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: nginx-deployment

minReplicas: 2

maxReplicas: 10

behavior:

scaleUp:

stabilizationWindowSeconds: 60

policies:

- type: Percent

value: 100

periodSeconds: 60

scaleDown:

stabilizationWindowSeconds: 300

policies:

- type: Percent

value: 10

periodSeconds: 60

EOF

This prevents rapid flapping during transient traffic spikes.

Monitor and Alert on Autoscaling Events

Use monitoring tools like Prometheus, Grafana, or cloud-native observability platforms to track:

  • Number of replicas over time.
  • Node count and utilization.
  • HPA conditions (e.g., FailedGetResourceMetric).
  • CA events (e.g., ScaleUp, ScaleDown).

Set alerts for:

  • HPA reaching max replicas.
  • CA unable to add nodes due to quota limits.
  • VPA recommending resource increases beyond 200% of current.

Avoid Overlapping Autoscaling Policies

Do not use HPA and VPA on the same deployment if VPA is in Auto mode this can cause conflicts. Instead, use VPA for sizing and HPA for replica count. Alternatively, use VPA in Off mode and manage requests manually.

Test Scaling Under Realistic Load

Use tools like Locust, k6, or JMeter to simulate production traffic patterns. Test:

  • How quickly HPA responds to traffic spikes.
  • Whether CA provisions nodes fast enough to prevent scheduling failures.
  • Whether VPA recommendations stabilize after sustained load.

Document results and adjust thresholds accordingly.

Tools and Resources

Core Kubernetes Tools

  • Metrics Server: Collects resource usage data for HPA and VPA.
  • Horizontal Pod Autoscaler (HPA): Built into Kubernetes; scales pod replicas.
  • Vertical Pod Autoscaler (VPA): Official Kubernetes project; adjusts pod resource requests.
  • Cluster Autoscaler (CA): Official project; manages node pools across cloud providers.

Monitoring and Observability

  • Prometheus + Grafana: Collect and visualize custom and resource metrics.
  • Prometheus Adapter: Exposes custom metrics to HPA.
  • Kube-State-Metrics: Provides metrics about Kubernetes objects (e.g., number of pending pods).
  • CloudWatch (AWS), Stackdriver (GCP), Azure Monitor: Native cloud observability tools.

Load Testing Tools

  • Hey: Lightweight HTTP load generator.
  • k6: Scriptable load testing with Prometheus integration.
  • Locust: Python-based distributed load testing.
  • Apache Bench (ab): Simple command-line HTTP benchmarking tool.

Documentation and Community

Real Examples

Example 1: E-Commerce Platform on AWS EKS

An e-commerce site experiences traffic surges during Black Friday sales. The team configured:

  • HPA on the product catalog service to scale based on HTTP request rate (target: 50 req/s per pod).
  • VPA in recommendation mode for the checkout service to optimize memory usage (reduced from 2Gi to 1.2Gi).
  • Cluster Autoscaler with min=5 and max=50 nodes in the worker group.
  • Prometheus Adapter to scale based on Redis queue depth (if orders backlog > 100, scale checkout pods).

Result: During peak traffic, the system scaled from 8 to 42 pods and added 18 nodes within 4 minutes. No timeouts occurred, and infrastructure costs remained 35% lower than static provisioning.

Example 2: Real-Time Analytics Pipeline on GKE

A data pipeline ingests streaming logs and processes them using 10 microservices. Each service has different resource profiles.

  • HPA on ingestion pods based on incoming data rate (from Pub/Sub).
  • VPA on processing pods with updateMode: Recreate to avoid data loss during restarts.
  • Cluster Autoscaler configured to use preemptible VMs for cost savings, with a 10-minute node retention policy.

Result: Processing latency dropped from 120s to 15s during peak loads. Monthly infrastructure costs decreased by 48% due to dynamic node sizing and preemptible instance usage.

Example 3: On-Premises AI Inference Cluster

A financial services firm runs AI models on-premises using Kubernetes. Nodes are high-memory, high-CPU machines.

  • HPA on inference pods based on GPU utilization (via NVIDIA Device Plugin and Prometheus).
  • Custom metrics exporter to track model throughput (inferences per second).
  • Cluster Autoscaler integrated with VMware vSphere to provision new VMs when GPU capacity is exhausted.

Result: GPU utilization increased from 40% to 85% on average. Model response time remained under 200ms even during 3x traffic spikes.

FAQs

Can I use HPA and VPA together on the same deployment?

Yes, but with caution. VPA modifies pod resource requests, which can trigger HPA to scale if CPU/memory usage changes. Use VPA in Recommendation mode first, then apply changes manually. Avoid using VPA in Auto mode unless youve thoroughly tested the interaction.

Why is my HPA not scaling?

Common causes:

  • Metrics Server is not installed or not running.
  • Pods lack resource requests (HPA requires them).
  • Target metric is unreachable (e.g., custom metric not exposed).
  • Scaling is blocked by PDB or insufficient cluster capacity.

Check HPA status with kubectl describe hpa <name> to see conditions and events.

How long does Cluster Autoscaler take to add a node?

Typically 15 minutes, depending on cloud provider provisioning speed. AWS EC2 takes ~23 minutes; GCP and Azure are similar. Ensure your node templates have sufficient quotas and IAM permissions.

Does autoscaling work with StatefulSets?

Yes. HPA supports StatefulSets. VPA and CA work with any workload type. However, VPA restarts pods, which may disrupt stateful applications use with care and test thoroughly.

Can I autoscale based on external events like weather or stock prices?

Yes. Use the External metric type in HPA. For example, a custom adapter can expose a metric like stock_price_volatility from an external API. HPA will scale based on that value.

Is autoscaling expensive?

No its cost-optimized. By scaling down during low traffic and avoiding over-provisioning, most organizations reduce infrastructure costs by 3060%. The overhead of running Metrics Server or CA is negligible.

What happens if I scale too aggressively?

Overly aggressive scaling can cause:

  • Pod churn and instability.
  • Increased cold starts for containerized apps.
  • Node thrashing if CA adds/removes nodes too frequently.

Use stabilization windows and conservative thresholds to avoid this.

Conclusion

Autoscaling Kubernetes is not a one-time configuration its an ongoing discipline that requires monitoring, testing, and refinement. By combining Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler, you create a self-optimizing system that responds intelligently to real-world demand. The key to success lies in understanding your applications behavior, defining clear performance targets, and using the right metrics to drive decisions.

Start small: deploy HPA with CPU-based scaling, monitor its behavior, then layer in VPA and CA. Use custom metrics to align scaling with business outcomes. Always test under load, document your thresholds, and set alerts for failures.

When implemented correctly, autoscaling transforms Kubernetes from a static orchestration platform into a dynamic, cost-efficient, and highly resilient system. It empowers teams to focus on innovation rather than infrastructure management delivering better user experiences while optimizing operational expenses. In todays fast-paced digital landscape, mastering autoscaling isnt optional its essential.