How to Autoscale Kubernetes
How to Autoscale Kubernetes Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In today’s cloud-native environments, where traffic patterns are unpredictable and user expectations for performance are high, manually managing pod replicas or cluster nodes is neither scalable nor sustainable. Auto
How to Autoscale Kubernetes
Autoscaling in Kubernetes is a foundational capability that enables applications to dynamically adjust their resource allocation based on real-time demand. In todays cloud-native environments, where traffic patterns are unpredictable and user expectations for performance are high, manually managing pod replicas or cluster nodes is neither scalable nor sustainable. Autoscaling ensures that your applications remain responsive during traffic spikes while minimizing infrastructure costs during periods of low usage. This tutorial provides a comprehensive, step-by-step guide to implementing autoscaling in Kubernetes, covering the core components, best practices, real-world examples, and essential tools to help you build resilient, cost-efficient systems.
By the end of this guide, you will understand how to configure and optimize three key autoscaling mechanisms: the Horizontal Pod Autoscaler (HPA), the Vertical Pod Autoscaler (VPA), and the Cluster Autoscaler (CA). You will learn how to integrate them effectively, avoid common pitfalls, and monitor their performance using industry-standard tools. Whether youre managing a small microservice deployment or a large-scale enterprise platform, mastering Kubernetes autoscaling is critical to achieving operational excellence.
Step-by-Step Guide
Understanding Kubernetes Autoscaling Components
Before diving into configuration, its essential to understand the three primary autoscaling mechanisms in Kubernetes:
- Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas up or down based on observed CPU utilization or custom metrics.
- Vertical Pod Autoscaler (VPA): Adjusts the CPU and memory requests and limits of individual pods to better match their actual usage.
- Cluster Autoscaler (CA): Automatically adds or removes worker nodes from the cluster based on resource demand and scheduling constraints.
These components work together to provide end-to-end scalability: VPA ensures pods are sized correctly, HPA ensures enough replicas exist to handle load, and CA ensures the cluster has sufficient capacity to run those pods. They are not mutually exclusive in fact, using them in combination yields the most efficient and resilient infrastructure.
Prerequisites
Before configuring autoscaling, ensure your environment meets the following requirements:
- A running Kubernetes cluster (version 1.19 or higher recommended).
- Metrics Server installed and operational. This is required for HPA to collect resource usage data.
- Appropriate RBAC permissions to create HPA, VPA, and CA resources.
- Cloud provider or on-premises infrastructure that supports dynamic node provisioning (e.g., AWS, GCP, Azure, or a supported on-prem solution like KubeVirt or vSphere).
To verify Metrics Server is running, execute:
kubectl get pods -n kube-system | grep metrics-server
If no output appears or the pod is in a CrashLoopBackOff state, install Metrics Server using:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Step 1: Configure Horizontal Pod Autoscaler (HPA)
HPA is the most commonly used autoscaling mechanism. It monitors resource usage (CPU and memory) or custom metrics (e.g., requests per second, queue length) and adjusts the number of pod replicas accordingly.
Lets walk through deploying a sample application and configuring HPA for it.
First, deploy a simple nginx deployment:
cat apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
EOF
Expose the deployment as a service:
kubectl expose deployment nginx-deployment --type=ClusterIP --port=80
Now create an HPA that scales between 2 and 10 replicas, targeting 70% CPU utilization:
kubectl autoscale deployment nginx-deployment --cpu-percent=70 --min=2 --max=10
Alternatively, define the HPA using a YAML manifest for greater control:
cat apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
EOF
Verify the HPA status:
kubectl get hpa
Output will show current replicas, target CPU usage, and actual usage. To simulate load and trigger scaling, use a tool like ab (Apache Bench) or hey:
hey -z 5m -c 20 http://<service-ip>
Monitor scaling behavior in real time:
kubectl get hpa nginx-hpa --watch
Within seconds, you should observe the replica count increase as CPU usage exceeds the 70% threshold.
Step 2: Configure Vertical Pod Autoscaler (VPA)
VPA analyzes historical resource usage and recommends or automatically applies changes to pod resource requests and limits. Unlike HPA, it does not scale the number of pods it scales the size of each pod.
Install VPA using the official manifests:
kubectl apply -f https://github.com/kubernetes/autoscaler/raw/master/vertical-pod-autoscaler/deploy/vpa-release.yaml
Wait for the VPA pods to become ready:
kubectl get pods -n kube-system | grep vpa
Once installed, create a VPA object for your nginx deployment:
cat apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: nginx-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: nginx-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: nginx
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 1000m
memory: 1Gi
EOF
Key settings:
updateMode: "Auto"VPA will automatically restart pods with updated resource requests.minAllowedandmaxAllowedDefine boundaries to prevent over- or under-provisioning.
Important: VPA does not modify running pods immediately. It waits for the next pod restart (e.g., during deployment rollout or node maintenance). To force an update, delete the pods:
kubectl delete pods -l app=nginx
After restart, check the new resource requests:
kubectl get pods -o yaml | grep -A 5 -B 5 "resources"
VPA will adjust requests based on historical usage. For example, if nginx was using 150m CPU on average, VPA might reduce the request from 200m to 180m, freeing up cluster capacity.
Step 3: Configure Cluster Autoscaler (CA)
Cluster Autoscaler responds to unschedulable pods by adding nodes to the cluster, and removes idle nodes to reduce cost. Configuration varies by cloud provider.
For AWS EKS:
Install CA using the official Helm chart:
helm repo add aws-charts https://aws.github.io/eks-charts
helm install cluster-autoscaler aws-charts/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=your-eks-cluster-name \
--set awsRegion=us-west-2 \
--set rbac.create=true \
--set image.repository=602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/cluster-autoscaler:v1.28.0
Alternatively, deploy using YAML:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
Ensure the CA service account has the necessary IAM permissions to manage EC2 Auto Scaling Groups.
For GCP GKE:
Enable Cluster Autoscaler via the gcloud CLI:
gcloud container clusters update your-cluster-name --enable-autoscaling --min-nodes=1 --max-nodes=10 --zone=us-central1-a
For Azure AKS:
az aks update --resource-group your-resource-group --name your-aks-cluster --enable-cluster-autoscaler --min-count 1 --max-count 10
For on-premises clusters, use the Cluster API Provider or configure CA with a custom cloud provider.
Test CA by creating a deployment that requests more resources than available:
cat apiVersion: apps/v1
kind: Deployment
metadata:
name: heavy-app
spec:
replicas: 1
selector:
matchLabels:
app: heavy-app
template:
metadata:
labels:
app: heavy-app
spec:
containers:
- name: heavy-app
image: busybox
command: ["sleep", "3600"]
resources:
requests:
cpu: "4"
memory: "8Gi"
EOF
If your cluster has no nodes with sufficient capacity, CA will provision a new node within 15 minutes. Monitor node creation:
kubectl get nodes --watch
Once the pod is scheduled, you can simulate reduced load and verify node removal by deleting the deployment and waiting for idle node eviction.
Step 4: Integrate HPA with Custom Metrics
While CPU and memory are useful, many applications require scaling based on business metrics such as HTTP requests per second, message queue depth, or database query latency.
To enable custom metrics, install Prometheus and the Prometheus Adapter:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
helm install prometheus-adapter prometheus-community/prometheus-adapter
Deploy a sample application that exposes custom metrics. For example, a Go service exposing http_requests_total via Prometheus:
cat apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-metric-app
spec:
replicas: 1
selector:
matchLabels:
app: custom-metric-app
template:
metadata:
labels:
app: custom-metric-app
spec:
containers:
- name: app
image: quay.io/prometheus/busybox:latest
command: ["/bin/sh", "-c", "while true; do echo 'http_requests_total{job=\"app\"} 100' | nc -l -p 9090; sleep 10; done"]
ports:
- containerPort: 9090
resources:
requests:
cpu: 100m
memory: 128Mi
---
apiVersion: v1
kind: Service
metadata:
name: custom-metric-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
selector:
app: custom-metric-app
ports:
- protocol: TCP
port: 9090
targetPort: 9090
EOF
Now create an HPA that scales based on http_requests_total:
cat apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: custom-metric-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: custom-metric-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_total
target:
type: AverageValue
averageValue: "100"
EOF
Verify custom metrics are available:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_total | jq
Once confirmed, HPA will scale based on real business traffic rather than just infrastructure metrics.
Best Practices
Set Realistic Resource Requests and Limits
Always define explicit requests and limits for CPU and memory in your deployments. Without them, HPA and VPA cannot function effectively, and the scheduler may place pods on overcommitted nodes.
Use historical data or load testing to determine baseline values. Avoid setting limits too high this wastes resources. Avoid setting them too low this causes throttling and degraded performance.
Use HPA with Multiple Metrics
Instead of relying on CPU alone, combine multiple metrics for more intelligent scaling. For example:
- Scale based on CPU + memory usage.
- Scale based on HTTP request rate + error rate.
- Scale based on queue depth + consumer latency.
Example:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_total
target:
type: AverageValue
averageValue: "100"
HPA will scale only if all conditions are met. Use type: Object or type: External for metrics not tied to pods (e.g., cloud queue depth).
Enable VPA in Recommendation Mode First
Before enabling updateMode: Auto, set updateMode: Off and monitor VPA recommendations for several days:
kubectl get vpa nginx-vpa -o yaml
Check the status.recommendation field to see suggested CPU/memory values. Only enable auto-updates once youre confident the recommendations are accurate and safe.
Configure Pod Disruption Budgets (PDBs)
When VPA or CA evicts pods, ensure your critical services remain available by defining PDBs:
cat apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: nginx
EOF
This ensures at least one nginx pod remains running during maintenance or scaling events.
Set Appropriate Scaling Cooldown Periods
By default, HPA waits 5 minutes after a scale-up and 15 minutes after a scale-down before making further changes. Adjust these based on your applications behavior:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 2
maxReplicas: 10
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
EOF
This prevents rapid flapping during transient traffic spikes.
Monitor and Alert on Autoscaling Events
Use monitoring tools like Prometheus, Grafana, or cloud-native observability platforms to track:
- Number of replicas over time.
- Node count and utilization.
- HPA conditions (e.g., FailedGetResourceMetric).
- CA events (e.g., ScaleUp, ScaleDown).
Set alerts for:
- HPA reaching max replicas.
- CA unable to add nodes due to quota limits.
- VPA recommending resource increases beyond 200% of current.
Avoid Overlapping Autoscaling Policies
Do not use HPA and VPA on the same deployment if VPA is in Auto mode this can cause conflicts. Instead, use VPA for sizing and HPA for replica count. Alternatively, use VPA in Off mode and manage requests manually.
Test Scaling Under Realistic Load
Use tools like Locust, k6, or JMeter to simulate production traffic patterns. Test:
- How quickly HPA responds to traffic spikes.
- Whether CA provisions nodes fast enough to prevent scheduling failures.
- Whether VPA recommendations stabilize after sustained load.
Document results and adjust thresholds accordingly.
Tools and Resources
Core Kubernetes Tools
- Metrics Server: Collects resource usage data for HPA and VPA.
- Horizontal Pod Autoscaler (HPA): Built into Kubernetes; scales pod replicas.
- Vertical Pod Autoscaler (VPA): Official Kubernetes project; adjusts pod resource requests.
- Cluster Autoscaler (CA): Official project; manages node pools across cloud providers.
Monitoring and Observability
- Prometheus + Grafana: Collect and visualize custom and resource metrics.
- Prometheus Adapter: Exposes custom metrics to HPA.
- Kube-State-Metrics: Provides metrics about Kubernetes objects (e.g., number of pending pods).
- CloudWatch (AWS), Stackdriver (GCP), Azure Monitor: Native cloud observability tools.
Load Testing Tools
- Hey: Lightweight HTTP load generator.
- k6: Scriptable load testing with Prometheus integration.
- Locust: Python-based distributed load testing.
- Apache Bench (ab): Simple command-line HTTP benchmarking tool.
Documentation and Community
- Kubernetes HPA Documentation
- VPA GitHub Repository
- CA GitHub Repository
- Prometheus Documentation
- Kubernetes Slack Community Channels:
sig-autoscaling, #kubernetes-users
Real Examples
Example 1: E-Commerce Platform on AWS EKS
An e-commerce site experiences traffic surges during Black Friday sales. The team configured:
- HPA on the product catalog service to scale based on HTTP request rate (target: 50 req/s per pod).
- VPA in recommendation mode for the checkout service to optimize memory usage (reduced from 2Gi to 1.2Gi).
- Cluster Autoscaler with min=5 and max=50 nodes in the worker group.
- Prometheus Adapter to scale based on Redis queue depth (if orders backlog > 100, scale checkout pods).
Result: During peak traffic, the system scaled from 8 to 42 pods and added 18 nodes within 4 minutes. No timeouts occurred, and infrastructure costs remained 35% lower than static provisioning.
Example 2: Real-Time Analytics Pipeline on GKE
A data pipeline ingests streaming logs and processes them using 10 microservices. Each service has different resource profiles.
- HPA on ingestion pods based on incoming data rate (from Pub/Sub).
- VPA on processing pods with
updateMode: Recreateto avoid data loss during restarts. - Cluster Autoscaler configured to use preemptible VMs for cost savings, with a 10-minute node retention policy.
Result: Processing latency dropped from 120s to 15s during peak loads. Monthly infrastructure costs decreased by 48% due to dynamic node sizing and preemptible instance usage.
Example 3: On-Premises AI Inference Cluster
A financial services firm runs AI models on-premises using Kubernetes. Nodes are high-memory, high-CPU machines.
- HPA on inference pods based on GPU utilization (via NVIDIA Device Plugin and Prometheus).
- Custom metrics exporter to track model throughput (inferences per second).
- Cluster Autoscaler integrated with VMware vSphere to provision new VMs when GPU capacity is exhausted.
Result: GPU utilization increased from 40% to 85% on average. Model response time remained under 200ms even during 3x traffic spikes.
FAQs
Can I use HPA and VPA together on the same deployment?
Yes, but with caution. VPA modifies pod resource requests, which can trigger HPA to scale if CPU/memory usage changes. Use VPA in Recommendation mode first, then apply changes manually. Avoid using VPA in Auto mode unless youve thoroughly tested the interaction.
Why is my HPA not scaling?
Common causes:
- Metrics Server is not installed or not running.
- Pods lack resource requests (HPA requires them).
- Target metric is unreachable (e.g., custom metric not exposed).
- Scaling is blocked by PDB or insufficient cluster capacity.
Check HPA status with kubectl describe hpa <name> to see conditions and events.
How long does Cluster Autoscaler take to add a node?
Typically 15 minutes, depending on cloud provider provisioning speed. AWS EC2 takes ~23 minutes; GCP and Azure are similar. Ensure your node templates have sufficient quotas and IAM permissions.
Does autoscaling work with StatefulSets?
Yes. HPA supports StatefulSets. VPA and CA work with any workload type. However, VPA restarts pods, which may disrupt stateful applications use with care and test thoroughly.
Can I autoscale based on external events like weather or stock prices?
Yes. Use the External metric type in HPA. For example, a custom adapter can expose a metric like stock_price_volatility from an external API. HPA will scale based on that value.
Is autoscaling expensive?
No its cost-optimized. By scaling down during low traffic and avoiding over-provisioning, most organizations reduce infrastructure costs by 3060%. The overhead of running Metrics Server or CA is negligible.
What happens if I scale too aggressively?
Overly aggressive scaling can cause:
- Pod churn and instability.
- Increased cold starts for containerized apps.
- Node thrashing if CA adds/removes nodes too frequently.
Use stabilization windows and conservative thresholds to avoid this.
Conclusion
Autoscaling Kubernetes is not a one-time configuration its an ongoing discipline that requires monitoring, testing, and refinement. By combining Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler, you create a self-optimizing system that responds intelligently to real-world demand. The key to success lies in understanding your applications behavior, defining clear performance targets, and using the right metrics to drive decisions.
Start small: deploy HPA with CPU-based scaling, monitor its behavior, then layer in VPA and CA. Use custom metrics to align scaling with business outcomes. Always test under load, document your thresholds, and set alerts for failures.
When implemented correctly, autoscaling transforms Kubernetes from a static orchestration platform into a dynamic, cost-efficient, and highly resilient system. It empowers teams to focus on innovation rather than infrastructure management delivering better user experiences while optimizing operational expenses. In todays fast-paced digital landscape, mastering autoscaling isnt optional its essential.