Auto-scaling is the heartbeat of modern microservices. When it’s tuned well, your cluster feels alive: pods materialize as traffic surges, costs melt away when demand drops, and latency hugs your SLOs. When it’s tuned poorly, you get the opposite—thrash, cold starts, timeouts, and a creeping sense that the cluster is secretly your boss.
This post is a field guide to three pillars of Kubernetes scaling:
- HPA (Horizontal Pod Autoscaler) — the built-in, resource-driven workhorse.
- KEDA — event-driven scaling powered by external metrics and queue backlogs.
- Knative — request-driven autoscaling for HTTP/GRPC with elegant scale-to-zero.
We’ll compare how they think about “load,” show realistic YAML you can paste into a cluster, and walk through hybrid patterns for latency-sensitive services. By the end, you’ll know when to pick each—and how to combine them without creating a hydra of competing HPAs.
The Real Problem: What Are We Scaling On?
Before picking a tool, decide what signal represents “work” for your service. Different autoscalers optimize for different signals:
- CPU/Memory — great when work is CPU-bound and steady. HPA’s home turf.
- Backlog/Events — ideal when work is discrete (messages, jobs) and bursty. KEDA shines here.
- Concurrent Requests / RPS / Latency — for interactive HTTP/GRPC where tail latency matters. Knative’s bread and butter.
Three more ideas frame our discussion:
- Reaction time vs. stability: faster scaling reacts to spikes but risks oscillations. Slower scaling is stable but can blow SLOs.
- Scale-to-zero: awesome for cost, tricky for latency due to cold starts.
- End-to-end capacity: pod autoscaling is only half the story—cluster autoscaler, rate limits, and downstreams must keep up.
We’ll keep these trade-offs in mind as we dive into each system.
Kubernetes HPA: The Default, for Good Reasons
Mental model: HPA watches metrics (CPU by default, plus custom metrics if configured) and adjusts replica counts to hit targets. It’s like a thermostat: “keep CPU ~70%.”
How HPA Decides Replica Counts
For a resource metric (say, CPU utilization), the desired replica count is roughly:
desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))
HPA v2 adds multiple metrics, scale-up/down behaviors, and stabilization windows to reduce thrash.
When HPA Works Great
- Services where CPU tracks actual work (e.g., CPU-bound compute, data transforms).
- Long-lived pods where scale-up shouldn’t be hyper-reactive.
- You want no external dependencies (no Prometheus adapter? no problem).
HPA (v2) Example: CPU + Request Latency (via External Metric)
Suppose you expose a Prometheus histogram for request latency and surface a p90 gauge through a metrics adapter. You want to keep p90 ≤ 200ms but also respect CPU.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
minReplicas: 2
maxReplicas: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100 # at most double every 15s
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # wait 5m before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: External
external:
metric:
name: http_request_duration_p90_ms
target:
type: AverageValue
averageValue: "200" # keep p90 around 200 ms
Pros
- Native, simple, battle-tested.
- Multi-metric support with v2.
- Good for CPU-bound workloads.
Cons
- Needs adapters to use custom/external metrics (latency, queue length).
- Not event-driven: it won’t “wake up” on a queue spike unless you wire that metric in.
- No built-in scale-to-zero.
Use HPA when CPU or a small set of metrics capture “work,” and you can tolerate warm baselines (minReplicas > 0).
KEDA: Event-Driven Scaling (Queues, Schedules, Cloudy Things)
Mental model: KEDA bridges external signals (Kafka lag, SQS depth, Redis list length, cron schedules, Prometheus queries, etc.) to pod scaling. It runs a controller + metrics server, creates an HPA on your behalf, and can scale to zero when there is no work.
Why KEDA Exists
For queue-driven systems, “work” is the backlog and arrival rate—not CPU. If 50,000 messages land at once, you want pods now, even if current CPU is idle. KEDA polls the external system, calculates a desired replica count from triggers, and feeds that to an HPA it manages.
KEDA ScaledObject Example: Kafka Lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: orders-worker
spec:
scaleTargetRef:
kind: Deployment
name: orders-worker
minReplicaCount: 0 # allow scale-to-zero
maxReplicaCount: 80
pollingInterval: 5 # seconds
cooldownPeriod: 300 # seconds
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 200
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: orders-cg
topic: orders
lagThreshold: "1000" # desired lag per replica
Interpretation: “Try to keep ~1000 messages of lag per replica.” If lag is 50,000, KEDA asks for ~50 pods (subject to min/max). When lag is zero, it can scale to zero.
KEDA Extras You’ll Actually Use
- ScaledJob: scale Jobs by events (e.g., 1 Job per 100 messages).
- Multiple triggers: e.g., Kafka lag and a Prometheus rate.
- Fallbacks/behaviors: guardrails when metric sources fail.
- Authentication resources for cloud providers.
Pros
- Speaks queue fluently; reacts to backlog.
- Scale-to-zero without Knative.
- Minimal changes to your app; no special HTTP sidecars.
Cons
- KEDA owns the HPA it creates; don’t attach a second HPA to the same target.
- Requires configuring triggers and (sometimes) credentials/adapters.
- Backlog-based scaling may over-provision compute for small messages unless you tune thresholds.
Use KEDA when your work arrives via events/queues, or you need aggressive scale-to-zero for background workers.
Knative: Request-Driven Autoscaling for HTTP/GRPC
Mental model: Knative Serving wraps your container in a queue-proxy and watches concurrency and request rates. It scales replicas to keep in-flight requests per pod near a target. It can also scale to zero and route cold traffic through an activator until pods are ready.
Why Knative Feels Different
Knative’s signal is not CPU or backlog; it is live request pressure. That makes it ideal for latency-sensitive endpoints where you want to say, “Never let a pod juggle more than N concurrent requests.” Knative can also switch between:
- KPA (Knative Pod Autoscaler) — concurrency/RPS-based.
- HPA class — CPU-based scaling if you prefer.
Knative Service Example: Concurrency-Based Autoscaling
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: quotes
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "50" # target in-flight reqs per pod
autoscaling.knative.dev/window: "60s" # stable window
autoscaling.knative.dev/panic-window-percentage: "10" # faster reaction
autoscaling.knative.dev/panic-threshold-percentage: "200"
autoscaling.knative.dev/minScale: "1" # avoid cold starts on baseline
autoscaling.knative.dev/maxScale: "100"
spec:
containerConcurrency: 100 # hard cap per pod
containers:
- image: ghcr.io/acme/quotes:latest
ports:
- containerPort: 8080
How it behaves (simplified):
- It estimates current concurrency across pods.
- Desired replicas ≈
observed_concurrency / target_concurrency. - A panic mode reacts quickly to spikes (short window), then stabilizes using a longer window to avoid flapping.
- If traffic drops to zero and
minScaleis 0, it scales to zero and later cold-starts via the activator.
Pros
- Optimizes for tail latency on HTTP/GRPC.
- Seamless scale-to-zero and traffic splitting by revision.
- First-class concurrency controls and graceful cold-start handling.
Cons
- Brings its own control plane (serving, activator, networking layer).
- Best fit for HTTP-ish workloads; not a queue consumer.
- Cold starts are real; you’ll often use
minScalefor critical paths.
Use Knative when you have interactive endpoints with strict latency SLOs, need revision traffic splitting, or want on-demand HTTP scale-to-zero.
Side-by-Side: How They React to Metrics
| Dimension | HPA | KEDA | Knative |
|---|---|---|---|
| Primary signal | CPU/memory; custom & external (via adapters) | Event/queue backlog, rates, cron, Prometheus | Concurrent requests, RPS; can optionally use CPU (HPA class) |
| Scale-to-zero | Not native | Yes (minReplicaCount: 0) |
Yes (minScale: 0, activator) |
| Reaction speed | Moderate; configurable behaviors | Fast (pollingInterval), but bounded by poll & HPA cool-down | Fast with panic window, then stable window |
| Best for | CPU-bound, steady workloads | Background workers, bursty event streams | Latency-sensitive HTTP/GRPC |
| Requires extra control plane | No | Small (operator + metrics server) | Larger (Knative Serving stack) |
| Custom metrics complexity | Needs adapters | Built-in triggers/adapters | Built-in request metrics |
The Math Behind Latency Targets (Quick Intuition)
If a single pod can handle μ requests/second at your SLO (e.g., measured at 50% CPU or at concurrency 50), and your incoming rate is λ requests/second, then a rough replica count is:
replicas ≈ ceil( λ / (μ * target_utilization) )
Knative bakes this into concurrency targets; HPA bakes it into CPU targets; KEDA translates queue length into “pods to drain backlog in time T.” If you know your service time and arrival rate, you can back into reasonable targets without pure guesswork.
Hybrid Patterns for Latency-Sensitive Systems
Here’s where most teams get tangled: mixing autoscalers. The rule of thumb:
One owner per Deployment. If KEDA generates an HPA for a Deployment, do not also attach your own HPA to it. If Knative owns scaling for a Service, don’t bolt on KEDA to the same pods.
You can still compose these systems by splitting responsibilities cleanly.
Pattern 1: Knative for Ingress, KEDA for Background Work
A classic e-commerce “checkout” has two faces:
- HTTP API (payment authorization, order placement) — latency-sensitive.
- Async pipeline (invoice emails, fraud scoring, warehouse updates) — backlog-driven.
Use Knative Service for the API and KEDA ScaledObject for the worker.
# API: Knative Service
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: checkout-api
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "40"
autoscaling.knative.dev/minScale: "2" # keep warm
autoscaling.knative.dev/maxScale: "60"
spec:
containers:
- image: ghcr.io/acme/checkout:1.2.3
env:
- name: QUEUE_URL
value: "kafka://orders"
---
# Worker: KEDA ScaledObject on Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-worker
spec:
replicas: 0 # KEDA will manage
selector:
matchLabels: { app: orders-worker }
template:
metadata:
labels: { app: orders-worker }
spec:
containers:
- name: worker
image: ghcr.io/acme/orders-worker:1.7.0
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: orders-worker
spec:
scaleTargetRef:
kind: Deployment
name: orders-worker
minReplicaCount: 0
maxReplicaCount: 100
pollingInterval: 2
cooldownPeriod: 180
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: orders-cg
topic: orders
lagThreshold: "2000"
Why it works: The API scales with live request pressure; the worker scales with backlog. Each has a single, clear owner.
Latency tip: set minScale: 2 on the Knative Service to avoid customer-visible cold starts, and let the worker fluctuate from 0→N.
Pattern 2: HPA for CPU + KEDA for a Separate Queue Adapter
You want to scale a CPU-bound image transformer on CPU, but also spike when a queue backlog grows. Avoid attaching both HPA and KEDA to the same Deployment. Instead:
- Keep image-transformer scaled by HPA (CPU & maybe latency).
- Add a thin adapter Deployment that pulls the queue and forwards to the transformer (e.g., through HTTP or NATS). Scale the adapter via KEDA on backlog.
# CPU-driven transformer (HPA owner)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: img-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: img-transformer
minReplicas: 2
maxReplicas: 40
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 75 }
---
# KEDA scales a separate adapter that fetches queue messages
apiVersion: apps/v1
kind: Deployment
metadata:
name: img-adapter
spec:
replicas: 0
selector:
matchLabels: { app: img-adapter }
template:
metadata:
labels: { app: img-adapter }
spec:
containers:
- name: adapter
image: ghcr.io/acme/img-adapter:2.1
env:
- name: TRANSFORMER_URL
value: "http://img-transformer:8080"
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: img-adapter
spec:
scaleTargetRef:
kind: Deployment
name: img-adapter
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: redis
metadata:
address: redis:6379
listName: transform-queue
listLength: "500"
Why it works: two owners, two Deployments. The adapter increases request pressure on the transformer as backlog grows; HPA adds transformer replicas when CPU rises.
Pattern 3: Knative with HPA Class (CPU) for “HTTP but CPU-Bound”
Your service is HTTP-facing but CPU is the real bottleneck (e.g., JSON → Parquet convertor). You want Knative’s routing/scale-to-zero but HPA’s CPU logic:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: parquetify
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "cpu"
autoscaling.knative.dev/target: "75" # percent CPU
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "80"
spec:
containers:
- image: ghcr.io/acme/parquetify:5.0.0
Why it works: Knative handles HTTP concerns and cold-starts; HPA semantics decide when to add pods.
Pattern 4: HPA with External Latency Metric (No Knative)
You don’t want Knative, but you do care about latency. Expose a Prometheus metric for p90 and use an external metrics adapter:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa-latency
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 60
metrics:
- type: External
external:
metric:
name: http_request_duration_p90_ms
target:
type: AverageValue
averageValue: "150"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
This keeps latency in check while staying within “just Kubernetes.”
Tuning Without Tears: Practical Knobs That Matter
A few settings do most of the work.
For HPA
behavior.scaleUp/scaleDown: shape your reaction curve. Aggressive scale-up, conservative scale-down is a sane default.stabilizationWindowSeconds: prevents oscillation. Higher on scale-down.- Multiple metrics: combine CPU with one external metric instead of stacking HPAs.
For KEDA
pollingInterval: lower for faster reaction, higher for fewer API calls.cooldownPeriod: how long to wait before scaling down after last trigger.- Trigger thresholds: map backlog per pod to realistic processing throughput.
minReplicaCount: 0 for savings, >0 for warm workers.- Avoid dual ownership: let KEDA own the HPA for that Deployment.
For Knative
autoscaling.knative.dev/target: concurrency target; measure your pod’s sweet spot.containerConcurrency: hard cap to avoid head-of-line blocking.minScale: keep a floor to avoid customer-visible cold starts.- Panic vs. stable windows: quicker spike reaction without long-term thrash.
Common Pitfalls (and How to Dodge Them)
-
Two HPAs, one Deployment Don’t. If KEDA created an HPA, that’s the owner. Split the workload if you need two different scaling signals.
-
Backlog target ignores message size If your messages vary wildly, scale on age of oldest message or processing time instead of raw count, or normalize lag by expected bytes.
-
Cold starts tank SLOs Use
minScale(Knative) orminReplicaCount(KEDA) ≥ 1 for critical paths. Consider pre-warming (periodic pings) for infrequent endpoints. -
Cluster autoscaler is asleep Pod autoscaling can request 100 replicas faster than nodes can appear. Ensure cluster autoscaler quotas and Pod Priorities align with your SLO.
-
Overfitting to synthetic tests Load tests with perfect Poisson arrivals understate burstiness. Use production traces to set burst budgets and panic windows.
-
Scaling on average, ignoring tails If your SLO is p95 latency, don’t scale only on CPU mean. Either adopt Knative (concurrency) or surface a tail-latency metric to HPA.
A Worked Example: Meeting a 200 ms p95 SLO
Scenario: An API averages 4 ms CPU time per request (nominal), but at concurrency > 60 per pod, GC and lock contention kick p95 above 200 ms. Traffic ranges from 50 rps to 3,000 rps within a minute.
Knative approach
- Benchmark shows sweet spot at ~50 in-flight reqs per pod to keep p95 ≤ 200 ms.
- Set
autoscaling.knative.dev/target: "50". - Keep
minScale: "2"to avoid cold starts during off-hours. - Use
panic-window-percentage: "10"to react within ~6s for spikes; longwindow: 60sto stabilize.
Expected behavior: At 3,000 rps, desired replicas ≈ 3,000 / 50 = 60 pods. Panic mode ramps quickly; stable window prevents oscillation as traffic wanes.
HPA approach
- Export p90 as external metric and set
averageValue: "200". - Add CPU target at 70% as a secondary guard.
- Aggressive scale-up (100% per 15s), conservative scale-down (50% per 60s, 5-minute stabilization).
Trade-off: HPA can work but you’re limited by metric freshness and adapter lag. Knative’s direct request signal reacts faster and more precisely for HTTP.
Decision Flow: Picking the Right Tool
Ask three questions:
-
How does work arrive?
- HTTP/GRPC → Start with Knative (concurrency/RPS).
- Queues/Events/Jobs → Start with KEDA (backlog/age).
- CPU-bound batch/stream → Start with HPA (CPU/memory).
-
Do I need scale-to-zero?
- Yes (and HTTP) → Knative.
- Yes (and queues) → KEDA.
- No → HPA might be simplest.
-
What’s my SLO?
- Tail latency → Prefer Knative or HPA with a tail-latency metric.
- Cost/minimal control plane → Prefer HPA.
- Bursty events → Prefer KEDA.
If two answers tie, split the workload into two Deployments and give each to its best-fit autoscaler. Fewer knobs per owner beats a mega-HPA with six metrics.
Cheat-Sheet: Sensible Defaults
-
HPA
- CPU target: 70–75%.
- Scale up: max 100% / 15s; scale down: max 50% / 60s, 5m stabilization.
- Add one external metric if latency matters.
-
KEDA
pollingInterval: 2–5s for hot queues.cooldownPeriod: 180–300s.- Backlog target:
(expected rps * drain_time) / per_pod_rate. Start conservative. minReplicaCount: 0 for workers unless SLO demands warm pods.
-
Knative
target(concurrency): measured sweet spot per pod (commonly 30–100).minScale: 1–2 for critical endpoints; 0 for low-traffic ones.- Keep a short panic window and longer stable window.
Summary: The Right Tool for the Right Signal
-
HPA is the sturdy default. If CPU tracks your real work (and you don’t need scale-to-zero), HPA is simple, native, and plenty capable—especially with v2 behaviors and external metrics.
-
KEDA turns external world signals into replicas. For queue-driven systems and background workers, it’s the most natural fit and gives you scale-to-zero without changing your app model.
-
Knative optimizes interactive latency by treating concurrency as the first-class signal. It’s the easiest way to keep p95 in check for HTTP/GRPC, with graceful scale-to-zero and traffic splitting.
For latency-sensitive microservices, the best setups are often hybrid: Knative for ingress paths, KEDA for background pipelines, HPA for CPU-bound transforms. Keep ownership clean—one autoscaler per Deployment—and tune a handful of behavior knobs. Your cluster (and your on-call rota) will thank you.
Further Reading & Exploration
- Kubernetes Horizontal Pod Autoscaler v2 concepts and API.
- KEDA documentation: triggers, ScaledObject, ScaledJob, and advanced HPA behavior.
- Knative Serving autoscaling: KPA vs. HPA class, concurrency, activator, and scale-to-zero.
- Kubernetes Cluster Autoscaler: ensuring node capacity keeps up with pod scale-ups.
- “Monitoring-driven capacity planning”: building latency and throughput SLOs into autoscaling targets.
If you want, I can adapt the YAML snippets to your stack (Prometheus vs. Datadog, Kafka vs. SQS, etc.) and sketch a rollout plan that won’t surprise your cluster at 2 a.m.