Appearance
Scaling β
Goals β
- Keep latency within SLO while maintaining cost efficiency.
- Default to horizontal scale (more replicas) and only scale vertically (bigger vCPU/RAM) when necessary.
- Use data (response times, CPU/memory, instance counts, and cost) to tune decisions.
Default Policy β
Scaler: ACA HTTP concurrency scaler.
- Target concurrency:
40concurrent requests per replica (initial default). - This value will be tuned per-service based on observed p95 latency and CPU.
- Target concurrency:
Replica bounds:
minReplicas:1for public APIs (avoid cold start).maxReplicas: start at5(service-specific; raise if needed).
Preference:
- Scale horizontally first for cost efficiency. Adding replicas only incurs cost while they run under load.
- Scale vertically later if a single request needs more headroom; vertical sizing increases βfixedβ cost because the larger container size is paid for throughout its active lifetime.
Configuration (ACA) β
Example (Bicep) β
yaml
template:
containers:
- name: api
image: myacr.azurecr.io/api:latest
resources:
cpu: 0.5
memory: 1Gi
scale:
minReplicas: 1
maxReplicas: 5
rules:
- name: http-concurrency
http:
concurrentRequests: 40Notes
- Start with
cpu: 0.5, memory: 1Gi. Right-size after profiling (see βWhen to Scale Verticallyβ).- If a service is truly bursty and can tolerate cold starts or is infrequently called,
minReplicas: 0is acceptable.
How We Tune concurrentRequests (40 β β¦) β
- Collect a baseline for 3β7 days under representative traffic.
- If CPU < 50% and p95 well below SLO, consider raising concurrency (e.g., 40 β 60).
- If CPU β₯ 75% or p95 approaches SLO, lower concurrency (e.g., 40 β 30) or add replicas by raising
maxReplicas. - Re-assess until we hit the βsweet spotβ (steady p95 and 60β75% CPU during normal load).
Decision Tree: Scale Out vs. Optimize vs. Scale Up β
Is p95 latency above SLO?
- No β Do nothing. Keep observing.
- Yes β Go to 2.
Is per-replica CPU β₯ 75% or memory β₯ 80%?
- Yes β Try horizontal scale first (increase
maxReplicasor decrease target concurrency slightly). - No β Go to 3.
- Yes β Try horizontal scale first (increase
Where is time spent (from traces)?
- Mostly external (DB, HTTP downstream) β Optimize queries, add caching, reduce payloads; scaling wonβt help much.
- Mostly app CPU / GC β Consider vertical scale (more vCPU/RAM) or reduce allocations/compute; then re-test.
Are we frequently at
maxReplicaswith p95 ~ SLO and costs rising?- Yes β Prioritize optimization (DB indexes, caching, batching) before adding more replicas.
When We Scale Horizontally (Default) β
- Workload is I/O-bound or embarrassingly parallel.
- Increasing replicas reduces queueing and improves p95.
- Cost is proportional to actual load (replicas scale down when idle).
Actions
- Increase
maxReplicaswith the 60β75% CPU target in mind. - Keep
concurrentRequestsnear the latency sweet spot. - Watch DB/redis/queue limits as you add replicas.
When We Scale Vertically (Exception, Not Default) β
Scale vertically only when many requests are CPU/memory intensive and a single replica is the bottleneck:
- p95/p99 dominated by CPU work inside the service even with low concurrency.
- High GC% time, LOH pressure, or frequent OOM.
- Thread pool starvation at modest concurrency.
Actions
- Move the app to a larger vCPU/RAM size (workload profile).
- Keep horizontal scaling enabled; vertical β disable autoscale.
- Re-check cost: bigger instances increase the βfixedβ cost floor because we pay the higher rate for the entire time the container runs.
FAQ β
Why 40 concurrent requests as default? Itβs a safe starting point for I/O-heavy APIs. We will tune up/down using latency and CPU data.
Why horizontal before vertical? Horizontal scaling matches cost to demand; vertical sizing raises the baseline cost as long as the container is running.
Whatβs the signal to go vertical? When p95/p99 are dominated by in-process CPU/memory work and a single request needs more headroom despite modest concurrency.