Scaling

Goals

Keep latency within SLO while maintaining cost efficiency.
Default to horizontal scale (more replicas) and only scale vertically (bigger vCPU/RAM) when necessary.
Use data (response times, CPU/memory, instance counts, and cost) to tune decisions.

Default Policy

Scaler: ACA HTTP concurrency scaler.
- Target concurrency: 40 concurrent requests per replica (initial default).
- This value will be tuned per-service based on observed p95 latency and CPU.
Replica bounds:
- minReplicas: 1 for public APIs (avoid cold start).
- maxReplicas: start at 5 (service-specific; raise if needed).
Preference:
- Scale horizontally first for cost efficiency. Adding replicas only incurs cost while they run under load.
- Scale vertically later if a single request needs more headroom; vertical sizing increases “fixed” cost because the larger container size is paid for throughout its active lifetime.

Configuration (ACA)

Example (Bicep)

yaml

template:
  containers:
    - name: api
      image: myacr.azurecr.io/api:latest
      resources:
        cpu: 0.5
        memory: 1Gi
  scale:
    minReplicas: 1
    maxReplicas: 5
    rules:
      - name: http-concurrency
        http:
          concurrentRequests: 40

Notes
Start with cpu: 0.5, memory: 1Gi. Right-size after profiling (see “When to Scale Vertically”).
If a service is truly bursty and can tolerate cold starts or is infrequently called, minReplicas: 0 is acceptable.

How We Tune `concurrentRequests` (40 → …)

Collect a baseline for 3–7 days under representative traffic.
If CPU < 50% and p95 well below SLO, consider raising concurrency (e.g., 40 → 60).
If CPU ≥ 75% or p95 approaches SLO, lower concurrency (e.g., 40 → 30) or add replicas by raising maxReplicas.
Re-assess until we hit the “sweet spot” (steady p95 and 60–75% CPU during normal load).

Decision Tree: Scale Out vs. Optimize vs. Scale Up

Is p95 latency above SLO?
- No → Do nothing. Keep observing.
- Yes → Go to 2.
Is per-replica CPU ≥ 75% or memory ≥ 80%?
- Yes → Try horizontal scale first (increase maxReplicas or decrease target concurrency slightly).
- No → Go to 3.
Where is time spent (from traces)?
- Mostly external (DB, HTTP downstream) → Optimize queries, add caching, reduce payloads; scaling won’t help much.
- Mostly app CPU / GC → Consider vertical scale (more vCPU/RAM) or reduce allocations/compute; then re-test.
Are we frequently at maxReplicas with p95 ~ SLO and costs rising?
- Yes → Prioritize optimization (DB indexes, caching, batching) before adding more replicas.

When We Scale Horizontally (Default)

Workload is I/O-bound or embarrassingly parallel.
Increasing replicas reduces queueing and improves p95.
Cost is proportional to actual load (replicas scale down when idle).

Actions

Increase maxReplicas with the 60–75% CPU target in mind.
Keep concurrentRequests near the latency sweet spot.
Watch DB/redis/queue limits as you add replicas.

When We Scale Vertically (Exception, Not Default)

Scale vertically only when many requests are CPU/memory intensive and a single replica is the bottleneck:

p95/p99 dominated by CPU work inside the service even with low concurrency.
High GC% time, LOH pressure, or frequent OOM.
Thread pool starvation at modest concurrency.

Actions

Move the app to a larger vCPU/RAM size (workload profile).
Keep horizontal scaling enabled; vertical ≠ disable autoscale.
Re-check cost: bigger instances increase the “fixed” cost floor because we pay the higher rate for the entire time the container runs.

FAQ

Why 40 concurrent requests as default? It’s a safe starting point for I/O-heavy APIs. We will tune up/down using latency and CPU data.
Why horizontal before vertical? Horizontal scaling matches cost to demand; vertical sizing raises the baseline cost as long as the container is running.
What’s the signal to go vertical? When p95/p99 are dominated by in-process CPU/memory work and a single request needs more headroom despite modest concurrency.

Pricing

Channels

Flows

Connectors

Scaling

Goals

Default Policy

Configuration (ACA)

Example (Bicep)

How We Tune `concurrentRequests` (40 → …)

Decision Tree: Scale Out vs. Optimize vs. Scale Up

When We Scale Horizontally (Default)

When We Scale Vertically (Exception, Not Default)

FAQ

Scaling ​

Goals ​

Default Policy ​

Configuration (ACA) ​

Example (Bicep) ​

How We Tune concurrentRequests (40 → …) ​

Decision Tree: Scale Out vs. Optimize vs. Scale Up ​

When We Scale Horizontally (Default) ​

When We Scale Vertically (Exception, Not Default) ​

FAQ ​

Scaling

Goals

Default Policy

Configuration (ACA)

Example (Bicep)

How We Tune `concurrentRequests` (40 → …)

Decision Tree: Scale Out vs. Optimize vs. Scale Up

When We Scale Horizontally (Default)

When We Scale Vertically (Exception, Not Default)

FAQ