Skip to content

Scaling ​

Goals ​

  • Keep latency within SLO while maintaining cost efficiency.
  • Default to horizontal scale (more replicas) and only scale vertically (bigger vCPU/RAM) when necessary.
  • Use data (response times, CPU/memory, instance counts, and cost) to tune decisions.

Default Policy ​

  1. Scaler: ACA HTTP concurrency scaler.

    • Target concurrency: 40 concurrent requests per replica (initial default).
    • This value will be tuned per-service based on observed p95 latency and CPU.
  2. Replica bounds:

    • minReplicas: 1 for public APIs (avoid cold start).
    • maxReplicas: start at 5 (service-specific; raise if needed).
  3. Preference:

    • Scale horizontally first for cost efficiency. Adding replicas only incurs cost while they run under load.
    • Scale vertically later if a single request needs more headroom; vertical sizing increases β€œfixed” cost because the larger container size is paid for throughout its active lifetime.

Configuration (ACA) ​

Example (Bicep) ​

yaml
template:
  containers:
    - name: api
      image: myacr.azurecr.io/api:latest
      resources:
        cpu: 0.5
        memory: 1Gi
  scale:
    minReplicas: 1
    maxReplicas: 5
    rules:
      - name: http-concurrency
        http:
          concurrentRequests: 40

Notes

  • Start with cpu: 0.5, memory: 1Gi. Right-size after profiling (see β€œWhen to Scale Vertically”).
  • If a service is truly bursty and can tolerate cold starts or is infrequently called, minReplicas: 0 is acceptable.

How We Tune concurrentRequests (40 β†’ …) ​

  1. Collect a baseline for 3–7 days under representative traffic.
  2. If CPU < 50% and p95 well below SLO, consider raising concurrency (e.g., 40 β†’ 60).
  3. If CPU β‰₯ 75% or p95 approaches SLO, lower concurrency (e.g., 40 β†’ 30) or add replicas by raising maxReplicas.
  4. Re-assess until we hit the β€œsweet spot” (steady p95 and 60–75% CPU during normal load).

Decision Tree: Scale Out vs. Optimize vs. Scale Up ​

  1. Is p95 latency above SLO?

    • No β†’ Do nothing. Keep observing.
    • Yes β†’ Go to 2.
  2. Is per-replica CPU β‰₯ 75% or memory β‰₯ 80%?

    • Yes β†’ Try horizontal scale first (increase maxReplicas or decrease target concurrency slightly).
    • No β†’ Go to 3.
  3. Where is time spent (from traces)?

    • Mostly external (DB, HTTP downstream) β†’ Optimize queries, add caching, reduce payloads; scaling won’t help much.
    • Mostly app CPU / GC β†’ Consider vertical scale (more vCPU/RAM) or reduce allocations/compute; then re-test.
  4. Are we frequently at maxReplicas with p95 ~ SLO and costs rising?

    • Yes β†’ Prioritize optimization (DB indexes, caching, batching) before adding more replicas.

When We Scale Horizontally (Default) ​

  • Workload is I/O-bound or embarrassingly parallel.
  • Increasing replicas reduces queueing and improves p95.
  • Cost is proportional to actual load (replicas scale down when idle).

Actions

  • Increase maxReplicas with the 60–75% CPU target in mind.
  • Keep concurrentRequests near the latency sweet spot.
  • Watch DB/redis/queue limits as you add replicas.

When We Scale Vertically (Exception, Not Default) ​

Scale vertically only when many requests are CPU/memory intensive and a single replica is the bottleneck:

  • p95/p99 dominated by CPU work inside the service even with low concurrency.
  • High GC% time, LOH pressure, or frequent OOM.
  • Thread pool starvation at modest concurrency.

Actions

  • Move the app to a larger vCPU/RAM size (workload profile).
  • Keep horizontal scaling enabled; vertical β‰  disable autoscale.
  • Re-check cost: bigger instances increase the β€œfixed” cost floor because we pay the higher rate for the entire time the container runs.

FAQ ​

  • Why 40 concurrent requests as default? It’s a safe starting point for I/O-heavy APIs. We will tune up/down using latency and CPU data.

  • Why horizontal before vertical? Horizontal scaling matches cost to demand; vertical sizing raises the baseline cost as long as the container is running.

  • What’s the signal to go vertical? When p95/p99 are dominated by in-process CPU/memory work and a single request needs more headroom despite modest concurrency.