Skip to main content

Serverless Worker autoscaling

View Markdown
SUPPORT, STABILITY, and DEPENDENCY INFO

Serverless Workers are in Pre-release and available to select Temporal Cloud customers. To request access during Pre-release, create a support ticket or contact your account team. APIs are experimental and may be subject to backwards-incompatible changes. Sign up for updates to be notified when Serverless Workers reach Public Preview.

The Worker Controller Instance (WCI) autoscales Serverless Workers using two signals: sync match failure and Task Queue backlog. The autoscaling algorithm differs by compute provider because of differences in cold start latency, invocation duration limits, and provider APIs.

Scaling signals

Both compute providers use the same two signals to drive scaling decisions.

Sync match failure

When a Task is submitted, the Matching Service attempts to route it directly to an available Worker. If no Worker is available, the sync match fails, and the Matching Service pushes a signal to the WCI. Because the Matching Service pushes match failures as they happen rather than the WCI polling on a timer, scaling is responsive.

Task Queue backlog

The WCI monitors Task Queue metadata to determine whether pending Tasks exist without enough Workers to process them. If there are Tasks on the queue and not enough Workers, the WCI scales up.

AWS Lambda

The Lambda algorithm is event-driven and reactive. Sync match failure is the primary control signal, and backlog aids sizing.

When the WCI needs more capacity, it calls the Lambda InvokeFunction API to start new Workers. Each call is a discrete action ("invoke N more functions"), not a target state. The WCI does not manage a fleet of instances.

Scale-out

On sync match failure, the WCI invokes new Lambda functions. Because Lambda cold start is sub-second to low single-digit seconds, reactive-only control does not create meaningful backlog overshoot. The WCI can scale from zero with low latency.

Scale-in

Scale-in is automatic. Each Lambda invocation runs until the Worker has finished processing available Tasks or approaches the 15-minute execution time limit, then shuts down. There is no drain logic or stabilization window. The WCI does not need to actively remove capacity.

Instance model

Each invocation is independent. The Worker starts, creates a fresh client connection, processes multiple Tasks until near the execution time limit, and then shuts down gracefully. There is no shared state across invocations.

GCP Cloud Run

The Cloud Run algorithm is a hybrid rate-plus-backlog controller. It extends the base algorithm with a latency-first fast-path that reacts to sync match failures.

Unlike Lambda, the WCI outputs a target state ("there should be c instances") rather than discrete invocations. The WCI adjusts Cloud Run's instance count through the Cloud Run admin API.

Scale-out

The algorithm uses four layers to determine the desired instance count:

  1. Feedforward base capacity. The WCI estimates the required fleet size from the Task arrival rate, divided by per-instance throughput at the target utilization. Feedforward sizing is critical because Cloud Run cold start is approximately 10-30 seconds. Waiting for backlog to signal under-provisioning means new capacity is 10-30 seconds away.
  2. Backlog-drain correction. If a backlog exists, the WCI adds instances to drain it within the target queue wait time.
  3. Warm-reserve headroom. The WCI maintains extra capacity above the feedforward estimate to absorb sync match failures without triggering cold starts.
  4. Sync match fast-path. On any sync match failure, the WCI immediately re-evaluates and scales out if the current fleet is undersized. This event-triggered path bypasses the regular control interval.

The final desired count is the maximum of the reactive and event-driven calculations, clamped to the configured minimum and maximum instance counts, and quantized to the scaling granularity.

Scale-in

Scale-in is conservative to avoid oscillation:

  • Scale-down stabilization window. After a scale-down decision, the WCI waits (default 300 seconds) before removing instances. If load increases during this window, the scale-down is canceled.
  • Hold after scale-out. After scaling out, the WCI holds the new capacity for a minimum period before considering scale-in.
  • Drain logic. When removing instances, the WCI drains them over a configurable horizon, allowing in-flight Tasks to complete before the instance is terminated.

Minimum instances

Setting c_min >= 1 keeps at least one instance warm at all times. With constant traffic, this behaves like an always-on Worker with elastic scale-up and scale-down. Setting c_min = 0 enables full scale-to-zero but means the first Task after an idle period incurs a cold start.

Tuning parameters

The following parameters control Cloud Run autoscaling behavior. These are starting points for latency-first operation.

ParameterStarting valueDescription
Control interval15sHow often the WCI re-evaluates the desired instance count.
Utilization target0.70-0.80Target per-instance utilization for feedforward sizing.
Queue wait target3-5sTarget time a Task should wait in the queue before being picked up.
Drain horizon30-60sHow long the WCI allows for in-flight Tasks to complete when removing an instance.
Event cooldownmax(5s, 0.25 x scale-up latency)Minimum time between event-triggered scale-out evaluations.
Scale-down stabilization300sHow long the WCI waits after a scale-down decision before removing instances.
Hold after scale-outmax(60s, 2 x scale-up latency)Minimum time to hold new capacity before considering scale-in.
Min instances>= 1 for latency-firstMinimum instance count. Set to 0 for full scale-to-zero.
Scaling granularity1Minimum step size for scaling changes.