Skip to content

Batching

Each ModelSpec has a BatchPolicy. By default sheaf-serve uses @serve.batch with max_batch_size=32 and a 10 ms wait window. Tune both to fit your model's throughput-vs-latency curve.

from sheaf import ModelSpec
from sheaf.api.base import ModelType
from sheaf.scheduling.batch import BatchPolicy

spec = ModelSpec(
    name="forecaster",
    model_type=ModelType.TIME_SERIES,
    backend="chronos2",
    batch_policy=BatchPolicy(
        max_batch_size=64,
        timeout_ms=20,
    ),
)

max_batch_size and timeout_ms map directly onto Ray Serve's @serve.batch parameters; the deployment configures itself at boot via the per-batch runtime setters.

bucket_by — length-variable inputs

For inputs that vary in length within a batch (time-series histories of different horizons, video clips with different frame counts), padding the whole batch to the longest item wastes compute. bucket_by names a scalar field on the request; requests with the same value share a sub-batch, requests with different values go through separate batch_predict calls in the same window.

spec = ModelSpec(
    name="forecaster",
    model_type=ModelType.TIME_SERIES,
    backend="chronos2",
    batch_policy=BatchPolicy(
        max_batch_size=64,
        timeout_ms=20,
        bucket_by="horizon",   # group by forecast horizon
    ),
)

Within one Ray Serve batch window, sheaf-serve calls bucket_requests(requests, "horizon") and dispatches one batch_predict per bucket — preserving original arrival order so results map back cleanly. Requests missing the field land in the None bucket together.

The ModalServer path handles requests one at a time (no @serve.batch), so bucket_by is silently ignored there.

Adapter-aware sub-batching (LoRA)

When ModelSpec.lora is set, the bucket-by-resolved-adapter step happens automatically inside each batch window. pipeline.set_adapters on diffusers is process-global state, so two requests in the same batch with different adapter selections must dispatch separately — the deployment groups them transparently and calls set_active_adapters once per group. See LoRA multiplexing for the full design.

bucket_by and ModelSpec.lora are mutually exclusive in v1; the spec validator rejects the combination.

When the defaults are wrong

Symptom What to change
Latency dominated by timeout_ms (small load, single requests) Lower timeout_ms; or accept the latency cost in exchange for batching when load grows
Batch never fills (low RPS, large max_batch_size) Lower max_batch_size or timeout_ms; the wait window is the latency floor for solo requests
OOM at high load Lower max_batch_size; profile peak GPU memory at the cap
Mix of short + long sequences padding everything to max Add bucket_by="<length-field>"

Reference

Full schema in the Scheduling API reference.