Concepts¶

The runtime ideas that hold sheaf-serve together. Each page is the short version of one design decision.

Batching — @serve.batch, BatchPolicy, and bucket_by for length-variable inputs.
Caching — opt-in per-deployment LRU response cache with optional TTL.
Streaming — POST /{name}/stream SSE for incremental output (FLUX progress, chunked transcripts).
Feast integration — send a feature_ref instead of raw history; sheaf-serve resolves features online.
Offline batch jobs — BatchRunner: JSONL in, JSONL out, via Ray Data, with stateless-task or warm-actor execution.
Async-job worker — SheafWorker: Redis Streams consumer with at-least-once delivery and webhook callbacks.
LoRA multiplexing — adapter-aware sub-batching for diffusion backends; per-request adapter selection on a single deployment.