Concepts¶
The runtime ideas that hold sheaf-serve together. Each page is the short version of one design decision.
- Batching —
@serve.batch,BatchPolicy, andbucket_byfor length-variable inputs. - Caching — opt-in per-deployment LRU response cache with optional TTL.
- Streaming —
POST /{name}/streamSSE for incremental output (FLUX progress, chunked transcripts). - Feast integration — send a
feature_refinstead of raw history; sheaf-serve resolves features online. - Offline batch jobs —
BatchRunner: JSONL in, JSONL out, via Ray Data, with stateless-task or warm-actor execution. - Async-job worker —
SheafWorker: Redis Streams consumer with at-least-once delivery and webhook callbacks. - LoRA multiplexing — adapter-aware sub-batching for diffusion backends; per-request adapter selection on a single deployment.