Back to Blog
Engineering2026-05-1111 min

LLM-in-the-loop scope generation: prompt caching, structured outputs, and the deterministic fallback ladder

J

John C. Thomas

Founder, BlueWave Projects

The AI scope generator in BlueWave Projects is the single most-used feature in the product. A contractor opens an iPhone, walks a room, drops the scan into the portal with a couple of photos and a note, and gets back a phase-by-phase scope of work — labor, materials, contingency, tax gross-up — in about 60 seconds.

Under the hood it's the most over-engineered feature in the whole system. Most of that work is *not* the LLM call. It's the deterministic rails around the LLM call. Here's the full pipeline, told without marketing.

The input: parametric RoomPlan, not pixels

iOS' RoomPlan returns a parametric JSON of the captured room — walls, openings, fixtures, objects, with positions and dimensions. Not a point cloud. Not a mesh. A structured tree of named primitives. That structure is the only reason scope generation is possible at AI-affordable cost.

The iOS app uploads:

  • The RoomPlan JSON (10-50 KB)
  • 2-5 photos (compressed to ~150 KB each)
  • A short text note from the contractor ("master bath gut, keep tub, lengthen vanity")
  • The address (for tax jurisdiction lookup)
  • The server flattens RoomPlan into a text fixture: "Room: 12' 4" × 9' 8", 92 sqft. Wall A: window opening 4' 2" × 3' 0" centered. Door: 2' 8" interior, north wall." Photos get described by Claude with vision in a single pre-pass and the descriptions are cached.

    Prompt design: pinned context, fluid tail

    The full system prompt is around 8K tokens and pinned with cache_control: ephemeral. It contains:

  • Role + constraints — "You are a Hawaii design-build contractor's scope generator. Output JSON conforming to the schema below. Do not produce prose. Do not invent line items not derivable from inputs."
  • The schema — phase enum, line-item structure, required ranges (labor low/high, material low/high, contingency %), tax gross-up rules.
  • Worked examples — 4 fully-completed scopes from prior accepted jobs. These are the gold reference for tone, granularity, and pricing realism.
  • The tenant's pricing context — most recent 20 accepted line-item prices from this tenant, anonymized to category. This is the per-tenant signal that makes the output feel like *yours*.
  • The active scan + photos + note — this is the only part that changes per call.
  • That structure means parts 1-4 (≈ 7.8K tokens) hit the cache on every call. Only the last ~300-500 tokens cost full input rates. At Claude API pricing today that brings the per-call input cost down by roughly 90%.

    Structured outputs: Pydantic schemas, not regex

    Every scope call uses Anthropic's tool-use / structured-output mode with a Pydantic v2 schema. The model gets a tool definition that exactly matches the schema:

    class LineItem(BaseModel):
        phase: Phase  # enum: demo | framing | electrical | plumbing | finish
        description: str
        labor_low: float
        labor_high: float
        material_low: float
        material_high: float
        contingency_pct: float
    
    class Scope(BaseModel):
        summary: str
        phases: list[Phase]
        items: list[LineItem]
        tax_gross_up_pct: float

    The model can't return malformed JSON. It can't omit required fields. It can't return a phase that isn't in the enum. Pydantic validates on receipt, raises on mismatch, and the retry loop knows what to send back to the model.

    This is the single biggest cost saving in the pipeline. Before structured outputs we lost ~8% of calls to bad JSON. Now it's effectively zero.

    The deterministic fallback ladder

    The first version of the scope generator was "call Claude, return the result." It was 95% magic and 5% disaster. The disasters were the part that got the bug reports.

    Now there's a four-rung ladder:

    Rung 1: Schema validation. Output passes Pydantic? Ship.

    Rung 2: Retry with the validation error. Pydantic raised on field X? Send the error back to Claude with "fix this field, leave the rest alone" and try again. Maximum two retries.

    Rung 3: Retry with a smaller context. The model can flake when the prompt is at the edge of its attention. Truncate the worked examples from 4 to 2, retry once.

    Rung 4: Template + narrative inserts. If three model attempts have failed, fall back to a deterministic phase-skeleton (demo, framing, electrical, plumbing, finish) with template ranges *for this tenant's pricing context* and ask Claude only for the human-readable summary and per-phase rationale. Ranges come from data; prose comes from the model. Worst-case scope still ships.

    In production we hit rung 1 on ~94% of calls, rung 2 on ~5%, rung 3 on <1%, and rung 4 on virtually nothing. The ladder exists so the failure mode is "less personalized scope," never "no scope, sorry, try again."

    Multi-tenant prompt isolation

    Every call is scoped to a tenant. The tenant's pricing context is fetched fresh per call, the tenant_id is in the system prompt, and the output is written to a tenant-scoped table. There is no single shared cache key that could leak one tenant's pricing into another tenant's output. The cache key is (system_prompt_version, tenant_id), not just system_prompt_version.

    This sounds paranoid. It's necessary. Hawaii has fewer than 5,000 active GCs. If two competitors land on the same tenant on the same shared cache, it would be a *small* number of people who get hurt — but it would be a problem you'd never recover from.

    Cost + latency numbers in production

    Today's averages over the last 1,000 calls:

  • Input: ~8,400 tokens (95% cache reads)
  • Output: ~1,400 tokens
  • Latency p50: 9.4s · p99: 18.6s
  • Cost per call: about $0.045 (Claude Sonnet pricing, ranges by month)
  • Cache hit rate on the static blocks: 98.7%
  • A contractor's time billed at $120/hr would value an 8-minute saving at $16. The scope generator runs at ~$0.04. The economics are obscene in our favor and we sell the feature on time-saved, not cost-of-token.

    What I'd tell another engineer building this

  • Pin the system prompt with cache_control. It's the single line that makes the economics work.
  • Use structured outputs from day one. Don't write a JSON parser. Don't write a JSON repair function. The schema is the contract.
  • Build the deterministic ladder before you scale. The day one model-only path is fine for a demo. It is not fine for a paying customer's first scope.
  • Tenant-scope the cache key. Especially in regulated industries.
  • Measure rung-1-hit-rate weekly. A drop is your earliest signal that the model has shifted under you. It's a much better canary than user complaints.
  • The scope generator looks like magic on a 60-second demo. Most of what makes it work is the unmagical machinery on either side of the Claude call.

    More from BlueWave