Itinerary v2 Prompt A/B — OLD (names-only) vs NEW (collection-driven)

Date: 2026-06-25
Question: Is Ashwin's collection-driven itinerary-v2 prompt (branch design/travel-preferences-tab) actually better at using a user's saves than main's current names-only prompt?
Verdict: Yes — decisively. NEW wins all 4 scenarios; every one of the 11 returned blind judge verdicts picked NEW. Recommendation: land NEW, with 3 small prompt tunings.

What "OLD" and "NEW" are

OLD (current main)NEW (Ashwin's branch)
Saves delivery3 soft blocks: USER SAVES ("weave in, ignore mismatches") + PRIORITY SAVES (hearted) + KNOWN PLACES1 block: CANDIDATE POOLS BY CITY — pre-built, scored, tagged [HEARTED]/[SAVED]/[ICONIC]/[GEM]/[SEEDED]
Who picks placesModel picks/invents; saves are hintsModel sequences a pre-built pool; tagged items MUST appear; invented = explicit (gap-fill)
EateriesIncluded in saves (told to ignore mismatches)Excluded (day-eligible only; eateries → food rail)
Day allocationModel allocatesDeterministic DAY ALLOCATION + post-gen repairCityCoverage
SeasonalitynoneSEASONALITY block (month + climate)
Item scheduletime: morning/afternoon labelsordered list = the schedule

candidate-pool.ts (the scorer) is net-new; main still runs OLD. The system prompt is identical on both sides, so this A/B isolates the user-message structure.

Method

Reproduce via workers/api/scripts/bench-render.ts (render) + bench-score.ts (deterministic score) + the itinerary-prompt-ab workflow.


Results

Scenario wins: NEW 4 — OLD 0 (11/11 blind verdicts → NEW)

Opus panel — dimension averages (0-10)

DimensionOLDNEWΔWinner
saves_coverage7.818.87+1.06NEW
schema_validity7.759.79+2.04NEW
invented_discipline8.168.46+0.30NEW (mixed)
city_lock9.889.75-0.13tie
day_balance8.758.38-0.37OLD (small)
narrative_quality8.678.42-0.25OLD (small)

Deterministic cross-check (ground-truth coverage + city-lock)

ScenarioOLD coverageNEW coverageOLD lockNEW lock
s1 Thailand19/40 (48%)21/40 (53%)3/33/3
s2 Italy10/10 (100%)10/10 (100%)3/33/3
s3 Japan12/17 (71%)16/17 (94%)2/22/2
s4 India9/11 (82%)11/11 (100%)3/33/3
Average75%87%100%100%

Panel and deterministic scorer agree: NEW covers more of the user's real saves, with city-lock effectively tied at 100%.


Why NEW wins

  1. It actually uses the saves (the whole point). +12 points of deterministic coverage. In s3 Japan, OLD drops the user's saved iconics — Osaka Aquarium Kaiyukan and Universal Studios Japan — and spends slots on model-chosen landmarks (Osaka Castle, Dotonbori); NEW schedules all 14 Osaka must-haves + both Kyoto items (16/16). In s4 India, OLD leaves 2 saves unscheduled; NEW gets 11/11. The thin-city seed top-up lets NEW fill days from the corpus instead of inventing.
  2. It stops scheduling restaurants as activities (biggest gap, +2.04 schema). OLD's USER SAVES + PRIORITY SAVES feed eateries — including a hearted restaurant — into the day list, and the model schedules "Ristorante Belvedere" / "Il Riccio" as kind:activity (the schema says never a restaurant). This tanked OLD's schema score in s2 Italy (4.83 vs 10). NEW's pool is day-eligible-only, so eateries route to the food rail. OLD is actively misusing the saves it's given.

Where NEW is weaker — the 3 things to tune in the port

NEW's only losses are small (day_balance −0.37, narrative −0.25, within panel noise), but the rationales surface 3 concrete, fixable weaknesses:

  1. Hearted items can still get dropped in a dense city. s1 Bangkok: NEW scheduled 3/5 hearted vs OLD's 4/5. The pool is best-first and repairCityCoverage guarantees city coverage, but nothing guarantees every hearted item survives when the pool exceeds the day cap.
  2. Gap-fill wanders. The (gap-fill) escape hatch let the model place a duplicate (s3: Fushimi Inari twice) and a Jaipur mall on a Jodhpur day (s4). Gap-fill isn't constrained to the day's city or de-duped.
  3. "Every SAVED/ICONIC MUST appear" overloads days. s3 day-3 paired a full-day Universal Studios with 4 more stops. The hard must-include fights the pace cap when the pool is large.

Data note: "Villa del Balbianello" is a Lake Como villa mis-saved under Rome — it hurt both variants. NEW's MUST-include rule forces the mis-geocoded save in; a pre-pool geo-sanity filter would help, but it's a data issue, not a prompt issue.

Recommendation

Land NEW (the collection-driven rewrite). It wins the two load-bearing dimensions — saves coverage and schema validity — by clear margins, and ties on city-lock. Its losses are marginal and addressable. Graft 3 small tweaks from OLD's strengths into NEW's CANDIDATE POOLS BY CITY rules block (prompts-v2.ts) during the port — no architecture change:

  1. Re-add an explicit hearted hard-include line (port OLD's PRIORITY SAVES idea) so dense-city hearted items can't be silently dropped; optionally extend repairCityCoverage to inject any unplaced [HEARTED] item.
  2. Tighten the gap-fill rule: gap-fill must be in the day's own city and must not duplicate an already-scheduled item.
  3. Soften "every SAVED/ICONIC MUST appear" → "schedule best-first; cover as many as the pace allows" when a city's pool exceeds its day capacity, so the must-include stops overloading days; keep repairCityCoverage as the per-day floor.

Net: the rewrite is the right call and directly delivers the "uses more saves" goal (+12pts coverage), while also fixing a real OLD bug (restaurants-as-activities). The 3 tunings recover NEW's only soft spots.