Itinerary v2 Prompt A/B — OLD (names-only) vs NEW (collection-driven)

Date: 2026-06-25

Question: Is Ashwin's collection-driven itinerary-v2 prompt (branch design/travel-preferences-tab) actually better at using a user's saves than main's current names-only prompt?

Verdict: Yes — decisively. NEW wins all 4 scenarios; every one of the 11 returned blind judge verdicts picked NEW. Recommendation: land NEW, with 3 small prompt tunings.

What "OLD" and "NEW" are

	OLD (current `main`)	NEW (Ashwin's branch)
Saves delivery	3 soft blocks: `USER SAVES` ("weave in, ignore mismatches") + `PRIORITY SAVES` (hearted) + `KNOWN PLACES`	1 block: `CANDIDATE POOLS BY CITY` — pre-built, scored, tagged `[HEARTED]/[SAVED]/[ICONIC]/[GEM]/[SEEDED]`
Who picks places	Model picks/invents; saves are hints	Model sequences a pre-built pool; tagged items MUST appear; invented = explicit `(gap-fill)`
Eateries	Included in saves (told to ignore mismatches)	Excluded (day-eligible only; eateries → food rail)
Day allocation	Model allocates	Deterministic `DAY ALLOCATION` + post-gen `repairCityCoverage`
Seasonality	none	`SEASONALITY` block (month + climate)
Item schedule	`time: morning/afternoon` labels	ordered list = the schedule

candidate-pool.ts (the scorer) is net-new; main still runs OLD. The system prompt is identical on both sides, so this A/B isolates the user-message structure.

Method

Real data. Saves pulled read-only from prod verso-db for 2 real saves-heavy users across 4 countries. 4 scenarios exercise different shapes:
- s1 Thailand 7d — Bangkok(38 saves/5♥)+Phuket+Krabi. Dense + hearted + thin-city top-up.
- s2 Italy 6d — Rome+Amalfi+Capri. Thin saves, hearted spread, eateries among saves.
- s3 Japan 5d — Osaka(18)+Kyoto(3). Dense + thin city top-up in one trip.
- s4 India/Rajasthan 7d — Jaipur+Jodhpur+Jaisalmer. All-iconic, zero hearted.
Identical inputs. Both prompts rendered from the same fixtures (NEW via the branch's own pure assemblePool/scoreSave; OLD via faithful re-impl of main's loadSavedPlaceNamesByCity + getTopCuratedPlaceNamesByCity).
Generation = Sonnet subagents, no product API call. Each prompt replayed by a Claude Sonnet subagent acting as the production model (8192-tok ceiling, strict schema). 8 generations.
Judging = Opus panel, blind. 3 Claude Opus judges per scenario, A/B blinded, scoring 6 dimensions against a variant-neutral ground truth. 12 judges (one s4 verdict didn't return → 11 valid).
Deterministic cross-check. A ground-truth scorer computes hard saves-coverage + city-lock from the generated plans, independent of judge opinion.

Reproduce via workers/api/scripts/bench-render.ts (render) + bench-score.ts (deterministic score) + the itinerary-prompt-ab workflow.

Results

Scenario wins: NEW 4 — OLD 0 (11/11 blind verdicts → NEW)

Opus panel — dimension averages (0-10)

Dimension	OLD	NEW	Δ	Winner
saves_coverage	7.81	8.87	+1.06	NEW
schema_validity	7.75	9.79	+2.04	NEW
invented_discipline	8.16	8.46	+0.30	NEW (mixed)
city_lock	9.88	9.75	-0.13	tie
day_balance	8.75	8.38	-0.37	OLD (small)
narrative_quality	8.67	8.42	-0.25	OLD (small)

Deterministic cross-check (ground-truth coverage + city-lock)

Scenario	OLD coverage	NEW coverage	OLD lock	NEW lock
s1 Thailand	19/40 (48%)	21/40 (53%)	3/3	3/3
s2 Italy	10/10 (100%)	10/10 (100%)	3/3	3/3
s3 Japan	12/17 (71%)	16/17 (94%)	2/2	2/2
s4 India	9/11 (82%)	11/11 (100%)	3/3	3/3
Average	75%	87%	100%	100%

Panel and deterministic scorer agree: NEW covers more of the user's real saves, with city-lock effectively tied at 100%.

Why NEW wins

It actually uses the saves (the whole point). +12 points of deterministic coverage. In s3 Japan, OLD drops the user's saved iconics — Osaka Aquarium Kaiyukan and Universal Studios Japan — and spends slots on model-chosen landmarks (Osaka Castle, Dotonbori); NEW schedules all 14 Osaka must-haves + both Kyoto items (16/16). In s4 India, OLD leaves 2 saves unscheduled; NEW gets 11/11. The thin-city seed top-up lets NEW fill days from the corpus instead of inventing.
It stops scheduling restaurants as activities (biggest gap, +2.04 schema). OLD's USER SAVES + PRIORITY SAVES feed eateries — including a hearted restaurant — into the day list, and the model schedules "Ristorante Belvedere" / "Il Riccio" as kind:activity (the schema says never a restaurant). This tanked OLD's schema score in s2 Italy (4.83 vs 10). NEW's pool is day-eligible-only, so eateries route to the food rail. OLD is actively misusing the saves it's given.

Where NEW is weaker — the 3 things to tune in the port

NEW's only losses are small (day_balance −0.37, narrative −0.25, within panel noise), but the rationales surface 3 concrete, fixable weaknesses:

Hearted items can still get dropped in a dense city. s1 Bangkok: NEW scheduled 3/5 hearted vs OLD's 4/5. The pool is best-first and repairCityCoverage guarantees city coverage, but nothing guarantees every hearted item survives when the pool exceeds the day cap.
Gap-fill wanders. The (gap-fill) escape hatch let the model place a duplicate (s3: Fushimi Inari twice) and a Jaipur mall on a Jodhpur day (s4). Gap-fill isn't constrained to the day's city or de-duped.
"Every SAVED/ICONIC MUST appear" overloads days. s3 day-3 paired a full-day Universal Studios with 4 more stops. The hard must-include fights the pace cap when the pool is large.

Data note: "Villa del Balbianello" is a Lake Como villa mis-saved under Rome — it hurt both variants. NEW's MUST-include rule forces the mis-geocoded save in; a pre-pool geo-sanity filter would help, but it's a data issue, not a prompt issue.

Recommendation

Land NEW (the collection-driven rewrite). It wins the two load-bearing dimensions — saves coverage and schema validity — by clear margins, and ties on city-lock. Its losses are marginal and addressable. Graft 3 small tweaks from OLD's strengths into NEW's CANDIDATE POOLS BY CITY rules block (prompts-v2.ts) during the port — no architecture change:

Re-add an explicit hearted hard-include line (port OLD's PRIORITY SAVES idea) so dense-city hearted items can't be silently dropped; optionally extend repairCityCoverage to inject any unplaced [HEARTED] item.
Tighten the gap-fill rule: gap-fill must be in the day's own city and must not duplicate an already-scheduled item.
Soften "every SAVED/ICONIC MUST appear" → "schedule best-first; cover as many as the pace allows" when a city's pool exceeds its day capacity, so the must-include stops overloading days; keep repairCityCoverage as the per-day floor.

Net: the rewrite is the right call and directly delivers the "uses more saves" goal (+12pts coverage), while also fixing a real OLD bug (restaurants-as-activities). The 3 tunings recover NEW's only soft spots.