LV 1 · 0%
Killa K · the plan

THE GAME
FACTORY

A pipeline that bottles everything we've learned across your games, one-shots a POC, then sends other agents to playtest & verify it — and only ships when it clears the bar we've already set.

research + planpros & consaction pointsverdict
scroll to play ↓
The pipeline

Six levels, six agents

Each stage is a level. Each level has an agent that owns it, a job, and a gate it has to pass before the next unlocks — same as a real game.

1
distill

The Standards Forge

agent::archivist
Reads the brain — game-design-brain, the retro-log, context — and compiles it into a machine-readable taste spec + quality rubric: the self-revealing skill tree, paced onboarding, juice-not-shake, DBZ forms, the Legacy-of-Goku north star. The factory's DNA.
Gate: a rubric a judge can actually score against, not vibes.
2
generate

One-Shot Forge

agent::smith
One-shots a single-file POC from the spec — config-driven, your patterns baked in from the jump: one core loop, a self-revealing progression tree, paced reveals, cinematic juice. Builds it like it's meant to be played, not a throwaway.
Gate: boots clean, single file, mobile-first, zero emoji soup.
3
playtest

The Proving Grounds

agent::tester
Drives the game headless with the kit we already proved: Node DOM-stub harness, window.__t rAF chains, served-Playwright screenshots. Runs scripted playthroughs — economy, progression, state machine, structure — and snaps every screen.
Gate: no throws, the loop completes, the numbers add up.
4
verify

The Standards Gate

agent::judge
Scores the build against the rubric, criterion by criterion — an LLM-as-judge reading the screenshots + the tester's logic report. DONE means verified, never attempted. Hands back a pass/fail card with the holes named.
Gate: every standard passes, or it bounces.
5
gate / loop

The Forge Loop

agent::smith ⟲ judge
A fail routes the notes straight back to Level 2 — regenerate, re-test, re-judge — until it's a clean pass. TESTHOOKS must be empty (no scaffolding shipped). This loop is where consistency actually comes from.
Gate: full pass + a clean test-hook ledger.
6
ship

Launch

agent::shipper → you
Deploys to Cloudflare prod, writes the run's learnings back into the brain (so the next POC starts higher — the estate compounds), and pings you for the one thing no agent can do: the feel check.
Gate: live URL + brain updated + your eyeball booked.
The quality bar

What it gates against

Not invented — pulled straight from the brain. These are the boss requirements every POC has to clear.

The WOW-bar"fuck me, I did this?" — premium, glow, depth. Not just tidy.
Paced onboardingone mechanic at a time — targets → enemy → enemy → power-up.
Self-revealing progressionthe skill tree blooms as you play. Discovery, not a menu.
Juice = motion, not shakecinematic, smooth, DBZ-style. No strobe, no rattle.
Cohesionbespoke SVG / sprites. Zero emoji soup, ever.
Mobile-firstfits + stays centred, touch controls, phone-tested.
Verified, not attemptednothing's "done" until it's been checked. No false ✅.
North star: Legacy of Gokuconstant visible growth + a meta that respects the grind.
The honest problem

Can an agent feel fun?

No. And pretending otherwise is how you ship a polished, on-spec game that's boring.

Agents can prove a game's logic (the harness) and its structure (screenshots) — but headless WebGL is unreliable and a screenshot can't tell you if it's fun. So the factory's honest job isn't "make fun automatically." It's to get every POC to a structurally-perfect, on-standard state cheaply — so your taste is spent only on the bit that's actually yours: the feel.

✓ logic & economy ✓ structure & layout ✓ standards checklist ✗ is it fun ✗ on-device feel
The weigh-up

Pros & cons — straight

Pros

  • Speed. Parallel agents one-shot + auto-verify many POCs while you do other things.
  • Consistency. Every game inherits your standards & patterns automatically.
  • The brain compounds. Each ship feeds learnings back — next POC starts higher.
  • Cheap to run. Grunt + judge agents on free coins (OpenRouter / Gemini / Groq).
  • Frees your taste for the one thing only you can do — judge the fun.
  • Cons / risks

  • Agents can't judge fun. The core risk — structurally perfect ≠ a good game.
  • One-shot ceiling. Complex games won't one-shot well — POC scope only.
  • Sameness. Templated output risks every game feeling the same.
  • Verification gaps. Headless can't see feel, performance, or on-device quirks.
  • Garbage-in scales. If the spec mis-captures your taste, it mass-produces the wrong thing.
  • The build path

    Action points

    Business-Don order: prove the riskiest assumption for the least money, before building the whole machine.

    1
    Prove one agent can clear the bar — once.

    Riskiest assumption = "an agent can one-shot a POC that passes the structural gate." Hand one agent the spec + one simple genre, one-shot it, run it through the existing harness + screenshot-judge. If it can't clear the bar once, the factory's moot.

    cheapest test of the riskiest bet
    2
    Write the taste spec + rubric v1.

    Distill the brain into a concrete checklist a judge can score. Small and sharp beats big and woolly.

    3
    Package the verify harness.

    Bundle the pieces you already have — Node DOM-stub + served-Playwright screenshot + the LLM-judge prompt — into one reusable module.

    4
    Wire the gate loop for one genre, end-to-end.

    Fail → notes → regenerate → re-judge, until clean. Prove the loop on the genre you've already nailed (survivor / tunnel-run lineage).

    5
    Then parallelise + add brain write-back.

    Fan out across genres on free coins; each ship writes its learnings home so the estate compounds.

    6
    Keep yourself as the final feel-gate. Forever.

    Never automate the fun check. The factory delivers POCs to your bar; you decide what's actually good.

    Your call

    Considerations & open questions

    Which genre proves it first? (the survivor / tunnel-run lineage one-shots most reliably.)
    How much scope per one-shot? (one core loop + one progression + juice — not a full game.)
    Free coins for the build too, or Claude builds + free coins do the grunt & judging?
    How do we encode "feel" hints an agent can act on (the juice checklist, motion-not-shake rule)?
    Auto-ship to prod, or stage every POC behind your review first?
    Ties into the Agent Fleet GUI — is the factory a tab inside mission control?
    The Business Don's call
    7.5/10
    Build it — but as a verify-and-gate factory, not a "fun machine." Its real value is consistency + speed to a known bar, with the brain compounding every run. Do the one-genre proof first (step 1); if a single agent clears the bar once, green-light the pipeline. Strong yes on the standards/verify engine — hold on expecting auto-fun. The cap is the feel gap, and that's exactly what keeps you in the loop. Which is fine.
    POC plan · built from the shared brain — your real lessons, named.
    No code shipped — this is the research, the weigh-up, and the call. Your move.