Killa K · the plan

THE GAME
FACTORY

A pipeline that bottles everything we've learned across your games, one-shots a POC, then sends other agents to playtest & verify it — and only ships when it clears the bar we've already set.

research + planpros & consaction pointsverdict

scroll to play ↓

The pipeline

Six levels, six agents

Each stage is a level. Each level has an agent that owns it, a job, and a gate it has to pass before the next unlocks — same as a real game.

distill

The Standards Forge

agent::archivist

Reads the brain — game-design-brain, the retro-log, context — and compiles it into a machine-readable taste spec + quality rubric: the self-revealing skill tree, paced onboarding, juice-not-shake, DBZ forms, the Legacy-of-Goku north star. The factory's DNA.

Gate: a rubric a judge can actually score against, not vibes.

generate

One-Shot Forge

agent::smith

One-shots a single-file POC from the spec — config-driven, your patterns baked in from the jump: one core loop, a self-revealing progression tree, paced reveals, cinematic juice. Builds it like it's meant to be played, not a throwaway.

Gate: boots clean, single file, mobile-first, zero emoji soup.

playtest

The Proving Grounds

agent::tester

Drives the game headless with the kit we already proved: Node DOM-stub harness, window.__t rAF chains, served-Playwright screenshots. Runs scripted playthroughs — economy, progression, state machine, structure — and snaps every screen.

Gate: no throws, the loop completes, the numbers add up.

verify

The Standards Gate

agent::judge

Scores the build against the rubric, criterion by criterion — an LLM-as-judge reading the screenshots + the tester's logic report. DONE means verified, never attempted. Hands back a pass/fail card with the holes named.

Gate: every standard passes, or it bounces.

gate / loop

The Forge Loop

agent::smith ⟲ judge

A fail routes the notes straight back to Level 2 — regenerate, re-test, re-judge — until it's a clean pass. TESTHOOKS must be empty (no scaffolding shipped). This loop is where consistency actually comes from.

Gate: full pass + a clean test-hook ledger.

ship

Launch

agent::shipper → you

Deploys to Cloudflare prod, writes the run's learnings back into the brain (so the next POC starts higher — the estate compounds), and pings you for the one thing no agent can do: the feel check.

Gate: live URL + brain updated + your eyeball booked.

The quality bar

What it gates against

Not invented — pulled straight from the brain. These are the boss requirements every POC has to clear.

The WOW-bar"fuck me, I did this?" — premium, glow, depth. Not just tidy.

Paced onboardingone mechanic at a time — targets → enemy → enemy → power-up.

Self-revealing progressionthe skill tree blooms as you play. Discovery, not a menu.

Juice = motion, not shakecinematic, smooth, DBZ-style. No strobe, no rattle.

Cohesionbespoke SVG / sprites. Zero emoji soup, ever.

Mobile-firstfits + stays centred, touch controls, phone-tested.

Verified, not attemptednothing's "done" until it's been checked. No false ✅.

North star: Legacy of Gokuconstant visible growth + a meta that respects the grind.

The honest problem

Can an agent feel fun?

No. And pretending otherwise is how you ship a polished, on-spec game that's boring.

Agents can prove a game's logic (the harness) and its structure (screenshots) — but headless WebGL is unreliable and a screenshot can't tell you if it's fun. So the factory's honest job isn't "make fun automatically." It's to get every POC to a structurally-perfect, on-standard state cheaply — so your taste is spent only on the bit that's actually yours: the feel.

✓ logic & economy ✓ structure & layout ✓ standards checklist ✗ is it fun ✗ on-device feel

The weigh-up

Pros & cons — straight

Pros

Speed. Parallel agents one-shot + auto-verify many POCs while you do other things.

Consistency. Every game inherits your standards & patterns automatically.

The brain compounds. Each ship feeds learnings back — next POC starts higher.

Cheap to run. Grunt + judge agents on free coins (OpenRouter / Gemini / Groq).

Frees your taste for the one thing only you can do — judge the fun.

Cons / risks

Agents can't judge fun. The core risk — structurally perfect ≠ a good game.

One-shot ceiling. Complex games won't one-shot well — POC scope only.

Sameness. Templated output risks every game feeling the same.

Verification gaps. Headless can't see feel, performance, or on-device quirks.

Garbage-in scales. If the spec mis-captures your taste, it mass-produces the wrong thing.

The build path

Action points

Business-Don order: prove the riskiest assumption for the least money, before building the whole machine.

Prove one agent can clear the bar — once.

Riskiest assumption = "an agent can one-shot a POC that passes the structural gate." Hand one agent the spec + one simple genre, one-shot it, run it through the existing harness + screenshot-judge. If it can't clear the bar once, the factory's moot.

cheapest test of the riskiest bet

Write the taste spec + rubric v1.

Distill the brain into a concrete checklist a judge can score. Small and sharp beats big and woolly.

Package the verify harness.

Bundle the pieces you already have — Node DOM-stub + served-Playwright screenshot + the LLM-judge prompt — into one reusable module.

Wire the gate loop for one genre, end-to-end.

Fail → notes → regenerate → re-judge, until clean. Prove the loop on the genre you've already nailed (survivor / tunnel-run lineage).

Then parallelise + add brain write-back.

Fan out across genres on free coins; each ship writes its learnings home so the estate compounds.

Keep yourself as the final feel-gate. Forever.

Never automate the fun check. The factory delivers POCs to your bar; you decide what's actually good.

Your call

Considerations & open questions

Which genre proves it first? (the survivor / tunnel-run lineage one-shots most reliably.)

How much scope per one-shot? (one core loop + one progression + juice — not a full game.)

Free coins for the build too, or Claude builds + free coins do the grunt & judging?

How do we encode "feel" hints an agent can act on (the juice checklist, motion-not-shake rule)?

Auto-ship to prod, or stage every POC behind your review first?

Ties into the Agent Fleet GUI — is the factory a tab inside mission control?

The Business Don's call

7.5/10

Build it — but as a verify-and-gate factory, not a "fun machine." Its real value is consistency + speed to a known bar, with the brain compounding every run. Do the one-genre proof first (step 1); if a single agent clears the bar once, green-light the pipeline. Strong yes on the standards/verify engine — hold on expecting auto-fun. The cap is the feel gap, and that's exactly what keeps you in the loop. Which is fine.

POC plan · built from the shared brain — your real lessons, named.
No code shipped — this is the research, the weigh-up, and the call. Your move.

THE GAMEFACTORY