Building an AI-native startup.
Most founder playbooks were written before AI changed who can build. This one is rebuilt around the four stages that actually matter when a non-technical founder can ship a working product in a weekend — and a technical founder can ship a company. Pradhya's working version of the four-stage founder map, with patterns drawn from a dozen AI-native companies we've watched up close.
- State your startup’s hypothesis sharply enough to test it in a week
- Architect an MVP that won’t collapse under its own AI-generated weight
- Tell genuine product-market fit from early hype using a clear framework
- Wire Claude (Chat, Cowork, Code) into the parts of your stack where it pays off
- Decide what to build, what to delegate, and what to refuse — for the next 90 days
The four-stage map.
Idea → MVP → Launch → Scale. The shapes are old. The work inside each is not. The framing here puts goals, exit criteria, failure modes, and AI exercises into each stage so you stop confusing “busy” with “progress.”
What goes inside each stage
| Stage | Goal | Exit when… | Common failure |
|---|---|---|---|
| Idea | A testable hypothesis backed by real customer signal | 10+ customer interviews say “yes, I’d pay for this” | Falling in love with the solution before you understand the pain |
| MVP | Shippable v1 that demonstrates the core value | 1 paying user uses it weekly without you holding their hand | Over-engineering. Building features for hypothetical scale |
| Launch | Public release, measurable adoption | Retention curve flattens, not zero, at week 4 | Confusing launch buzz with PMF |
| Scale | The company runs without the founder typing every prompt | You can take a week off without revenue dropping | Hiring “ops” instead of building the operating system |
Place your startup on the map.
- Read the “Exit when…” column above.
- For each stage you’ve allegedly cleared, write the evidence (names of customers, weekly active count, retention number).
- If you can’t cite evidence, you haven’t cleared the stage — go back.
- Pin a note on your monitor with the stage name + the next exit criterion.
Stretch. Show your stage diagnosis to one honest friend. If they push you back a stage, listen.
Hypothesis validation.
A startup hypothesis has three pieces: who, pain, willingness. Most founders have one. AI doesn’t fix the missing two — it just makes it cheaper to discover them.
The shape of a testable hypothesis
Bad: “Small businesses need better bookkeeping software.”
Better: “US Schedule-C self-employed earning $50k–$200k will pay $25/month to never see a receipt again, because their current tax-season pain is 30+ hours of reconstruction.”
Three pieces in the better version: who (US Schedule-C, specific income band), pain (30+ hours, named season), willingness ($25/month, specific number).
Write the three-part hypothesis + kill criterion.
- Who: name a specific segment with a number you can size (industry + role + revenue band).
- Pain: name the specific cost — hours, dollars, missed opportunities — with a quantity.
- Willingness: name the price they’d pay and the cadence.
- Kill criterion: what evidence would convince you to abandon this idea?
Stretch. Run the hypothesis past Claude with: “Critique this as a YC partner would. What’s vague? What’s missing?” Tighten where it pushes back.
Landscape mapping.
Thirty minutes with Claude beats two weeks of competitor research the old way. The map you build now becomes your positioning later.
What goes on the map
- Direct competitors — same problem, same solution shape.
- Adjacent solutions — same problem, different shape (often what people use today by default).
- Substitutes — the “none of the above” alternatives (spreadsheets, paper, doing-it-themselves).
- Why each existing option fails — the gap you’re building into.
Generate the landscape with Claude.
- Paste your hypothesis into Claude with web search on.
- Ask: “List every direct competitor, adjacent solution, and substitute. For each, give: 1-line description, who uses it, why they’d leave.”
- Verify 3 random claims by clicking through to the cited sources.
- Save the brief as
landscape.mdin your project folder.
Stretch. Repeat the exercise from the customer’s perspective: “How would a [target segment] currently solve this?” The gap between competitor mental-model and customer mental-model is often your wedge.
Customer discovery with AI.
Twenty interviews in a week used to be heroic. Now the bottleneck isn’t conducting them — it’s synthesizing them. Use AI on the back end, not the front.
The interview shape that actually works
- Past behavior, not future intent. “What did you do last time X happened?” beats “Would you use X?”
- Specific stories, not generalizations. “Tell me about the most recent time…” beats “How often do you…?”
- Their words, not yours. Never describe the solution — let them name what they wish existed.
- Listen for emotion. The right ideas land on real pain. If the interviewee is calm, you’re probably building a vitamin, not a painkiller.
Where AI helps: sourcing prospects (LinkedIn / Twitter), drafting the outreach, transcribing the calls, and — most importantly — synthesizing patterns across 10+ transcripts. See Cowork Recipe R14 · The feedback themes.
Run 3 interviews this week, synthesized by Claude.
- Source 10 candidates matching your “who”. Use LinkedIn search + a Claude-drafted outreach DM.
- Book 3 calls (15-min each).
- Use Granola / Otter / Read.ai to transcribe.
- Feed all 3 transcripts to Claude: “Surface the 3 most-repeated pains across these. Cite verbatim quotes. Note where I led the witness.”
Stretch. After 10 interviews, do the synthesis again. Compare to the 3-interview synthesis — how stable are the themes? Unstable themes mean you need more interviews; stable themes mean you can move to MVP.
Architecture you’ll regret.
AI-generated code accelerates everything, including the regrets. The decisions you make in the first week determine whether you ship in week 4 or rewrite in month 4.
Five decisions that compound
| Decision | Get this wrong & … | The default that ages well |
|---|---|---|
| Language / framework | You can’t hire, can’t debug, AI is weak at it | TypeScript + React, or Python + FastAPI — both well-trained |
| Database | Migrations become impossible at scale | Postgres — Supabase or Neon for managed |
| Auth | You spend a month rolling your own | Clerk or Auth0 — offload until you can’t |
| Hosting | You can’t deploy at 11pm | Vercel / Cloudflare Pages for frontends, Fly.io / Render for services |
| LLM provider | Vendor lock makes the next model migration painful | SDK with a thin abstraction layer for swapping providers |
Make your five decisions.
- Open
ARCHITECTURE.mdin your repo. - For each of the 5 decisions, write: the choice, why, and the “reversal cost” (how painful would it be to change in month 6?).
- If the reversal cost is “catastrophic”, pick a default that’s easier to swap.
- Commit. The doc is the audit trail for future you.
Stretch. Ask Claude to critique your doc as a hostile reviewer. Note where it pushes back. Resolve, don’t dismiss.
Scope discipline.
When you can ship features in hours, scope becomes the most important constraint. The MVP is what you say no to.
The MVP cut
- One workflow. Not one feature. One end-to-end thing the user can do that delivers the value.
- Hardcode the rest. Settings page? Single hardcoded config. Permissions? You’re the only admin. Onboarding? Send them a Loom.
- Reverse the funnel. Don’t build sign-up if you don’t have users. Build for the first 10 manually-added users.
- Cut by week 2. If you can’t ship the one workflow in two weeks of focused work, the scope is too wide.
The scope-cut session.
- List every feature you’ve considered for v1. (Be honest; usually it’s 15–30 items.)
- Identify the one workflow that — if it works — proves the hypothesis.
- Strike everything else. Move struck items to
BACKLOG.md— not deleted, just postponed. - If you can’t strike, ask Claude: “Which of these features could I hardcode, fake, or do manually for the first 10 users?”
Stretch. Tell a friend the v1 spec out loud. If they ask “but what about X?” for something you cut, that’s the test — can you answer with “manually, for now” without flinching?
Security from day zero.
AI-generated code carries new failure modes: prompt injection, data leakage through context, cost runaways. They’re cheap to prevent on day one and ruinous to retrofit at month six.
Five surfaces to lock down on day one
- Prompt injection. Treat all user input + retrieved content as untrusted. Sandbox tool calls. Never let user input change your system prompt’s rules.
- Data leakage. If your agent has access to multiple customers’ data, partition by customer in every retrieval. One leaked tenant kills the company.
- Cost runaway. Per-customer rate limits + budget alerts. A buggy loop can burn $10k overnight.
- Reversibility. Drafts before sends. Reads before writes. Preview before execute. Borrow the discipline from Cowork U11.
- Secrets. Never in code. Never in client-side. Vault them. Rotate quarterly.
See SaaS Practice U11 for the security-review agent pattern that catches these at PR time.
Run the day-zero security checklist on your repo.
- For each of the 5 surfaces, write down: what your current code does, what gap exists, what the fix would be.
- Fix the gaps that don’t require new infrastructure (e.g. add rate limits, partition queries by customer).
- Add the security-review agent (from SaaS U11) to your PR pipeline.
- File the rest as tickets with deadlines. Don’t let “we’ll get to it” be the answer.
Stretch. Wire a cost-alert webhook so a 10x spike in token usage triggers a Slack ping. Catching the bug at $50 is much better than at $5000.
The AI-native build loop.
The day-to-day rhythm of building when Claude Code is your senior IC. Plan mode, CLAUDE.md, ultrareview, deploy preview. Same patterns whether you’re a solo founder or a five-person team.
The loop
- Plan — describe the feature, hit plan mode, edit the plan.
- Build — approve the plan; Claude codes it. You read along.
- Review — run
/ultrareview. Read every flag. - Preview — deploy preview URL. Click through the change as a user would.
- Ship — merge. Watch metrics for 1 hour.
Cycle time: 30 minutes to 2 hours per feature. The constraint stops being typing speed and becomes decision quality.
/ultrareview is a real, copy-able pack: a slash command plus six single-purpose review agents (security, performance, tests, types, comments, simplify). Grab it from code-examples/ultrareview/ — the command file and the six agent prompts — and drop the folder into your repo’s .claude/ directory. Full install-and-run walkthrough in SaaS Practice U12.
Run one full loop on your MVP.
- Pick the smallest visible feature you can think of (a single button + endpoint).
- Plan mode → build → ultrareview → preview → ship.
- Note where the loop felt fast and where it bogged down.
- The bog-downs are your tools to invest in next: better CLAUDE.md, more MCPs, better preview setup.
Stretch. Time five loops over a week. The curve should bend down. If it doesn’t, your scope per feature is too large.
Real PMF vs early hype.
Launch buzz is not product-market fit. PMF is boring — it shows up as retention, not as TechCrunch posts. The framework here is what tells them apart.
The signals that lie
- Launch-day signups. Vanity. People will sign up for anything.
- Twitter engagement. Also vanity. AI products especially get retweet-bumps that don’t convert.
- VC interest. Investors fund stories. Stories aren’t fit.
- Press coverage. Lagging. Often arrives after the product’s peak.
The signals that don’t
- Retention curve. Week-N active / Week-1 active. Healthy: flattens, not zeroes. Unhealthy: cliff at week 2.
- Organic referral rate. Users telling other users — without a referral program.
- Cohort revenue expansion. Customers paying more over time, not less.
- The Sean Ellis test. “How would you feel if you could no longer use [product]?” ≥ 40% “very disappointed” is the classic threshold.
Run the Sean Ellis survey.
- Send one question to users who’ve used the product ≥ 2 weeks: “How would you feel if you could no longer use [product]?” Options: very disappointed / somewhat disappointed / not disappointed / N/A.
- Collect responses. Need ≥ 30 for a signal.
- Compute the “very disappointed” percentage.
- ≥ 40% → you have PMF; scale carefully. 20–40% → you have product-segment fit; double down on the segment. < 20% → iterate on the product itself.
- Pick 5 of the contacts from your U04 problem interviews — the ones who described the sharpest pain.
- Show them the prototype (a clickable mockup or a 2-minute Loom is fine). Then ask the Sean Ellis question worded for intent: “If this existed today and then disappeared, how would you feel — very disappointed / somewhat / not disappointed?”
- Record each answer verbatim plus one sentence of why.
- Count the “very disappointed” responses out of 5.
Stretch. Re-run quarterly (post-launch) or after every prototype revision (pre-launch). The percentage tells you whether the product is getting more sticky or less.
Measurement for AI products.
AI products generate non-deterministic outputs. Traditional metrics break. The new metrics measure trust, recovery, and the gap between “what we shipped” and “what users perceived.”
The new metric stack
| Metric | What it measures | Healthy |
|---|---|---|
| Task success rate | % of user tasks the agent completed without escalation | > 80% on the golden path |
| Edit distance | How much users edit AI output before using it | Trending down over time |
| Refusal rate | How often the agent says “I can’t” (and was it right to?) | < 5%, with a manual audit of correctness |
| Recovery rate | When the agent fails, how often does the user retry? | > 60% (gives up too fast = trust broken) |
| Cost per successful task | $ tokens + infrastructure per shipped value | Drops as you optimize, not flat |
See Practice 14 · Observability for how to instrument these.
Pick three metrics for your product.
- Pick 3 metrics that match your product’s value prop.
- Define each precisely: numerator, denominator, time window, exclusions.
- Measure baseline this week. Even an estimate counts.
- Add to your weekly review (or build a dashboard if scale warrants).
- Pick 3 metrics from the table that match your value prop (task success rate and edit distance work even with a handful of test users).
- Write each one down with its exact definition: numerator, denominator, what counts as “success.”
- For 7 days, every time you or a tester runs the core workflow, tally the result by hand — one tick mark per run in the right column.
- On day 7, compute each metric from your tally sheet.
Stretch. Set targets, not just baselines. Without targets, metrics are decoration.
Agentic workflows in product.
The first wave of AI products was “ChatGPT for X.” The second wave is workflows: opinionated multi-step jobs the agent runs on the user’s behalf. Two questions decide whether to add an agentic feature: can you describe the value in 1 sentence? and can the user verify the result faster than they could have done it themselves?
When to add an agentic workflow
| Reach for an agentic workflow when… | Don’t when… |
|---|---|
| The job is repetitive and the value compounds across runs | The job is one-off |
| The user can verify the result in 30 seconds | The user must check every line |
| The cost of a wrong action is low or reversible | Wrong actions damage data or relationships |
| The job lives in a single tool / context | The job requires cross-system orchestration you can’t fully access |
Map your product to workflows vs chat.
- List your product’s top 5 user actions.
- For each: chat, workflow, or both?
- For workflows, write the 1-sentence value prop and the verification step.
- For chat, name the failure mode: what happens when the model goes off-piste? Is the user OK with it?
Stretch. A common evolution: feature ships as chat, becomes a workflow once you see the 5 actual prompts users send. Bake the most-used prompt into a button.
The launch checklist.
What must be true before public launch. Skip any of these and the launch turns into firefighting instead of growth.
The non-negotiables
- You have 10+ users who’d be very disappointed if it shut down. The Sean Ellis test from U09.
- You have 1 paying customer at full price. Discounts don’t prove fit; full price does.
- Cost per user < price per user. Even by 1¢. Negative unit economics at launch is technical debt with a deadline.
- The 5 most-likely failure modes are handled. Either fixed, monitored with alerts, or explicitly accepted with a runbook.
- Onboarding works without you. A stranger can sign up and reach the first “aha” without DMing you.
- You can shut it down gracefully. If it’s a flop, you have a plan to refund / wind down / pivot without burning users.
Run the launch readiness review.
- For each non-negotiable, write: pass / fail / partial.
- For each fail or partial, what’s the smallest fix?
- If you can fix in a week: delay launch a week. If you can’t: launch in beta only.
- Schedule the post-launch review for week 4.
- Pick 3 prospects who match your “who” (reuse U04 interviewees).
- Run this exact script on each, one at a time: (1) “How do you solve this today, and what does that cost you in time or money?” (2) “If a tool did this for you, what would make it worth paying for?” (3) “At $X/month, is that a yes, a no, or a maybe — and why?” (Name your real price for X.)
- Write down each prospect’s answer to question 3 as a literal yes / no / maybe, plus their stated reason.
- Tally the three answers. Two or more unprompted “yes” at full price is your pre-launch proxy for the “1 paying customer” box.
Stretch. Show the checklist to one founder who’s 2 years ahead of you. Their pushback on your “pass” entries is gold.
Product tools matrix.
There are three product surfaces — Chat (claude.ai), Cowork (Claude Desktop), and Code (Claude Code). Each fits a different stage and shape of work. Picking wrong is the most common founder mistake.
The matrix
| Surface | Best for | Where it fits in your stack |
|---|---|---|
| Chat (claude.ai) | Open-ended exploration, prompt design, customer support replies | Customer-facing “ask Claude” features; pre-prompt-engineering workspace |
| Cowork (Claude Desktop) | Personal operations: read files, run multi-step jobs across your apps | Founder ops (sales prep, weekly review, contract review); not customer-facing |
| Code (Claude Code) | Engineering work: write, refactor, ship | Your engineering stack from day one |
| API (SDK) | Whatever you build into your product | Anywhere AI shows up in your product’s critical path |
Founders frequently confuse the lines. A common mistake: building product features in Cowork (which only you can use) instead of the API. Or treating Chat as your engineering tool, when Code does it better. See Practice 03 (Cowork) and Practice 02 (Claude Code).
Place every AI-touching task into the matrix.
- Brainstorm: customer support, sales prep, content, ops, engineering, customer-facing features.
- For each, assign the right surface.
- Flag mismatches: where are you using the wrong surface today?
- Plan one migration this week.
Stretch. Add your team’s entries too. The matrix becomes a quick onboarding doc for the next hire: “here’s where AI lives for us.”
Founder operating system.
Scale starts when the company stops needing the founder to do every prompt. The operating system is the set of skills, recipes, and agents that take over the work you’d otherwise re-do every week.
What goes in the founder OS
- The Cowork recipe library — meeting prep, weekly review, contract review, customer triage, investor updates. See Cowork Day 04.
- The Claude Code plugin — your team’s conventions encoded as hooks, slash commands, skills. See Practice 02.
- The customer-facing agentic workflows — the API-driven product features.
- The metrics dashboard — from U10, automated and visible.
- The runbooks — what to do when X breaks. Written before the break, not during.
Build the founder OS v1.
- Make a 2-column list: tasks I do weekly · can they be skilled / recipe’d / agent’d?
- For each “yes”, pick one to build this week (Cowork recipe is usually fastest).
- Document each as a starred chat, a skill, or a plugin.
- Repeat weekly. The OS gets thicker; your week gets lighter.
Stretch. Pair: ask your co-founder or first hire which of your recurring tasks they wish they could just run themselves. That’s the most leveraged OS component to build next.
Founder stories.
Five companies, five lessons. The playbook’s case studies are short on purpose — the lesson is always one decision, not the whole company.
Lean team, AI as the leverage point.
Ambral built customer-success software that synthesizes signals from customer activity and interactions into AI-powered models of every account. A small founding team is doing the work of a 20-person CS org because the AI is the org chart, not the headcount.
The lesson: if your product’s value compounds with data per customer, the AI handles the scaling. Headcount stays flat.
Pivoted, then codified what worked.
HumanLayer pivoted and scaled while turning their internal context-engineering practices into a framework now used across the YC ecosystem. Built around the insight that the most useful functions in software are also the riskiest — especially for non-deterministic LLM systems.
The lesson: the practice you develop solving your own hard problem is sometimes more valuable than the product itself. Document and share.
Domain expertise + AI > engineering background.
Vulcan won government contracts as non-engineers. The AI was the engineer; the founders were the domain experts and operators.
The lesson: if you have deep domain knowledge, the “but I’m not technical” objection no longer holds. The technical work is what AI does. Your job is to know the customer, the regulation, the contract terms.
The compliance + AI moat.
Carta Healthcare builds AI tooling for the healthcare data abstraction problem — an area where compliance, accuracy, and trust gate every deal. The wedge is being the AI player that meets the healthcare bar.
The lesson: regulated markets are slow but defensible. Once you’ve cleared the compliance bar, generic AI tools can’t compete.
Shipping the surface where users live.
Anything builds AI-native consumer experiences where the agent is the product, not an add-on. Where most consumer companies bolt a chat onto an existing app, Anything started from the assumption that the chat is the app.
The lesson: the form factor is part of the bet. Choosing “AI-as-feature” vs “AI-as-product” is a strategic decision, not a UX one.
Find the founder story that maps to yours.
- Read each story above. Which one’s pattern matches your startup?
- Write down what the founder did differently than you would have.
- Adapt one specific move from that story to your week.
- If you can’t find a match, search for one founder who’s 18 months ahead in your category. Steal from them instead.
Stretch. Email the founder. Most respond if you ask one specific question. Worst case, no reply — best case, a 15-min call that compresses your learning by a year.
The 90-day plan.
A practice is decoration unless it changes what happens Monday morning. The closing exercise turns everything above into one written 90-day plan you actually execute.
The shape of the plan
- The one number. Pick one metric from U10 you’ll move. The 90 days are about moving it.
- The 3 outputs. What ships in 30 / 60 / 90 days? Be specific. Features, decisions, customers.
- The kill criteria. What would tell you to stop? Don’t skip this.
- The weekly check-in. Same time, same template. Friday afternoon, 20 minutes, alone.
- The accountability partner. One person who reads the plan and asks at week 12 whether you did what you said.
Write the one-page 90-day plan.
- Pick the number from U10. State the current value and the target.
- List the 3 outputs — 30 day, 60 day, 90 day — that would credibly move that number.
- Write the kill criteria. What would convince you to scrap or pivot?
- Schedule the weekly check-in. Same slot, same template, on the calendar.
- Pick the accountability partner. Send them the plan today.
Stretch. At week 12, re-read the plan before writing the next one. Note what you got right, what you got wrong, and what surprised you. The reflection is where the next playbook starts.
The one prompt every founder runs Friday.
Across the dozen AI-native companies we've watched closely, the single highest-leverage habit is a structured Friday retro using Claude as your honest second voice. This is the prompt; install it as a recurring Claude Project.
# Founder weekly review — paste into a Claude Project every Friday You are my honest second voice. I'm a founder building [one sentence about your company]. Right now I am at stage: [Idea / MVP / Launch / Scale]. Every Friday I check in with you. Here's what happened this week: WINS (3 max): - [list] WHAT I LEARNED about my customer or market: - [list] WHAT I AVOIDED that I shouldn't have: - [list, be honest] WHAT I'M TELLING MYSELF that might be a lie: - [list] Based on this, tell me: 1. The ONE move next week that would change the trajectory most. Be specific — a person to call, a feature to ship, a piece to write, a decision to stop avoiding. 2. The thing I'm clearly procrastinating on, and why I'm probably wrong about why. 3. If my stage is right based on what's actually happening. (Founders almost always mis-stage themselves — too far ahead in their head, behind in reality. Push back if my self-assessment is off.) 4. One question I should be sitting with this weekend. Be direct. No hedging. I don't need a coach — I need someone who'd notice if I were drifting.
Why this works: The "things I'm telling myself that might be a lie" line is the unlock. Founders' self-reports are systematically biased toward the version of the story they wish were true. Naming that bias up front gives Claude permission to push back, which is the part you actually need.
Install it today.
- Create a Claude Project called
Friday Founder Review. - Paste the prompt above as the Project's custom instructions.
- Add a recurring Friday-4pm calendar event titled Friday Review — 20 minutes.
- Run it this Friday. Save the output. Re-read Monday before deciding what to work on.