How to Build an AI Content Research Engine: a Claude + Reddit Case Study

TL;DR
For most solo builders and small teams, content research is the bottleneck, not writing.
A weekly Collect → Cluster → Score → Synthesize pipeline turns real customer conversations into a ranked list of topics you can scan over coffee.
The pattern works for any niche where your audience talks publicly somewhere: Reddit, Indie Hackers, niche Discords, forums.
The scoring prompt is where most of the leverage lives. Treat it as a markdown file you keep rewriting until its top three match the ones you’d pick by hand.

As I was building out Storkly, I was pretty confident in getting us up and running to an MVP/Soft Launch phase. I had enough experience with software and web development and while there was (and still is) a lot to learn, getting from zero to one wasn’t my primary worry. It was, “how do I drive people to this product?”. I knew we needed to create content to get the word out there but I always found myself frozen at what to write about and _what to post about.

If you’re running content for a small brand or building out a new product as a team of one, you’ve probably noticed that this is where the real challenge lies and it’s the kind of research that historically meant a content strategist with a Reddit tab (among many others) open, a spreadsheet, and a lot of intuition.

That research step is the natural place to point AI. It’s information processing, not voice. Get the topic right and the rest of the workflow protects itself.

This post walks through an AI content research engine I built: a weekly content topic discovery system that ranks 3–5 themes from real customer conversations and drops them into a phone-readable digest every Sunday. I built it for Storkly, but the architecture is generic. I’ll show the pattern, then point at where you’d swap pieces for your own niche.

The content sprint problem

Most small content teams end up converging on roughly the same weekly workflow:

Research — what should we write about this week?
Founder or brand voice memo — the take, in their actual words
AI draft assembly — a long-form scaffold from the memo
Human edit — voice and accuracy pass
Distribution — fan one piece out across channels
Schedule

Steps 2 through 6 keep a human in the loop because that’s where the brand actually lives. Step 1 doesn’t need one. It’s pattern recognition over public data, which is the kind of thing AI is genuinely good at. Automate it aggressively and you buy back the part of your week that only you can do.

What I was optimizing for

Before writing any code I pinned down what “good” looked like.

Decision in 10 minutes. Output is a phone-readable Sunday morning digest. If I have to log into something to see it, I built the wrong thing.
Anchored in real customer voice. Every theme is backed by 2–3 verbatim quotes with source links. An LLM saying “your audience cares about X” without quotes is hallucination with extra steps. Storkly uses Reddit because that’s where new parents are venting and asking questions. For SaaS or AI products you might use Indie Hackers, for hobbyists a Discord export.
Brand-aligned scoring. No generic “topics in your category.” The system has to score against our specific brand narrative. A generic ranker surfaces the loudest topic in the niche every week, and that’s rarely the should be written about.
Don’t Over-Engineer. Keep the focus narrow. One scheduled function, one table, three prompts. No vector DB, no agent framework, no orchestrator. When something breaks, I want to know exactly which file to open.

The architecture

Source ──► Collector ──► Cluster ──► Score ──► Synthesize ──► Store            (HTTP)      (LLM #1)   (LLM #2)    (LLM #3)     (Airtable)

Five stages, three of which are LLM calls. The Collect → Cluster → Score → Synthesize pattern is the meat of the flow and should be applicable for any niche where your audience has a public watering hole somewhere.

For Storkly, I started with Reddit (a handful of parenting subreddits) as the source, but will likely expand this overtime. The whole thing runs as a single Modal scheduled function, Saturday 11pm ET, around 5–10 minutes per run.

Decisions that took thinking

Once the architecture and guiding principles were defined, most of the build was straightforward. A few choices weren’t and took some playing around with.

Unauthenticated Reddit JSON instead of PRAW or OAuth

Reddit serves every public subreddit as JSON at a predictable URL. No app registration, no client secret, no token refresh:

GET https://www.reddit.com/r/{subreddit}/top.json?t=week&limit=30

The weekly job is roughly 210 requests, well under the ~60 requests-per-minute unauthenticated cap at 1 req/sec pacing. Skipping OAuth removed a setup step that fails silently, two secrets from .env, and one dependency. If I ever need authenticated access, the collector swap is local and doesn’t touch anything downstream.

Porting this: for B2B tech audiences, Hacker News’ Algolia API is similarly auth-free and excellent. For hobbyist niches, Discord exports or RSS from a forum work the same way. The pattern is “find the lowest-friction read endpoint for your audience’s watering hole.”

Three LLM calls instead of one

A lot of people reach for “let one big model do everything in one prompt.” I think that’s a mistake when each step is a different kind of reasoning. I’d be lying if I said I hadn’t been fighting that concept as well (see “Don’t Over-Engineer” above).

Clustering needs broad context, so you hand it everything and let it find structure. Scoring needs sharp criteria, so you hand it a tight rubric and constrain it. Synthesis needs polished prose, so you hand it a small set of scored themes and ask for one good sentence at a time.

Combining them muddles every prompt. Splitting them means each stage has one testable job and one clear failure mode. When the briefs read flat, I know it’s the synthesis prompt. When the ranking is off, I know it’s the scoring prompt. That separation pays for itself the first time something goes wrong.

Porting this: don’t combine. Even in a simpler niche, the debugging cost of a one-shot prompt outweighs the latency win.

Sonnet, not Opus

Claude Sonnet is handling cluster and score fine. I’ve tested with both and while the systhesis is slightly better with Opus, it wasn’t quite enough of a difference to justify the difference. If synthesis ever starts reading flat, promoting that stage to Opus is a one-line change because the prompts are decoupled.

Porting this: start with the cheaper model. Upgrade individual stages only when output quality is actually the bottleneck.

Prompts live in version-controlled `.md` files, not Python strings

The scoring prompt is the highest-leverage piece of the system, and it’s going to change every week for the first month. Markdown diffs cleanly in git, reviews well on GitHub mobile, and a prompt change doesn’t drag in a code review. Each stage loads its prompt at runtime from src/prompts/. I never want to worry about a breaking error in the code just to try a new rubric. Flexibility and speed are key here.

My default is often to lead with code solutions (again, see “Don’t Over-Engineer” above), but when flexibility, iteration and letting the LLMs do what they do best is key .md prompts are by far the best approach.

Porting this: non-negotiable. If your prompts live in source code, your iteration loop is broken.

Airtable, not Notion or Postgres

Review happens on a phone over coffee, so mobile UX matters more than anything else. Airtable’s mobile app beats Notion’s table view, and Postgres is overkill until I’m storing 10k+ rows. In all honestly, Airtable was also what I was used to and already had available so that’s what I went with. The key is something that is easy to update and review.

Porting this: any mobile-friendly review surface works. Airtable, Notion, even a daily email. The constraint is “scannable on a phone in 10 minutes,” not the specific tool.

No SEO keyword data in v1

Audience engagement is a better proxy for topic intent than search volume. Search volume captures curiosity. A forum thread with 200 comments captures pain and engagement. Pain is what you want, because those are the conversations where someone is actively looking for an answer, not just typing a question into Google. I will likely add SEO keyword data later on to capture and synthesis intent with action. But as a phase 1 to get started, I kept the scope focused on audience engagement.

Porting this: if your distribution is SEO-first, add Ahrefs or SEMrush later as an enrichment signal. Engagement still leads.

The most reusable artifact: the scoring prompt

Most of the system can sit untouched for months. The scoring prompt is the one file I expect to keep editing.

It does two jobs. First, it tells the LLM what your brand is actually about. The fit hierarchy. This is the template that ports directly to any brand:

CORE (fit score 8–10): 
	[Topics that are 100% on-brand for you]
ADJACENT (fit score 5–7): 
	[Topics that touch your brand from the side]
OFF-BRAND (fit score 1–4): 
	[Topics in your category but not your story]
(Score conservatively even with a clever angle — off-brand is off-brand.)

For Storkly, CORE includes things like “managing extended-family communication during postpartum” and “photo privacy for newborn photos.” OFF-BRAND includes sleep training and baby gear, which are common parenting topics that just aren’t our story.

The fit hierarchy is the most important thing to write down for your own brand. It’s also the artifact that makes this system yours instead of generic.

Second, the prompt scores each theme on four dimensions:

Pain — how emotionally or practically acute is this? Signals: emotional language, exhaustion markers, late-night posts, conflict mentions.
Volume — how often the theme shows up in the data.
Fit — per the hierarchy above. Bias hard.
Readiness — “I’d buy a solution right now” energy vs. venting.

Composite:

Pain × Fit × Readiness × LOG(Volume + 1)

Volume is logged so a high-volume but low-pain theme doesn’t drown out a sharp, low-volume one. Pain, Fit, and Readiness all matter linearly because they’re the dimensions that determine whether the topic is actually worth writing about. Volume is a tiebreaker.

What’s actually built so far

Reddit collector with 1-second pacing and 429 exponential backoff. Descriptive User-Agent header on every request, because Reddit aggressively rate-limits the default httpx/curl agents (which I learned the hard/slow way).
Three Claude processing stages (cluster, score, synthesize), each loading its prompt at runtime via the Anthropic Python SDK.
Airtable writer built on pyairtable, hitting a Weekly Topics schema with the composite score as a formula field.
Local orchestrator plus a fixture cache so I can iterate prompts offline without re-hitting Reddit. Once you start tweaking prompts you want to push the same input data through 20 variations in quick succession without burning rate limit.

The first end-to-end run produced 5 ranked topics with verbatim quotes and source links, reviewable on my phone in under 10 minutes. That was the bar I set going in.

What’s next

Prompt tuning. I’ll do a few weeks of manual Sunday runs. After each one, I log my picks and skips with the why. I use that to update the scoring prompt. The skips and picks with the why are the key training signal.
Modal deploy. Wrap the orchestrator in a scheduled function. Airtable push notification when the digest lands. I’ve played around with scheduling this as a CRON job on a local server, but Modal is cheap enough that the couple of dollars was worth not having to potentially troubleshoot the CRON job or my server setup.
Multi-source enrichment. Pinterest Trends and Google Trends. I’m leaving this for later to keep the initial deployment focused and test my engagement for interest thesis. If I start to notice a lot of repeat themes/topics then I’ll want to start pulling in more sources for a broader base.
Customer signal loop. Once we start driving more traffic to Storkly directly from blog posts, reddit, instagram, etc I’ll be able to leverage that data to further refine. First-party signals are always the priority.

Porting this to your own niche

If you want to build something similar, here’s the rough porting guide:

Pick your audience watering hole. Consumer or lifestyle goes to Reddit. B2B and SaaS founders go to Indie Hackers and Hacker News (Algolia API). Developers go to GitHub Discussions and Stack Overflow (I miss when Stack Overflow was the go to) tags. Hobbyists go to Discord exports, niche forums, or subreddits. Vertical professionals go to industry Slack or Discord communities.
Write your own fit hierarchy. The CORE / ADJACENT / OFF-BRAND template ports directly. Spend an actual hour on this. It’s the artifact that makes the system rank your topics, not generic ones.
Keep the three-call structure. Cluster / Score / Synthesize is independent of niche. Resist the urge to combine.
Use a mobile-friendly review surface. Airtable, Notion, a daily email. Anything you’ll actually open with coffee in the morning.
Define your own success bar before you start tuning. Mine: 4 of 6 weeks the top three includes a topic worth writing about, average review under 15 minutes, and at least one piece sourced from the system tops its primary channel. Yours will be different, but write it down before you start tuning or you’ll move the goalposts.

Closing thought

The interesting AI work in content workflows isn’t the generation step. It’s the ranking step.

Generation is cheap and getting cheaper, which means generation without taste mostly produces more of what already exists, converging on the average. The work that’s actually valuable is putting the right input in front of a human whose voice the audience already wants to hear, and then getting out of the way.

I’ll post weekly updates from the tuning phase as Phase 2 runs. If you want the scoring prompt template as a starting point for your own niche, reach out — happy to share.

The content sprint problem#

What I was optimizing for#

The architecture#

Decisions that took thinking#

Unauthenticated Reddit JSON instead of PRAW or OAuth#

Three LLM calls instead of one#

Sonnet, not Opus#

Prompts live in version-controlled .md files, not Python strings#

Airtable, not Notion or Postgres#

No SEO keyword data in v1#

The most reusable artifact: the scoring prompt#

What’s actually built so far#

What’s next#

Porting this to your own niche#

Closing thought#