How I Built a RAG System on the SpaceX S-1 in One Weekend

SpaceX filed a 308-page S-1 on May 20, 2026. I wanted to ask it questions and get cited answers — not summaries, not someone else's interpretation. The actual text, with a page reference.

So I built AskS1.com. Here's what that actually involved.

The Architecture

SpaceX S-1 PDF (308 pages)
    ↓ pdfplumber — extract text page by page
    ↓ sliding window chunker — 400 words, 100 overlap
    ↓ all-MiniLM-L6-v2 — embed chunks → 384-dim vectors
    ↓ Qdrant Cloud — store vectors + page metadata

User question
    ↓ same embedding model — embed query
    ↓ cosine similarity → top 15 candidates
    ↓ re-rank — penalize summary pages
    ↓ Claude Haiku — generate cited answer

Four components. Each does one thing.

The Chunking Decision

400 words per chunk with 100-word overlap. The overlap ensures no fact gets cut at a chunk boundary without appearing in an adjacent chunk. Smaller chunks lose context for multi-sentence financial disclosures. Larger chunks reduce retrieval precision.

Why Claude Haiku

I benchmarked five models on 15 SpaceX S-1 questions with RAG context injected:

Model	Overall	Latency
Claude Haiku	4.7/5	2.8s
phi4:14b (local)	4.5/5	27.6s
mistral:7b (local)	4.4/5	9.0s
deepseek-r1:14b (local)	4.3/5	102.8s

Quality gap between Haiku and local 14B models: 0.2 points. Latency gap: 10x. For a web product, Haiku wins.

The Challenges

Page citations were harder than expected.

The S-1 exists as HTML only on SEC EDGAR — no official PDF. I saved it to PDF via Chrome, which creates a text layer that doesn't align with the visual page layout. Chrome's rendering reflows HTML to fit the page, so the text order in the PDF doesn't match what you see visually.

I tried three approaches: regex matching standalone page numbers, position-based extraction using word coordinates, and WeasyPrint HTML→PDF conversion. The regex approach matched Chrome's footer pattern (89/308) correctly but those numbers matched Chrome's physical page count, not the document's printed pages. Position-based extraction picked up random numbers from tables and footnotes.

The pragmatic fix: citations show a ±20 page range (~p.69-109) rather than a single page number. Honest about the uncertainty, still directionally useful.

Summary pages dominated retrieval.

The executive summary (pages 1-30) mentions every topic at a high level and consistently scored highest in semantic similarity for almost any query — even when detailed content was 100 pages later.

Fix: retrieve 15 candidates, apply a 0.15 penalty to pages under 30, return top 5 after re-ranking. Simple but effective.

pypdf missed page 1 of the PDF.

Switched to pdfplumber. Handles styled PDFs correctly and extracts text from every page including the first.

What's Next

Anthropic and OpenAI S-1s are expected later this year. asks1.com will be there when they file.

Built with Claude API, Qdrant, Next.js, sentence-transformers, and pdfplumber. Deployed on Railway.

I'm a software engineer working on large-scale ads infrastructure. This was a weekend project to learn RAG engineering by applying it to something real.