Field Report

I built an adaptive AI tutor for the GRE Math Subject Test for my son. Here is what I learned about AI in test prep.

By Steve Hubbard. Published May 18, 2026.

My son is aiming at a Princeton math PhD. The GRE Math Subject Test sits squarely in his way as one of the few remaining quantitative gates that graduate admissions committees actually still read. When I looked at what was available to prepare him, the options were unimpressive. There are expensive courses pitched at students starting from zero. There are scattered .edu mirrors of retired ETS forms with no scoring, no review, no feedback loop. There are a few apps with a clean UI and a static question bank that you grind through and then have nothing to do with the data you generated.

So I built one. I am an AI infrastructure engineer, not a test prep entrepreneur. I write systems for a living. Over the last two weeks of evenings I put together an adaptive practice app with an integrated AI tutor, deployed it to Netlify, and watched my son use it. This is a writeup of what worked, what was harder than I expected, and what the per-student cost actually looks like when you put a frontier model behind a study app.

What it actually is

Get the no-hype AI weekly

Every Tuesday: one honest review, one tool worth your money, one trap to skip. No fluff.

The app is a single-page vanilla JavaScript front end deployed on Netlify, with four small serverless functions that proxy calls to the Anthropic API. The model is Claude Opus 4.7 for the AI features, with a Sonnet fallback. No frameworks. No database. Per-user state lives in the browser via localStorage, keyed by a profile name so my son and I can have separate stats on the same machine.

The question bank is 330 items pulled from five complete official ETS Math Subject Test forms (GR3768, GR0568, GR9768, GR9367, GR8767), 66 questions per form. Every question is tagged against a 16-topic canonical taxonomy covering single-variable calculus, multivariable calculus, linear algebra, abstract algebra, real analysis, complex analysis, topology, probability, combinatorics, number theory, ODE, and the rest. Math renders through KaTeX. Figures are cropped from the original PDFs and served as static images.

On top of that bank the app runs four things that matter:

An adaptive Leitner selector that classifies every question as New, Learning, or Mastered, with a 1, 3, 7, 14, 30 day review ladder and a three-correct-in-a-row mastery exit.
A multi-turn AI tutor attached to every question, with chat history persisted per question so a follow-up next week resumes the same conversation.
A generate three similar problems button that, when my son misses a question, hands the seed to the model and gets back three brand new five-choice multiple-choice items of the same topic and difficulty family, complete with answer keys and explanations.
A wrong-answer pattern detector that every 30 misses sends a sample to the model and gets back a one-sentence observation about what kind of trap the student is falling into.

Four session modes (sprint, quarter, mock, custom), a topic mastery heatmap, bookmarks, per-question notes, a score trend chart, mid-session resume, an export-to-PDF of the session log, and a test-day target date that flips the home screen into a final-week dashboard 14 days out. None of it is novel individually. The combination is what made it useful.

What was harder than I expected

The fantasy version of building this product goes: ask Claude for 500 GRE Math Subject Test questions, paste them into the app, ship. The reality is that the question bank is the product, and the bank cannot be sourced cheaply.

The 330 questions I shipped came from ETS PDFs. Some are old enough that the figures are scanned bitmaps and the text needs OCR. I ran them through a combination of vision OCR and direct PDF extraction, then sampled and audited. The defect rate on the first pass was 8 percent. Eight percent. On a test where one wrong answer changes a scaled score by 10 points, an 8 percent corruption rate in the practice set is unacceptable.

The defects were not subtle and they were not random. A few examples from the audit log:

GR0568 question 28. Correct answer index recorded as 2 by the extractor. Actual ETS key, verified against the answer page, is 3. The student would have been told they were wrong while picking the right answer.
GR9768 question 17. A delta-f table in the stem was rendered as plain text and lost its column structure. The resulting "stem" was mathematically meaningless. Required manual reconstruction from the original PDF.
GR3768 question 34, option E. Contained a stray OCR marker (a character the vision model emitted to flag uncertainty) that I had failed to strip in post-processing. A student would see a nonsense token in one of their choices.
GR9768 question 43. A matrix in the answer choices had the entries reordered by a column-major reader where the original was row-major. Every option was technically wrong.
GR8767 question 62. The stem said "clockwise" where the original said "counterclockwise." Reversing the orientation flipped the correct answer.

Each of these was caught only by a human (me, in this case) opening the original ETS PDF and comparing it to the rendered question side by side. Per question, the verification time is around 12 minutes when you include reading the stem, walking through the solution, confirming the answer key, and spot-checking the explanation. At 330 questions that is roughly 66 hours of work, and it is the kind of work that an AI model is not yet good at doing for itself, because the model is reading the same defective extraction.

The takeaway: in test prep, AI accelerates almost everything except the part that determines whether the product is honest. The data layer is still a human problem.

What worked

Three features justified the build and would justify the build again. They are also the three features that are very hard to find in any existing prep product at any price.

Adaptive Leitner with a real exit condition

Most prep apps that claim "adaptive" run a difficulty knob. The app I built runs a spaced-repetition schedule. Every question lives in one of three buckets. New, never attempted. Learning, in the Leitner ladder waiting for its next review day. Mastered, three correct in a row and removed from rotation. Session selection targets 60 percent Learning, 30 percent review-due, 10 percent New, with backfill from other buckets if any bucket is empty. The student does not pick what to study. The system picks, and the system gets it right.

The behavioral effect on my son was immediate. The first week he kept hitting questions he had already missed. The second week the ratio shifted. The third week he was hitting mostly new material because the system had quietly graduated 80 questions to Mastered. The cognitive load shifted from "what should I work on" to "do the next session." This is what spaced repetition is supposed to feel like, and most prep apps do not provide it.

An AI tutor with per-question memory

When my son misses a question, he can open a chat panel right there. The system prompt loads the original question, the choices, his wrong answer, and the correct answer as context. He can ask "why is C wrong" or "is there a faster way" or "what theorem is this." The model is told to teach, to target 80 to 150 words per reply, and to end any teaching block with a one-sentence rule for later spaced recall.

The persistence is the part that surprised me. The chat history for every question is saved in localStorage. A week later, he can come back to the same question, and the same conversation resumes. He can ask a follow-up that builds on what was discussed before. That is the difference between a tutor and a search engine, and it is the thing that ChatGPT and Claude as general products do not give you, because they do not know what question he is on.

Generate three similar problems

This is the feature I am most proud of and the one that points at where the real moat is. When my son misses a question, he can press a button and the model returns three brand new multiple-choice questions of the same topic and difficulty family, with answer keys and explanations. They are stored client-side for the session and the student can attack them immediately.

From the student's point of view, the question bank is no longer finite. The model takes about 10 seconds per batch of three. The cost is a few cents. The pedagogical effect, of getting three more reps on the exact thing you just missed while it is still warm, is the kind of practice that a $300-an-hour human tutor would charge for and could not produce on demand. A vendor-built product would never put this feature in a $99 one-time-purchase tier, because the per-use cost would eat them alive at scale.

What the AI actually costs per student per month

This is the section every founder building on a frontier model should bookmark. I instrumented the app and computed the per-call token math against the actual function source. Here is what a typical study month looks like for one student.

Assumptions: 3 full mock sessions (66 questions each, no tutor calls during the timed mock), 30 tutor chat turns post-session, 15 generate-three-similar runs, 6 wrong-answer pattern detections. Output tokens dominate cost on the generate calls; input tokens dominate on the pattern detection. All numbers include the Opus 4.7 tokenizer overhead, which adds roughly 35 percent to the input token count compared to Opus 4.6.

At full Opus 4.7 pricing ($5 per million input tokens, $25 per million output tokens), the math comes out to:

Tutor chat (30 turns): about $0.44
Generate three similar (15 runs): about $0.99
Pattern detection (6 runs): about $0.09
Total: about $1.52 per student per month on Opus 4.7.

A heavier user (double the tutor turns, double the generate runs) lands around $3.00. Switching the model to Claude Sonnet 4.6 cuts the bill by roughly 4x to about $0.40 per student per month, with a noticeable but acceptable quality regression on the generate-three-similar feature. Switching to Haiku 4.5 cuts another 3x to about $0.13, with a more noticeable regression on the tutor.

The implication is uncomfortable for the test-prep industry. A $99 one-time-fee prep app cannot economically bolt on a real AI tutor at Opus quality without raising its price or capping its features. The shelf price of "AI-enhanced GRE prep" at fair quality is in the $5 to $15 per student per month range, all in, for the AI layer alone. Anyone selling adaptive AI tutoring for less than that is either using a much weaker model, capping use aggressively, or losing money on the AI line while making it up elsewhere.

What I would tell a builder thinking about this space

Three lessons from the build, in descending order of how surprising they were to me.

The question bank is the product. Plan to spend 12 minutes per question on human review. AI can extract, AI can tag, AI can write explanations, AI can cross-validate, but on a test where correctness is the entire game, only a human who can solve the problem can sign off on the stem and the key. Budget for it. The 12 minutes per question figure I am quoting came out of my own log, and I am a fast reviewer. Plan for $60 to $80 an hour for a graduate-level math reviewer, which means a 500-question bank costs you on the order of $7,000 in human time alone, on top of whatever you spend on AI generation.

The adaptive layer is the moat. Static prep is a commodity. Every prep company has a question bank. The differentiator that justifies a premium price is the combination of spaced repetition, per-question conversational memory, and unbounded same-topic practice generation. A static PDF of 500 questions is worth maybe $30. The same 500 questions wrapped in a real adaptive system with a tutor is worth $300 to $500 over a six-month study window, because it replaces the variable that a private tutor would normally provide.

Distribution is the actual problem. Building this took me two weekends and some weeknight tuning. Getting it in front of 1,000 students taking the GRE Math Subject Test each year is the harder challenge, because the population is small, niche, and not on the channels that B2C marketing reaches. A SaaS company building a product like this should think about the question of how to acquire users before the question of what to build. I have not solved this part. It is the reason this article exists.

An honest ask

I am still deciding whether to commercialize this beyond my own kid. The build is done. The economics check out. The legal status of the question bank (verbatim ETS questions are under copyright and would need to be replaced with AI-generated equivalents before any paid distribution) is the next gate to clear. The remaining question is whether anyone outside my household actually wants the thing.

If you, or someone you know, is preparing for the GRE Math Subject Test or a similar quantitative subject test, and you want to try the tool free in exchange for honest feedback, email me at honestaiguide@redsunllc.com. Six slots, first come. I will set you up with a profile, give you the access password, and ask you for an unfiltered take after two weeks of use. If nobody emails, that is also useful data, and I will publish the negative result.

Steve Hubbard is an AI infrastructure engineer and the editor of Honest AI Guide. He is not affiliated with ETS. His son's GRE Math Subject Test attempt is currently scheduled for Fall 2026.

Get the no-hype AI weekly

Every Tuesday: one honest review, one tool worth your money, one trap to skip. No fluff.

Disclosure. This article describes a personal project. Honest AI Guide earns affiliate commissions on some product reviews elsewhere on the site, but this piece is not an affiliate placement. The tool described is not commercially available as of publication.