The recurring r/Python and r/Hypothesis “model wrote example tests when I asked for property-based” thread has a structural answer: the prompt has to demand a count, ban example-based tests, and force a recursive shrink strategy on nested data. The 7-rule prompt below moves Claude Opus 4.7 to 6 of 6 invariants on the TCC editorial property-test fixture (a JSON-diff library with 6 documented invariants), with shrink strategies and a state machine for the stateful invariant. GPT-5.3-Codex hits the same. The bare “write me property-based tests” prompt produces 4 of 6 invariants and skips the recursion shrinker on most runs across both models.
The prompt
Write property-based tests for the library below, in Python, using Hypothesis.
Inputs:
- Source file: <paste source here or reference by path>
- Function under test: <name>
- Public contract (plain English): <2-4 lines>
Produce exactly this output:
1. A list of 6 invariants, numbered. Each invariant is one sentence.
2. For each invariant, a @given-decorated test function. Name the test `test_prop_<short_name>`.
3. Include shrink-friendly strategies. For recursive or nested inputs, use `st.recursive` with a size cutoff (max_leaves=8) and compose with shared structures where the invariant requires shared identity.
4. Use `@settings(max_examples=200, deadline=timedelta(seconds=1))`. Adjust only if justified in a comment.
5. If any invariant is not provable as a pure property (requires a model), emit it as a stateful RuleBasedStateMachine instead, not as a @given test. Mark it clearly in a comment.
6. Do not write example-based tests. Do not write docstrings longer than one line per test.
7. At the end, emit a single line: `# Run: pytest -q && pytest --hypothesis-profile=ci`
Why it works, in 5 bullets
- Asks for 6 invariants, not “as many as you can find”. Frontier models underproduce invariants when the target is open-ended. A specific count steers them toward the full contract surface, not the obvious three.
- Shrink strategy is explicit. Models default to flat strategies. For recursive data (JSON trees, ASTs), the strategy the model picks without a prompt nudge produces examples that do not shrink.
st.recursivewithmax_leaves=8gets you clean shrinks and real bug repros, per the Hypothesis recursive-data docs. - Stateful fallback is explicit. Some invariants are fundamentally stateful (“applying a diff and reversing it returns the original, for any sequence of diffs”). Forcing the model to route those to
RuleBasedStateMachineinstead of pretending they are pure properties is the difference between a real PBT suite and one that looks plausible. - Bans example-based tests. Without this line, models mix 3 property tests with 2 hand-written examples. The examples are almost always weaker than what Hypothesis would find, and they crowd out the property tests in the PR diff.
- Emits a runnable command at the end. The final comment lets the harness execute the generated file directly without a wrapper. Small thing, removes one CI step.
Failure modes
- Stateful invariant written as a pure property. On Gemini 3.1 Pro, around 30% of TCC fixture runs still emit the “apply + reverse” invariant as a flat
@giveninstead of a state machine. Claude Opus 4.7 gets it right 5 of 5 runs. - Flat strategies for recursive data. When the test file does not import
st.recursive, the recursion shrinker is missing. Grep the output forrecursive(; if absent on a JSON-tree target, retry with “use st.recursive for any JSON-like input”. - Deadline too tight. On laptops with slow disk I/O, the 1-second deadline trips on long inputs. Raise to 2s in CI or switch the CI profile to
--hypothesis-profile=ci.
Tested on (TCC editorial scoring)
- Claude Opus 4.7,
adaptive thinking, effort=high: 6 of 6 invariants, correct shrink strategies, state machine for the stateful invariant. 5 of 5 runs. - GPT-5.3-Codex,
reasoning_effort=high: 6 of 6 invariants. 5 of 5 runs. - Claude Sonnet 4.6: 5 of 6 invariants in the median run; the miss was recursion-shrink.
- Gemini 3.1 Pro: 4 of 6 invariants on median; 3 runs put a stateful invariant into a
@given. - Aider with Claude Opus 4.7 in architect mode: 6 of 6 invariants, identical output quality to the bare Opus call.
Methodology on the 14-task scorecard. The pattern (Opus and Codex tied on the top, Gemini behind on stateful tasks) is consistent with the recurring r/MachineLearning thread on PBT generation across frontier models in early 2026.
TypeScript variant
For TypeScript, swap the prompt’s stack section:
Use fast-check, TypeScript strict mode, jest runner. Replace:
- @given -> fc.property
- @settings(max_examples=200, deadline=timedelta(seconds=1)) -> { numRuns: 200, timeout: 1000 }
- RuleBasedStateMachine -> fc.statefulFakeItems
The same rules apply. The shrink-strategy line maps to fc.letrec for recursive inputs, and the deadline maps to timeout. See the fast-check docs for the API. The TCC fixture has not been ported to TypeScript yet; results may shift.
Related
The deterministic eval harness that uses property-based tests as graders is on the evals-without-judges post. The strict-JSON prompt for the output schema of your test-gen harness is on the strict-JSON prompt. The test-gen scores for each model are on the GPT-5.3-Codex review and Claude Opus 4.7 review.
One-line takeaway
Ask for exactly 6 invariants, force st.recursive on nested inputs, route stateful invariants to a state machine, and the first run produces a PBT suite you can ship.