Property-based test generation prompt: 6 invariants on the first run

The recurring r/Python and r/Hypothesis “model wrote example tests when I asked for property-based” thread has a structural answer: the prompt has to demand a count, ban example-based tests, and force a recursive shrink strategy on nested data. The 7-rule prompt below moves Claude Opus 4.7 to 6 of 6 invariants on the TCC editorial property-test fixture (a JSON-diff library with 6 documented invariants), with shrink strategies and a state machine for the stateful invariant. GPT-5.3-Codex hits the same. The bare “write me property-based tests” prompt produces 4 of 6 invariants and skips the recursion shrinker on most runs across both models.

The prompt

Write property-based tests for the library below, in Python, using Hypothesis.

Inputs:
- Source file: <paste source here or reference by path>
- Function under test: <name>
- Public contract (plain English): <2-4 lines>

Produce exactly this output:

1. A list of 6 invariants, numbered. Each invariant is one sentence.
2. For each invariant, a @given-decorated test function. Name the test `test_prop_<short_name>`.
3. Include shrink-friendly strategies. For recursive or nested inputs, use `st.recursive` with a size cutoff (max_leaves=8) and compose with shared structures where the invariant requires shared identity.
4. Use `@settings(max_examples=200, deadline=timedelta(seconds=1))`. Adjust only if justified in a comment.
5. If any invariant is not provable as a pure property (requires a model), emit it as a stateful RuleBasedStateMachine instead, not as a @given test. Mark it clearly in a comment.
6. Do not write example-based tests. Do not write docstrings longer than one line per test.
7. At the end, emit a single line: `# Run: pytest -q && pytest --hypothesis-profile=ci`

Why it works, in 5 bullets

Asks for 6 invariants, not “as many as you can find”. Frontier models underproduce invariants when the target is open-ended. A specific count steers them toward the full contract surface, not the obvious three.
Shrink strategy is explicit. Models default to flat strategies. For recursive data (JSON trees, ASTs), the strategy the model picks without a prompt nudge produces examples that do not shrink. st.recursive with max_leaves=8 gets you clean shrinks and real bug repros, per the Hypothesis recursive-data docs.
Stateful fallback is explicit. Some invariants are fundamentally stateful (“applying a diff and reversing it returns the original, for any sequence of diffs”). Forcing the model to route those to RuleBasedStateMachine instead of pretending they are pure properties is the difference between a real PBT suite and one that looks plausible.
Bans example-based tests. Without this line, models mix 3 property tests with 2 hand-written examples. The examples are almost always weaker than what Hypothesis would find, and they crowd out the property tests in the PR diff.
Emits a runnable command at the end. The final comment lets the harness execute the generated file directly without a wrapper. Small thing, removes one CI step.

Failure modes

Stateful invariant written as a pure property. On Gemini 3.1 Pro, around 30% of TCC fixture runs still emit the “apply + reverse” invariant as a flat @given instead of a state machine. Claude Opus 4.7 gets it right 5 of 5 runs.
Flat strategies for recursive data. When the test file does not import st.recursive, the recursion shrinker is missing. Grep the output for recursive(; if absent on a JSON-tree target, retry with “use st.recursive for any JSON-like input”.
Deadline too tight. On laptops with slow disk I/O, the 1-second deadline trips on long inputs. Raise to 2s in CI or switch the CI profile to --hypothesis-profile=ci.

Tested on (TCC editorial scoring)

Claude Opus 4.7, adaptive thinking, effort=high: 6 of 6 invariants, correct shrink strategies, state machine for the stateful invariant. 5 of 5 runs.
GPT-5.3-Codex, reasoning_effort=high: 6 of 6 invariants. 5 of 5 runs.
Claude Sonnet 4.6: 5 of 6 invariants in the median run; the miss was recursion-shrink.
Gemini 3.1 Pro: 4 of 6 invariants on median; 3 runs put a stateful invariant into a @given.
Aider with Claude Opus 4.7 in architect mode: 6 of 6 invariants, identical output quality to the bare Opus call.

Methodology on the 14-task scorecard. The pattern (Opus and Codex tied on the top, Gemini behind on stateful tasks) is consistent with the recurring r/MachineLearning thread on PBT generation across frontier models in early 2026.

TypeScript variant

For TypeScript, swap the prompt’s stack section:

Use fast-check, TypeScript strict mode, jest runner. Replace:
- @given -> fc.property
- @settings(max_examples=200, deadline=timedelta(seconds=1)) -> { numRuns: 200, timeout: 1000 }
- RuleBasedStateMachine -> fc.statefulFakeItems

The same rules apply. The shrink-strategy line maps to fc.letrec for recursive inputs, and the deadline maps to timeout. See the fast-check docs for the API. The TCC fixture has not been ported to TypeScript yet; results may shift.

The deterministic eval harness that uses property-based tests as graders is on the evals-without-judges post. The strict-JSON prompt for the output schema of your test-gen harness is on the strict-JSON prompt. The test-gen scores for each model are on the GPT-5.3-Codex review and Claude Opus 4.7 review.

One-line takeaway

Ask for exactly 6 invariants, force st.recursive on nested inputs, route stateful invariants to a state machine, and the first run produces a PBT suite you can ship.