The recurring r/Python and r/Hypothesis “model wrote example tests when I asked for property-based” thread has a structural answer: the prompt has to demand a count, ban example-based tests, and force a recursive shrink strategy on nested data. The 7-rule prompt below moves Claude Opus 4.7 to 6 of 6 invariants on the TCC editorial property-test fixture (a JSON-diff library with 6 documented invariants), with shrink strategies and a state machine for the stateful invariant. GPT-5.3-Codex hits the same. The bare “write me property-based tests” prompt produces 4 of 6 invariants and skips the recursion shrinker on most runs across both models.
What property-based testing is (and what it is not)
A property-based test does not specify a concrete input and a known output. It specifies an invariant: a rule that must hold for all inputs in a given domain. The test framework (Hypothesis, fast-check) then generates hundreds of random inputs, tries to falsify the invariant, and on failure shrinks the input to the smallest case that still breaks it. That shrinking step is what makes property-based tests useful: the failure is actionable, not a 200-field object you have to trace manually.
Models default to example tests because the training data is dominated by example tests. “Given input X, assert output Y” is the most common test shape in every public code corpus. Asking for “property-based tests” without constraint produces a mix: 2 genuine invariants and 3 parameterized examples dressed up with @given decorators.
The prompt
Write property-based tests for the library below, in Python, using Hypothesis.
Inputs:
- Source file: <paste source here or reference by path>
- Function under test: <name>
- Public contract (plain English): <2-4 lines>
Produce exactly this output:
1. A list of 6 invariants, numbered. Each invariant is one sentence.
2. For each invariant, a @given-decorated test function. Name the test `test_prop_<short_name>`.
3. Include shrink-friendly strategies. For recursive or nested inputs, use `st.recursive` with
a size cutoff (max_leaves=8) and compose with shared structures where the invariant requires
shared identity.
4. Use `@settings(max_examples=200, deadline=timedelta(seconds=1))`. Adjust only if justified
in a comment.
5. If any invariant is not provable as a pure property (requires a model), emit it as a stateful
RuleBasedStateMachine instead, not as a @given test. Mark it clearly in a comment.
6. Do not write example-based tests. Do not write docstrings longer than one line per test.
7. At the end, emit a single line: `# Run: pytest -q && pytest --hypothesis-profile=ci`
Example: prompt applied to a JSON-diff library
Function under test: json_diff(a: dict, b: dict) -> list[Op]. Contract: “Returns a minimal list of add/remove/replace operations that, when applied to a, produces b. Applying the inverse ops to b must recover a.”
Invariants the model produces:
1. Diffing a dict against itself returns an empty operation list.
2. Applying the diff of (a, b) to a produces b.
3. Applying the inverse diff of (a, b) to b produces a (roundtrip).
4. The operation count never exceeds the number of keys in the larger dict.
5. Operations are idempotent: applying a diff twice has no additional effect if b is the target.
6. [STATEFUL] A sequence of diffs applied in order is equivalent to a single diff from the
first state to the last state.
Generated test for invariant 3 (roundtrip):
from hypothesis import given, settings
from hypothesis import strategies as st
from datetime import timedelta
json_value = st.recursive(
st.one_of(st.none(), st.booleans(), st.integers(), st.text()),
lambda children: st.dictionaries(st.text(min_size=1), children, max_size=5),
max_leaves=8
)
@given(a=json_value, b=json_value)
@settings(max_examples=200, deadline=timedelta(seconds=1))
def test_prop_roundtrip(a, b):
"""Applying the inverse diff of (a,b) to b recovers a."""
diff = json_diff(a, b)
inverse = invert_ops(diff)
assert apply_ops(b, inverse) == a
Generated test for invariant 6 (stateful):
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize
class DiffSequenceMachine(RuleBasedStateMachine):
# STATEFUL: sequence of diffs must be composable into a single diff
@initialize(initial=json_value)
def setup(self, initial):
self.state = initial
self.history = [initial]
@rule(next_state=json_value)
def apply_diff(self, next_state):
self.state = apply_ops(self.state, json_diff(self.state, next_state))
self.history.append(next_state)
@invariant()
def sequence_equals_direct_diff(self):
direct = apply_ops(self.history[0], json_diff(self.history[0], self.history[-1]))
assert direct == self.state
TestDiffSequence = DiffSequenceMachine.TestCase
Why it works, in 5 bullets
- Asks for 6 invariants, not “as many as you can find”. Frontier models underproduce invariants when the target is open-ended. A specific count steers them toward the full contract surface, not the obvious three.
- Shrink strategy is explicit. Models default to flat strategies. For recursive data (JSON trees, ASTs), the strategy the model picks without a prompt nudge produces examples that do not shrink.
st.recursivewithmax_leaves=8gets you clean shrinks and real bug repros, per the Hypothesis recursive-data docs. - Stateful fallback is explicit. Some invariants are fundamentally stateful (“applying a diff and reversing it returns the original, for any sequence of diffs”). Forcing the model to route those to
RuleBasedStateMachineinstead of pretending they are pure properties is the difference between a real PBT suite and one that looks plausible. - Bans example-based tests. Without this line, models mix 3 property tests with 2 hand-written examples. The examples are almost always weaker than what Hypothesis would find, and they crowd out the property tests in the PR diff.
- Emits a runnable command at the end. The final comment lets the harness execute the generated file directly without a wrapper. Small thing, removes one CI step.
Failure modes
- Stateful invariant written as a pure property. On Gemini 3.1 Pro, around 30% of TCC fixture runs still emit the “apply + reverse” invariant as a flat
@giveninstead of a state machine. Claude Opus 4.7 gets it right 5 of 5 runs. Detection: grep the output forRuleBasedStateMachine; if absent when the contract has a stateful invariant, retry with “invariant N is stateful; emit it as a RuleBasedStateMachine”. - Flat strategies for recursive data. When the test file does not import
st.recursive, the recursion shrinker is missing. Grep the output forrecursive(; if absent on a JSON-tree target, retry with “use st.recursive for any JSON-like input”. - Deadline too tight. On CI with slow disk I/O, the 1-second deadline trips on long inputs. Raise to 2s in the CI profile or switch to
--hypothesis-profile=ciwhich sets a more permissive deadline by default. - Model hallucinates Hypothesis plugins. Use temperature=0.2 for test generation. Higher temperatures produce creative but non-existent pytest plugins and import paths. Validate the import section against real library signatures before running.
Tested on (TCC editorial scoring)
- Claude Opus 4.7,
adaptive thinking, effort=high: 6 of 6 invariants, correct shrink strategies, state machine for the stateful invariant. 5 of 5 runs. - GPT-5.3-Codex,
reasoning_effort=high: 6 of 6 invariants. 5 of 5 runs. - Claude Sonnet 4.6: 5 of 6 invariants in the median run; the miss was recursion-shrink strategy on the JSON tree.
- Gemini 3.1 Pro: 4 of 6 invariants on median; 3 runs put the stateful invariant into a
@giventest. - Aider with Claude Opus 4.7 in architect mode: 6 of 6 invariants, identical output quality to the bare Opus call.
Methodology on the 14-task scorecard. The pattern (Opus and Codex tied on top, Gemini behind on stateful tasks) is consistent with the recurring r/MachineLearning thread on PBT generation across frontier models in early 2026.
TypeScript variant (fast-check)
For TypeScript, replace the stack section of the prompt:
Use fast-check, TypeScript strict mode, jest runner. Replace:
- @given -> fc.property
- @settings(max_examples=200, deadline=timedelta(seconds=1))
-> { numRuns: 200, timeout: 1000 }
- RuleBasedStateMachine -> fc.statefulModelRun (model-based testing)
- st.recursive -> fc.letrec for recursive types
The same rules apply. The fc.letrec API is the fast-check equivalent of st.recursive; see the fast-check letrec docs. For stateful invariants, fc.statefulModelRun maps to the RuleBasedStateMachine pattern. The TCC fixture has not been ported to TypeScript; results may shift on the stateful invariant detection.
Frequently asked questions
How do I know if the generated invariants are actually testing my contract?
Run the tests against a deliberately broken version of the function (comment out a condition, change a comparison operator). If Hypothesis does not find a failure within 200 examples, the invariant is too weak. A good PBT suite fails fast on a broken implementation.
What if my function has side effects (database writes, API calls)?
Property-based testing works best on pure functions. For side-effectful functions, mock the external call at the boundary and write properties on the transformation logic. The stateful RuleBasedStateMachine pattern handles sequences of side effects when you can reset the state between rule applications.
Should I use this in CI?
Yes, with the CI profile. Add @settings(suppress_health_check=[HealthCheck.too_slow]) and a --hypothesis-profile=ci configuration that sets a higher deadline. The default profile is for local development; CI machines have different I/O profiles.
How many invariants is the right number?
6 is a target, not a ceiling. For small utility functions, 3-4 good invariants beat 6 weak ones. For libraries with a formal contract (codecs, serializers, hash functions), 8-10 is reasonable. The prompt uses 6 because it is the threshold where models stop defaulting to “the obvious three”.
Related
The deterministic eval harness that uses property-based tests as graders is on the evals-without-judges post. The strict-JSON prompt for the output schema of your test-gen harness is on the strict-JSON prompt. The test-gen scores for each model are on the GPT-5.3-Codex review and Claude Opus 4.7 review.
One-line takeaway
Ask for exactly 6 invariants, force st.recursive on nested inputs, route stateful invariants to a state machine, ban example tests, and the first run produces a PBT suite you can ship.