~/prompts/property-based-test-generation-prompt-6-invariants-on-first-run
§ PROMPT · APR 23, 2026 GPT-5 · PROPERTY · TESTING v1.0

Property-based test generation prompt: 6 invariants on first run

The prompt that writes 6 Hypothesis invariants for a JSON-diff library on the first run, with shrink strategies. Tested on GPT-5.3-Codex, Claude Opus 4.7, and Aider.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
  7 min read
# TESTING · gpt-5.3-codex
Given the following function signature and its JSDoc, write 4 property-based tests using fast-check.
For each property, state the invariant in one English sentence before the code.

The recurring r/Python and r/Hypothesis “model wrote example tests when I asked for property-based” thread has a structural answer: the prompt has to demand a count, ban example-based tests, and force a recursive shrink strategy on nested data. The 7-rule prompt below moves Claude Opus 4.7 to 6 of 6 invariants on the TCC editorial property-test fixture (a JSON-diff library with 6 documented invariants), with shrink strategies and a state machine for the stateful invariant. GPT-5.3-Codex hits the same. The bare “write me property-based tests” prompt produces 4 of 6 invariants and skips the recursion shrinker on most runs across both models.

What property-based testing is (and what it is not)

A property-based test does not specify a concrete input and a known output. It specifies an invariant: a rule that must hold for all inputs in a given domain. The test framework (Hypothesis, fast-check) then generates hundreds of random inputs, tries to falsify the invariant, and on failure shrinks the input to the smallest case that still breaks it. That shrinking step is what makes property-based tests useful: the failure is actionable, not a 200-field object you have to trace manually.

Models default to example tests because the training data is dominated by example tests. “Given input X, assert output Y” is the most common test shape in every public code corpus. Asking for “property-based tests” without constraint produces a mix: 2 genuine invariants and 3 parameterized examples dressed up with @given decorators.

The prompt

Write property-based tests for the library below, in Python, using Hypothesis.

Inputs:
- Source file: <paste source here or reference by path>
- Function under test: <name>
- Public contract (plain English): <2-4 lines>

Produce exactly this output:

1. A list of 6 invariants, numbered. Each invariant is one sentence.
2. For each invariant, a @given-decorated test function. Name the test `test_prop_<short_name>`.
3. Include shrink-friendly strategies. For recursive or nested inputs, use `st.recursive` with
   a size cutoff (max_leaves=8) and compose with shared structures where the invariant requires
   shared identity.
4. Use `@settings(max_examples=200, deadline=timedelta(seconds=1))`. Adjust only if justified
   in a comment.
5. If any invariant is not provable as a pure property (requires a model), emit it as a stateful
   RuleBasedStateMachine instead, not as a @given test. Mark it clearly in a comment.
6. Do not write example-based tests. Do not write docstrings longer than one line per test.
7. At the end, emit a single line: `# Run: pytest -q && pytest --hypothesis-profile=ci`

Example: prompt applied to a JSON-diff library

Function under test: json_diff(a: dict, b: dict) -> list[Op]. Contract: “Returns a minimal list of add/remove/replace operations that, when applied to a, produces b. Applying the inverse ops to b must recover a.”

Invariants the model produces:

1. Diffing a dict against itself returns an empty operation list.
2. Applying the diff of (a, b) to a produces b.
3. Applying the inverse diff of (a, b) to b produces a (roundtrip).
4. The operation count never exceeds the number of keys in the larger dict.
5. Operations are idempotent: applying a diff twice has no additional effect if b is the target.
6. [STATEFUL] A sequence of diffs applied in order is equivalent to a single diff from the
   first state to the last state.

Generated test for invariant 3 (roundtrip):

from hypothesis import given, settings
from hypothesis import strategies as st
from datetime import timedelta

json_value = st.recursive(
    st.one_of(st.none(), st.booleans(), st.integers(), st.text()),
    lambda children: st.dictionaries(st.text(min_size=1), children, max_size=5),
    max_leaves=8
)

@given(a=json_value, b=json_value)
@settings(max_examples=200, deadline=timedelta(seconds=1))
def test_prop_roundtrip(a, b):
    """Applying the inverse diff of (a,b) to b recovers a."""
    diff = json_diff(a, b)
    inverse = invert_ops(diff)
    assert apply_ops(b, inverse) == a

Generated test for invariant 6 (stateful):

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize

class DiffSequenceMachine(RuleBasedStateMachine):
    # STATEFUL: sequence of diffs must be composable into a single diff
    @initialize(initial=json_value)
    def setup(self, initial):
        self.state = initial
        self.history = [initial]

    @rule(next_state=json_value)
    def apply_diff(self, next_state):
        self.state = apply_ops(self.state, json_diff(self.state, next_state))
        self.history.append(next_state)

    @invariant()
    def sequence_equals_direct_diff(self):
        direct = apply_ops(self.history[0], json_diff(self.history[0], self.history[-1]))
        assert direct == self.state

TestDiffSequence = DiffSequenceMachine.TestCase

Why it works, in 5 bullets

Failure modes

Tested on (TCC editorial scoring)

Methodology on the 14-task scorecard. The pattern (Opus and Codex tied on top, Gemini behind on stateful tasks) is consistent with the recurring r/MachineLearning thread on PBT generation across frontier models in early 2026.

TypeScript variant (fast-check)

For TypeScript, replace the stack section of the prompt:

Use fast-check, TypeScript strict mode, jest runner. Replace:
- @given           -> fc.property
- @settings(max_examples=200, deadline=timedelta(seconds=1))
                   -> { numRuns: 200, timeout: 1000 }
- RuleBasedStateMachine -> fc.statefulModelRun (model-based testing)
- st.recursive     -> fc.letrec for recursive types

The same rules apply. The fc.letrec API is the fast-check equivalent of st.recursive; see the fast-check letrec docs. For stateful invariants, fc.statefulModelRun maps to the RuleBasedStateMachine pattern. The TCC fixture has not been ported to TypeScript; results may shift on the stateful invariant detection.

Frequently asked questions

How do I know if the generated invariants are actually testing my contract?
Run the tests against a deliberately broken version of the function (comment out a condition, change a comparison operator). If Hypothesis does not find a failure within 200 examples, the invariant is too weak. A good PBT suite fails fast on a broken implementation.

What if my function has side effects (database writes, API calls)?
Property-based testing works best on pure functions. For side-effectful functions, mock the external call at the boundary and write properties on the transformation logic. The stateful RuleBasedStateMachine pattern handles sequences of side effects when you can reset the state between rule applications.

Should I use this in CI?
Yes, with the CI profile. Add @settings(suppress_health_check=[HealthCheck.too_slow]) and a --hypothesis-profile=ci configuration that sets a higher deadline. The default profile is for local development; CI machines have different I/O profiles.

How many invariants is the right number?
6 is a target, not a ceiling. For small utility functions, 3-4 good invariants beat 6 weak ones. For libraries with a formal contract (codecs, serializers, hash functions), 8-10 is reasonable. The prompt uses 6 because it is the threshold where models stop defaulting to “the obvious three”.

The deterministic eval harness that uses property-based tests as graders is on the evals-without-judges post. The strict-JSON prompt for the output schema of your test-gen harness is on the strict-JSON prompt. The test-gen scores for each model are on the GPT-5.3-Codex review and Claude Opus 4.7 review.

One-line takeaway

Ask for exactly 6 invariants, force st.recursive on nested inputs, route stateful invariants to a state machine, ban example tests, and the first run produces a PBT suite you can ship.

esc