← All posts

What is property-based testing?

Most tests check a handful of cases you happened to think of. Property-based testing (PBT) checks a rule that should hold for every input, and then lets the computer hunt for the counterexample.

Example-based versus property-based

An example test pins down one point. You write assert add(2, 2) == 4, and you have proven exactly that: two plus two is four. It says nothing about any other pair of numbers.

A property states something that should be true everywhere. For addition: for all a and b, add(a, b) == add(b, a). A property-based runner takes that statement and checks it across thousands of generated inputs, including the ones you would never think to type out by hand.

The first approach checks the cases you imagined. The second checks the rule those cases were supposed to be examples of.

The three parts

A property-based test has three pieces:

  • A property: a statement that is true for all valid inputs, written as code that either holds or fails.
  • A generator: a recipe that produces many random inputs, from empty and tiny to huge and weird, so the property is stressed across the whole space.
  • A runner: the engine that checks the property over hundreds of generated cases and, when one fails, shrinks the input down to the smallest example that still breaks it.

A concrete example

Here is a property in Python, using Hypothesis. Sorting is idempotent: sorting an already-sorted list changes nothing.

from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_sort_is_idempotent(xs):
    assert sorted(sorted(xs)) == sorted(xs)

You wrote no example lists. The @given decorator hands test_sort_is_idempotent hundreds of generated lists: empty lists, single-element lists, lists full of duplicates, lists with negative numbers and huge numbers. If the property holds for all of them, the test passes. If any list breaks it, Hypothesis reports that list.

Common shapes of properties

Most useful properties fall into a few recurring shapes:

  • Round-trip: decode(encode(x)) == x. Encoding then decoding returns the original value.
  • Invariant: something that must always be true, such as a balance that never goes negative no matter the sequence of operations.
  • Idempotence: f(f(x)) == f(x). Applying the operation twice is the same as applying it once.
  • Commutativity: order does not matter, so add(a, b) == add(b, a).
  • Model-based, or oracle: compare the real implementation against a simple, obviously-correct reference and assert they agree.
  • Metamorphic: relate the outputs of two related inputs, for example that adding an item to a cart never lowers the total.

Shrinking is the feature that makes it usable

Random testing without shrinking is miserable. A failure surfaces on a 4,000-element list of random integers, and you are left staring at a wall of numbers with no idea which one matters.

Shrinking fixes that. When a property fails, the runner automatically searches for a smaller input that fails the same way, then a smaller one, until it cannot reduce further. The 4,000-element list collapses to something like [0, -1], a counterexample you can read at a glance and turn into a one-line regression test.

Why this matters now

Example tests encode the cases you imagined. Properties encode what must always be true. That difference is exactly the gap AI-written code falls into: a generated change is very good at passing the examples and fully capable of violating the underlying rule.

Properties catch that, because they do not care how the code is written. They care whether the guarantee still holds.

The hard part of property-based testing has always been writing good properties. That is exactly what Delta does: it mines them from your codebase, proves they catch real regressions, and runs them on every pull request.

Want properties enforced on your repositories? Request access.