RelA: Reliability as a Control Problem

Reliability failures rarely come from missing mechanisms.

Most systems already have retries, timeouts, hedging, rate limits, and autoscaling. When they still struggle under load, it’s often because these mechanisms activate too late, interact poorly, or operate without shared context.

RelA - Reliability Autohealer - started as a small experiment to observe that behavior more closely.

The goal was not to build a faster circuit breaker or a production-ready reliability system. It was to look at a simpler question:

What changes when reliability decisions are treated as a control problem rather than as static configuration embedded in application code?

Where reliability logic usually lives

In many architectures, reliability behavior is implemented inside SDKs or middleware. Each service independently decides when to retry, hedge, or shed load.

Under normal conditions, this arrangement works well.

As systems approach saturation, a different set of pressures appears. Application code competes for the same resources it needs to remain stable. Decisions about what to drop and what to preserve are made locally, with limited visibility into global conditions or downstream impact.

Over time, this creates a familiar pattern. The system spends increasing effort deciding how to respond, while its ability to stabilize continues to erode.

RelA explores an alternative framing: separating decision-making from request handling.

Reliability as a feedback loop

RelA models reliability explicitly as a control loop:

Signals are observed from the system (latency, saturation indicators).
Policies evaluate those signals in priority order.
Actions are asserted back onto the application as control flags.

This separates the cadence of control from the cadence of deployment. Decisions can change without redeploying code. More importantly, decision timing becomes visible rather than implicit.

One constraint surfaced quickly during the experiment: feedback lag dominates behavior.

When signals arrive late or are averaged over long windows, the system reacts to conditions that may already have passed. Recovery actions overshoot. Load shedding continues after traffic subsides. Dashboards remain stable while user experience degrades.

This is not a tooling issue. It’s an artifact of how feedback loops behave under delay.

Decision ordering under saturation

Retries, hedging, and load shedding are all reasonable responses to stress. In isolation, each behaves correctly.

Under saturation, they begin to interact.

Latency-preserving actions consume capacity. Availability-preserving actions restrict work to protect it.

Without an explicit ordering, these responses can conflict. Actions intended to help begin amplifying the very conditions they are reacting to.

RelA treats reliability actions as mutually exclusive modes rather than independent toggles. When the system enters a survival condition, capacity-preserving behavior dominates. Latency optimizations are suppressed until stability returns.

This isn’t an optimization strategy. It’s a safety boundary that makes behavior predictable under pressure.

Reliability reflects product intent

Another pattern becomes visible when decisions are centralized.

Random load shedding treats all requests equally. In practice, systems often care about which requests succeed. That preference exists whether or not the system acknowledges it.

Reliability decisions quietly encode product priorities.

RelA makes this explicit by allowing business context to participate in control decisions. During stress, the system preserves value rather than fairness, based on predefined policy.

Scope of the experiment

RelA is intentionally small.

It does not aim to establish production-grade benchmarks or absolute performance characteristics. Its purpose is to surface decision behavior, feedback dynamics, and failure modes that remain consistent as systems scale.

The same patterns appear at small volumes and large ones. Scale changes the cost of mistakes, not their structure.

Why this framing matters

Most reliability incidents are not caused by missing mechanisms. They emerge from late decisions, conflicting actions, and signals that arrive after options have narrowed.

Treating reliability as a control problem brings these constraints forward. It makes timing, ordering, and intent explicit — while the system still has room to respond.