README.md

# Proxy Pool Governor

An **online governor** that continuously steers request traffic toward the
best-performing proxy providers. For every service it watches the live health of
each proxy pool and adjusts that pool's routing **weight** — the share of traffic
it receives — pulling weight away from pools that start to degrade and handing it
back as they recover. The goal is to keep overall success rates and latency good
without a human having to react to every incident.

The decision logic is a rule-based expert system written in **CLIPS** (`gov.clp`).
A small C harness (`gov-sim`) feeds it observations one cycle at a time and applies
the weight changes it emits. Because the policy is just rules, it can be read,
audited, and tuned without touching the harness.

## How it works

Routing is expressed as integer weights per (pool, service): the more weight a pool
has for a service, the larger its share of that service's traffic. Each pool has a
**base weight** (the operator's default preference for that pool) and an
**effective weight** (what the governor is currently using). The governor only ever
moves the effective weight within `min_weight ≤ effective_weight ≤ base_weight`,
so it can throttle a bad pool down to a floor and restore it up to — but never
above — its baseline.

The governor runs as a loop. Each **cycle** corresponds to one observation
timestamp:

1. Fresh per-pool/per-service observations are injected as facts: success rate,
   response time, timeout rate, SSL-error rate, plus the rolling average and
   standard deviation of success rate and response time.
2. CLIPS fires its rules to detect degradation, roll degradation up to a
   service-wide view, reduce weight on degrading pools, and restore weight on
   recovered ones.
3. The harness reads the `weight-adjustment` and `alert` facts the rules produced,
   applies the new effective weights (carrying them into the next cycle), and
   surfaces alerts.

CLIPS owns the operational state across the cycle (degradation status, how long a
pool has been healthy); the harness owns the effective-weight matrix and carries it
forward. Statistics (moving average / stddev) are computed outside the rules and
supplied with each observation — the rules consume them, they don't maintain a
history window themselves.

### Detecting a degrading provider

A pool is flagged as degraded for a service when any of these fire:

| Signal | Trigger |
|--------|---------|
| Response time | `response_time` exceeds `avg + sigma_threshold × stddev`, **or** exceeds `max_response_time` |
| Success rate | `rate_success` falls below `avg − sigma_threshold × stddev`, **or** below `min_success_rate` |
| SSL errors | SSL error rate exceeds 5% |
| Timeouts | Timeout rate exceeds 10% |

Each detection carries a **severity** derived from how far the metric has moved
(in standard deviations for the statistical checks). Severity drives how hard the
governor reacts.

### Shifting traffic away

When a pool is degraded *and the service as a whole is still healthy*, the governor
reduces that pool's effective weight:

- First reduction multiplies the weight by `weight_reduction` (e.g. 0.5 halves it),
  clamped to `[min_weight, base_weight]`.
- An already-reduced pool whose severity exceeds 3σ is cut again (halved).

Reducing one pool's weight naturally shifts its traffic to its healthier peers —
that is the core "move traffic to the best providers" behavior.

**Service-wide safety:** if too many of a service's pools are degraded at once
(degraded fraction ≥ `service_degrade_threshold`), the service is marked degraded,
an alert is raised, and weight reductions are **suspended**. When everything is
already bad there is no "good" pool to shift toward, so the governor stops cutting
rather than gutting the whole service.

### Restoring traffic

A pool that is no longer degraded is timestamped as healthy. Once it has stayed
healthy for longer than `restore_cooldown`, the governor steps its weight back up by
`base_weight / 4` per cycle (at least 1), never exceeding `base_weight`. Restoration
is deliberately gradual so a flapping pool doesn't immediately reclaim full traffic.

## Tunable parameters

Tuning is per service, via `service-config` facts (seeded from `main.c`). The
parameters the rules actually read:

| Parameter | Role |
|-----------|------|
| `sigma_threshold` | How many standard deviations count as anomalous |
| `min_success_rate` | Hard floor below which a pool is degraded regardless of its own baseline |
| `max_response_time` | Hard ceiling above which a pool is degraded |
| `weight_reduction` | Multiplier applied on first reduction |
| `min_weight` | Floor the governor will not throttle below (e.g. a contractual minimum) |
| `restore_cooldown` | How long a pool must stay healthy before restoration begins |
| `service_degrade_threshold` | Fraction of degraded pools that marks the whole service degraded |

The SSL (5%) and timeout (10%) detection thresholds are currently constants in the
rules rather than per-service config.

## Building and running

This targets **OpenBSD**: the build uses BSD `make` and the binary calls
`pledge(2)`/`unveil(2)`.

```sh
# CLIPS_DIR must point at a built CLIPS core (libclips.a + clips.h).
make CLIPS_DIR=/path/to/clips/core

./gov-sim                       # synthetic simulation, 60 cycles (default)
./gov-sim -s 200                # synthetic simulation, N cycles
./gov-sim -r scenarios/foo.pps  # replay recorded observations from a scenario file
```

- **Simulation mode** generates synthetic pool health (periodic incidents and
  recoveries) to exercise the rules end to end.
- **Replay mode** drives the governor from a recorded `.pps` scenario file, so the
  same observation stream can be replayed against different parameter settings.
  See `scenario.h` for the file format and `scenario.R` for tooling to read, write,
  generate, and analyze scenarios.

Each cycle prints the weight adjustments and alerts it produced, and the weight
matrix is printed periodically and at the end.

## Operational notes

- **Pool semantics:** response times degrade non-linearly as a pool saturates, and
  different providers degrade differently — some gracefully, some sharply. The
  statistical (sigma) checks adapt to each pool's own baseline; the hard floors
  (`min_success_rate`, `max_response_time`) catch absolute-bad behavior.
- **Restart:** effective weights are the durable state; the in-memory facts
  (healthy-since timers, degradation status) are rebuilt from the next few cycles of
  observations. Behavior after restart is conservative — a still-degraded pool is
  re-detected quickly, while restoration timers simply start over.

## Design intent (not yet implemented)

The following are part of the governor's intended direction but are **not** in the
current rules. They are listed so the gap between intent and implementation is
explicit:

- **Capacity awareness:** respect each pool's soft/hard request limits — reduce
  weight as a pool nears its limit even if quality is fine, block restoration into a
  near-full pool, and allow boosting a healthy pool *above* its base weight (up to a
  ceiling, e.g. 1.5×) to absorb load shed from constrained peers.
- **Proactive, time-of-day shifts:** anticipate the daily traffic cycle
  (trough → ramp-up → peak → ramp-down) and pre-shift toward load-efficient pools
  before peak rather than only reacting after degradation.
- **Probation / flap handling:** a `healthy → degraded → probation → healthy` state
  machine with flap detection and extended probation for pools that oscillate.
- **Trend detection:** act on multi-cycle trends, not just single-cycle snapshots.
- **Persistence and scale:** sourcing observations from a production metrics store
  on a fixed poll interval and persisting effective weights across restarts.

When this document and `gov.clp` disagree, `gov.clp` is what actually runs.

## License

Copyright (C) 2026 SWGY, Inc.

This program is free software: you can redistribute it and/or modify it under the
terms of the GNU Affero General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later version.
See [`LICENSE`](LICENSE) for the full text.