Here’s a question that sounds simple but absolutely isn’t: did the £4 million sewer upgrade at this site last year actually reduce sewage discharges?

The naive approach is to count spills before and after. If it went from 40 spills to 12, that’s a 70% improvement, job done, press release written, shareholder pleased. The problem is that this analysis is almost completely meaningless unless you know one other thing: how much it rained.

CSOs discharge when sewers overflow. Sewers overflow when it rains. A dry year after a wet year will always show “improvement” in raw spill counts regardless of whether anything changed. This is not a minor caveat — it is the entire story. And yet raw before/after counts are what almost every press release, every regulatory report, and most media coverage uses.

WaterWatch’s upgrade analysis tool controls for this. Here’s exactly how.

Step 1: Two windows, one excluded gap

When you run an analysis, you specify the upgrade date and a site permit number. The tool splits the discharge history into two windows: everything before that date, and everything after. Both default to 24 months, though they’re naturally limited by how much data we have.

But we don’t start the “after” window on the day of the upgrade. We exclude ±30 days around it. This matters more than it sounds. Major sewer works cause significant disruption — groundwork, pipework replacement, flow rerouting. The site might actually discharge more during construction as the normal flow pathways are disrupted. Including that in your “after” window and comparing it to normal pre-upgrade operation would be unfair in the other direction. The buffer removes the construction noise.

Window structure

Before window: up to 24 months ending 30 days before the upgrade date.
After window: starting 30 days after the upgrade date, up to 24 months.
The 60-day exclusion zone around the upgrade date is discarded entirely.

Step 2: Scoring daily environmental pressure

This is where it gets interesting. Instead of counting raw spills, we assign every day in both windows an environmental pressure score — a number that represents how hard the sewer network was being pushed by weather on that day.

The score combines three inputs:

24-hour rainfall from an Environment Agency monitoring station matched to the catchment. The nearest station with good coverage is used; if no station can be assigned, the analysis flags a data quality warning.

Antecedent Wetness Index (AWI)— this one has a proper academic pedigree. It’s based on a 1951 paper by Kohler and Linsley, and it models the fact that soil doesn’t drain instantly. A heavy rain event last Tuesday means the catchment is still wetter than average today, even if it’s been dry since. The AWI decays exponentially with a factor of k=0.85 per day: today’s AWI = today’s rainfall + yesterday’s AWI × 0.85. A day 10 days after a storm still carries about 20% of its influence.

18mm of rain on Day 1, then dry. The AWI still shows elevated soil wetness for days afterwards — the sewer doesn’t know the sun came out.

River level anomaly — how far above or below the seasonal baseline the nearest river level gauge is reading. High river levels reduce the effective head pressure that a sewer can drain against, increasing the likelihood of overflow. This catches flood-driven events that pure rainfall data would miss.

Step 3: Spill rate per 100 pressure units

Once we have a daily pressure score for every day in both windows, we can calculate a pressure-adjusted spill rate:

spill_rate = (total spill episodes ÷ total pressure) × 100

This gives us the number of discharge episodes per 100 units of environmental pressure. Now we’re comparing like with like: if the before-window had 40 spills under 1000 pressure units and the after-window had 12 spills under 300 pressure units, the rates are effectively identical (4.0 vs 4.0). No improvement. If the after-window had 12 spills under 600 pressure units (rate: 2.0), that’s a genuine 50% reduction in spill frequency per unit of environmental stress. That’s real.

The same 40→12 headline figure reads very differently once you account for how much pressure the sewer was under in each window.

Step 4: Confidence classification

A 17% pressure-adjusted reduction is still a reduction. But we don’t report it the same way we’d report a 60% reduction. The tool classifies results into confidence tiers:

Verdict	Criteria
Strong improvement	≥30% reduction in spill rate, ≥20 episodes in both windows
Moderate improvement	10–30% reduction, sufficient episode count
Inconclusive	Change <10%, or fewer than 10 episodes in a window
Deterioration signal	≥10% increase in pressure-adjusted spill rate

The episode count threshold matters. A site with 3 before-spills and 1 after-spill has a 67% reduction but almost zero statistical credibility. We flag these as inconclusive rather than “strong improvement” regardless of the percentage.

The data quality problem (and how we handle it)

Here’s a real-world failure mode we encountered. Weybridge CSO showed 77 spills before and 3 after — the tool called it “deteriorating”. That seems absurd. Raw counts showed massive improvement.

What happened: the rain gauge assigned to that catchment had significant data gaps in the pre-upgrade window. Days with missing rain data score zero pressure. Zero pressure days still produce the same denominator contribution as calm dry days, but they’re also the days when you’d least expect spills — which artificially inflates the pre-upgrade rate, making the site look like it was spilling a lot relative to conditions. When the after-window had better rain data coverage, the rates became incomparable.

Data quality warning

The tool now checks rain data coverage for both windows. If either drops below 25% coverage, a red warning banner fires and you should use raw spill counts as the headline figure, not the normalised rate. The normalised rate is still shown but clearly flagged as unreliable.

This is why the data quality diagnostics exist: rain station assigned, coverage percentage, mean daily pressure in each window. Low mean pressure in one window compared to the other is a red flag — it means the two periods weren’t comparable on environmental stress.

What this analysis can and can’t tell you

It can tell you: after controlling for weather, did the number of spill episodes per unit of environmental stress go down? If yes, by how much? If the change is large, is it statistically robust given how many episodes occurred?

It can’t tell you: whether the watercourse is ecologically healthier. Whether spills that did happen were shorter or longer. Whether the improvement was caused by the specific upgrade or by something else. Whether the result will hold up over a wider range of weather conditions than the two windows happened to cover.

It also can’t tell you anything useful when the data is sparse. If an upgrade happened 3 months ago, you don’t have enough post-upgrade data for a credible analysis. We’re not in the business of declaring things improved on the basis of six weeks of dry weather.

For journalists and researchers

If you’re investigating a specific upgrade or want to run an analysis on a site and have questions about the methodology, get in touch. We can walk you through the full result — including the pressure time series, rain station assignment, and what confidence level the data actually supports.

hello@water-watch.co.uk

The pressure bands breakdown

One more output worth explaining: the pressure band breakdown. We split days into three bands — low pressure (0–3 units), medium (3–6), and high (6+) — and show the spill rate in each band, before and after the upgrade.

This tells you something raw rates can’t: where the improvement is concentrated. A site might show overall improvement because it barely spills during low-pressure periods now, but its high-pressure spill rate is unchanged — meaning the upgrade worked well at the margins but the infrastructure still can’t cope with intense rainfall events. Or the opposite: the high-pressure rate dropped dramatically but the low-pressure rate is the same — suggesting the upgrade specifically increased capacity at the extreme end.

Both would register as “improvement” overall. The band breakdown tells you what kind of improvement.

A worked example: how to read the output

Let’s say you run an analysis on a site with an upgrade date of 1 June 2023. The tool pulls 24 months before (with the 30-day buffer), 24 months after, assigns a rain station, scores every day, and returns:

Before: 38 spills / 920 total pressure = rate 4.13
After: 14 spills / 680 total pressure = rate 2.06
Change: −50.1% → Strong improvement
Rain coverage: 91% before, 88% after → data quality OK
Mean daily pressure: 2.8 before vs 2.6 after → conditions broadly comparable

That’s a solid result. The two windows had similar pressure distributions, the rain station had good coverage, and the reduction is large enough that a few unlucky high-pressure events can’t explain it away. You can publish that result with reasonable confidence.

Now consider the same site where the rain station coverage was 43% before and 89% after. The normalised rates might show the same pattern — but the methodology red-flags it, because 57% of the pre-upgrade days had zero pressure contribution from missing rain data. The raw counts might still be worth reporting, but the normalised rate is not reliable enough to build conclusions on.

Related: How we determine if a site is improving (annual trend methodology) · How WaterWatch actually works

Methodology:Kohler & Linsley (1951) antecedent wetness model · Environment Agency EDM and rainfall open data · Environment Agency river level API

Did the upgrade actually work?Here’s how we find out.