← wx.jamestannahill.com
TECHNICAL OVERVIEW

How It Works

wx.jamestannahill.com publishes live hyperlocal weather from a private station in Midtown Manhattan. Below is the full data pipeline — physical sensor, AWS Lambdas, ML signals, and 156 years of climate context — that produces every reading on the dashboard.

Station
Ambient WS-2902
Location
40.75°N · 73.98°W
Cadence
Every 5 minutes
Stack
AWS · Python · uPlot

DATA COLLECTION

The station is an Ambient Weather WS-2902 mounted at the property. It transmits readings every 5 minutes to Ambient Weather's cloud via a local Wi-Fi gateway and is also registered with Weather Underground as KNYNEWYO2140. Live readings are visible at ambientweather.net and wunderground.com/dashboard/pws/KNYNEWYO2140.

A scheduled AWS Lambda (wx-poller) fires every 5 minutes via EventBridge and calls the Ambient Weather REST API to fetch the latest reading:

GET https://api.ambientweather.net/v1/devices/{mac}
    ?apiKey=...&applicationKey=...&limit=1

Each reading includes:

FieldDescription
tempfOutdoor temperature (°F)
feelsLikeFeels-like temperature (wind chill / heat index)
humidityRelative humidity (%)
dewPointDew point temperature (°F)
windspeedmph10-minute average wind speed (mph)
windgustmphPeak wind gust in the last 10 minutes (mph)
winddirWind direction (0–360°)
baromrelinRelative barometric pressure (inHg)
solarradiationSolar radiation (W/m²)
uvUV index (0–11+)
hourlyraininRainfall rate in the current hour (in/hr)
dailyraininTotal rainfall since local midnight (in)

On each poll, the poller also fetches the current METAR UHI delta from NOAA (see Urban Heat Island below) and appends it as a uhi_delta field on the reading before storing it. Raw readings are written to DynamoDB (wx-readings) with a 90-day TTL. The poller also generates and uploads a fresh OG image (og.png) to S3 on each run.


SOURCE VALIDATION

Before storing a reading, the poller runs two quality checks:

Flagged readings are still stored in wx-readings so the history is complete, but they are excluded from baseline updates. The dashboard shows a banner when the station data is stale or suspect.


HISTORICAL BACKFILL — WEATHER UNDERGROUND

When the station was first set up, the baseline table had no station data. To bootstrap it with real hyperlocal history, a one-time backfill script fetched 90 days of 5-minute readings from Weather Underground's PWS history API for the nearest private station (KNYNEWYO2140):

GET https://api.weather.com/v2/pws/history/all
    ?stationId=KNYNEWYO2140&format=json&units=e
    &date=20260115&apiKey=...

This produced 24,914 readings going back 90 days, which were written to wx-readings and used to seed the wx-daily-stats rolling averages. The baseline is now grounded in real local station data rather than climate reanalysis alone.


BASELINES — OPEN-METEO ERA5

Anomaly labels ("8.2°F above average for 9am in April") require historical climate normals for this exact location. The initial seed uses Open-Meteo's ERA5 reanalysis archive — actual measured atmospheric data going back to 1940, free and public.

A bootstrap Lambda fetches the 14th of each month for 2019–2023 (5 years × 12 months) at the station's coordinates and averages the hourly values into per-month-per-hour climate normals:

GET https://archive-api.open-meteo.com/v1/archive
    ?latitude=40.7549&longitude=-73.984
    &start_date=2021-04-14&end_date=2021-04-14
    &hourly=temperature_2m,relativehumidity_2m,
            windspeed_10m,surface_pressure
    &temperature_unit=fahrenheit&windspeed_unit=mph

This produces 288 baseline slots (12 months × 24 hours) stored in wx-daily-stats, weighted as 288 samples. Real station readings blend in via a rolling weighted average — after a few months the baselines are fully derived from the station itself.


TODAY IN HISTORY — HISTORICAL CLIMATE CONTEXT

Separate from the rolling-average baseline system, the dashboard surfaces a deeper historical perspective: how does today's temperature rank against every recorded year at Central Park? This uses two independent data sources — NOAA GHCN-Daily for 156 years of station records, and ERA5 reanalysis for 85 years of hourly distribution data.

NOAA GHCN-Daily (daily high/low context). Station USW00094728 ("NY CITY CNTRL PARK") has daily temperature records going back to 1869. The data is fetched directly from NOAA's public S3 bucket in long format and parsed into per-DOY (day-of-year) distributions:

s3://noaa-ghcn-pds/csv/by_station/USW00094728.csv
# format: ID, DATE (YYYYMMDD), ELEMENT, DATA_VALUE, ...
# elements: TMAX, TMIN, AWND (daily high, low, avg wind)

A bootstrap Lambda (WxClimateBootstrap) runs once to load all years, compute per-DOY statistics, and write them to the wx-climate-doy DynamoDB table — 366 rows keyed by MMDD. A nightly Lambda (WxClimateUpdater) refreshes the current year's data each night. Each row stores:

On each /current request, climate_context.py calls daily_verdict() to rank the current station's confirmed daily high against the full historical distribution. The result produces statements like "Warmest April 13th since 1987 · 95th percentile" — scanning the year-descending record to find the most recent year that exceeded today's high.

ERA5 hourly distribution (live percentile context). A second table, wx-climate-hourly, stores 8,784 slots (366 DOY × 24 hours), each containing the ERA5 hourly distribution for temperature, dew point, and wind speed at this exact coordinate going back to 1940. These are loaded once by WxClimateBootstrap from the Open-Meteo archive API. live_context() in climate_context.py uses a normal CDF against the slot's mean and std to express the current reading as a percentile: "currently 91st percentile for this hour on April 13th in 85 years of records."

Headline priority. anomaly_headline() prefers the NOAA GHCN verdict (when a confirmed daily high is available) over the ERA5 live percentile. Both are included in the /current response under climate_context. The dashboard's deviation bar visualizes the span from p5 to p95, with a tick at the historical average and a dot at today's value.


ANOMALY SCORING

On every /current request, the API looks up the baseline for the current month and hour (New York local time) and computes a delta:

delta = current_value − rolling_avg_value
label = "8.2°F above average for 9am in April"

Anomalies are computed for temperature, humidity, wind speed, and UV index. The dashboard surfaces the most notable one as a headline subline beneath the main temperature. Baselines are keyed by MM-HH (month + local hour). When a NOAA GHCN verdict is available, it takes precedence over the rolling-average anomaly label in the headline — the two systems run in parallel and complement each other.


WEATHER ALERTS

A scheduled Lambda (wx-alerter) runs every 15 minutes and evaluates the latest reading against five anomaly thresholds:

AlertConditionCooldown
Unusually warmTemp > baseline avg + 10°F2 hours
Unusually coldTemp < baseline avg − 10°F2 hours
High wind gustGust > avg wind + 20 mph and > 25 mph absolute1 hour
Rain startedHourly rain > 0.01"/hr30 minutes
Rapid pressure dropPressure drops > 0.08" in ~1 hour2 hours

Wind gust thresholds are statistically calibrated — the alert fires only when the gust is significantly above the baseline average wind for that hour, not just above a fixed value. Alert state and debounce timestamps are persisted in wx-alerts. Triggered alerts are delivered via SES email.


ML SIGNALS

Five proprietary signals are computed on every /current request. Each is specific to this station's location, baseline, and accumulated history.

SignalMethodOutput
Comfort Score Composite of feels-like deviation from 71°F, humidity, wind speed, UV, and rain rate. Season-weighted (UV matters more in summer, wind in winter). Adjusted by temperature anomaly vs. this station's rolling baseline. 0–100 integer + label (Excellent / Good / Fair / Poor / Harsh)
Percentile Rank Current temperature expressed as a percentile against this station's historical distribution for the same month. Uses a normal CDF approximation with season-calibrated standard deviations (winter σ=9.5°F, summer σ=6.0°F). 1st–99th percentile + label ("one of the warmest days on record here")
Rain Probability Logistic regression: p = σ(w·x + b). Features: humidity (normalized), 1-hour pressure trend, dew-point depression, sin/cos of hour-of-day. Coefficients fitted weekly against labeled station history by wx-ml-fitter (see below). Persistence boost if currently raining. 1–99% + label (Unlikely / Slight chance / Possible / Likely / Very likely)
Urban Heat Island (UHI) This station's temperature minus the average of live METAR readings at KJFK, KLGA, and KEWR via NOAA aviationweather.gov. Fetched and persisted on each 5-minute poll. Per-month rolling averages accumulate in wx-uhi-seasonal for the seasonal curve. ±°F delta + direction label + typical delta for current month
Analog Forecast Nearest-neighbor pattern matching on 90 days of hourly data. Pre-computed every 30 minutes by wx-forecaster (see below). The dashboard reports running forecast accuracy (MAE) as evaluations accumulate. +1h / +2h / +3h for temp, humidity, wind, pressure + confidence %

RAIN PROBABILITY MODEL FITTING

A scheduled Lambda (wx-ml-fitter) runs every Sunday at 3:00 UTC. It scans all 90-day readings, labels each reading with whether measurable rain (hourlyrainin > 0.01) occurred in any reading within the following 60 minutes, then fits logistic regression via gradient descent with L2 regularization (300 epochs, learning rate 0.05).

Rain events are rare (~8–9% of readings), so training uses class weighting: positive examples (rain) are upweighted by n_neg / n_pos. This produces a recall-biased model — appropriate for weather, where a missed rain event is worse than a false alarm.

Fitted weights are stored in wx-ml-models. On the next wx-api cold start, ml.py loads them and uses them for all subsequent calls in that warm instance. The model is only stored if F1 ≥ 0.05; otherwise the heuristic coefficients remain active. Training metrics (accuracy, precision, recall, F1, confusion matrix) are stored alongside the weights.

Current model: 24,348 training examples, F1=0.396, recall=82%, precision=26%.


NEARBY STATIONS

Every 5 minutes, alongside the primary station reading, the poller fetches up to 20 nearby Personal Weather Stations from the Weather Underground API using the geocode endpoint (centered on 40.7549°N, 73.984°W). Each result is filtered to exclude the home station (distance < 0.05 mi) and sorted by distance. The snapshot — station ID, temperature, humidity, rain rate, wind, bearing, and distance — is written to the wx-nearby-snapshots DynamoDB table with a 30-day TTL.

The /nearby API endpoint returns the most recent snapshot. The dashboard Nearby Stations strip shows up to 8 of the closest stations with their current temperature and rain rate.


SPATIAL RAIN BOOST

When nearby station data is available, the rain probability estimate is augmented with a spatial boost from upwind stations. A station is considered "upwind" if its bearing from the home station falls within ±60° of the current wind direction. The boost formula is:

boost = min(0.35, √rain_rate × 0.25 / distance_mi)

where rain_rate is in inches per hour and distance_mi is the straight-line distance. The boost is additive to the logistic regression base probability (capped at 0.99). The contributing station ID is surfaced in the dashboard as a "↑ boosted" annotation on the rain probability card when a boost is applied.

This is Phase 1 of the spatial rain model. Phase 2 (retraining the logistic regression with spatial features as explicit inputs) is deferred until 30+ days of nearby snapshot history accumulates.


SEASONAL UHI CURVE

On every 5-minute poll, the poller calls the NOAA METAR endpoint, computes the UHI delta, and updates two stores: the uhi_delta field on the reading in wx-readings, and the per-month rolling average in wx-uhi-seasonal.

wx-uhi-seasonal stores one row per calendar month (01–12), keyed by station ID + month. Each row holds a capped rolling average (max 10,000 samples per month) so the curve reflects the current year's observations rather than a multi-year stale mean.

The dashboard shows the typical UHI delta for the current month once at least 10 readings exist for that month. The full 12-month curve is available in the /current API response as uhi_seasonal_curve[].


ANALOG FORECAST

A scheduled Lambda (wx-forecaster) runs every 30 minutes and produces a 3-hour forecast from historical pattern matching. The result is pre-computed and served synchronously by the API.

Algorithm:

  1. Fetch all 90-day readings and bucket into 1-hour averages (~2,100 hourly buckets across 4 fields).
  2. The current fingerprint is the last 6 hourly buckets, normalized to [0,1] within physical bounds for Midtown Manhattan.
  3. For every historical 6-hour window with at least 50% field coverage, compute normalized Euclidean distance to the current fingerprint.
  4. Take the top-5 closest analogs. Average their subsequent 3-hour trajectories into an ensemble forecast.

Confidence (0–100%) reflects how closely the analogs matched: lower average distance → higher confidence. The best-matching analog date is shown for interpretability. The model improves continuously as new readings expand the pool of candidates.


FORECAST ACCURACY TRACKING

On each run, wx-forecaster first reads the previous forecast from wx-forecasts. If that forecast is between 55 and 240 minutes old — meaning actual hourly readings are available for the +1h, +2h, and +3h windows — it computes mean absolute error (MAE) for all four forecast fields against the actual readings.

Each evaluation is written to wx-forecast-accuracy as a timestamped row. A running row in the same table accumulates the incremental mean across all evaluations. Once 5+ evaluations exist, the dashboard shows the running +1h temperature MAE alongside the confidence percentage.


DAILY SUMMARIES

A scheduled Lambda (wx-summarizer) runs daily at 05:00 UTC (midnight ET). It processes the last 30 calendar days idempotently — safe to rerun without duplicating data. For each day it computes high/low/average temperature, total rainfall, max wind gust, and average comfort score, then generates a 2–3 sentence prose description of the day.

Prose generation is deterministic — it selects sentence templates based on temperature spread, rain event count and peak intensity, and notable wind. Rain event detection parses the 5-minute readings to find contiguous rain periods (rate > 0.01"/hr), grouping them by start time, duration, total accumulation, and peak rate.

The most recent summary appears on the dashboard beneath the hero section. The full 30-day series is also used to populate the Comfort Calendar — a heatmap grid where each day is colored from red (0/100) to green (85+/100) based on the daily average comfort score.

Results are stored in wx-daily-summaries (keyed by station_id + date) and exposed via GET /daily-summaries.


STATION RECORDS

A scheduled Lambda (wx-records-tracker) runs weekly on Sunday at 02:00 UTC. It scans all 90-day clean readings (no quality_flag), groups them by local calendar month, and computes per-month extremes:

RecordField
Highest temperaturetemp_high + date
Lowest temperaturetemp_low + date
Max wind gustmax_gust + date
Peak rain ratemax_rain_rate + date (only recorded if > 0.01"/hr)
Lowest pressuremin_pressure + date
Highest pressuremax_pressure + date

Records are stored in wx-station-records (keyed by station_id + month). The current month's records are included in every /current response and shown in the Station Records section of the dashboard. As the station accumulates history across multiple years, these records will reflect the full observed range for each calendar month.


RAIN EVENTS

The GET /rain-events endpoint scans recent readings on-demand and parses them into discrete rain events. A rain event begins when hourlyrainin > 0.01 (in/hr) and ends when the rate returns to zero for a 5-minute interval. For each event the API returns:

Small events (<0.01") that result from sensor noise are naturally excluded by the rate threshold. Rainfall totals are also surfaced in the daily summaries and Comfort Calendar.


CHART ANOMALY BANDS

In addition to the dotted baseline overlay, the history chart shows a shaded ±1σ band for each metric. The upper and lower edges of the band are baseline ± std_dev, where the standard deviation is computed incrementally by the poller using Welford's online algorithm:

new_var = (old_var × n + (x − old_mean) × (x − new_mean)) / (n + 1)

This allows variance to be updated in O(1) per reading without storing all historical values. Once the rolling sample count reaches 8,640 (30 days × 288 readings/day), the algorithm transitions to exponential moving average variance to prevent very old data from dominating:

new_var = (1 − α) × old_var + α × (x − old_mean)²

The resulting std_{field} values are stored per MM-HH baseline slot and attached to history readings by the API. The dashboard renders the band as a transparent fill between the upper and lower edges using uPlot's bands config, anchored to the upper and lower σ series.


DOWNSAMPLING FOR LONGER RANGES

The /history endpoint accepts up to 720 hours (30 days). To keep payloads manageable, readings are automatically bucketed:

Numeric fields are averaged within each bucket. The timestamp is the floor of the bucket boundary. Per-slot baselines are batch-fetched from wx-daily-stats and attached to each reading so the chart can overlay the historical normal line.


API

The API is public and read-only. No authentication required.

EndpointDescription
GET /current Latest reading with all ML signals: condition label, anomaly, percentile rank, comfort score, rain probability, UHI delta, seasonal UHI curve, analog forecast with accuracy, yesterday's daily summary, current-month station records, and climate_context (NOAA GHCN verdict + ERA5 hourly percentile with p5/p50/p95 distribution data)
GET /history?hours=N Last N hours of readings (default 24, max 720). Downsampled for longer ranges. Each reading includes baseline_{field} and baseline_std_{field} values for chart overlay and ±1σ bands.
GET /rain-events?days=N Parsed rain events from the last N days (default 30). Each event has start, duration_min, total_in, peak_rate.
GET /daily-summaries?days=N Pre-computed daily summaries from the last N days (default 30). Each entry has high/low/avg temp, total rain, max gust, avg comfort score, and prose summary.
GET /nearby Most recent snapshot of nearby Manhattan PWS stations.

All responses are JSON. CloudFront caches all endpoints for 5 minutes, keyed by path and query string. CORS is open (Access-Control-Allow-Origin: *).


INFRASTRUCTURE

Everything runs on AWS in us-east-1. Estimated cost: ~$4–6/month.