How It Works
wx.jamestannahill.com publishes live hyperlocal weather from a private station in Midtown Manhattan. Below is the full data pipeline — physical sensor, AWS Lambdas, ML signals, and 156 years of climate context — that produces every reading on the dashboard.
DATA COLLECTION
The station is an Ambient Weather WS-2902 mounted at the property. It transmits readings every 5 minutes to Ambient Weather's cloud via a local Wi-Fi gateway and is also registered with Weather Underground as KNYNEWYO2140. Live readings are visible at ambientweather.net and wunderground.com/dashboard/pws/KNYNEWYO2140.
A scheduled AWS Lambda (wx-poller) fires every 5 minutes via EventBridge and calls the Ambient Weather REST API to fetch the latest reading:
GET https://api.ambientweather.net/v1/devices/{mac}
?apiKey=...&applicationKey=...&limit=1 Each reading includes:
| Field | Description |
|---|---|
tempf | Outdoor temperature (°F) |
feelsLike | Feels-like temperature (wind chill / heat index) |
humidity | Relative humidity (%) |
dewPoint | Dew point temperature (°F) |
windspeedmph | 10-minute average wind speed (mph) |
windgustmph | Peak wind gust in the last 10 minutes (mph) |
winddir | Wind direction (0–360°) |
baromrelin | Relative barometric pressure (inHg) |
solarradiation | Solar radiation (W/m²) |
uv | UV index (0–11+) |
hourlyrainin | Rainfall rate in the current hour (in/hr) |
dailyrainin | Total rainfall since local midnight (in) |
On each poll, the poller also fetches the current METAR UHI delta from NOAA (see Urban Heat Island below) and appends it as a uhi_delta field on the reading before storing it. Raw readings are written to DynamoDB (wx-readings) with a 90-day TTL. The poller also generates and uploads a fresh OG image (og.png) to S3 on each run.
SOURCE VALIDATION
Before storing a reading, the poller runs two quality checks:
- Range check — each field is tested against physical bounds (e.g., temperature −40°F to 140°F, humidity 0–100%). Fields that fall outside the bounds are nulled and the reading is flagged
range_error. - Stuck sensor detection — if the last 6 temperature readings are identical, the reading is flagged
stuck. This catches a common failure mode in home weather stations where the outdoor sensor freezes on a stale value.
Flagged readings are still stored in wx-readings so the history is complete, but they are excluded from baseline updates. The dashboard shows a banner when the station data is stale or suspect.
HISTORICAL BACKFILL — WEATHER UNDERGROUND
When the station was first set up, the baseline table had no station data. To bootstrap it with real hyperlocal history, a one-time backfill script fetched 90 days of 5-minute readings from Weather Underground's PWS history API for the nearest private station (KNYNEWYO2140):
GET https://api.weather.com/v2/pws/history/all
?stationId=KNYNEWYO2140&format=json&units=e
&date=20260115&apiKey=... This produced 24,914 readings going back 90 days, which were written to wx-readings and used to seed the wx-daily-stats rolling averages. The baseline is now grounded in real local station data rather than climate reanalysis alone.
BASELINES — OPEN-METEO ERA5
Anomaly labels ("8.2°F above average for 9am in April") require historical climate normals for this exact location. The initial seed uses Open-Meteo's ERA5 reanalysis archive — actual measured atmospheric data going back to 1940, free and public.
A bootstrap Lambda fetches the 14th of each month for 2019–2023 (5 years × 12 months) at the station's coordinates and averages the hourly values into per-month-per-hour climate normals:
GET https://archive-api.open-meteo.com/v1/archive
?latitude=40.7549&longitude=-73.984
&start_date=2021-04-14&end_date=2021-04-14
&hourly=temperature_2m,relativehumidity_2m,
windspeed_10m,surface_pressure
&temperature_unit=fahrenheit&windspeed_unit=mph This produces 288 baseline slots (12 months × 24 hours) stored in wx-daily-stats, weighted as 288 samples. Real station readings blend in via a rolling weighted average — after a few months the baselines are fully derived from the station itself.
TODAY IN HISTORY — HISTORICAL CLIMATE CONTEXT
Separate from the rolling-average baseline system, the dashboard surfaces a deeper historical perspective: how does today's temperature rank against every recorded year at Central Park? This uses two independent data sources — NOAA GHCN-Daily for 156 years of station records, and ERA5 reanalysis for 85 years of hourly distribution data.
NOAA GHCN-Daily (daily high/low context). Station USW00094728 ("NY CITY CNTRL PARK") has daily temperature records going back to 1869. The data is fetched directly from NOAA's public S3 bucket in long format and parsed into per-DOY (day-of-year) distributions:
s3://noaa-ghcn-pds/csv/by_station/USW00094728.csv
# format: ID, DATE (YYYYMMDD), ELEMENT, DATA_VALUE, ...
# elements: TMAX, TMIN, AWND (daily high, low, avg wind) A bootstrap Lambda (WxClimateBootstrap) runs once to load all years, compute per-DOY statistics, and write them to the wx-climate-doy DynamoDB table — 366 rows keyed by MMDD. A nightly Lambda (WxClimateUpdater) refreshes the current year's data each night. Each row stores:
- annual_highs — array of
{year, tmax, tmin, awnd}for every recorded year, used for "warmest since YEAR" narratives - p5/p25/p50/p75/p95 — percentile distribution of historical daily highs and lows
- mean/std — mean and standard deviation for normal CDF lookups
- sample_count — number of years contributing to the distribution
On each /current request, climate_context.py calls daily_verdict() to rank the current station's confirmed daily high against the full historical distribution. The result produces statements like "Warmest April 13th since 1987 · 95th percentile" — scanning the year-descending record to find the most recent year that exceeded today's high.
ERA5 hourly distribution (live percentile context). A second table, wx-climate-hourly, stores 8,784 slots (366 DOY × 24 hours), each containing the ERA5 hourly distribution for temperature, dew point, and wind speed at this exact coordinate going back to 1940. These are loaded once by WxClimateBootstrap from the Open-Meteo archive API. live_context() in climate_context.py uses a normal CDF against the slot's mean and std to express the current reading as a percentile: "currently 91st percentile for this hour on April 13th in 85 years of records."
Headline priority. anomaly_headline() prefers the NOAA GHCN verdict (when a confirmed daily high is available) over the ERA5 live percentile. Both are included in the /current response under climate_context. The dashboard's deviation bar visualizes the span from p5 to p95, with a tick at the historical average and a dot at today's value.
ANOMALY SCORING
On every /current request, the API looks up the baseline for the current month and hour (New York local time) and computes a delta:
delta = current_value − rolling_avg_value
label = "8.2°F above average for 9am in April" Anomalies are computed for temperature, humidity, wind speed, and UV index. The dashboard surfaces the most notable one as a headline subline beneath the main temperature. Baselines are keyed by MM-HH (month + local hour). When a NOAA GHCN verdict is available, it takes precedence over the rolling-average anomaly label in the headline — the two systems run in parallel and complement each other.
WEATHER ALERTS
A scheduled Lambda (wx-alerter) runs every 15 minutes and evaluates the latest reading against five anomaly thresholds:
| Alert | Condition | Cooldown |
|---|---|---|
| Unusually warm | Temp > baseline avg + 10°F | 2 hours |
| Unusually cold | Temp < baseline avg − 10°F | 2 hours |
| High wind gust | Gust > avg wind + 20 mph and > 25 mph absolute | 1 hour |
| Rain started | Hourly rain > 0.01"/hr | 30 minutes |
| Rapid pressure drop | Pressure drops > 0.08" in ~1 hour | 2 hours |
Wind gust thresholds are statistically calibrated — the alert fires only when the gust is significantly above the baseline average wind for that hour, not just above a fixed value. Alert state and debounce timestamps are persisted in wx-alerts. Triggered alerts are delivered via SES email.
ML SIGNALS
Five proprietary signals are computed on every /current request. Each is specific to this station's location, baseline, and accumulated history.
| Signal | Method | Output |
|---|---|---|
| Comfort Score | Composite of feels-like deviation from 71°F, humidity, wind speed, UV, and rain rate. Season-weighted (UV matters more in summer, wind in winter). Adjusted by temperature anomaly vs. this station's rolling baseline. | 0–100 integer + label (Excellent / Good / Fair / Poor / Harsh) |
| Percentile Rank | Current temperature expressed as a percentile against this station's historical distribution for the same month. Uses a normal CDF approximation with season-calibrated standard deviations (winter σ=9.5°F, summer σ=6.0°F). | 1st–99th percentile + label ("one of the warmest days on record here") |
| Rain Probability | Logistic regression: p = σ(w·x + b). Features: humidity (normalized), 1-hour pressure trend, dew-point depression, sin/cos of hour-of-day. Coefficients fitted weekly against labeled station history by wx-ml-fitter (see below). Persistence boost if currently raining. | 1–99% + label (Unlikely / Slight chance / Possible / Likely / Very likely) |
| Urban Heat Island (UHI) | This station's temperature minus the average of live METAR readings at KJFK, KLGA, and KEWR via NOAA aviationweather.gov. Fetched and persisted on each 5-minute poll. Per-month rolling averages accumulate in wx-uhi-seasonal for the seasonal curve. | ±°F delta + direction label + typical delta for current month |
| Analog Forecast | Nearest-neighbor pattern matching on 90 days of hourly data. Pre-computed every 30 minutes by wx-forecaster (see below). The dashboard reports running forecast accuracy (MAE) as evaluations accumulate. | +1h / +2h / +3h for temp, humidity, wind, pressure + confidence % |
RAIN PROBABILITY MODEL FITTING
A scheduled Lambda (wx-ml-fitter) runs every Sunday at 3:00 UTC. It scans all 90-day readings, labels each reading with whether measurable rain (hourlyrainin > 0.01) occurred in any reading within the following 60 minutes, then fits logistic regression via gradient descent with L2 regularization (300 epochs, learning rate 0.05).
Rain events are rare (~8–9% of readings), so training uses class weighting: positive examples (rain) are upweighted by n_neg / n_pos. This produces a recall-biased model — appropriate for weather, where a missed rain event is worse than a false alarm.
Fitted weights are stored in wx-ml-models. On the next wx-api cold start, ml.py loads them and uses them for all subsequent calls in that warm instance. The model is only stored if F1 ≥ 0.05; otherwise the heuristic coefficients remain active. Training metrics (accuracy, precision, recall, F1, confusion matrix) are stored alongside the weights.
Current model: 24,348 training examples, F1=0.396, recall=82%, precision=26%.
NEARBY STATIONS
Every 5 minutes, alongside the primary station reading, the poller fetches up to 20 nearby Personal Weather Stations from the Weather Underground API using the geocode endpoint (centered on 40.7549°N, 73.984°W). Each result is filtered to exclude the home station (distance < 0.05 mi) and sorted by distance. The snapshot — station ID, temperature, humidity, rain rate, wind, bearing, and distance — is written to the wx-nearby-snapshots DynamoDB table with a 30-day TTL.
The /nearby API endpoint returns the most recent snapshot. The dashboard Nearby Stations strip shows up to 8 of the closest stations with their current temperature and rain rate.
SPATIAL RAIN BOOST
When nearby station data is available, the rain probability estimate is augmented with a spatial boost from upwind stations. A station is considered "upwind" if its bearing from the home station falls within ±60° of the current wind direction. The boost formula is:
boost = min(0.35, √rain_rate × 0.25 / distance_mi) where rain_rate is in inches per hour and distance_mi is the straight-line distance. The boost is additive to the logistic regression base probability (capped at 0.99). The contributing station ID is surfaced in the dashboard as a "↑ boosted" annotation on the rain probability card when a boost is applied.
This is Phase 1 of the spatial rain model. Phase 2 (retraining the logistic regression with spatial features as explicit inputs) is deferred until 30+ days of nearby snapshot history accumulates.
SEASONAL UHI CURVE
On every 5-minute poll, the poller calls the NOAA METAR endpoint, computes the UHI delta, and updates two stores: the uhi_delta field on the reading in wx-readings, and the per-month rolling average in wx-uhi-seasonal.
wx-uhi-seasonal stores one row per calendar month (01–12), keyed by station ID + month. Each row holds a capped rolling average (max 10,000 samples per month) so the curve reflects the current year's observations rather than a multi-year stale mean.
The dashboard shows the typical UHI delta for the current month once at least 10 readings exist for that month. The full 12-month curve is available in the /current API response as uhi_seasonal_curve[].
ANALOG FORECAST
A scheduled Lambda (wx-forecaster) runs every 30 minutes and produces a 3-hour forecast from historical pattern matching. The result is pre-computed and served synchronously by the API.
Algorithm:
- Fetch all 90-day readings and bucket into 1-hour averages (~2,100 hourly buckets across 4 fields).
- The current fingerprint is the last 6 hourly buckets, normalized to [0,1] within physical bounds for Midtown Manhattan.
- For every historical 6-hour window with at least 50% field coverage, compute normalized Euclidean distance to the current fingerprint.
- Take the top-5 closest analogs. Average their subsequent 3-hour trajectories into an ensemble forecast.
Confidence (0–100%) reflects how closely the analogs matched: lower average distance → higher confidence. The best-matching analog date is shown for interpretability. The model improves continuously as new readings expand the pool of candidates.
FORECAST ACCURACY TRACKING
On each run, wx-forecaster first reads the previous forecast from wx-forecasts. If that forecast is between 55 and 240 minutes old — meaning actual hourly readings are available for the +1h, +2h, and +3h windows — it computes mean absolute error (MAE) for all four forecast fields against the actual readings.
Each evaluation is written to wx-forecast-accuracy as a timestamped row. A running row in the same table accumulates the incremental mean across all evaluations. Once 5+ evaluations exist, the dashboard shows the running +1h temperature MAE alongside the confidence percentage.
DAILY SUMMARIES
A scheduled Lambda (wx-summarizer) runs daily at 05:00 UTC (midnight ET). It processes the last 30 calendar days idempotently — safe to rerun without duplicating data. For each day it computes high/low/average temperature, total rainfall, max wind gust, and average comfort score, then generates a 2–3 sentence prose description of the day.
Prose generation is deterministic — it selects sentence templates based on temperature spread, rain event count and peak intensity, and notable wind. Rain event detection parses the 5-minute readings to find contiguous rain periods (rate > 0.01"/hr), grouping them by start time, duration, total accumulation, and peak rate.
The most recent summary appears on the dashboard beneath the hero section. The full 30-day series is also used to populate the Comfort Calendar — a heatmap grid where each day is colored from red (0/100) to green (85+/100) based on the daily average comfort score.
Results are stored in wx-daily-summaries (keyed by station_id + date) and exposed via GET /daily-summaries.
STATION RECORDS
A scheduled Lambda (wx-records-tracker) runs weekly on Sunday at 02:00 UTC. It scans all 90-day clean readings (no quality_flag), groups them by local calendar month, and computes per-month extremes:
| Record | Field |
|---|---|
| Highest temperature | temp_high + date |
| Lowest temperature | temp_low + date |
| Max wind gust | max_gust + date |
| Peak rain rate | max_rain_rate + date (only recorded if > 0.01"/hr) |
| Lowest pressure | min_pressure + date |
| Highest pressure | max_pressure + date |
Records are stored in wx-station-records (keyed by station_id + month). The current month's records are included in every /current response and shown in the Station Records section of the dashboard. As the station accumulates history across multiple years, these records will reflect the full observed range for each calendar month.
RAIN EVENTS
The GET /rain-events endpoint scans recent readings on-demand and parses them into discrete rain events. A rain event begins when hourlyrainin > 0.01 (in/hr) and ends when the rate returns to zero for a 5-minute interval. For each event the API returns:
- start — ISO timestamp of the first wet reading
- duration_min — total length of the rain event in minutes (5-minute resolution)
- total_in — accumulated rainfall in inches (
sum(rates) × 5/60) - peak_rate — highest hourly rate observed during the event (in/hr)
Small events (<0.01") that result from sensor noise are naturally excluded by the rate threshold. Rainfall totals are also surfaced in the daily summaries and Comfort Calendar.
CHART ANOMALY BANDS
In addition to the dotted baseline overlay, the history chart shows a shaded ±1σ band for each metric. The upper and lower edges of the band are baseline ± std_dev, where the standard deviation is computed incrementally by the poller using Welford's online algorithm:
new_var = (old_var × n + (x − old_mean) × (x − new_mean)) / (n + 1) This allows variance to be updated in O(1) per reading without storing all historical values. Once the rolling sample count reaches 8,640 (30 days × 288 readings/day), the algorithm transitions to exponential moving average variance to prevent very old data from dominating:
new_var = (1 − α) × old_var + α × (x − old_mean)² The resulting std_{field} values are stored per MM-HH baseline slot and attached to history readings by the API. The dashboard renders the band as a transparent fill between the upper and lower edges using uPlot's bands config, anchored to the upper and lower σ series.
DOWNSAMPLING FOR LONGER RANGES
The /history endpoint accepts up to 720 hours (30 days). To keep payloads manageable, readings are automatically bucketed:
- ≤24h — raw 5-minute readings (~288 points)
- 25–168h (7d) — hourly averages (~168 points)
- >168h (30d) — daily averages (~30 points)
Numeric fields are averaged within each bucket. The timestamp is the floor of the bucket boundary. Per-slot baselines are batch-fetched from wx-daily-stats and attached to each reading so the chart can overlay the historical normal line.
API
The API is public and read-only. No authentication required.
| Endpoint | Description |
|---|---|
GET /current | Latest reading with all ML signals: condition label, anomaly, percentile rank, comfort score, rain probability, UHI delta, seasonal UHI curve, analog forecast with accuracy, yesterday's daily summary, current-month station records, and climate_context (NOAA GHCN verdict + ERA5 hourly percentile with p5/p50/p95 distribution data) |
GET /history?hours=N | Last N hours of readings (default 24, max 720). Downsampled for longer ranges. Each reading includes baseline_{field} and baseline_std_{field} values for chart overlay and ±1σ bands. |
GET /rain-events?days=N | Parsed rain events from the last N days (default 30). Each event has start, duration_min, total_in, peak_rate. |
GET /daily-summaries?days=N | Pre-computed daily summaries from the last N days (default 30). Each entry has high/low/avg temp, total rain, max gust, avg comfort score, and prose summary. |
GET /nearby | Most recent snapshot of nearby Manhattan PWS stations. |
All responses are JSON. CloudFront caches all endpoints for 5 minutes, keyed by path and query string. CORS is open (Access-Control-Allow-Origin: *).
INFRASTRUCTURE
Everything runs on AWS in us-east-1. Estimated cost: ~$4–6/month.
- DynamoDB (on-demand, 12 tables) —
wx-readings(90-day TTL),wx-daily-stats(baselines + std dev),wx-alerts(debounce state),wx-forecasts(latest forecast),wx-forecast-accuracy(MAE history + running),wx-uhi-seasonal(monthly UHI averages),wx-ml-models(fitted model weights),wx-daily-summaries(daily stats + prose),wx-station-records(per-month extremes),wx-nearby-snapshots(nearby PWS readings, 30-day TTL),wx-climate-doy(NOAA GHCN per-DOY distributions, 366 rows),wx-climate-hourly(ERA5 per-DOY-hour distributions, 8,784 rows) - Lambda (Python 3.12, arm64, 10 functions) —
wx-poller(5 min),wx-api(on-demand),wx-bootstrap(one-time),wx-alerter(15 min),wx-forecaster(30 min),wx-ml-fitter(weekly Sun 3AM),wx-summarizer(daily 5AM UTC),wx-records-tracker(weekly Sun 2AM),WxClimateBootstrap(one-time NOAA + ERA5 historical load),WxClimateUpdater(nightly current-year NOAA refresh) - API Gateway HTTP API — routes
/current,/history,/rain-events, and/daily-summariesto the API Lambda - CloudFront — in front of both the API Gateway and the S3 dashboard bucket
- S3 — static dashboard files, auto-generated OG image, and NOAA GHCN-Daily CSV source (
noaa-ghcn-pds, public bucket) - EventBridge — triggers all scheduled Lambdas
- SES — outbound email delivery for weather alerts
- Secrets Manager — Ambient Weather API keys and station config