diff --git a/README.md b/README.md index f99b0a2..7237d71 100644 --- a/README.md +++ b/README.md @@ -197,3 +197,70 @@ Capture a boot log with timestamps: ```bash python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30 ``` + +## Deploying firmware 1.1 (network resilience) + +### Flash command + +```bash +cd firmware && pio run -e timercam -t upload +``` + +### Expected first boot + +On the serial log (115200 baud), the device prints the boot banner, then +initializes `event_log`, then records the reset reason via `EVT_BOOT`. +The first heartbeat fires roughly 60-70s after power-on (15s WiFi +busy-wait + NTP sync + 60s `BOOT_REPORT_DELAY_S`). Monitor with +`pio device monitor` or: + +```bash +python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 90 +``` + +### What's new in 1.1 + +- Event-driven WiFi reconnect with 1s→60s exponential backoff (`net_guard` module); disconnect reasons logged. +- HTTP timeouts (5s connect / 10s response) + 3-try retry on every POST. +- ESP-IDF Task Watchdog (30s) on camera, reporter, and loop tasks; panic → reboot → reason surfaces in the next heartbeat. +- Software heartbeat-miss watchdog: 6 consecutive missed heartbeats (~6 h) triggers a clean reboot. +- Persistent NVS event-log ring buffer (32 entries) surfaced in the heartbeat's `recent_events` field. +- New heartbeat fields: `reset_reason`, `heap_free`, `heap_min_free`, `last_disconnect_code`, `recent_events`. + +### 24-hour field checks + +After deploying a device, run through this checklist against the server's +heartbeat records at the 24-hour mark: + +- **Heartbeat count ≥ 22** — ≥ 92% uptime across 24 h at the hourly cadence. +- **No sustained `t=6` (EVT_HEARTBEAT_MISS) entries in `recent_events`** — transient singletons are expected; repeated misses indicate a sticky network problem worth investigating. +- **`heap_min_free` stable day over day** — a downward drift indicates a leak. Alert threshold: min-free drops by more than 20% vs baseline. +- **`last_disconnect_code` matches known AP behavior** — reason 8 (assoc lost) and reason 15 (4-way handshake timeout) are common on busy APs; recurring reason 200+ indicates a firmware bug. +- **`reset_reason` has no unexpected values** — see table below. + +| `reset_reason` | Meaning | Expected? | +|----------------|---------|-----------| +| 1 | Power-on | Normal immediately after a deployment. | +| 4 | Software reset (our `ESP.restart()`) | Correlate with `EVT_REBOOT` in `recent_events`. | +| 6 | Task watchdog | Investigate — a task hung for 30s. | +| 7 | Brownout | Investigate power supply / USB cable. | +| 8 | SDIO reset | Unusual — investigate. | + +### Decoding recent_events + +The `recent_events` array is a ring buffer of `{t, d0, d1, ts}` entries. +Tag definitions live in `firmware/lib/event_log/event_log.h`: + +| `t` | Event | `d0` | `d1` | +|-----|-------|------|------| +| 1 | `EVT_BOOT` | `esp_reset_reason()` | — | +| 2 | `EVT_WIFI_UP` | RSSI | — | +| 3 | `EVT_WIFI_DOWN` | disconnect reason code; `0xFF` = silent-death fallback | — | +| 4 | `EVT_HTTP_OK` | fnv1a-16 path hash | elapsed ms (capped at 65535) | +| 5 | `EVT_HTTP_FAIL` | path hash | HTTP status or negative errno cast to `uint16` | +| 6 | `EVT_HEARTBEAT_MISS` | consecutive miss count | — | +| 7 | `EVT_NTP_SYNC` | reserved | — | +| 8 | `EVT_REBOOT` | `RebootReason`: 1=HEARTBEAT_MISS, 2=FACTORY_RESET, 3=OTA, 4=WIFI_REPROV | — | + +Server-side decoder tables (`EVENT_TAG_DECODER`, `REBOOT_REASON_DECODER`) +live in `server/heartbeat_diagnostics_stub.py`.