docs: network-resilience firmware 1.1 deployment + field diagnostic guide

Flash command, expected first-boot behavior, per-feature summary of the
1.1 release, 24-hour field-check playbook, and a reference table for
decoding the heartbeat's recent_events array.
This commit is contained in:
2026-04-23 14:02:09 -07:00
parent 867e90b1f6
commit 2d95069bd1

View File

@@ -197,3 +197,70 @@ Capture a boot log with timestamps:
```bash ```bash
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30 python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30
``` ```
## Deploying firmware 1.1 (network resilience)
### Flash command
```bash
cd firmware && pio run -e timercam -t upload
```
### Expected first boot
On the serial log (115200 baud), the device prints the boot banner, then
initializes `event_log`, then records the reset reason via `EVT_BOOT`.
The first heartbeat fires roughly 60-70s after power-on (15s WiFi
busy-wait + NTP sync + 60s `BOOT_REPORT_DELAY_S`). Monitor with
`pio device monitor` or:
```bash
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 90
```
### What's new in 1.1
- Event-driven WiFi reconnect with 1s→60s exponential backoff (`net_guard` module); disconnect reasons logged.
- HTTP timeouts (5s connect / 10s response) + 3-try retry on every POST.
- ESP-IDF Task Watchdog (30s) on camera, reporter, and loop tasks; panic → reboot → reason surfaces in the next heartbeat.
- Software heartbeat-miss watchdog: 6 consecutive missed heartbeats (~6 h) triggers a clean reboot.
- Persistent NVS event-log ring buffer (32 entries) surfaced in the heartbeat's `recent_events` field.
- New heartbeat fields: `reset_reason`, `heap_free`, `heap_min_free`, `last_disconnect_code`, `recent_events`.
### 24-hour field checks
After deploying a device, run through this checklist against the server's
heartbeat records at the 24-hour mark:
- **Heartbeat count ≥ 22** — ≥ 92% uptime across 24 h at the hourly cadence.
- **No sustained `t=6` (EVT_HEARTBEAT_MISS) entries in `recent_events`** — transient singletons are expected; repeated misses indicate a sticky network problem worth investigating.
- **`heap_min_free` stable day over day** — a downward drift indicates a leak. Alert threshold: min-free drops by more than 20% vs baseline.
- **`last_disconnect_code` matches known AP behavior** — reason 8 (assoc lost) and reason 15 (4-way handshake timeout) are common on busy APs; recurring reason 200+ indicates a firmware bug.
- **`reset_reason` has no unexpected values** — see table below.
| `reset_reason` | Meaning | Expected? |
|----------------|---------|-----------|
| 1 | Power-on | Normal immediately after a deployment. |
| 4 | Software reset (our `ESP.restart()`) | Correlate with `EVT_REBOOT` in `recent_events`. |
| 6 | Task watchdog | Investigate — a task hung for 30s. |
| 7 | Brownout | Investigate power supply / USB cable. |
| 8 | SDIO reset | Unusual — investigate. |
### Decoding recent_events
The `recent_events` array is a ring buffer of `{t, d0, d1, ts}` entries.
Tag definitions live in `firmware/lib/event_log/event_log.h`:
| `t` | Event | `d0` | `d1` |
|-----|-------|------|------|
| 1 | `EVT_BOOT` | `esp_reset_reason()` | — |
| 2 | `EVT_WIFI_UP` | RSSI | — |
| 3 | `EVT_WIFI_DOWN` | disconnect reason code; `0xFF` = silent-death fallback | — |
| 4 | `EVT_HTTP_OK` | fnv1a-16 path hash | elapsed ms (capped at 65535) |
| 5 | `EVT_HTTP_FAIL` | path hash | HTTP status or negative errno cast to `uint16` |
| 6 | `EVT_HEARTBEAT_MISS` | consecutive miss count | — |
| 7 | `EVT_NTP_SYNC` | reserved | — |
| 8 | `EVT_REBOOT` | `RebootReason`: 1=HEARTBEAT_MISS, 2=FACTORY_RESET, 3=OTA, 4=WIFI_REPROV | — |
Server-side decoder tables (`EVENT_TAG_DECODER`, `REBOOT_REASON_DECODER`)
live in `server/heartbeat_diagnostics_stub.py`.