docs: network-resilience firmware 1.1 deployment + field diagnostic guide
Flash command, expected first-boot behavior, per-feature summary of the 1.1 release, 24-hour field-check playbook, and a reference table for decoding the heartbeat's recent_events array.
This commit is contained in:
67
README.md
67
README.md
@@ -197,3 +197,70 @@ Capture a boot log with timestamps:
|
|||||||
```bash
|
```bash
|
||||||
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30
|
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Deploying firmware 1.1 (network resilience)
|
||||||
|
|
||||||
|
### Flash command
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd firmware && pio run -e timercam -t upload
|
||||||
|
```
|
||||||
|
|
||||||
|
### Expected first boot
|
||||||
|
|
||||||
|
On the serial log (115200 baud), the device prints the boot banner, then
|
||||||
|
initializes `event_log`, then records the reset reason via `EVT_BOOT`.
|
||||||
|
The first heartbeat fires roughly 60-70s after power-on (15s WiFi
|
||||||
|
busy-wait + NTP sync + 60s `BOOT_REPORT_DELAY_S`). Monitor with
|
||||||
|
`pio device monitor` or:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 90
|
||||||
|
```
|
||||||
|
|
||||||
|
### What's new in 1.1
|
||||||
|
|
||||||
|
- Event-driven WiFi reconnect with 1s→60s exponential backoff (`net_guard` module); disconnect reasons logged.
|
||||||
|
- HTTP timeouts (5s connect / 10s response) + 3-try retry on every POST.
|
||||||
|
- ESP-IDF Task Watchdog (30s) on camera, reporter, and loop tasks; panic → reboot → reason surfaces in the next heartbeat.
|
||||||
|
- Software heartbeat-miss watchdog: 6 consecutive missed heartbeats (~6 h) triggers a clean reboot.
|
||||||
|
- Persistent NVS event-log ring buffer (32 entries) surfaced in the heartbeat's `recent_events` field.
|
||||||
|
- New heartbeat fields: `reset_reason`, `heap_free`, `heap_min_free`, `last_disconnect_code`, `recent_events`.
|
||||||
|
|
||||||
|
### 24-hour field checks
|
||||||
|
|
||||||
|
After deploying a device, run through this checklist against the server's
|
||||||
|
heartbeat records at the 24-hour mark:
|
||||||
|
|
||||||
|
- **Heartbeat count ≥ 22** — ≥ 92% uptime across 24 h at the hourly cadence.
|
||||||
|
- **No sustained `t=6` (EVT_HEARTBEAT_MISS) entries in `recent_events`** — transient singletons are expected; repeated misses indicate a sticky network problem worth investigating.
|
||||||
|
- **`heap_min_free` stable day over day** — a downward drift indicates a leak. Alert threshold: min-free drops by more than 20% vs baseline.
|
||||||
|
- **`last_disconnect_code` matches known AP behavior** — reason 8 (assoc lost) and reason 15 (4-way handshake timeout) are common on busy APs; recurring reason 200+ indicates a firmware bug.
|
||||||
|
- **`reset_reason` has no unexpected values** — see table below.
|
||||||
|
|
||||||
|
| `reset_reason` | Meaning | Expected? |
|
||||||
|
|----------------|---------|-----------|
|
||||||
|
| 1 | Power-on | Normal immediately after a deployment. |
|
||||||
|
| 4 | Software reset (our `ESP.restart()`) | Correlate with `EVT_REBOOT` in `recent_events`. |
|
||||||
|
| 6 | Task watchdog | Investigate — a task hung for 30s. |
|
||||||
|
| 7 | Brownout | Investigate power supply / USB cable. |
|
||||||
|
| 8 | SDIO reset | Unusual — investigate. |
|
||||||
|
|
||||||
|
### Decoding recent_events
|
||||||
|
|
||||||
|
The `recent_events` array is a ring buffer of `{t, d0, d1, ts}` entries.
|
||||||
|
Tag definitions live in `firmware/lib/event_log/event_log.h`:
|
||||||
|
|
||||||
|
| `t` | Event | `d0` | `d1` |
|
||||||
|
|-----|-------|------|------|
|
||||||
|
| 1 | `EVT_BOOT` | `esp_reset_reason()` | — |
|
||||||
|
| 2 | `EVT_WIFI_UP` | RSSI | — |
|
||||||
|
| 3 | `EVT_WIFI_DOWN` | disconnect reason code; `0xFF` = silent-death fallback | — |
|
||||||
|
| 4 | `EVT_HTTP_OK` | fnv1a-16 path hash | elapsed ms (capped at 65535) |
|
||||||
|
| 5 | `EVT_HTTP_FAIL` | path hash | HTTP status or negative errno cast to `uint16` |
|
||||||
|
| 6 | `EVT_HEARTBEAT_MISS` | consecutive miss count | — |
|
||||||
|
| 7 | `EVT_NTP_SYNC` | reserved | — |
|
||||||
|
| 8 | `EVT_REBOOT` | `RebootReason`: 1=HEARTBEAT_MISS, 2=FACTORY_RESET, 3=OTA, 4=WIFI_REPROV | — |
|
||||||
|
|
||||||
|
Server-side decoder tables (`EVENT_TAG_DECODER`, `REBOOT_REASON_DECODER`)
|
||||||
|
live in `server/heartbeat_diagnostics_stub.py`.
|
||||||
|
|||||||
Reference in New Issue
Block a user