Files
DoorCounter/docs/ota-deployment-status.md
Peter Woolery d2c2d97fb7 feat(ota): harden OTA apply flow + bump firmware to 1.0.1
End-to-end OTA verified on dc-0002 after resolving server-side schema
mismatch (server now emits update/size/sig_b64 alongside existing fields).

Firmware changes:
- Bump FW_VERSION 1.0.0 -> 1.0.1
- Replace log_i/w/e with Serial.printf in ota_updater so output appears
  regardless of CORE_DEBUG_LEVEL (the prior macros were silent in prod)
- Log partition labels/offsets, per-128KB progress, computed sha256,
  HTTP errors with body, esp_ota_* errors by name, Content-Length vs
  expected size
- Check esp_ota_write return value (previously ignored -- silent
  partition corruption on write failure) and abort cleanly on error
- Reject update if expected_size > target partition size
- Serial.flush() + 500ms delay before esp_restart() so the final log
  line escapes the UART
- Boot-time: log running partition label/offset/state + FW_VERSION,
  and call esp_ota_mark_app_valid_cancel_rollback() on PENDING_VERIFY
  to prevent silent rollback after a successful OTA

Docs:
- Rewrite docs/ota-deployment-status.md to reflect resolved state,
  document the schema fix and the .bin/.sig co-deploy invariant
2026-05-14 12:21:52 -07:00

89 lines
5.2 KiB
Markdown

# OTA Deployment — Status
## Current state (2026-05-14)
**End-to-end OTA verified working on `dc-0002`.** Device polled `engagement-api-1`, received a signed manifest, downloaded and verified firmware 1.0.1, set the alternate boot partition, rebooted, and came up reporting `fw=1.0.1`.
## What's deployed
- **Branch `feat/pull-ota-code-signing`** merged to `main` (13 commits, 17 new files, 936 LOC).
- **Signing toolchain**: `tools/gen_signing_key.py`, `tools/sign_firmware.py`, `tools/deploy_firmware.py`.
- **Firmware OTA library**: `firmware/lib/ota_updater/`.
- **Signing key**: `secrets/firmware_signing_key.pem` (gitignored). Public key committed at `firmware/lib/ota_updater/ota_pubkey.h`.
- **Live OTA handler**: served by `engagement-api-1` Docker service (source not in this repo). The stub at `server/ota_endpoint.py` is unwired and not the one responding to devices.
- **Configurable poll interval** via NVS key `ota_interval`. Provision with `flash_device.py --ota-interval-seconds N`. Min 10 s, default 21600 (6 h).
## Issues resolved
### 1. HMAC format mismatch (resolved 2026-05-13)
Firmware OTA updater was using `X-HMAC-Signature` header + `millis()`-derived timestamp; the reporter component used `X-Signature` + `time(nullptr)`. Server expected the reporter format. Fixed by aligning the OTA updater to the same canonical scheme as the reporter (`firmware/lib/ota_updater/ota_updater.cpp` `add_hmac_headers`).
### 2. `/ota/check` JSON schema mismatch (resolved 2026-05-14)
Server was emitting `{update_available, sha256, url}` but firmware reads `{update, size, sig_b64}`. Device silently decided "up to date" every poll because `doc["update"]` defaulted to `false`. Fixed server-side: the `/ota/check` response now also includes the fields the firmware needs. Firmware schema remains the source of truth.
### 3. Signed firmware artifact pipeline (resolved 2026-05-14)
Deploy flow now bumps `FW_VERSION` → builds → copies `.pio/build/timercam/firmware.bin` to `firmware-<version>.bin` → signs with `tools/sign_firmware.py` → SCPs both `.bin` and `.bin.sig` to `root@nginx:/root/engagement-api/firmware/`. Server team updates `firmware_releases.sha256` to match the new binary.
**Gotcha:** the `.bin` and `.sig` must always be deployed together. The signature is over the bytes; replacing one without the other puts the server in an inconsistent state and devices will reject the update with `SIGNATURE INVALID`.
## Hardening added this session
### Firmware logging (`firmware/lib/ota_updater/ota_updater.cpp`, `firmware/src/main.cpp`)
The previous `log_i/w/e` macros were silenced by the default `CORE_DEBUG_LEVEL`. Replaced with `Serial.printf` so output appears regardless of log level. Now logs at every step:
- `[OTA] task started, interval=N ms`
- Per-tick WiFi status
- Full check URL + HMAC header preview (device id, ts, sig prefix)
- HTTP response code + error body on non-200
- JSON parse errors
- "Up to date" decision
- Partition labels and offsets (running + target)
- Per-128 KB download progress
- Total bytes + elapsed ms
- Computed sha256 of the downloaded image (compare against server `X-SHA256`)
- Signature verify result
- `esp_ota_end` / `esp_ota_set_boot_partition` errors by name
- 500 ms `Serial.flush()` + `delay()` before `esp_restart()` so the final log line escapes the UART
### Boot-time partition state (`firmware/src/main.cpp`)
Logs `running partition '<label>' (off=0x…) state=N fw=…` at every boot. If `state == ESP_OTA_IMG_PENDING_VERIFY` (3), calls `esp_ota_mark_app_valid_cancel_rollback()` to prevent the bootloader from reverting on the next reboot. Harmless no-op when rollback isn't enabled, but eliminates a class of silent OTA failures.
### `esp_ota_write` return value (`firmware/lib/ota_updater/ota_updater.cpp`)
Previously ignored — a failed write would silently corrupt the new partition and the device would still try to boot from it. Now checked, aborts the OTA cleanly, and logs the failing offset.
### Partition size pre-check
Reject the update before `esp_ota_begin` if `expected_size > target->size`.
## Verifying a deployment
After a server push, watch the device's serial output on the next OTA tick:
```
[OTA] tick: WiFi connected, running check
[OTA] check → GET http://logs.research.bike:80/ota/check?version=X.Y.Z
[OTA] check response: HTTP 200
[OTA] Update: X.Y.Z → A.B.C (N bytes)
[OTA] running='app0' (off=…), target='app1' (off=…)
[OTA] progress: N/N bytes
[OTA] sha256(image)=<hex> ← must match server X-SHA256
[OTA] signature OK
[OTA] boot partition set to 'app1' — rebooting in 500 ms
```
Then on reboot:
```
[BOOT] running partition 'app1' (off=…) state=N fw=A.B.C
```
The `fw=A.B.C` line is the success signal — it reflects the `FW_VERSION` macro baked into the freshly-booted image, not just what the device claims to be running.
## Quick reference
- Plan: `docs/superpowers/plans/2026-05-10-pull-ota-code-signing.md`
- Firmware version: `firmware/include/version.h`
- OTA library: `firmware/lib/ota_updater/`
- HMAC implementation: `firmware/lib/hmac/hmac.cpp`
- Provisioning tool: `tools/flash_device.py`
- Signing tools: `tools/gen_signing_key.py`, `tools/sign_firmware.py`, `tools/deploy_firmware.py`
- Server deploy path: `root@nginx:/root/engagement-api/firmware/` (per server team runbook)