From d2c2d97fb7ffff93bb25d78d655bc59fabe7ddb8 Mon Sep 17 00:00:00 2001 From: Peter Woolery Date: Thu, 14 May 2026 12:21:52 -0700 Subject: [PATCH] feat(ota): harden OTA apply flow + bump firmware to 1.0.1 End-to-end OTA verified on dc-0002 after resolving server-side schema mismatch (server now emits update/size/sig_b64 alongside existing fields). Firmware changes: - Bump FW_VERSION 1.0.0 -> 1.0.1 - Replace log_i/w/e with Serial.printf in ota_updater so output appears regardless of CORE_DEBUG_LEVEL (the prior macros were silent in prod) - Log partition labels/offsets, per-128KB progress, computed sha256, HTTP errors with body, esp_ota_* errors by name, Content-Length vs expected size - Check esp_ota_write return value (previously ignored -- silent partition corruption on write failure) and abort cleanly on error - Reject update if expected_size > target partition size - Serial.flush() + 500ms delay before esp_restart() so the final log line escapes the UART - Boot-time: log running partition label/offset/state + FW_VERSION, and call esp_ota_mark_app_valid_cancel_rollback() on PENDING_VERIFY to prevent silent rollback after a successful OTA Docs: - Rewrite docs/ota-deployment-status.md to reflect resolved state, document the schema fix and the .bin/.sig co-deploy invariant --- docs/ota-deployment-status.md | 88 ++++++++++++++ firmware/include/version.h | 2 +- firmware/lib/ota_updater/ota_updater.cpp | 141 ++++++++++++++++++----- firmware/src/main.cpp | 33 +++++- 4 files changed, 232 insertions(+), 32 deletions(-) create mode 100644 docs/ota-deployment-status.md diff --git a/docs/ota-deployment-status.md b/docs/ota-deployment-status.md new file mode 100644 index 0000000..fd52856 --- /dev/null +++ b/docs/ota-deployment-status.md @@ -0,0 +1,88 @@ +# OTA Deployment — Status + +## Current state (2026-05-14) + +**End-to-end OTA verified working on `dc-0002`.** Device polled `engagement-api-1`, received a signed manifest, downloaded and verified firmware 1.0.1, set the alternate boot partition, rebooted, and came up reporting `fw=1.0.1`. + +## What's deployed + +- **Branch `feat/pull-ota-code-signing`** merged to `main` (13 commits, 17 new files, 936 LOC). +- **Signing toolchain**: `tools/gen_signing_key.py`, `tools/sign_firmware.py`, `tools/deploy_firmware.py`. +- **Firmware OTA library**: `firmware/lib/ota_updater/`. +- **Signing key**: `secrets/firmware_signing_key.pem` (gitignored). Public key committed at `firmware/lib/ota_updater/ota_pubkey.h`. +- **Live OTA handler**: served by `engagement-api-1` Docker service (source not in this repo). The stub at `server/ota_endpoint.py` is unwired and not the one responding to devices. +- **Configurable poll interval** via NVS key `ota_interval`. Provision with `flash_device.py --ota-interval-seconds N`. Min 10 s, default 21600 (6 h). + +## Issues resolved + +### 1. HMAC format mismatch (resolved 2026-05-13) +Firmware OTA updater was using `X-HMAC-Signature` header + `millis()`-derived timestamp; the reporter component used `X-Signature` + `time(nullptr)`. Server expected the reporter format. Fixed by aligning the OTA updater to the same canonical scheme as the reporter (`firmware/lib/ota_updater/ota_updater.cpp` `add_hmac_headers`). + +### 2. `/ota/check` JSON schema mismatch (resolved 2026-05-14) +Server was emitting `{update_available, sha256, url}` but firmware reads `{update, size, sig_b64}`. Device silently decided "up to date" every poll because `doc["update"]` defaulted to `false`. Fixed server-side: the `/ota/check` response now also includes the fields the firmware needs. Firmware schema remains the source of truth. + +### 3. Signed firmware artifact pipeline (resolved 2026-05-14) +Deploy flow now bumps `FW_VERSION` → builds → copies `.pio/build/timercam/firmware.bin` to `firmware-.bin` → signs with `tools/sign_firmware.py` → SCPs both `.bin` and `.bin.sig` to `root@nginx:/root/engagement-api/firmware/`. Server team updates `firmware_releases.sha256` to match the new binary. + +**Gotcha:** the `.bin` and `.sig` must always be deployed together. The signature is over the bytes; replacing one without the other puts the server in an inconsistent state and devices will reject the update with `SIGNATURE INVALID`. + +## Hardening added this session + +### Firmware logging (`firmware/lib/ota_updater/ota_updater.cpp`, `firmware/src/main.cpp`) +The previous `log_i/w/e` macros were silenced by the default `CORE_DEBUG_LEVEL`. Replaced with `Serial.printf` so output appears regardless of log level. Now logs at every step: +- `[OTA] task started, interval=N ms` +- Per-tick WiFi status +- Full check URL + HMAC header preview (device id, ts, sig prefix) +- HTTP response code + error body on non-200 +- JSON parse errors +- "Up to date" decision +- Partition labels and offsets (running + target) +- Per-128 KB download progress +- Total bytes + elapsed ms +- Computed sha256 of the downloaded image (compare against server `X-SHA256`) +- Signature verify result +- `esp_ota_end` / `esp_ota_set_boot_partition` errors by name +- 500 ms `Serial.flush()` + `delay()` before `esp_restart()` so the final log line escapes the UART + +### Boot-time partition state (`firmware/src/main.cpp`) +Logs `running partition '