End-to-end OTA verified on dc-0002 after resolving server-side schema mismatch (server now emits update/size/sig_b64 alongside existing fields). Firmware changes: - Bump FW_VERSION 1.0.0 -> 1.0.1 - Replace log_i/w/e with Serial.printf in ota_updater so output appears regardless of CORE_DEBUG_LEVEL (the prior macros were silent in prod) - Log partition labels/offsets, per-128KB progress, computed sha256, HTTP errors with body, esp_ota_* errors by name, Content-Length vs expected size - Check esp_ota_write return value (previously ignored -- silent partition corruption on write failure) and abort cleanly on error - Reject update if expected_size > target partition size - Serial.flush() + 500ms delay before esp_restart() so the final log line escapes the UART - Boot-time: log running partition label/offset/state + FW_VERSION, and call esp_ota_mark_app_valid_cancel_rollback() on PENDING_VERIFY to prevent silent rollback after a successful OTA Docs: - Rewrite docs/ota-deployment-status.md to reflect resolved state, document the schema fix and the .bin/.sig co-deploy invariant
89 lines
5.2 KiB
Markdown
89 lines
5.2 KiB
Markdown
# OTA Deployment — Status
|
|
|
|
## Current state (2026-05-14)
|
|
|
|
**End-to-end OTA verified working on `dc-0002`.** Device polled `engagement-api-1`, received a signed manifest, downloaded and verified firmware 1.0.1, set the alternate boot partition, rebooted, and came up reporting `fw=1.0.1`.
|
|
|
|
## What's deployed
|
|
|
|
- **Branch `feat/pull-ota-code-signing`** merged to `main` (13 commits, 17 new files, 936 LOC).
|
|
- **Signing toolchain**: `tools/gen_signing_key.py`, `tools/sign_firmware.py`, `tools/deploy_firmware.py`.
|
|
- **Firmware OTA library**: `firmware/lib/ota_updater/`.
|
|
- **Signing key**: `secrets/firmware_signing_key.pem` (gitignored). Public key committed at `firmware/lib/ota_updater/ota_pubkey.h`.
|
|
- **Live OTA handler**: served by `engagement-api-1` Docker service (source not in this repo). The stub at `server/ota_endpoint.py` is unwired and not the one responding to devices.
|
|
- **Configurable poll interval** via NVS key `ota_interval`. Provision with `flash_device.py --ota-interval-seconds N`. Min 10 s, default 21600 (6 h).
|
|
|
|
## Issues resolved
|
|
|
|
### 1. HMAC format mismatch (resolved 2026-05-13)
|
|
Firmware OTA updater was using `X-HMAC-Signature` header + `millis()`-derived timestamp; the reporter component used `X-Signature` + `time(nullptr)`. Server expected the reporter format. Fixed by aligning the OTA updater to the same canonical scheme as the reporter (`firmware/lib/ota_updater/ota_updater.cpp` `add_hmac_headers`).
|
|
|
|
### 2. `/ota/check` JSON schema mismatch (resolved 2026-05-14)
|
|
Server was emitting `{update_available, sha256, url}` but firmware reads `{update, size, sig_b64}`. Device silently decided "up to date" every poll because `doc["update"]` defaulted to `false`. Fixed server-side: the `/ota/check` response now also includes the fields the firmware needs. Firmware schema remains the source of truth.
|
|
|
|
### 3. Signed firmware artifact pipeline (resolved 2026-05-14)
|
|
Deploy flow now bumps `FW_VERSION` → builds → copies `.pio/build/timercam/firmware.bin` to `firmware-<version>.bin` → signs with `tools/sign_firmware.py` → SCPs both `.bin` and `.bin.sig` to `root@nginx:/root/engagement-api/firmware/`. Server team updates `firmware_releases.sha256` to match the new binary.
|
|
|
|
**Gotcha:** the `.bin` and `.sig` must always be deployed together. The signature is over the bytes; replacing one without the other puts the server in an inconsistent state and devices will reject the update with `SIGNATURE INVALID`.
|
|
|
|
## Hardening added this session
|
|
|
|
### Firmware logging (`firmware/lib/ota_updater/ota_updater.cpp`, `firmware/src/main.cpp`)
|
|
The previous `log_i/w/e` macros were silenced by the default `CORE_DEBUG_LEVEL`. Replaced with `Serial.printf` so output appears regardless of log level. Now logs at every step:
|
|
- `[OTA] task started, interval=N ms`
|
|
- Per-tick WiFi status
|
|
- Full check URL + HMAC header preview (device id, ts, sig prefix)
|
|
- HTTP response code + error body on non-200
|
|
- JSON parse errors
|
|
- "Up to date" decision
|
|
- Partition labels and offsets (running + target)
|
|
- Per-128 KB download progress
|
|
- Total bytes + elapsed ms
|
|
- Computed sha256 of the downloaded image (compare against server `X-SHA256`)
|
|
- Signature verify result
|
|
- `esp_ota_end` / `esp_ota_set_boot_partition` errors by name
|
|
- 500 ms `Serial.flush()` + `delay()` before `esp_restart()` so the final log line escapes the UART
|
|
|
|
### Boot-time partition state (`firmware/src/main.cpp`)
|
|
Logs `running partition '<label>' (off=0x…) state=N fw=…` at every boot. If `state == ESP_OTA_IMG_PENDING_VERIFY` (3), calls `esp_ota_mark_app_valid_cancel_rollback()` to prevent the bootloader from reverting on the next reboot. Harmless no-op when rollback isn't enabled, but eliminates a class of silent OTA failures.
|
|
|
|
### `esp_ota_write` return value (`firmware/lib/ota_updater/ota_updater.cpp`)
|
|
Previously ignored — a failed write would silently corrupt the new partition and the device would still try to boot from it. Now checked, aborts the OTA cleanly, and logs the failing offset.
|
|
|
|
### Partition size pre-check
|
|
Reject the update before `esp_ota_begin` if `expected_size > target->size`.
|
|
|
|
## Verifying a deployment
|
|
|
|
After a server push, watch the device's serial output on the next OTA tick:
|
|
|
|
```
|
|
[OTA] tick: WiFi connected, running check
|
|
[OTA] check → GET http://logs.research.bike:80/ota/check?version=X.Y.Z
|
|
[OTA] check response: HTTP 200
|
|
[OTA] Update: X.Y.Z → A.B.C (N bytes)
|
|
[OTA] running='app0' (off=…), target='app1' (off=…)
|
|
[OTA] progress: N/N bytes
|
|
[OTA] sha256(image)=<hex> ← must match server X-SHA256
|
|
[OTA] signature OK
|
|
[OTA] boot partition set to 'app1' — rebooting in 500 ms
|
|
```
|
|
|
|
Then on reboot:
|
|
|
|
```
|
|
[BOOT] running partition 'app1' (off=…) state=N fw=A.B.C
|
|
```
|
|
|
|
The `fw=A.B.C` line is the success signal — it reflects the `FW_VERSION` macro baked into the freshly-booted image, not just what the device claims to be running.
|
|
|
|
## Quick reference
|
|
|
|
- Plan: `docs/superpowers/plans/2026-05-10-pull-ota-code-signing.md`
|
|
- Firmware version: `firmware/include/version.h`
|
|
- OTA library: `firmware/lib/ota_updater/`
|
|
- HMAC implementation: `firmware/lib/hmac/hmac.cpp`
|
|
- Provisioning tool: `tools/flash_device.py`
|
|
- Signing tools: `tools/gen_signing_key.py`, `tools/sign_firmware.py`, `tools/deploy_firmware.py`
|
|
- Server deploy path: `root@nginx:/root/engagement-api/firmware/` (per server team runbook)
|