End-to-end OTA verified on dc-0002 after resolving server-side schema mismatch (server now emits update/size/sig_b64 alongside existing fields). Firmware changes: - Bump FW_VERSION 1.0.0 -> 1.0.1 - Replace log_i/w/e with Serial.printf in ota_updater so output appears regardless of CORE_DEBUG_LEVEL (the prior macros were silent in prod) - Log partition labels/offsets, per-128KB progress, computed sha256, HTTP errors with body, esp_ota_* errors by name, Content-Length vs expected size - Check esp_ota_write return value (previously ignored -- silent partition corruption on write failure) and abort cleanly on error - Reject update if expected_size > target partition size - Serial.flush() + 500ms delay before esp_restart() so the final log line escapes the UART - Boot-time: log running partition label/offset/state + FW_VERSION, and call esp_ota_mark_app_valid_cancel_rollback() on PENDING_VERIFY to prevent silent rollback after a successful OTA Docs: - Rewrite docs/ota-deployment-status.md to reflect resolved state, document the schema fix and the .bin/.sig co-deploy invariant
5.2 KiB
OTA Deployment — Status
Current state (2026-05-14)
End-to-end OTA verified working on dc-0002. Device polled engagement-api-1, received a signed manifest, downloaded and verified firmware 1.0.1, set the alternate boot partition, rebooted, and came up reporting fw=1.0.1.
What's deployed
- Branch
feat/pull-ota-code-signingmerged tomain(13 commits, 17 new files, 936 LOC). - Signing toolchain:
tools/gen_signing_key.py,tools/sign_firmware.py,tools/deploy_firmware.py. - Firmware OTA library:
firmware/lib/ota_updater/. - Signing key:
secrets/firmware_signing_key.pem(gitignored). Public key committed atfirmware/lib/ota_updater/ota_pubkey.h. - Live OTA handler: served by
engagement-api-1Docker service (source not in this repo). The stub atserver/ota_endpoint.pyis unwired and not the one responding to devices. - Configurable poll interval via NVS key
ota_interval. Provision withflash_device.py --ota-interval-seconds N. Min 10 s, default 21600 (6 h).
Issues resolved
1. HMAC format mismatch (resolved 2026-05-13)
Firmware OTA updater was using X-HMAC-Signature header + millis()-derived timestamp; the reporter component used X-Signature + time(nullptr). Server expected the reporter format. Fixed by aligning the OTA updater to the same canonical scheme as the reporter (firmware/lib/ota_updater/ota_updater.cpp add_hmac_headers).
2. /ota/check JSON schema mismatch (resolved 2026-05-14)
Server was emitting {update_available, sha256, url} but firmware reads {update, size, sig_b64}. Device silently decided "up to date" every poll because doc["update"] defaulted to false. Fixed server-side: the /ota/check response now also includes the fields the firmware needs. Firmware schema remains the source of truth.
3. Signed firmware artifact pipeline (resolved 2026-05-14)
Deploy flow now bumps FW_VERSION → builds → copies .pio/build/timercam/firmware.bin to firmware-<version>.bin → signs with tools/sign_firmware.py → SCPs both .bin and .bin.sig to root@nginx:/root/engagement-api/firmware/. Server team updates firmware_releases.sha256 to match the new binary.
Gotcha: the .bin and .sig must always be deployed together. The signature is over the bytes; replacing one without the other puts the server in an inconsistent state and devices will reject the update with SIGNATURE INVALID.
Hardening added this session
Firmware logging (firmware/lib/ota_updater/ota_updater.cpp, firmware/src/main.cpp)
The previous log_i/w/e macros were silenced by the default CORE_DEBUG_LEVEL. Replaced with Serial.printf so output appears regardless of log level. Now logs at every step:
[OTA] task started, interval=N ms- Per-tick WiFi status
- Full check URL + HMAC header preview (device id, ts, sig prefix)
- HTTP response code + error body on non-200
- JSON parse errors
- "Up to date" decision
- Partition labels and offsets (running + target)
- Per-128 KB download progress
- Total bytes + elapsed ms
- Computed sha256 of the downloaded image (compare against server
X-SHA256) - Signature verify result
esp_ota_end/esp_ota_set_boot_partitionerrors by name- 500 ms
Serial.flush()+delay()beforeesp_restart()so the final log line escapes the UART
Boot-time partition state (firmware/src/main.cpp)
Logs running partition '<label>' (off=0x…) state=N fw=… at every boot. If state == ESP_OTA_IMG_PENDING_VERIFY (3), calls esp_ota_mark_app_valid_cancel_rollback() to prevent the bootloader from reverting on the next reboot. Harmless no-op when rollback isn't enabled, but eliminates a class of silent OTA failures.
esp_ota_write return value (firmware/lib/ota_updater/ota_updater.cpp)
Previously ignored — a failed write would silently corrupt the new partition and the device would still try to boot from it. Now checked, aborts the OTA cleanly, and logs the failing offset.
Partition size pre-check
Reject the update before esp_ota_begin if expected_size > target->size.
Verifying a deployment
After a server push, watch the device's serial output on the next OTA tick:
[OTA] tick: WiFi connected, running check
[OTA] check → GET http://logs.research.bike:80/ota/check?version=X.Y.Z
[OTA] check response: HTTP 200
[OTA] Update: X.Y.Z → A.B.C (N bytes)
[OTA] running='app0' (off=…), target='app1' (off=…)
[OTA] progress: N/N bytes
[OTA] sha256(image)=<hex> ← must match server X-SHA256
[OTA] signature OK
[OTA] boot partition set to 'app1' — rebooting in 500 ms
Then on reboot:
[BOOT] running partition 'app1' (off=…) state=N fw=A.B.C
The fw=A.B.C line is the success signal — it reflects the FW_VERSION macro baked into the freshly-booted image, not just what the device claims to be running.
Quick reference
- Plan:
docs/superpowers/plans/2026-05-10-pull-ota-code-signing.md - Firmware version:
firmware/include/version.h - OTA library:
firmware/lib/ota_updater/ - HMAC implementation:
firmware/lib/hmac/hmac.cpp - Provisioning tool:
tools/flash_device.py - Signing tools:
tools/gen_signing_key.py,tools/sign_firmware.py,tools/deploy_firmware.py - Server deploy path:
root@nginx:/root/engagement-api/firmware/(per server team runbook)