fix(firmware): upgrade NimBLE to 2.x + DNS fallback for unreliable resolvers
NimBLE-Arduino 1.4.2 had an init/fire race in its FreeRTOS callout porting layer where os_callout_timer_cb dispatched a queued TimerHandle expiry against a not-yet-initialized event (NULL fn pointer), causing PC=0 InstrFetchProhibited within ~1s of boot when the camera task starved the timer service. Confirmed by ets_printf instrumentation. Upgrading to ^2.0.0 rewrites the porting layer and eliminates the race; verified clean on the customer network for 1+ hour. Also rolls in DNS-resilience work that surfaced the BLE crash during provisioning: pin lwIP/esp-netif resolvers to 1.1.1.1/8.8.8.8 across DHCP renewals, add three-tier resolver fallback in reporter with a hardcoded IP of last resort, and switch to raw WiFiClient with manual Host header to bypass HTTPClient's brittle DNS path. Migration touches for NimBLE 2.x: - NimBLEAdvertisedDeviceCallbacks -> NimBLEScanCallbacks - onResult signature now takes const NimBLEAdvertisedDevice* - setAdvertisedDeviceCallbacks -> setScanCallbacks - start(0, nullptr, false) -> start(0, false, false) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
1032
docs/superpowers/plans/2026-04-23-network-resilience.md
Normal file
1032
docs/superpowers/plans/2026-04-23-network-resilience.md
Normal file
File diff suppressed because it is too large
Load Diff
189
docs/superpowers/plans/2026-05-01-ble-nimble-crash.md
Normal file
189
docs/superpowers/plans/2026-05-01-ble-nimble-crash.md
Normal file
@@ -0,0 +1,189 @@
|
|||||||
|
# BLE / NimBLE Timer-Callout Crash — Handoff
|
||||||
|
|
||||||
|
**Date opened:** 2026-05-01
|
||||||
|
**Status:** Resolved 2026-05-01 by upgrading `h2zero/NimBLE-Arduino` from `^1.4.2` to `^2.0.0` (`firmware/platformio.ini:24`). BLE scanning re-enabled via `BLE_SCANNING_ENABLED 1` (`firmware/src/main.cpp:30`). Verified clean on customer network for 1+ hour with no panics.
|
||||||
|
**Goal:** Re-enable BLE scanning without the device crashing within ~1s of boot.
|
||||||
|
|
||||||
|
**Confirmed root cause:** Instrumented `os_callout_timer_cb` with `ets_printf` and observed the very first callout fire on the direct-call path with both `evq=NULL` and `fn=NULL`, while the same `co` address later (post-init) showed valid `evq` and `fn`. Same callout struct reused — classic NimBLE 1.x callout init/fire race where the FreeRTOS `TimerHandle_t` had a queued expiry against a not-yet-initialized event. NimBLE 2.x rewrote the porting layer; the race is gone.
|
||||||
|
|
||||||
|
**Migration touches (NimBLE 1.x → 2.x):**
|
||||||
|
- `NimBLEAdvertisedDeviceCallbacks` → `NimBLEScanCallbacks`
|
||||||
|
- `onResult(NimBLEAdvertisedDevice*)` → `onResult(const NimBLEAdvertisedDevice*)`
|
||||||
|
- `setAdvertisedDeviceCallbacks(cb, true)` → `setScanCallbacks(cb, true)`
|
||||||
|
- `start(0, nullptr, false)` → `start(0, false, false)` (signature: `duration, isContinue, restart`)
|
||||||
|
|
||||||
|
BLE was working before today's customer-site provisioning trip. The crash is reliably reproducible on the current build at the customer location whenever `BLE_SCANNING_ENABLED` is set back to `1`. It may or may not reproduce on a quieter network — the camera task's CPU-starvation pattern is shared, but the crash window's exact trigger is unconfirmed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
Within ~1s of boot, after several `cam_hal: EV-VSYNC-OVF` lines from the camera driver:
|
||||||
|
|
||||||
|
```
|
||||||
|
Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited). Exception was unhandled.
|
||||||
|
|
||||||
|
Core 0 register dump:
|
||||||
|
PC : 0x00000000 PS : 0x00060630 A0 : 0x8009a9af A1 : 0x3ffbd6e0
|
||||||
|
A2 : 0x3fff1ef8 A3 : 0x00000001 ...
|
||||||
|
A8 : 0x800f2ebc ...
|
||||||
|
EXCCAUSE: 0x00000014 EXCVADDR: 0x00000000
|
||||||
|
|
||||||
|
Backtrace: 0xfffffffd:0x3ffbd6e0 0x4009a9ac:0x3ffbd700
|
||||||
|
```
|
||||||
|
|
||||||
|
Decoded with `~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line -e .pio/build/timercam/firmware.elf -pfiC 0x4009a9ac 0x400f2ebc`:
|
||||||
|
|
||||||
|
```
|
||||||
|
prvProcessReceivedCommands at freertos/timers.c:852
|
||||||
|
(inlined by) prvTimerTask at freertos/timers.c:600
|
||||||
|
os_callout_timer_cb at NimBLE-Arduino/.../npl_os_freertos.c:1742
|
||||||
|
```
|
||||||
|
|
||||||
|
`PC=0` + `EXCCAUSE=0x14` (InstrFetchProhibited) = jump-to-NULL. The FreeRTOS timer-service task is dispatching a NimBLE callout whose callback function pointer is NULL.
|
||||||
|
|
||||||
|
The relevant NimBLE source:
|
||||||
|
|
||||||
|
```c
|
||||||
|
// firmware/.pio/libdeps/timercam/NimBLE-Arduino/src/nimble/porting/npl/freertos/src/npl_os_freertos.c:1729-1742
|
||||||
|
static void
|
||||||
|
os_callout_timer_cb(TimerHandle_t timer)
|
||||||
|
{
|
||||||
|
struct ble_npl_callout *co;
|
||||||
|
|
||||||
|
co = pvTimerGetTimerID(timer);
|
||||||
|
assert(co);
|
||||||
|
|
||||||
|
if (co->evq) {
|
||||||
|
ble_npl_eventq_put(co->evq, &co->ev);
|
||||||
|
} else {
|
||||||
|
co->ev.fn(&co->ev); // <-- co->ev.fn is NULL
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Either `co->ev.fn` is genuinely NULL on the direct-call path, OR — given the addr2line frame is a few lines off and the callsite is ambiguous — the FreeRTOS timer's own callback pointer (`pxTimer->pxCallbackFunction`) is NULL inside `prvProcessReceivedCommands`. Both indicate a callout/timer being freed or zeroed while the FreeRTOS timer service still has a command queued for it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- Board: M5Stack TimerCam-F (ESP32-D0WDQ6-V3, dual-core 240 MHz, 4MB flash).
|
||||||
|
- BLE library: `h2zero/NimBLE-Arduino@^1.4.2` (`firmware/platformio.ini`). 1.4.2 is end-of-life on the 1.x branch; 2.x exists with breaking API changes.
|
||||||
|
- Camera: OV3660 via `esp32-camera` driver, 96×96 grayscale @ 5 FPS.
|
||||||
|
- BLE scan: passive, low-overhead, hash-collected by `firmware/src/ble_scanner.cpp`.
|
||||||
|
- Tasks: `task_camera` (core 1, prio 2, 8KB stack), `task_reporter` (core 0, prio 1, 8KB stack), Arduino loop (default).
|
||||||
|
- The camera task triggers `cam_hal: EV-VSYNC-OVF` whenever frame capture overlaps another long operation — this consistently precedes the crash in logs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What's been ruled out
|
||||||
|
|
||||||
|
1. **DNS / network code** — entirely unrelated. DNS path works in production via the new fallback-IP machinery (`firmware/src/reporter.cpp` `resolve_api_ip` and `firmware/src/reporter.h` `REPORTER_API_FALLBACK_IP`). Do not regress this; it shipped with reports working at the customer site.
|
||||||
|
2. **Our BLE app code** — the backtrace stays inside the FreeRTOS timer service and NimBLE's own porting layer; nothing in `ble_scanner.cpp` is on the call stack. The bug is in vendored NimBLE.
|
||||||
|
3. **Memory corruption from our side** — `A2 = 0x3fff1ef8` is a normal heap address, no obvious overrun pattern. Heap is healthy at the time (we'd see a different fault otherwise).
|
||||||
|
4. **Stack overflow** — A1 = 0x3ffbd6e0 is well within the FreeRTOS timer-service task's stack range; no canary smash log.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What changed today
|
||||||
|
|
||||||
|
| File | Change | Keep? |
|
||||||
|
|---|---|---|
|
||||||
|
| `firmware/src/main.cpp` | Added `BLE_SCANNING_ENABLED 0` gate; all `ble_scanner_*` callsites compile out; `BLEHourlyRecord` zero-stubbed when off | Keep until crash fixed; flip to `1` to reproduce |
|
||||||
|
| `firmware/src/main.cpp` | Removed verbose `[F]`/`[CV] spawn` per-frame logging; kept entry/exit + heartbeat | Keep |
|
||||||
|
| `firmware/src/ble_scanner.cpp` | Removed `[BLE] new device:` per-discovery log | Keep |
|
||||||
|
| `firmware/src/reporter.{h,cpp}` | DNS resolution with fallback IP, raw `WiFiClient` HTTP, manual `Host:` header | Keep — production fix |
|
||||||
|
| `firmware/lib/net_guard/net_guard.{h,cpp}` | DNS pin to 1.1.1.1/8.8.8.8 at lwIP + esp-netif layers; `net_guard_dump_dns` diagnostic | Keep |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Reproduction
|
||||||
|
|
||||||
|
1. `cd firmware && pio run -e timercam`.
|
||||||
|
2. Edit `firmware/src/main.cpp`, set `#define BLE_SCANNING_ENABLED 1`. Rebuild.
|
||||||
|
3. Flash a TimerCam: `python tools/flash_device.py --port /dev/ttyUSB0 --device-id dc-XXXX --location-id <loc> --hmac-secret <secret> --wifi-ssid "<ssid>" --wifi-password "<pw>"`.
|
||||||
|
4. `pio device monitor --port /dev/ttyUSB0 --baud 115200`.
|
||||||
|
5. Wait ≤30s. Expect the `Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited)` traceback above.
|
||||||
|
|
||||||
|
Crash is **deterministic** on the customer's network (Elly-Fi). Worth retesting on a quiet desk network — if it doesn't repro there, the trigger is camera-task starvation interacting with NimBLE timers, not a pure NimBLE bug.
|
||||||
|
|
||||||
|
To decode any future crash backtrace:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line \
|
||||||
|
-e firmware/.pio/build/timercam/firmware.elf -pfiC <addr1> <addr2> ...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Investigation paths, in order of effort/confidence
|
||||||
|
|
||||||
|
### 1. Confirm the failing call site (cheap, do this first)
|
||||||
|
|
||||||
|
The addr2line line numbers can be off by ±3 due to inlining. Add a temporary `Serial.printf` patch to `npl_os_freertos.c` `os_callout_timer_cb` to log `co`, `co->evq`, `co->ev.fn` on entry. Reproduce. Then we know with certainty whether `co->ev.fn` is NULL on the direct-call path or whether this is an FreeRTOS-level issue (queued command for a deleted timer).
|
||||||
|
|
||||||
|
If `evq != NULL` and we still crash, the NULL is in the queued event dispatcher (a different code path; pivot the investigation).
|
||||||
|
|
||||||
|
### 2. Try upgrading NimBLE-Arduino to 2.x (medium effort, likely-fix)
|
||||||
|
|
||||||
|
`platformio.ini` has `h2zero/NimBLE-Arduino@^1.4.2`. 2.x rewrote the porting layer significantly. Breaking API changes — `NimBLEAdvertisedDeviceCallbacks` was renamed/restructured. Touch points: `firmware/src/ble_scanner.cpp` (the only file that uses NimBLE).
|
||||||
|
|
||||||
|
Try: pin `^2.0.0`, fix the API breakage in `ble_scanner.cpp` (it's <100 lines). If 2.x crashes too, the issue is independent of NimBLE version → pivot to (3) or (4).
|
||||||
|
|
||||||
|
### 3. Reduce camera-task starvation (cheap, may be sufficient)
|
||||||
|
|
||||||
|
The `EV-VSYNC-OVF` lines are the canary. The camera task pins core 1 at priority 2 doing CV processing every 200ms. NimBLE host task runs on core 0 by default but the FreeRTOS timer service task is core-agnostic and may be starved during long CV passes that hold a mutex.
|
||||||
|
|
||||||
|
Things to try in `firmware/src/main.cpp`:
|
||||||
|
- Lower `CAM_FPS` from 5 to 3, see if VSYNC-OVF still appears.
|
||||||
|
- Move CV processing off the capture path (capture into a queue, process at lower priority).
|
||||||
|
- Raise FreeRTOS timer-service task priority via `configTIMER_TASK_PRIORITY` (sdkconfig).
|
||||||
|
- Confirm NimBLE host task pinning — `CONFIG_BT_NIMBLE_PINNED_TO_CORE` should be 0 or 1 (not unpinned).
|
||||||
|
|
||||||
|
### 4. Local NULL-guard patch (last resort, masks the bug)
|
||||||
|
|
||||||
|
If upgrade is blocked and starvation reduction isn't enough, patch the vendored source:
|
||||||
|
|
||||||
|
```c
|
||||||
|
// npl_os_freertos.c:1740
|
||||||
|
} else {
|
||||||
|
if (co->ev.fn) co->ev.fn(&co->ev);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This silences the crash but drops the dropped event. The dropped events are likely scan-result deliveries; we'd undercount BLE devices but not crash. Acceptable as a stopgap with a `// TODO: remove when NimBLE upgraded` and a note in this doc.
|
||||||
|
|
||||||
|
Caveat: vendored library files in `.pio/libdeps/` get blown away by clean builds. Either copy NimBLE into `firmware/lib/` and pin it (vendored), or use `lib_archive` + a post-install script. Don't ship a build that depends on an unpinned hand-edit.
|
||||||
|
|
||||||
|
### 5. Replace BLE stack (high effort)
|
||||||
|
|
||||||
|
If 2.x also crashes and starvation reduction doesn't help, switch to the IDF-native bluedroid stack via the Arduino-ESP32 `BLEDevice` API. Larger memory footprint (~30KB more heap) but a different lifecycle model — won't share NimBLE's bug.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Constraints / things not to break
|
||||||
|
|
||||||
|
- `firmware/src/reporter.cpp` DNS path with `REPORTER_API_FALLBACK_IP` — production fix, must keep working. Do not regress to `HTTPClient`.
|
||||||
|
- `BLE_SCANNING_ENABLED 0` is the **shipping default** until this is resolved. Devices in the field rely on this; flip to `1` only in dev builds.
|
||||||
|
- `firmware/lib/net_guard/net_guard.cpp` `net_guard_pin_dns()` is called both at boot and on every WiFi reconnect; if reorganizing net_guard, preserve both call sites.
|
||||||
|
- The `ble_scanner` module supports `ble_scanner_pause`/`resume` for OTA — verify it still works after any NimBLE upgrade (`ArduinoOTA.onStart` hook in `main.cpp:248`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- Does the crash repro on a quiet network with no `EV-VSYNC-OVF`? (Determines whether starvation is necessary vs sufficient.)
|
||||||
|
- Was BLE working in a previous build, and on which NimBLE version? Earliest BLE-related commit traced to is well before today; binary search across firmware commits with BLE enabled would identify the regression boundary if it's our code.
|
||||||
|
- Does the customer site have an unusual RF environment (very dense BLE) that increases the callout-churn rate, making the race more likely? Worth a `nimble_scan_event` count log during a 60s capture window.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick verification once you think it's fixed
|
||||||
|
|
||||||
|
1. Set `BLE_SCANNING_ENABLED 1`, rebuild, flash.
|
||||||
|
2. Run for at least 10 minutes on the customer network — the original crash hit within ~1s, so 10 min with no panic is strong evidence.
|
||||||
|
3. Confirm a successful hourly cycle: `[CV] entry/exit`, then `[HTTP] POST .../events/batch ... -> 200`, BLE record with non-zero `unique_devices`.
|
||||||
|
4. Run a second device side-by-side; confirm no cross-device interference.
|
||||||
|
|
||||||
|
When done, set `BLE_SCANNING_ENABLED 1` as the default and remove the gate (keep the comment block as institutional memory of the bug).
|
||||||
@@ -9,8 +9,66 @@ uint32_t net_guard_next_backoff_ms(uint32_t attempt) {
|
|||||||
#ifdef ARDUINO
|
#ifdef ARDUINO
|
||||||
#include "config.h"
|
#include "config.h"
|
||||||
#include <WiFi.h>
|
#include <WiFi.h>
|
||||||
|
#include <Arduino.h>
|
||||||
|
#include <lwip/dns.h>
|
||||||
|
#include <esp_netif.h>
|
||||||
#include "event_log.h"
|
#include "event_log.h"
|
||||||
|
|
||||||
|
// Both lwIP's ip_addr_t and esp-netif's esp_ip_addr_t alias the same on-disk
|
||||||
|
// layout for IPv4, but the C++ types differ. Take the raw u32 to sidestep it.
|
||||||
|
static String fmt_v4(uint32_t addr_be) {
|
||||||
|
if (addr_be == 0) return String("0.0.0.0");
|
||||||
|
char b[16];
|
||||||
|
snprintf(b, sizeof(b), "%u.%u.%u.%u",
|
||||||
|
(unsigned)((addr_be >> 0) & 0xFF),
|
||||||
|
(unsigned)((addr_be >> 8) & 0xFF),
|
||||||
|
(unsigned)((addr_be >> 16) & 0xFF),
|
||||||
|
(unsigned)((addr_be >> 24) & 0xFF));
|
||||||
|
return String(b);
|
||||||
|
}
|
||||||
|
|
||||||
|
void net_guard_dump_dns(const char* tag) {
|
||||||
|
const ip_addr_t* d0 = dns_getserver(0);
|
||||||
|
const ip_addr_t* d1 = dns_getserver(1);
|
||||||
|
Serial.printf("[DNS] %s lwip: %s , %s\n", tag,
|
||||||
|
fmt_v4(d0 ? ip_2_ip4(d0)->addr : 0).c_str(),
|
||||||
|
fmt_v4(d1 ? ip_2_ip4(d1)->addr : 0).c_str());
|
||||||
|
|
||||||
|
esp_netif_t* sta = esp_netif_get_handle_from_ifkey("WIFI_STA_DEF");
|
||||||
|
if (sta) {
|
||||||
|
esp_netif_dns_info_t main_dns{}, backup_dns{};
|
||||||
|
esp_netif_get_dns_info(sta, ESP_NETIF_DNS_MAIN, &main_dns);
|
||||||
|
esp_netif_get_dns_info(sta, ESP_NETIF_DNS_BACKUP, &backup_dns);
|
||||||
|
Serial.printf("[DNS] %s netif: %s , %s\n", tag,
|
||||||
|
fmt_v4(main_dns.ip.u_addr.ip4.addr).c_str(),
|
||||||
|
fmt_v4(backup_dns.ip.u_addr.ip4.addr).c_str());
|
||||||
|
} else {
|
||||||
|
Serial.printf("[DNS] %s netif: <no STA handle>\n", tag);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void net_guard_pin_dns() {
|
||||||
|
ip_addr_t d1, d2;
|
||||||
|
IP_ADDR4(&d1, 1, 1, 1, 1);
|
||||||
|
IP_ADDR4(&d2, 8, 8, 8, 8);
|
||||||
|
dns_setserver(0, &d1);
|
||||||
|
dns_setserver(1, &d2);
|
||||||
|
|
||||||
|
// Also push through the esp_netif layer. dns_setserver() writes the
|
||||||
|
// global lwIP table directly; esp_netif_set_dns_info() is what the
|
||||||
|
// DHCP client itself calls, so writing here prevents the next DHCP
|
||||||
|
// event from silently overwriting our pin.
|
||||||
|
esp_netif_t* sta = esp_netif_get_handle_from_ifkey("WIFI_STA_DEF");
|
||||||
|
if (sta) {
|
||||||
|
esp_netif_dns_info_t info{};
|
||||||
|
IP_ADDR4(&info.ip, 1, 1, 1, 1);
|
||||||
|
esp_netif_set_dns_info(sta, ESP_NETIF_DNS_MAIN, &info);
|
||||||
|
IP_ADDR4(&info.ip, 8, 8, 8, 8);
|
||||||
|
esp_netif_set_dns_info(sta, ESP_NETIF_DNS_BACKUP, &info);
|
||||||
|
}
|
||||||
|
net_guard_dump_dns("pinned");
|
||||||
|
}
|
||||||
|
|
||||||
// Shared with the WiFi event task. 32-bit aligned loads/stores are atomic on
|
// Shared with the WiFi event task. 32-bit aligned loads/stores are atomic on
|
||||||
// Xtensa; volatile suffices. Tick re-evaluates every loop iteration, so stale
|
// Xtensa; volatile suffices. Tick re-evaluates every loop iteration, so stale
|
||||||
// reads self-correct within ~200ms.
|
// reads self-correct within ~200ms.
|
||||||
@@ -23,6 +81,11 @@ static volatile uint32_t s_next_retry_ms = 0;
|
|||||||
static void on_wifi_event(WiFiEvent_t event, WiFiEventInfo_t info) {
|
static void on_wifi_event(WiFiEvent_t event, WiFiEventInfo_t info) {
|
||||||
switch (event) {
|
switch (event) {
|
||||||
case ARDUINO_EVENT_WIFI_STA_GOT_IP:
|
case ARDUINO_EVENT_WIFI_STA_GOT_IP:
|
||||||
|
// Override DHCP-supplied DNS. Some routers return TC=1 for short
|
||||||
|
// answers (forcing TCP fallback that lwIP can't follow), or hand
|
||||||
|
// out an unreachable resolver. Pin to public resolvers so
|
||||||
|
// hostByName() never depends on the local network's DNS quality.
|
||||||
|
net_guard_pin_dns();
|
||||||
s_up = true;
|
s_up = true;
|
||||||
s_attempts = 0;
|
s_attempts = 0;
|
||||||
s_next_retry_ms = 0;
|
s_next_retry_ms = 0;
|
||||||
|
|||||||
@@ -21,4 +21,13 @@ uint8_t net_guard_last_disconnect_reason();
|
|||||||
|
|
||||||
// Non-blocking tick called from loop(); kicks reconnect if due.
|
// Non-blocking tick called from loop(); kicks reconnect if due.
|
||||||
extern "C" void net_guard_tick();
|
extern "C" void net_guard_tick();
|
||||||
|
|
||||||
|
// Override DHCP-supplied DNS with public resolvers (1.1.1.1, 8.8.8.8).
|
||||||
|
// Idempotent; safe to call repeatedly. net_guard re-applies on every GOT_IP,
|
||||||
|
// but main.cpp must call it once for the boot association (which completes
|
||||||
|
// before net_guard_start() registers its event handler).
|
||||||
|
void net_guard_pin_dns();
|
||||||
|
|
||||||
|
// Diagnostic: print current DNS table state from both lwIP and esp_netif.
|
||||||
|
void net_guard_dump_dns(const char* tag);
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ upload_flags = --no-stub
|
|||||||
lib_deps =
|
lib_deps =
|
||||||
tzapu/WiFiManager@^2.0.17
|
tzapu/WiFiManager@^2.0.17
|
||||||
bblanchon/ArduinoJson@^7.0.0
|
bblanchon/ArduinoJson@^7.0.0
|
||||||
h2zero/NimBLE-Arduino@^1.4.2
|
h2zero/NimBLE-Arduino@^2.0.0
|
||||||
espressif/esp32-camera
|
espressif/esp32-camera
|
||||||
|
|
||||||
; Frame-capture build. Strips WiFi/BLE/CV/reporter; streams raw 96x96 frames
|
; Frame-capture build. Strips WiFi/BLE/CV/reporter; streams raw 96x96 frames
|
||||||
|
|||||||
@@ -42,8 +42,8 @@ static String sha256_prefix(const String& input) {
|
|||||||
return hex;
|
return hex;
|
||||||
}
|
}
|
||||||
|
|
||||||
class ScanCallback : public NimBLEAdvertisedDeviceCallbacks {
|
class ScanCallback : public NimBLEScanCallbacks {
|
||||||
void onResult(NimBLEAdvertisedDevice* dev) override {
|
void onResult(const NimBLEAdvertisedDevice* dev) override {
|
||||||
String mac = String(dev->getAddress().toString().c_str());
|
String mac = String(dev->getAddress().toString().c_str());
|
||||||
String hash = sha256_prefix(mac);
|
String hash = sha256_prefix(mac);
|
||||||
int rssi = dev->getRSSI();
|
int rssi = dev->getRSSI();
|
||||||
@@ -51,7 +51,6 @@ class ScanCallback : public NimBLEAdvertisedDeviceCallbacks {
|
|||||||
std::lock_guard<std::mutex> lock(s_mutex);
|
std::lock_guard<std::mutex> lock(s_mutex);
|
||||||
auto it = s_seen.find(hash);
|
auto it = s_seen.find(hash);
|
||||||
if (it == s_seen.end()) {
|
if (it == s_seen.end()) {
|
||||||
Serial.printf("[BLE] new device: %s (rssi %d)\n", hash.c_str(), rssi);
|
|
||||||
s_seen[hash] = {rssi, 1};
|
s_seen[hash] = {rssi, 1};
|
||||||
} else {
|
} else {
|
||||||
it->second.rssi_sum += rssi;
|
it->second.rssi_sum += rssi;
|
||||||
@@ -68,16 +67,16 @@ static NimBLEScan* s_scan = nullptr;
|
|||||||
void ble_scanner_start() {
|
void ble_scanner_start() {
|
||||||
NimBLEDevice::init("");
|
NimBLEDevice::init("");
|
||||||
s_scan = NimBLEDevice::getScan();
|
s_scan = NimBLEDevice::getScan();
|
||||||
s_scan->setAdvertisedDeviceCallbacks(&s_callback, true); // true = allow duplicates
|
s_scan->setScanCallbacks(&s_callback, true); // true = allow duplicates
|
||||||
s_scan->setActiveScan(false); // passive
|
s_scan->setActiveScan(false); // passive
|
||||||
s_scan->setInterval(100);
|
s_scan->setInterval(100);
|
||||||
s_scan->setWindow(99);
|
s_scan->setWindow(99);
|
||||||
s_scan->setMaxResults(0); // don't store results — callback-only
|
s_scan->setMaxResults(0); // don't store results — callback-only
|
||||||
s_scan->start(0, nullptr, false); // 0 = continuous
|
s_scan->start(0, false, false); // duration=0 (forever), isContinue=false, restart=false
|
||||||
}
|
}
|
||||||
|
|
||||||
void ble_scanner_pause() { if (s_scan) s_scan->stop(); }
|
void ble_scanner_pause() { if (s_scan) s_scan->stop(); }
|
||||||
void ble_scanner_resume() { if (s_scan) s_scan->start(0, nullptr, false); }
|
void ble_scanner_resume() { if (s_scan) s_scan->start(0, false, false); }
|
||||||
|
|
||||||
void ble_scanner_deinit() {
|
void ble_scanner_deinit() {
|
||||||
if (s_scan) s_scan->stop();
|
if (s_scan) s_scan->stop();
|
||||||
|
|||||||
@@ -19,6 +19,15 @@
|
|||||||
#define BUTTON_PIN 37
|
#define BUTTON_PIN 37
|
||||||
#define FACTORY_RESET_HOLD_MS 5000
|
#define FACTORY_RESET_HOLD_MS 5000
|
||||||
|
|
||||||
|
// BLE scanning disabled in production until the NimBLE-Arduino 1.4.2 timer
|
||||||
|
// race is resolved. Symptom: FreeRTOS timer task dispatches an
|
||||||
|
// os_callout_timer_cb whose callback fn is NULL, causing PC=0 fetch and
|
||||||
|
// Historical note: NimBLE-Arduino 1.4.2 had an init/fire race in its FreeRTOS
|
||||||
|
// callout porting layer that caused a NULL-fn dispatch (PC=0,
|
||||||
|
// InstrFetchProhibited) within ~1s of boot when the camera task starved the
|
||||||
|
// timer service. Fixed by upgrading to 2.x (see platformio.ini).
|
||||||
|
#define BLE_SCANNING_ENABLED 1
|
||||||
|
|
||||||
#define CAM_FPS 5
|
#define CAM_FPS 5
|
||||||
#define CAM_INTERVAL_MS (1000 / CAM_FPS)
|
#define CAM_INTERVAL_MS (1000 / CAM_FPS)
|
||||||
#define REPORT_INTERVAL_S 3600
|
#define REPORT_INTERVAL_S 3600
|
||||||
@@ -67,16 +76,7 @@ static void task_camera(void*) {
|
|||||||
if (camera_capture_96(frame)) {
|
if (camera_capture_96(frame)) {
|
||||||
if (xSemaphoreTake(s_cv_mutex, pdMS_TO_TICKS(100)) == pdTRUE) {
|
if (xSemaphoreTake(s_cv_mutex, pdMS_TO_TICKS(100)) == pdTRUE) {
|
||||||
CVResult r = cv_process(g_cv, frame, g_cfg.line_offset);
|
CVResult r = cv_process(g_cv, frame, g_cfg.line_offset);
|
||||||
for (const auto& t : g_cv.tracks) {
|
(void)last_logged_track_id;
|
||||||
if (t.id > last_logged_track_id) {
|
|
||||||
last_logged_track_id = t.id;
|
|
||||||
Serial.printf("[CV] spawn id=%d y=%.1f\n", t.id, t.spawn_y);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (r.fg_count > 0) {
|
|
||||||
Serial.printf("[F] n=%d y=%d..%d c=%.1f\n",
|
|
||||||
r.fg_count, r.fg_min_y, r.fg_max_y, r.fg_centroid_y);
|
|
||||||
}
|
|
||||||
if (r.entries_delta) Serial.printf("[CV] entry +%d (total %d) first=%.1f min=%.1f max=%.1f last=%.1f dur=%d\n",
|
if (r.entries_delta) Serial.printf("[CV] entry +%d (total %d) first=%.1f min=%.1f max=%.1f last=%.1f dur=%d\n",
|
||||||
r.entries_delta, g_cv.entries,
|
r.entries_delta, g_cv.entries,
|
||||||
r.fire_first_c, r.fire_min_c, r.fire_max_c, r.fire_last_c, r.fire_duration);
|
r.fire_first_c, r.fire_min_c, r.fire_max_c, r.fire_last_c, r.fire_duration);
|
||||||
@@ -119,7 +119,9 @@ static void task_reporter(void*) {
|
|||||||
last_report_ts = now;
|
last_report_ts = now;
|
||||||
|
|
||||||
// Deinit BLE to free ~25KB heap for SSL handshakes
|
// Deinit BLE to free ~25KB heap for SSL handshakes
|
||||||
|
#if BLE_SCANNING_ENABLED
|
||||||
ble_scanner_deinit();
|
ble_scanner_deinit();
|
||||||
|
#endif
|
||||||
led_set(true); // on = uploading
|
led_set(true); // on = uploading
|
||||||
|
|
||||||
CameraHourlyRecord cam_rec;
|
CameraHourlyRecord cam_rec;
|
||||||
@@ -129,18 +131,26 @@ static void task_reporter(void*) {
|
|||||||
xSemaphoreGive(s_cv_mutex);
|
xSemaphoreGive(s_cv_mutex);
|
||||||
} else {
|
} else {
|
||||||
// Failed to acquire — skip this cycle, will report next hour
|
// Failed to acquire — skip this cycle, will report next hour
|
||||||
|
#if BLE_SCANNING_ENABLED
|
||||||
ble_scanner_reinit();
|
ble_scanner_reinit();
|
||||||
|
#endif
|
||||||
led_set(false);
|
led_set(false);
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if !BLE_SCANNING_ENABLED
|
||||||
|
BLEHourlyRecord ble_rec = {period_start, period_end, 0, 0};
|
||||||
|
#else
|
||||||
BLEHourlyRecord ble_rec = ble_scanner_collect(period_start, period_end);
|
BLEHourlyRecord ble_rec = ble_scanner_collect(period_start, period_end);
|
||||||
|
#endif
|
||||||
|
|
||||||
reporter_submit_camera(g_cfg, cam_rec);
|
reporter_submit_camera(g_cfg, cam_rec);
|
||||||
reporter_submit_ble(g_cfg, ble_rec);
|
reporter_submit_ble(g_cfg, ble_rec);
|
||||||
bool hb_ok = reporter_heartbeat(g_cfg, millis() / 1000, WiFi.RSSI());
|
bool hb_ok = reporter_heartbeat(g_cfg, millis() / 1000, WiFi.RSSI());
|
||||||
|
|
||||||
|
#if BLE_SCANNING_ENABLED
|
||||||
ble_scanner_reinit();
|
ble_scanner_reinit();
|
||||||
|
#endif
|
||||||
led_set(false);
|
led_set(false);
|
||||||
|
|
||||||
static uint8_t consecutive_misses = 0;
|
static uint8_t consecutive_misses = 0;
|
||||||
@@ -202,6 +212,11 @@ void setup() {
|
|||||||
ESP.restart();
|
ESP.restart();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Boot connect happens before net_guard registers its WiFi event handler,
|
||||||
|
// so the GOT_IP-driven DNS override there won't fire for this association.
|
||||||
|
// Pin DNS now; net_guard re-applies it on every subsequent reconnect.
|
||||||
|
net_guard_pin_dns();
|
||||||
|
|
||||||
net_guard_start(g_cfg);
|
net_guard_start(g_cfg);
|
||||||
led_set(false); // off = connected
|
led_set(false); // off = connected
|
||||||
|
|
||||||
@@ -220,17 +235,29 @@ void setup() {
|
|||||||
|
|
||||||
reporter_init();
|
reporter_init();
|
||||||
|
|
||||||
|
#if BLE_SCANNING_ENABLED
|
||||||
ble_scanner_start();
|
ble_scanner_start();
|
||||||
|
#endif
|
||||||
|
|
||||||
// OTA update support
|
// OTA update support
|
||||||
ArduinoOTA.setHostname(g_cfg.device_id.c_str());
|
ArduinoOTA.setHostname(g_cfg.device_id.c_str());
|
||||||
|
#if !BLE_SCANNING_ENABLED
|
||||||
|
ArduinoOTA.onStart([]() { });
|
||||||
|
#else
|
||||||
ArduinoOTA.onStart([]() { ble_scanner_pause(); });
|
ArduinoOTA.onStart([]() { ble_scanner_pause(); });
|
||||||
|
#endif
|
||||||
ArduinoOTA.onEnd([]() {
|
ArduinoOTA.onEnd([]() {
|
||||||
|
#if BLE_SCANNING_ENABLED
|
||||||
ble_scanner_resume();
|
ble_scanner_resume();
|
||||||
|
#endif
|
||||||
event_log_write(EVT_REBOOT, REBOOT_OTA, 0);
|
event_log_write(EVT_REBOOT, REBOOT_OTA, 0);
|
||||||
ESP.restart();
|
ESP.restart();
|
||||||
});
|
});
|
||||||
|
#if !BLE_SCANNING_ENABLED
|
||||||
|
ArduinoOTA.onError([](ota_error_t e) { });
|
||||||
|
#else
|
||||||
ArduinoOTA.onError([](ota_error_t e) { ble_scanner_resume(); });
|
ArduinoOTA.onError([](ota_error_t e) { ble_scanner_resume(); });
|
||||||
|
#endif
|
||||||
ArduinoOTA.begin();
|
ArduinoOTA.begin();
|
||||||
|
|
||||||
s_cv_mutex = xSemaphoreCreateMutex();
|
s_cv_mutex = xSemaphoreCreateMutex();
|
||||||
|
|||||||
@@ -26,28 +26,107 @@ static uint32_t now_ts() {
|
|||||||
return (uint32_t)time(nullptr);
|
return (uint32_t)time(nullptr);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Last successfully resolved IP — used as a warm fallback if a subsequent
|
||||||
|
// resolution fails. Never takes precedence over a fresh successful resolve.
|
||||||
|
static IPAddress s_cached_api_ip;
|
||||||
|
|
||||||
|
// Resolve the API host. Tries hostByName first; on failure falls back to the
|
||||||
|
// last good resolution, then to the hardcoded fallback IP. Returns the IP via
|
||||||
|
// out-param and a label describing where it came from for logging.
|
||||||
|
static bool resolve_api_ip(IPAddress& out, const char*& source) {
|
||||||
|
IPAddress ip;
|
||||||
|
uint32_t r0 = millis();
|
||||||
|
bool ok = WiFi.hostByName(REPORTER_API_HOST_NAME, ip);
|
||||||
|
uint32_t elapsed = millis() - r0;
|
||||||
|
if (ok) {
|
||||||
|
s_cached_api_ip = ip;
|
||||||
|
out = ip;
|
||||||
|
source = "dns";
|
||||||
|
Serial.printf("[DNS] %s -> %s (%u ms)\n",
|
||||||
|
REPORTER_API_HOST_NAME, ip.toString().c_str(), (unsigned)elapsed);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
Serial.printf("[DNS] %s -> FAIL (%u ms)\n",
|
||||||
|
REPORTER_API_HOST_NAME, (unsigned)elapsed);
|
||||||
|
net_guard_dump_dns("on-fail");
|
||||||
|
net_guard_pin_dns(); // re-assert in case something overwrote the table
|
||||||
|
|
||||||
|
if ((uint32_t)s_cached_api_ip != 0) {
|
||||||
|
out = s_cached_api_ip;
|
||||||
|
source = "cache";
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (out.fromString(REPORTER_API_FALLBACK_IP)) {
|
||||||
|
source = "fallback";
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Drains and parses the HTTP response status line. Returns the numeric status
|
||||||
|
// code, or -1 on read timeout / malformed response.
|
||||||
|
static int read_http_status(WiFiClient& client, uint32_t timeout_ms) {
|
||||||
|
uint32_t deadline = millis() + timeout_ms;
|
||||||
|
while (!client.available() && millis() < deadline) vTaskDelay(pdMS_TO_TICKS(10));
|
||||||
|
if (!client.available()) return -1;
|
||||||
|
String line = client.readStringUntil('\n');
|
||||||
|
line.trim();
|
||||||
|
// Format: "HTTP/1.1 200 OK"
|
||||||
|
int sp1 = line.indexOf(' ');
|
||||||
|
if (sp1 < 0) return -1;
|
||||||
|
int sp2 = line.indexOf(' ', sp1 + 1);
|
||||||
|
String code_str = (sp2 > 0) ? line.substring(sp1 + 1, sp2) : line.substring(sp1 + 1);
|
||||||
|
return code_str.toInt();
|
||||||
|
}
|
||||||
|
|
||||||
static bool post_json_once(const DeviceConfig& cfg, const char* path, const String& body) {
|
static bool post_json_once(const DeviceConfig& cfg, const char* path, const String& body) {
|
||||||
uint32_t ts = now_ts();
|
uint32_t ts = now_ts();
|
||||||
if (ts < 1700000000UL) return false;
|
if (ts < 1700000000UL) return false;
|
||||||
String sig = hmac_sign(cfg.hmac_secret, "POST", path, ts, body);
|
String sig = hmac_sign(cfg.hmac_secret, "POST", path, ts, body);
|
||||||
if (sig.isEmpty()) return false;
|
if (sig.isEmpty()) return false;
|
||||||
|
|
||||||
HTTPClient http;
|
IPAddress ip;
|
||||||
String url = String(REPORTER_API_HOST) + path;
|
const char* ip_source = "?";
|
||||||
http.begin(url);
|
if (!resolve_api_ip(ip, ip_source)) {
|
||||||
http.setConnectTimeout(5000); // DNS + TCP connect
|
Serial.printf("[HTTP] POST %s -> resolve-fail\n", path);
|
||||||
http.setTimeout(10000); // per-transaction response timeout
|
event_log_write(EVT_HTTP_FAIL, event_log_path_hash(path), (uint16_t)-1);
|
||||||
http.addHeader("Content-Type", "application/json");
|
return false;
|
||||||
http.addHeader("X-Device-Id", cfg.device_id);
|
}
|
||||||
http.addHeader("X-Timestamp", String(ts));
|
|
||||||
http.addHeader("X-Signature", sig);
|
|
||||||
|
|
||||||
uint32_t t0 = millis();
|
uint32_t t0 = millis();
|
||||||
int code = http.POST(body);
|
WiFiClient client;
|
||||||
|
client.setTimeout(10); // seconds — read timeout
|
||||||
|
if (!client.connect(ip, REPORTER_API_PORT, 5000 /*ms connect timeout*/)) {
|
||||||
|
uint32_t elapsed = millis() - t0;
|
||||||
|
Serial.printf("[HTTP] connect %s:%u (%s) -> failed (%u ms)\n",
|
||||||
|
ip.toString().c_str(), REPORTER_API_PORT, ip_source, (unsigned)elapsed);
|
||||||
|
event_log_write(EVT_HTTP_FAIL, event_log_path_hash(path), (uint16_t)-1);
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Manual HTTP/1.1 — gives us full control over the Host header so the
|
||||||
|
// server's vhost routing works even when we connect by IP.
|
||||||
|
client.printf("POST %s HTTP/1.1\r\n", path);
|
||||||
|
client.printf("Host: %s\r\n", REPORTER_API_HOST_NAME);
|
||||||
|
client.print ("Connection: close\r\n");
|
||||||
|
client.print ("Content-Type: application/json\r\n");
|
||||||
|
client.printf("Content-Length: %u\r\n", (unsigned)body.length());
|
||||||
|
client.printf("X-Device-Id: %s\r\n", cfg.device_id.c_str());
|
||||||
|
client.printf("X-Timestamp: %u\r\n", (unsigned)ts);
|
||||||
|
client.printf("X-Signature: %s\r\n", sig.c_str());
|
||||||
|
client.print ("\r\n");
|
||||||
|
client.print(body);
|
||||||
|
|
||||||
|
int code = read_http_status(client, 10000);
|
||||||
|
// Drain so the server can close cleanly.
|
||||||
|
while (client.connected() && client.available()) client.read();
|
||||||
|
client.stop();
|
||||||
|
|
||||||
uint32_t elapsed = millis() - t0;
|
uint32_t elapsed = millis() - t0;
|
||||||
http.end();
|
|
||||||
uint16_t phash = event_log_path_hash(path);
|
uint16_t phash = event_log_path_hash(path);
|
||||||
Serial.printf("[HTTP] POST %s -> %d (%u ms)\n", url.c_str(), code, (unsigned)elapsed);
|
Serial.printf("[HTTP] POST %s%s (%s %s) -> %d (%u ms)\n",
|
||||||
|
REPORTER_API_HOST_NAME, path, ip_source, ip.toString().c_str(),
|
||||||
|
code, (unsigned)elapsed);
|
||||||
if (code == 200) {
|
if (code == 200) {
|
||||||
event_log_write(EVT_HTTP_OK, phash, (uint16_t)((elapsed > 65535) ? 65535 : elapsed));
|
event_log_write(EVT_HTTP_OK, phash, (uint16_t)((elapsed > 65535) ? 65535 : elapsed));
|
||||||
return true;
|
return true;
|
||||||
|
|||||||
@@ -12,7 +12,12 @@ struct CameraHourlyRecord {
|
|||||||
};
|
};
|
||||||
|
|
||||||
static const int REPORTER_MAX_BUFFER = 24;
|
static const int REPORTER_MAX_BUFFER = 24;
|
||||||
static const char* REPORTER_API_HOST = "http://logs.research.bike";
|
static const char* REPORTER_API_HOST_NAME = "logs.research.bike";
|
||||||
|
static const uint16_t REPORTER_API_PORT = 80;
|
||||||
|
// Hardcoded fallback used when DNS fails (some customer networks intercept
|
||||||
|
// :53 with a transparent proxy that mangles responses). Update if the
|
||||||
|
// server's IP changes — but a successful hostByName() always wins over this.
|
||||||
|
static const char* REPORTER_API_FALLBACK_IP = "5.78.114.131";
|
||||||
|
|
||||||
void reporter_init();
|
void reporter_init();
|
||||||
void reporter_submit_camera(const DeviceConfig& cfg, const CameraHourlyRecord& rec);
|
void reporter_submit_camera(const DeviceConfig& cfg, const CameraHourlyRecord& rec);
|
||||||
|
|||||||
Reference in New Issue
Block a user