fix(firmware): upgrade NimBLE to 2.x + DNS fallback for unreliable resolvers
NimBLE-Arduino 1.4.2 had an init/fire race in its FreeRTOS callout porting layer where os_callout_timer_cb dispatched a queued TimerHandle expiry against a not-yet-initialized event (NULL fn pointer), causing PC=0 InstrFetchProhibited within ~1s of boot when the camera task starved the timer service. Confirmed by ets_printf instrumentation. Upgrading to ^2.0.0 rewrites the porting layer and eliminates the race; verified clean on the customer network for 1+ hour. Also rolls in DNS-resilience work that surfaced the BLE crash during provisioning: pin lwIP/esp-netif resolvers to 1.1.1.1/8.8.8.8 across DHCP renewals, add three-tier resolver fallback in reporter with a hardcoded IP of last resort, and switch to raw WiFiClient with manual Host header to bypass HTTPClient's brittle DNS path. Migration touches for NimBLE 2.x: - NimBLEAdvertisedDeviceCallbacks -> NimBLEScanCallbacks - onResult signature now takes const NimBLEAdvertisedDevice* - setAdvertisedDeviceCallbacks -> setScanCallbacks - start(0, nullptr, false) -> start(0, false, false) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
1032
docs/superpowers/plans/2026-04-23-network-resilience.md
Normal file
1032
docs/superpowers/plans/2026-04-23-network-resilience.md
Normal file
File diff suppressed because it is too large
Load Diff
189
docs/superpowers/plans/2026-05-01-ble-nimble-crash.md
Normal file
189
docs/superpowers/plans/2026-05-01-ble-nimble-crash.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# BLE / NimBLE Timer-Callout Crash — Handoff
|
||||
|
||||
**Date opened:** 2026-05-01
|
||||
**Status:** Resolved 2026-05-01 by upgrading `h2zero/NimBLE-Arduino` from `^1.4.2` to `^2.0.0` (`firmware/platformio.ini:24`). BLE scanning re-enabled via `BLE_SCANNING_ENABLED 1` (`firmware/src/main.cpp:30`). Verified clean on customer network for 1+ hour with no panics.
|
||||
**Goal:** Re-enable BLE scanning without the device crashing within ~1s of boot.
|
||||
|
||||
**Confirmed root cause:** Instrumented `os_callout_timer_cb` with `ets_printf` and observed the very first callout fire on the direct-call path with both `evq=NULL` and `fn=NULL`, while the same `co` address later (post-init) showed valid `evq` and `fn`. Same callout struct reused — classic NimBLE 1.x callout init/fire race where the FreeRTOS `TimerHandle_t` had a queued expiry against a not-yet-initialized event. NimBLE 2.x rewrote the porting layer; the race is gone.
|
||||
|
||||
**Migration touches (NimBLE 1.x → 2.x):**
|
||||
- `NimBLEAdvertisedDeviceCallbacks` → `NimBLEScanCallbacks`
|
||||
- `onResult(NimBLEAdvertisedDevice*)` → `onResult(const NimBLEAdvertisedDevice*)`
|
||||
- `setAdvertisedDeviceCallbacks(cb, true)` → `setScanCallbacks(cb, true)`
|
||||
- `start(0, nullptr, false)` → `start(0, false, false)` (signature: `duration, isContinue, restart`)
|
||||
|
||||
BLE was working before today's customer-site provisioning trip. The crash is reliably reproducible on the current build at the customer location whenever `BLE_SCANNING_ENABLED` is set back to `1`. It may or may not reproduce on a quieter network — the camera task's CPU-starvation pattern is shared, but the crash window's exact trigger is unconfirmed.
|
||||
|
||||
---
|
||||
|
||||
## Symptom
|
||||
|
||||
Within ~1s of boot, after several `cam_hal: EV-VSYNC-OVF` lines from the camera driver:
|
||||
|
||||
```
|
||||
Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited). Exception was unhandled.
|
||||
|
||||
Core 0 register dump:
|
||||
PC : 0x00000000 PS : 0x00060630 A0 : 0x8009a9af A1 : 0x3ffbd6e0
|
||||
A2 : 0x3fff1ef8 A3 : 0x00000001 ...
|
||||
A8 : 0x800f2ebc ...
|
||||
EXCCAUSE: 0x00000014 EXCVADDR: 0x00000000
|
||||
|
||||
Backtrace: 0xfffffffd:0x3ffbd6e0 0x4009a9ac:0x3ffbd700
|
||||
```
|
||||
|
||||
Decoded with `~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line -e .pio/build/timercam/firmware.elf -pfiC 0x4009a9ac 0x400f2ebc`:
|
||||
|
||||
```
|
||||
prvProcessReceivedCommands at freertos/timers.c:852
|
||||
(inlined by) prvTimerTask at freertos/timers.c:600
|
||||
os_callout_timer_cb at NimBLE-Arduino/.../npl_os_freertos.c:1742
|
||||
```
|
||||
|
||||
`PC=0` + `EXCCAUSE=0x14` (InstrFetchProhibited) = jump-to-NULL. The FreeRTOS timer-service task is dispatching a NimBLE callout whose callback function pointer is NULL.
|
||||
|
||||
The relevant NimBLE source:
|
||||
|
||||
```c
|
||||
// firmware/.pio/libdeps/timercam/NimBLE-Arduino/src/nimble/porting/npl/freertos/src/npl_os_freertos.c:1729-1742
|
||||
static void
|
||||
os_callout_timer_cb(TimerHandle_t timer)
|
||||
{
|
||||
struct ble_npl_callout *co;
|
||||
|
||||
co = pvTimerGetTimerID(timer);
|
||||
assert(co);
|
||||
|
||||
if (co->evq) {
|
||||
ble_npl_eventq_put(co->evq, &co->ev);
|
||||
} else {
|
||||
co->ev.fn(&co->ev); // <-- co->ev.fn is NULL
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Either `co->ev.fn` is genuinely NULL on the direct-call path, OR — given the addr2line frame is a few lines off and the callsite is ambiguous — the FreeRTOS timer's own callback pointer (`pxTimer->pxCallbackFunction`) is NULL inside `prvProcessReceivedCommands`. Both indicate a callout/timer being freed or zeroed while the FreeRTOS timer service still has a command queued for it.
|
||||
|
||||
---
|
||||
|
||||
## Environment
|
||||
|
||||
- Board: M5Stack TimerCam-F (ESP32-D0WDQ6-V3, dual-core 240 MHz, 4MB flash).
|
||||
- BLE library: `h2zero/NimBLE-Arduino@^1.4.2` (`firmware/platformio.ini`). 1.4.2 is end-of-life on the 1.x branch; 2.x exists with breaking API changes.
|
||||
- Camera: OV3660 via `esp32-camera` driver, 96×96 grayscale @ 5 FPS.
|
||||
- BLE scan: passive, low-overhead, hash-collected by `firmware/src/ble_scanner.cpp`.
|
||||
- Tasks: `task_camera` (core 1, prio 2, 8KB stack), `task_reporter` (core 0, prio 1, 8KB stack), Arduino loop (default).
|
||||
- The camera task triggers `cam_hal: EV-VSYNC-OVF` whenever frame capture overlaps another long operation — this consistently precedes the crash in logs.
|
||||
|
||||
---
|
||||
|
||||
## What's been ruled out
|
||||
|
||||
1. **DNS / network code** — entirely unrelated. DNS path works in production via the new fallback-IP machinery (`firmware/src/reporter.cpp` `resolve_api_ip` and `firmware/src/reporter.h` `REPORTER_API_FALLBACK_IP`). Do not regress this; it shipped with reports working at the customer site.
|
||||
2. **Our BLE app code** — the backtrace stays inside the FreeRTOS timer service and NimBLE's own porting layer; nothing in `ble_scanner.cpp` is on the call stack. The bug is in vendored NimBLE.
|
||||
3. **Memory corruption from our side** — `A2 = 0x3fff1ef8` is a normal heap address, no obvious overrun pattern. Heap is healthy at the time (we'd see a different fault otherwise).
|
||||
4. **Stack overflow** — A1 = 0x3ffbd6e0 is well within the FreeRTOS timer-service task's stack range; no canary smash log.
|
||||
|
||||
---
|
||||
|
||||
## What changed today
|
||||
|
||||
| File | Change | Keep? |
|
||||
|---|---|---|
|
||||
| `firmware/src/main.cpp` | Added `BLE_SCANNING_ENABLED 0` gate; all `ble_scanner_*` callsites compile out; `BLEHourlyRecord` zero-stubbed when off | Keep until crash fixed; flip to `1` to reproduce |
|
||||
| `firmware/src/main.cpp` | Removed verbose `[F]`/`[CV] spawn` per-frame logging; kept entry/exit + heartbeat | Keep |
|
||||
| `firmware/src/ble_scanner.cpp` | Removed `[BLE] new device:` per-discovery log | Keep |
|
||||
| `firmware/src/reporter.{h,cpp}` | DNS resolution with fallback IP, raw `WiFiClient` HTTP, manual `Host:` header | Keep — production fix |
|
||||
| `firmware/lib/net_guard/net_guard.{h,cpp}` | DNS pin to 1.1.1.1/8.8.8.8 at lwIP + esp-netif layers; `net_guard_dump_dns` diagnostic | Keep |
|
||||
|
||||
---
|
||||
|
||||
## Reproduction
|
||||
|
||||
1. `cd firmware && pio run -e timercam`.
|
||||
2. Edit `firmware/src/main.cpp`, set `#define BLE_SCANNING_ENABLED 1`. Rebuild.
|
||||
3. Flash a TimerCam: `python tools/flash_device.py --port /dev/ttyUSB0 --device-id dc-XXXX --location-id <loc> --hmac-secret <secret> --wifi-ssid "<ssid>" --wifi-password "<pw>"`.
|
||||
4. `pio device monitor --port /dev/ttyUSB0 --baud 115200`.
|
||||
5. Wait ≤30s. Expect the `Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited)` traceback above.
|
||||
|
||||
Crash is **deterministic** on the customer's network (Elly-Fi). Worth retesting on a quiet desk network — if it doesn't repro there, the trigger is camera-task starvation interacting with NimBLE timers, not a pure NimBLE bug.
|
||||
|
||||
To decode any future crash backtrace:
|
||||
|
||||
```sh
|
||||
~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line \
|
||||
-e firmware/.pio/build/timercam/firmware.elf -pfiC <addr1> <addr2> ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Investigation paths, in order of effort/confidence
|
||||
|
||||
### 1. Confirm the failing call site (cheap, do this first)
|
||||
|
||||
The addr2line line numbers can be off by ±3 due to inlining. Add a temporary `Serial.printf` patch to `npl_os_freertos.c` `os_callout_timer_cb` to log `co`, `co->evq`, `co->ev.fn` on entry. Reproduce. Then we know with certainty whether `co->ev.fn` is NULL on the direct-call path or whether this is an FreeRTOS-level issue (queued command for a deleted timer).
|
||||
|
||||
If `evq != NULL` and we still crash, the NULL is in the queued event dispatcher (a different code path; pivot the investigation).
|
||||
|
||||
### 2. Try upgrading NimBLE-Arduino to 2.x (medium effort, likely-fix)
|
||||
|
||||
`platformio.ini` has `h2zero/NimBLE-Arduino@^1.4.2`. 2.x rewrote the porting layer significantly. Breaking API changes — `NimBLEAdvertisedDeviceCallbacks` was renamed/restructured. Touch points: `firmware/src/ble_scanner.cpp` (the only file that uses NimBLE).
|
||||
|
||||
Try: pin `^2.0.0`, fix the API breakage in `ble_scanner.cpp` (it's <100 lines). If 2.x crashes too, the issue is independent of NimBLE version → pivot to (3) or (4).
|
||||
|
||||
### 3. Reduce camera-task starvation (cheap, may be sufficient)
|
||||
|
||||
The `EV-VSYNC-OVF` lines are the canary. The camera task pins core 1 at priority 2 doing CV processing every 200ms. NimBLE host task runs on core 0 by default but the FreeRTOS timer service task is core-agnostic and may be starved during long CV passes that hold a mutex.
|
||||
|
||||
Things to try in `firmware/src/main.cpp`:
|
||||
- Lower `CAM_FPS` from 5 to 3, see if VSYNC-OVF still appears.
|
||||
- Move CV processing off the capture path (capture into a queue, process at lower priority).
|
||||
- Raise FreeRTOS timer-service task priority via `configTIMER_TASK_PRIORITY` (sdkconfig).
|
||||
- Confirm NimBLE host task pinning — `CONFIG_BT_NIMBLE_PINNED_TO_CORE` should be 0 or 1 (not unpinned).
|
||||
|
||||
### 4. Local NULL-guard patch (last resort, masks the bug)
|
||||
|
||||
If upgrade is blocked and starvation reduction isn't enough, patch the vendored source:
|
||||
|
||||
```c
|
||||
// npl_os_freertos.c:1740
|
||||
} else {
|
||||
if (co->ev.fn) co->ev.fn(&co->ev);
|
||||
}
|
||||
```
|
||||
|
||||
This silences the crash but drops the dropped event. The dropped events are likely scan-result deliveries; we'd undercount BLE devices but not crash. Acceptable as a stopgap with a `// TODO: remove when NimBLE upgraded` and a note in this doc.
|
||||
|
||||
Caveat: vendored library files in `.pio/libdeps/` get blown away by clean builds. Either copy NimBLE into `firmware/lib/` and pin it (vendored), or use `lib_archive` + a post-install script. Don't ship a build that depends on an unpinned hand-edit.
|
||||
|
||||
### 5. Replace BLE stack (high effort)
|
||||
|
||||
If 2.x also crashes and starvation reduction doesn't help, switch to the IDF-native bluedroid stack via the Arduino-ESP32 `BLEDevice` API. Larger memory footprint (~30KB more heap) but a different lifecycle model — won't share NimBLE's bug.
|
||||
|
||||
---
|
||||
|
||||
## Constraints / things not to break
|
||||
|
||||
- `firmware/src/reporter.cpp` DNS path with `REPORTER_API_FALLBACK_IP` — production fix, must keep working. Do not regress to `HTTPClient`.
|
||||
- `BLE_SCANNING_ENABLED 0` is the **shipping default** until this is resolved. Devices in the field rely on this; flip to `1` only in dev builds.
|
||||
- `firmware/lib/net_guard/net_guard.cpp` `net_guard_pin_dns()` is called both at boot and on every WiFi reconnect; if reorganizing net_guard, preserve both call sites.
|
||||
- The `ble_scanner` module supports `ble_scanner_pause`/`resume` for OTA — verify it still works after any NimBLE upgrade (`ArduinoOTA.onStart` hook in `main.cpp:248`).
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
- Does the crash repro on a quiet network with no `EV-VSYNC-OVF`? (Determines whether starvation is necessary vs sufficient.)
|
||||
- Was BLE working in a previous build, and on which NimBLE version? Earliest BLE-related commit traced to is well before today; binary search across firmware commits with BLE enabled would identify the regression boundary if it's our code.
|
||||
- Does the customer site have an unusual RF environment (very dense BLE) that increases the callout-churn rate, making the race more likely? Worth a `nimble_scan_event` count log during a 60s capture window.
|
||||
|
||||
---
|
||||
|
||||
## Quick verification once you think it's fixed
|
||||
|
||||
1. Set `BLE_SCANNING_ENABLED 1`, rebuild, flash.
|
||||
2. Run for at least 10 minutes on the customer network — the original crash hit within ~1s, so 10 min with no panic is strong evidence.
|
||||
3. Confirm a successful hourly cycle: `[CV] entry/exit`, then `[HTTP] POST .../events/batch ... -> 200`, BLE record with non-zero `unique_devices`.
|
||||
4. Run a second device side-by-side; confirm no cross-device interference.
|
||||
|
||||
When done, set `BLE_SCANNING_ENABLED 1` as the default and remove the gate (keep the comment block as institutional memory of the bug).
|
||||
Reference in New Issue
Block a user