Files
DoorCounter/docs/superpowers/plans/2026-05-01-ble-nimble-crash.md
Peter Woolery a585a56cff fix(firmware): upgrade NimBLE to 2.x + DNS fallback for unreliable resolvers
NimBLE-Arduino 1.4.2 had an init/fire race in its FreeRTOS callout porting
layer where os_callout_timer_cb dispatched a queued TimerHandle expiry
against a not-yet-initialized event (NULL fn pointer), causing PC=0
InstrFetchProhibited within ~1s of boot when the camera task starved the
timer service. Confirmed by ets_printf instrumentation. Upgrading to
^2.0.0 rewrites the porting layer and eliminates the race; verified clean
on the customer network for 1+ hour.

Also rolls in DNS-resilience work that surfaced the BLE crash during
provisioning: pin lwIP/esp-netif resolvers to 1.1.1.1/8.8.8.8 across DHCP
renewals, add three-tier resolver fallback in reporter with a hardcoded
IP of last resort, and switch to raw WiFiClient with manual Host header
to bypass HTTPClient's brittle DNS path.

Migration touches for NimBLE 2.x:
- NimBLEAdvertisedDeviceCallbacks -> NimBLEScanCallbacks
- onResult signature now takes const NimBLEAdvertisedDevice*
- setAdvertisedDeviceCallbacks -> setScanCallbacks
- start(0, nullptr, false) -> start(0, false, false)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:34:17 -07:00

11 KiB
Raw Blame History

BLE / NimBLE Timer-Callout Crash — Handoff

Date opened: 2026-05-01 Status: Resolved 2026-05-01 by upgrading h2zero/NimBLE-Arduino from ^1.4.2 to ^2.0.0 (firmware/platformio.ini:24). BLE scanning re-enabled via BLE_SCANNING_ENABLED 1 (firmware/src/main.cpp:30). Verified clean on customer network for 1+ hour with no panics. Goal: Re-enable BLE scanning without the device crashing within ~1s of boot.

Confirmed root cause: Instrumented os_callout_timer_cb with ets_printf and observed the very first callout fire on the direct-call path with both evq=NULL and fn=NULL, while the same co address later (post-init) showed valid evq and fn. Same callout struct reused — classic NimBLE 1.x callout init/fire race where the FreeRTOS TimerHandle_t had a queued expiry against a not-yet-initialized event. NimBLE 2.x rewrote the porting layer; the race is gone.

Migration touches (NimBLE 1.x → 2.x):

  • NimBLEAdvertisedDeviceCallbacksNimBLEScanCallbacks
  • onResult(NimBLEAdvertisedDevice*)onResult(const NimBLEAdvertisedDevice*)
  • setAdvertisedDeviceCallbacks(cb, true)setScanCallbacks(cb, true)
  • start(0, nullptr, false)start(0, false, false) (signature: duration, isContinue, restart)

BLE was working before today's customer-site provisioning trip. The crash is reliably reproducible on the current build at the customer location whenever BLE_SCANNING_ENABLED is set back to 1. It may or may not reproduce on a quieter network — the camera task's CPU-starvation pattern is shared, but the crash window's exact trigger is unconfirmed.


Symptom

Within ~1s of boot, after several cam_hal: EV-VSYNC-OVF lines from the camera driver:

Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited). Exception was unhandled.

Core 0 register dump:
PC      : 0x00000000   PS      : 0x00060630   A0      : 0x8009a9af   A1      : 0x3ffbd6e0
A2      : 0x3fff1ef8   A3      : 0x00000001   ...
A8      : 0x800f2ebc   ...
EXCCAUSE: 0x00000014   EXCVADDR: 0x00000000

Backtrace: 0xfffffffd:0x3ffbd6e0 0x4009a9ac:0x3ffbd700

Decoded with ~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line -e .pio/build/timercam/firmware.elf -pfiC 0x4009a9ac 0x400f2ebc:

prvProcessReceivedCommands at freertos/timers.c:852
  (inlined by) prvTimerTask at freertos/timers.c:600
os_callout_timer_cb at NimBLE-Arduino/.../npl_os_freertos.c:1742

PC=0 + EXCCAUSE=0x14 (InstrFetchProhibited) = jump-to-NULL. The FreeRTOS timer-service task is dispatching a NimBLE callout whose callback function pointer is NULL.

The relevant NimBLE source:

// firmware/.pio/libdeps/timercam/NimBLE-Arduino/src/nimble/porting/npl/freertos/src/npl_os_freertos.c:1729-1742
static void
os_callout_timer_cb(TimerHandle_t timer)
{
    struct ble_npl_callout *co;

    co = pvTimerGetTimerID(timer);
    assert(co);

    if (co->evq) {
        ble_npl_eventq_put(co->evq, &co->ev);
    } else {
        co->ev.fn(&co->ev);   // <-- co->ev.fn is NULL
    }
}

Either co->ev.fn is genuinely NULL on the direct-call path, OR — given the addr2line frame is a few lines off and the callsite is ambiguous — the FreeRTOS timer's own callback pointer (pxTimer->pxCallbackFunction) is NULL inside prvProcessReceivedCommands. Both indicate a callout/timer being freed or zeroed while the FreeRTOS timer service still has a command queued for it.


Environment

  • Board: M5Stack TimerCam-F (ESP32-D0WDQ6-V3, dual-core 240 MHz, 4MB flash).
  • BLE library: h2zero/NimBLE-Arduino@^1.4.2 (firmware/platformio.ini). 1.4.2 is end-of-life on the 1.x branch; 2.x exists with breaking API changes.
  • Camera: OV3660 via esp32-camera driver, 96×96 grayscale @ 5 FPS.
  • BLE scan: passive, low-overhead, hash-collected by firmware/src/ble_scanner.cpp.
  • Tasks: task_camera (core 1, prio 2, 8KB stack), task_reporter (core 0, prio 1, 8KB stack), Arduino loop (default).
  • The camera task triggers cam_hal: EV-VSYNC-OVF whenever frame capture overlaps another long operation — this consistently precedes the crash in logs.

What's been ruled out

  1. DNS / network code — entirely unrelated. DNS path works in production via the new fallback-IP machinery (firmware/src/reporter.cpp resolve_api_ip and firmware/src/reporter.h REPORTER_API_FALLBACK_IP). Do not regress this; it shipped with reports working at the customer site.
  2. Our BLE app code — the backtrace stays inside the FreeRTOS timer service and NimBLE's own porting layer; nothing in ble_scanner.cpp is on the call stack. The bug is in vendored NimBLE.
  3. Memory corruption from our sideA2 = 0x3fff1ef8 is a normal heap address, no obvious overrun pattern. Heap is healthy at the time (we'd see a different fault otherwise).
  4. Stack overflow — A1 = 0x3ffbd6e0 is well within the FreeRTOS timer-service task's stack range; no canary smash log.

What changed today

File Change Keep?
firmware/src/main.cpp Added BLE_SCANNING_ENABLED 0 gate; all ble_scanner_* callsites compile out; BLEHourlyRecord zero-stubbed when off Keep until crash fixed; flip to 1 to reproduce
firmware/src/main.cpp Removed verbose [F]/[CV] spawn per-frame logging; kept entry/exit + heartbeat Keep
firmware/src/ble_scanner.cpp Removed [BLE] new device: per-discovery log Keep
firmware/src/reporter.{h,cpp} DNS resolution with fallback IP, raw WiFiClient HTTP, manual Host: header Keep — production fix
firmware/lib/net_guard/net_guard.{h,cpp} DNS pin to 1.1.1.1/8.8.8.8 at lwIP + esp-netif layers; net_guard_dump_dns diagnostic Keep

Reproduction

  1. cd firmware && pio run -e timercam.
  2. Edit firmware/src/main.cpp, set #define BLE_SCANNING_ENABLED 1. Rebuild.
  3. Flash a TimerCam: python tools/flash_device.py --port /dev/ttyUSB0 --device-id dc-XXXX --location-id <loc> --hmac-secret <secret> --wifi-ssid "<ssid>" --wifi-password "<pw>".
  4. pio device monitor --port /dev/ttyUSB0 --baud 115200.
  5. Wait ≤30s. Expect the Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited) traceback above.

Crash is deterministic on the customer's network (Elly-Fi). Worth retesting on a quiet desk network — if it doesn't repro there, the trigger is camera-task starvation interacting with NimBLE timers, not a pure NimBLE bug.

To decode any future crash backtrace:

~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line \
  -e firmware/.pio/build/timercam/firmware.elf -pfiC <addr1> <addr2> ...

Investigation paths, in order of effort/confidence

1. Confirm the failing call site (cheap, do this first)

The addr2line line numbers can be off by ±3 due to inlining. Add a temporary Serial.printf patch to npl_os_freertos.c os_callout_timer_cb to log co, co->evq, co->ev.fn on entry. Reproduce. Then we know with certainty whether co->ev.fn is NULL on the direct-call path or whether this is an FreeRTOS-level issue (queued command for a deleted timer).

If evq != NULL and we still crash, the NULL is in the queued event dispatcher (a different code path; pivot the investigation).

2. Try upgrading NimBLE-Arduino to 2.x (medium effort, likely-fix)

platformio.ini has h2zero/NimBLE-Arduino@^1.4.2. 2.x rewrote the porting layer significantly. Breaking API changes — NimBLEAdvertisedDeviceCallbacks was renamed/restructured. Touch points: firmware/src/ble_scanner.cpp (the only file that uses NimBLE).

Try: pin ^2.0.0, fix the API breakage in ble_scanner.cpp (it's <100 lines). If 2.x crashes too, the issue is independent of NimBLE version → pivot to (3) or (4).

3. Reduce camera-task starvation (cheap, may be sufficient)

The EV-VSYNC-OVF lines are the canary. The camera task pins core 1 at priority 2 doing CV processing every 200ms. NimBLE host task runs on core 0 by default but the FreeRTOS timer service task is core-agnostic and may be starved during long CV passes that hold a mutex.

Things to try in firmware/src/main.cpp:

  • Lower CAM_FPS from 5 to 3, see if VSYNC-OVF still appears.
  • Move CV processing off the capture path (capture into a queue, process at lower priority).
  • Raise FreeRTOS timer-service task priority via configTIMER_TASK_PRIORITY (sdkconfig).
  • Confirm NimBLE host task pinning — CONFIG_BT_NIMBLE_PINNED_TO_CORE should be 0 or 1 (not unpinned).

4. Local NULL-guard patch (last resort, masks the bug)

If upgrade is blocked and starvation reduction isn't enough, patch the vendored source:

// npl_os_freertos.c:1740
} else {
    if (co->ev.fn) co->ev.fn(&co->ev);
}

This silences the crash but drops the dropped event. The dropped events are likely scan-result deliveries; we'd undercount BLE devices but not crash. Acceptable as a stopgap with a // TODO: remove when NimBLE upgraded and a note in this doc.

Caveat: vendored library files in .pio/libdeps/ get blown away by clean builds. Either copy NimBLE into firmware/lib/ and pin it (vendored), or use lib_archive + a post-install script. Don't ship a build that depends on an unpinned hand-edit.

5. Replace BLE stack (high effort)

If 2.x also crashes and starvation reduction doesn't help, switch to the IDF-native bluedroid stack via the Arduino-ESP32 BLEDevice API. Larger memory footprint (~30KB more heap) but a different lifecycle model — won't share NimBLE's bug.


Constraints / things not to break

  • firmware/src/reporter.cpp DNS path with REPORTER_API_FALLBACK_IP — production fix, must keep working. Do not regress to HTTPClient.
  • BLE_SCANNING_ENABLED 0 is the shipping default until this is resolved. Devices in the field rely on this; flip to 1 only in dev builds.
  • firmware/lib/net_guard/net_guard.cpp net_guard_pin_dns() is called both at boot and on every WiFi reconnect; if reorganizing net_guard, preserve both call sites.
  • The ble_scanner module supports ble_scanner_pause/resume for OTA — verify it still works after any NimBLE upgrade (ArduinoOTA.onStart hook in main.cpp:248).

Open questions

  • Does the crash repro on a quiet network with no EV-VSYNC-OVF? (Determines whether starvation is necessary vs sufficient.)
  • Was BLE working in a previous build, and on which NimBLE version? Earliest BLE-related commit traced to is well before today; binary search across firmware commits with BLE enabled would identify the regression boundary if it's our code.
  • Does the customer site have an unusual RF environment (very dense BLE) that increases the callout-churn rate, making the race more likely? Worth a nimble_scan_event count log during a 60s capture window.

Quick verification once you think it's fixed

  1. Set BLE_SCANNING_ENABLED 1, rebuild, flash.
  2. Run for at least 10 minutes on the customer network — the original crash hit within ~1s, so 10 min with no panic is strong evidence.
  3. Confirm a successful hourly cycle: [CV] entry/exit, then [HTTP] POST .../events/batch ... -> 200, BLE record with non-zero unique_devices.
  4. Run a second device side-by-side; confirm no cross-device interference.

When done, set BLE_SCANNING_ENABLED 1 as the default and remove the gate (keep the comment block as institutional memory of the bug).