NimBLE-Arduino 1.4.2 had an init/fire race in its FreeRTOS callout porting layer where os_callout_timer_cb dispatched a queued TimerHandle expiry against a not-yet-initialized event (NULL fn pointer), causing PC=0 InstrFetchProhibited within ~1s of boot when the camera task starved the timer service. Confirmed by ets_printf instrumentation. Upgrading to ^2.0.0 rewrites the porting layer and eliminates the race; verified clean on the customer network for 1+ hour. Also rolls in DNS-resilience work that surfaced the BLE crash during provisioning: pin lwIP/esp-netif resolvers to 1.1.1.1/8.8.8.8 across DHCP renewals, add three-tier resolver fallback in reporter with a hardcoded IP of last resort, and switch to raw WiFiClient with manual Host header to bypass HTTPClient's brittle DNS path. Migration touches for NimBLE 2.x: - NimBLEAdvertisedDeviceCallbacks -> NimBLEScanCallbacks - onResult signature now takes const NimBLEAdvertisedDevice* - setAdvertisedDeviceCallbacks -> setScanCallbacks - start(0, nullptr, false) -> start(0, false, false) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
BLE / NimBLE Timer-Callout Crash — Handoff
Date opened: 2026-05-01
Status: Resolved 2026-05-01 by upgrading h2zero/NimBLE-Arduino from ^1.4.2 to ^2.0.0 (firmware/platformio.ini:24). BLE scanning re-enabled via BLE_SCANNING_ENABLED 1 (firmware/src/main.cpp:30). Verified clean on customer network for 1+ hour with no panics.
Goal: Re-enable BLE scanning without the device crashing within ~1s of boot.
Confirmed root cause: Instrumented os_callout_timer_cb with ets_printf and observed the very first callout fire on the direct-call path with both evq=NULL and fn=NULL, while the same co address later (post-init) showed valid evq and fn. Same callout struct reused — classic NimBLE 1.x callout init/fire race where the FreeRTOS TimerHandle_t had a queued expiry against a not-yet-initialized event. NimBLE 2.x rewrote the porting layer; the race is gone.
Migration touches (NimBLE 1.x → 2.x):
NimBLEAdvertisedDeviceCallbacks→NimBLEScanCallbacksonResult(NimBLEAdvertisedDevice*)→onResult(const NimBLEAdvertisedDevice*)setAdvertisedDeviceCallbacks(cb, true)→setScanCallbacks(cb, true)start(0, nullptr, false)→start(0, false, false)(signature:duration, isContinue, restart)
BLE was working before today's customer-site provisioning trip. The crash is reliably reproducible on the current build at the customer location whenever BLE_SCANNING_ENABLED is set back to 1. It may or may not reproduce on a quieter network — the camera task's CPU-starvation pattern is shared, but the crash window's exact trigger is unconfirmed.
Symptom
Within ~1s of boot, after several cam_hal: EV-VSYNC-OVF lines from the camera driver:
Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x00000000 PS : 0x00060630 A0 : 0x8009a9af A1 : 0x3ffbd6e0
A2 : 0x3fff1ef8 A3 : 0x00000001 ...
A8 : 0x800f2ebc ...
EXCCAUSE: 0x00000014 EXCVADDR: 0x00000000
Backtrace: 0xfffffffd:0x3ffbd6e0 0x4009a9ac:0x3ffbd700
Decoded with ~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line -e .pio/build/timercam/firmware.elf -pfiC 0x4009a9ac 0x400f2ebc:
prvProcessReceivedCommands at freertos/timers.c:852
(inlined by) prvTimerTask at freertos/timers.c:600
os_callout_timer_cb at NimBLE-Arduino/.../npl_os_freertos.c:1742
PC=0 + EXCCAUSE=0x14 (InstrFetchProhibited) = jump-to-NULL. The FreeRTOS timer-service task is dispatching a NimBLE callout whose callback function pointer is NULL.
The relevant NimBLE source:
// firmware/.pio/libdeps/timercam/NimBLE-Arduino/src/nimble/porting/npl/freertos/src/npl_os_freertos.c:1729-1742
static void
os_callout_timer_cb(TimerHandle_t timer)
{
struct ble_npl_callout *co;
co = pvTimerGetTimerID(timer);
assert(co);
if (co->evq) {
ble_npl_eventq_put(co->evq, &co->ev);
} else {
co->ev.fn(&co->ev); // <-- co->ev.fn is NULL
}
}
Either co->ev.fn is genuinely NULL on the direct-call path, OR — given the addr2line frame is a few lines off and the callsite is ambiguous — the FreeRTOS timer's own callback pointer (pxTimer->pxCallbackFunction) is NULL inside prvProcessReceivedCommands. Both indicate a callout/timer being freed or zeroed while the FreeRTOS timer service still has a command queued for it.
Environment
- Board: M5Stack TimerCam-F (ESP32-D0WDQ6-V3, dual-core 240 MHz, 4MB flash).
- BLE library:
h2zero/NimBLE-Arduino@^1.4.2(firmware/platformio.ini). 1.4.2 is end-of-life on the 1.x branch; 2.x exists with breaking API changes. - Camera: OV3660 via
esp32-cameradriver, 96×96 grayscale @ 5 FPS. - BLE scan: passive, low-overhead, hash-collected by
firmware/src/ble_scanner.cpp. - Tasks:
task_camera(core 1, prio 2, 8KB stack),task_reporter(core 0, prio 1, 8KB stack), Arduino loop (default). - The camera task triggers
cam_hal: EV-VSYNC-OVFwhenever frame capture overlaps another long operation — this consistently precedes the crash in logs.
What's been ruled out
- DNS / network code — entirely unrelated. DNS path works in production via the new fallback-IP machinery (
firmware/src/reporter.cppresolve_api_ipandfirmware/src/reporter.hREPORTER_API_FALLBACK_IP). Do not regress this; it shipped with reports working at the customer site. - Our BLE app code — the backtrace stays inside the FreeRTOS timer service and NimBLE's own porting layer; nothing in
ble_scanner.cppis on the call stack. The bug is in vendored NimBLE. - Memory corruption from our side —
A2 = 0x3fff1ef8is a normal heap address, no obvious overrun pattern. Heap is healthy at the time (we'd see a different fault otherwise). - Stack overflow — A1 = 0x3ffbd6e0 is well within the FreeRTOS timer-service task's stack range; no canary smash log.
What changed today
| File | Change | Keep? |
|---|---|---|
firmware/src/main.cpp |
Added BLE_SCANNING_ENABLED 0 gate; all ble_scanner_* callsites compile out; BLEHourlyRecord zero-stubbed when off |
Keep until crash fixed; flip to 1 to reproduce |
firmware/src/main.cpp |
Removed verbose [F]/[CV] spawn per-frame logging; kept entry/exit + heartbeat |
Keep |
firmware/src/ble_scanner.cpp |
Removed [BLE] new device: per-discovery log |
Keep |
firmware/src/reporter.{h,cpp} |
DNS resolution with fallback IP, raw WiFiClient HTTP, manual Host: header |
Keep — production fix |
firmware/lib/net_guard/net_guard.{h,cpp} |
DNS pin to 1.1.1.1/8.8.8.8 at lwIP + esp-netif layers; net_guard_dump_dns diagnostic |
Keep |
Reproduction
cd firmware && pio run -e timercam.- Edit
firmware/src/main.cpp, set#define BLE_SCANNING_ENABLED 1. Rebuild. - Flash a TimerCam:
python tools/flash_device.py --port /dev/ttyUSB0 --device-id dc-XXXX --location-id <loc> --hmac-secret <secret> --wifi-ssid "<ssid>" --wifi-password "<pw>". pio device monitor --port /dev/ttyUSB0 --baud 115200.- Wait ≤30s. Expect the
Guru Meditation Error: Core 0 panic'ed (InstrFetchProhibited)traceback above.
Crash is deterministic on the customer's network (Elly-Fi). Worth retesting on a quiet desk network — if it doesn't repro there, the trigger is camera-task starvation interacting with NimBLE timers, not a pure NimBLE bug.
To decode any future crash backtrace:
~/.platformio/packages/toolchain-xtensa-esp32/bin/xtensa-esp32-elf-addr2line \
-e firmware/.pio/build/timercam/firmware.elf -pfiC <addr1> <addr2> ...
Investigation paths, in order of effort/confidence
1. Confirm the failing call site (cheap, do this first)
The addr2line line numbers can be off by ±3 due to inlining. Add a temporary Serial.printf patch to npl_os_freertos.c os_callout_timer_cb to log co, co->evq, co->ev.fn on entry. Reproduce. Then we know with certainty whether co->ev.fn is NULL on the direct-call path or whether this is an FreeRTOS-level issue (queued command for a deleted timer).
If evq != NULL and we still crash, the NULL is in the queued event dispatcher (a different code path; pivot the investigation).
2. Try upgrading NimBLE-Arduino to 2.x (medium effort, likely-fix)
platformio.ini has h2zero/NimBLE-Arduino@^1.4.2. 2.x rewrote the porting layer significantly. Breaking API changes — NimBLEAdvertisedDeviceCallbacks was renamed/restructured. Touch points: firmware/src/ble_scanner.cpp (the only file that uses NimBLE).
Try: pin ^2.0.0, fix the API breakage in ble_scanner.cpp (it's <100 lines). If 2.x crashes too, the issue is independent of NimBLE version → pivot to (3) or (4).
3. Reduce camera-task starvation (cheap, may be sufficient)
The EV-VSYNC-OVF lines are the canary. The camera task pins core 1 at priority 2 doing CV processing every 200ms. NimBLE host task runs on core 0 by default but the FreeRTOS timer service task is core-agnostic and may be starved during long CV passes that hold a mutex.
Things to try in firmware/src/main.cpp:
- Lower
CAM_FPSfrom 5 to 3, see if VSYNC-OVF still appears. - Move CV processing off the capture path (capture into a queue, process at lower priority).
- Raise FreeRTOS timer-service task priority via
configTIMER_TASK_PRIORITY(sdkconfig). - Confirm NimBLE host task pinning —
CONFIG_BT_NIMBLE_PINNED_TO_COREshould be 0 or 1 (not unpinned).
4. Local NULL-guard patch (last resort, masks the bug)
If upgrade is blocked and starvation reduction isn't enough, patch the vendored source:
// npl_os_freertos.c:1740
} else {
if (co->ev.fn) co->ev.fn(&co->ev);
}
This silences the crash but drops the dropped event. The dropped events are likely scan-result deliveries; we'd undercount BLE devices but not crash. Acceptable as a stopgap with a // TODO: remove when NimBLE upgraded and a note in this doc.
Caveat: vendored library files in .pio/libdeps/ get blown away by clean builds. Either copy NimBLE into firmware/lib/ and pin it (vendored), or use lib_archive + a post-install script. Don't ship a build that depends on an unpinned hand-edit.
5. Replace BLE stack (high effort)
If 2.x also crashes and starvation reduction doesn't help, switch to the IDF-native bluedroid stack via the Arduino-ESP32 BLEDevice API. Larger memory footprint (~30KB more heap) but a different lifecycle model — won't share NimBLE's bug.
Constraints / things not to break
firmware/src/reporter.cppDNS path withREPORTER_API_FALLBACK_IP— production fix, must keep working. Do not regress toHTTPClient.BLE_SCANNING_ENABLED 0is the shipping default until this is resolved. Devices in the field rely on this; flip to1only in dev builds.firmware/lib/net_guard/net_guard.cppnet_guard_pin_dns()is called both at boot and on every WiFi reconnect; if reorganizing net_guard, preserve both call sites.- The
ble_scannermodule supportsble_scanner_pause/resumefor OTA — verify it still works after any NimBLE upgrade (ArduinoOTA.onStarthook inmain.cpp:248).
Open questions
- Does the crash repro on a quiet network with no
EV-VSYNC-OVF? (Determines whether starvation is necessary vs sufficient.) - Was BLE working in a previous build, and on which NimBLE version? Earliest BLE-related commit traced to is well before today; binary search across firmware commits with BLE enabled would identify the regression boundary if it's our code.
- Does the customer site have an unusual RF environment (very dense BLE) that increases the callout-churn rate, making the race more likely? Worth a
nimble_scan_eventcount log during a 60s capture window.
Quick verification once you think it's fixed
- Set
BLE_SCANNING_ENABLED 1, rebuild, flash. - Run for at least 10 minutes on the customer network — the original crash hit within ~1s, so 10 min with no panic is strong evidence.
- Confirm a successful hourly cycle:
[CV] entry/exit, then[HTTP] POST .../events/batch ... -> 200, BLE record with non-zerounique_devices. - Run a second device side-by-side; confirm no cross-device interference.
When done, set BLE_SCANNING_ENABLED 1 as the default and remove the gate (keep the comment block as institutional memory of the bug).