net_guard_tick() compared absolute uint32_t millis() values:
if (millis() < s_next_retry_ms) return;
This is broken across the ~49.7-day millis() wrap: depending on which
side of the wrap each value lands, retries either tight-loop or stall
indefinitely. The device is designed for multi-month uptime, so this
is a real production case, not a theoretical one.
Replace with the standard wrap-safe pattern using a signed difference.
Found via adversarial review (run 2026-05-01-202910, gpt-5.5 reviewer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NimBLE-Arduino 1.4.2 had an init/fire race in its FreeRTOS callout porting
layer where os_callout_timer_cb dispatched a queued TimerHandle expiry
against a not-yet-initialized event (NULL fn pointer), causing PC=0
InstrFetchProhibited within ~1s of boot when the camera task starved the
timer service. Confirmed by ets_printf instrumentation. Upgrading to
^2.0.0 rewrites the porting layer and eliminates the race; verified clean
on the customer network for 1+ hour.
Also rolls in DNS-resilience work that surfaced the BLE crash during
provisioning: pin lwIP/esp-netif resolvers to 1.1.1.1/8.8.8.8 across DHCP
renewals, add three-tier resolver fallback in reporter with a hardcoded
IP of last resort, and switch to raw WiFiClient with manual Host header
to bypass HTTPClient's brittle DNS path.
Migration touches for NimBLE 2.x:
- NimBLEAdvertisedDeviceCallbacks -> NimBLEScanCallbacks
- onResult signature now takes const NimBLEAdvertisedDevice*
- setAdvertisedDeviceCallbacks -> setScanCallbacks
- start(0, nullptr, false) -> start(0, false, false)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- net_guard_tick now detects status-vs-event divergence. If s_up is
true but WiFi.status() says otherwise (rare: driver wedge, silent
RF failure), force DOWN state and schedule reconnect. Uses 0xFF
disconnect reason so the event log distinguishes this path.
- Forward-declare DeviceConfig in net_guard.h so consumers that don't
call net_guard_start don't transitively pull config.h.
loop() no longer blocks for 5s after a disconnect; reconnect is
scheduled from the WiFi event handler with exponential backoff.
Buffered reports flush on every clean UP transition.
- Seed s_up from WiFi.status() in net_guard_start so the first
STA_GOT_IP (fired during setup's busy-wait, before onEvent was
registered) is not missed — prevents a reconnect flap on every boot.
- Drop WiFi.disconnect() from net_guard_tick; WiFi.begin() alone
re-associates cleanly and avoids a spurious STA_DISCONNECTED that
was double-logging EVT_WIFI_DOWN on every retry.
- Re-check s_up after the millis() timing gate to close the
GOT_IP-vs-tick race.
- Document the volatile-only shared-state contract.
net_guard registers WiFi.onEvent() so disconnects are handled
immediately instead of polled every 1s. Backoff 1s->2s->4s->...->60s cap.
Every up/down transition is logged to the event log with the disconnect
reason code, so field failures are diagnosable.