Peter Woolery d943b3df5a feat(firmware): log reason before FATAL hang loops
Two FATAL while(true) hangs in main.cpp (config load fail, camera init
fail) previously relied on the hardware watchdog to reboot the device,
leaving the cause invisible beyond a generic TWDT reset reason. Now
each path logs EVT_REBOOT with REBOOT_FATAL_CONFIG or REBOOT_FATAL_CAMERA
before hanging, so the next heartbeat's recent_events surfaces which
branch hung. Server-side decoder updated for the two new enum values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:03:57 -07:00

DoorCounter

Retail door traffic counter using M5Stack TimerCamera-F (ESP32 + OV3660). Counts walker traversals via overhead camera CV, passively scans BLE foot traffic, and reports hourly to logs.research.bike.

Known limitation — directional accuracy. This firmware reports counts as {entries, exits} for API compatibility, but per-walk direction labelling is not reliable at the current mount (7' overhead, straight down). In bench testing, event detection was 100% (8/8 walks detected) while per-walk direction matched the physical walk only ~50% of the time — the centroid trajectories produced by entries and exits were nearly indistinguishable. The number to trust is gross traffic: entries + exits ≈ total walkers through the doorway. The directional split is an unreliable best-effort heuristic. See Directional counting for why.

Hardware

  • Device: M5Stack TimerCamera-F (ESP32-S, OV3660, PSRAM, WiFi/BLE)
  • Mount: Overhead, camera pointing straight down, centered above doorway
  • Power: USB (any phone charger)

Firmware

Built with PlatformIO. Target: timercam.

cd firmware
pio run -t upload --upload-port /dev/ttyUSB0

What it does

Module Behavior
CV pipeline 5 fps, 96×96 grayscale, event-based walker detector (foreground-count state machine; centroid-trajectory direction heuristic) with post-fire refractory period
Detection LED Single blink on entry, double blink on exit (preserves upload/no-WiFi status LED)
BLE scanner Continuous passive scan; deinits during hourly upload to free heap
Reporter Hourly HMAC-signed POST; 60s boot report for fast connectivity check
Provisioning Captive portal AP on first boot for WiFi setup
OTA Arduino OTA; operator push via ota_push.py

Reporting intervals

  • First report: 60 seconds after NTP sync (connectivity check)
  • Subsequent reports: every 3600 seconds

Counting model — event-based walker detector

The CV pipeline is a single event state machine (no per-blob tracking for counting). Per-frame foreground pixel count gates event start and end; centroid trajectory within the active event decides direction.

Event lifecycle:

  1. Idle → Active: fg_count ≥ CV_EVENT_ENTER_THRESH (250 px) fires event start. Background updates freeze while the event is active so the walker does not get absorbed into the baseline.
  2. Active accumulation: every frame updates first_c (once), min_c, max_c, last_c, min_y_seen, max_y_seen, and the frame count.
  3. Active → End (either):
    • Quiet exit: fg_count < CV_EVENT_EXIT_THRESH (150 px) for CV_EVENT_QUIET_FRAMES (3) consecutive frames — walker has left.
    • Timeout: event_frame_count > CV_EVENT_MAX_FRAMES (25 frames ≈ 5s).
  4. On end, the event is finalized: gated by minimum duration, vertical extent (must span a large fraction of the frame), and minimum centroid trajectory magnitude. Background snaps to the current frame.
  5. A refractory period (CV_EVENT_REFRACTORY_FRAMES = 10 ≈ 2s) after a fire blocks a new event from starting — absorbs residual lingering motion that would otherwise double-count.

Direction heuristic (applied only if the event passes all gates):

  • up_score = first_c min_c (how far centroid excursed upward)
  • down_score = max_c first_c (how far it excursed downward)
  • Quiet-exit events: is_entry = (up_score ≥ down_score)
  • Timeout events: is_entry = (last_c < first_c) — net displacement is more reliable than excursion when the walker is still in frame at timeout.

Per-mount convention: centroid moving up through the frame (y decreasing) = entry into the store.

Directional counting — known limitation

Per-walk direction labelling is unreliable at the current mount. In bench testing (8 alternating entry/exit walks at 4s intervals, 7' overhead mount pointing straight down):

  • Event detection: 8/8 (100%) — every walk produced exactly one event.
  • Aggregate split: 4 entries + 4 exits — matches the 4+4 ground truth.
  • Per-walk direction: 4/8 (50%) — essentially a coin flip.

At this mount, entries and exits produce nearly identical centroid trajectories: both begin near mid-frame (walker is already large when fg_count crosses 250), both reach a peak excursion toward the top, and both end near mid-frame (walker's tail is still visible when fg_count drops below 150). No heuristic over the recorded centroid statistics separates them with better than ~50% accuracy on alternating walks.

What we ship, and what the server should trust:

  • Gross traffic (entries + exits) is accurate. This is the number downstream analytics should use as "people through the door this hour."
  • Directional split is reported but unreliable. Treat individual entries and exits values as a best-effort labelling. Do not infer net flow or dwell from them.

To actually recover per-walk direction would require either a physical change (raise or tilt the camera so walkers enter/leave through the frame edges) or a richer signal than centroid statistics (e.g. time-resolved optical flow, or a second sensor). That work is out of scope for v1.

See firmware/lib/cv/cv.h for tuning constants and cv.cpp for the finalize logic.

Operator Setup

1. Flash firmware

cd firmware
pio run -t upload --upload-port /dev/ttyUSB0

2. Provision device identity

python tools/flash_device.py \
  --port /dev/ttyUSB0 \
  --device-id dc-0042 \
  --location-id retailer-123 \
  --hmac-secret <32-byte-hex> \
  --wifi-ssid "StoreWiFi" \
  --wifi-password "secret"

WiFi credentials are optional — if omitted, device starts captive portal on boot.

Re-provision after firmware uploads. Flashing firmware via pio run -t upload may clear the NVS partition on this board. If the device boots into a ~1 Hz LED blink (the "not provisioned" fatal state) after a firmware update, re-run flash_device.py with the same credentials. See Troubleshooting.

3. OTA updates

python tools/ota_push.py \
  --host dc-0042.local \
  --firmware firmware/.pio/build/timercam/firmware.bin

End User Setup

  1. Mount device overhead, camera pointing straight down
  2. Plug into USB power
  3. Connect phone to DoorCounter-Setup WiFi
  4. Browser opens automatically → enter store WiFi password → done

LED indicators: Red = no WiFi · Blue = counting · Yellow = uploading · Brief flash (×1) on entry · Brief flash (×2) on exit

API

Endpoint: http://logs.research.bike

Endpoint Data
POST /api/v1/camera/events/batch Hourly entry/exit counts
POST /api/v1/events/batch Hourly BLE proximity records
POST /api/v1/heartbeat Device health (uptime, RSSI, pending records)

All requests are HMAC-SHA256 signed. See design spec for full API shapes and auth scheme.

Project Structure

DoorCounter/
├── firmware/
│   ├── platformio.ini
│   ├── lib/
│   │   ├── cv/            — CV pipeline (event state machine, centroid-trajectory direction)
│   │   └── hmac/          — HMAC-SHA256 signing library
│   └── src/
│       ├── main.cpp       — FreeRTOS tasks, boot sequence
│       ├── config.*       — NVS read/write
│       ├── provisioning.* — captive portal
│       ├── camera.*       — frame capture + CV pipeline
│       ├── ble_scanner.*  — BLE passive scan
│       └── reporter.*     — hourly batch POST + local buffer
├── tools/
│   ├── flash_device.py    — NVS provisioning script
│   ├── ota_push.py        — OTA push script
│   └── serial_monitor.py  — reset + read serial with timestamps (diagnostic)
├── docs/
│   ├── server-prompt-crossing-cooldown.md — server-side coordination notes
│   └── superpowers/specs/2026-04-13-door-counter-design.md
└── server/                — API server (separate deployment)

Troubleshooting

Symptom Likely cause Remedy
~1 Hz LED blink after boot, no serial beyond esp_core_dump_flash: No core dump partition found! NVS missing device_id / location_id / hmac_secret. Commonly triggered by a firmware upload wiping NVS. Re-run flash_device.py with the device's known credentials.
Device stays on DoorCounter-Setup AP instead of joining customer WiFi SSID/password in NVS wrong, or network out of range. Connect phone to DoorCounter-Setup → captive portal → re-enter WiFi. Or reflash NVS with correct --wifi-ssid / --wifi-password.
No entries/exits counted for a known-walking doorway WiFi captive portal still up (camera task starts only after connect); or camera blocked/unfocused. Check LED: solid on = booting/uploading, off = counting. Run serial_monitor.py to see [CV] entry/exit log lines.

Capture a boot log with timestamps:

python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30

Deploying firmware 1.1 (network resilience)

Flash command

cd firmware && pio run -e timercam -t upload

Expected first boot

On the serial log (115200 baud), the device prints the boot banner, then initializes event_log, then records the reset reason via EVT_BOOT. The first heartbeat fires roughly 60-70s after power-on (15s WiFi busy-wait + NTP sync + 60s BOOT_REPORT_DELAY_S). Monitor with pio device monitor or:

python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 90

What's new in 1.1

  • Event-driven WiFi reconnect with 1s→60s exponential backoff (net_guard module); disconnect reasons logged.
  • HTTP timeouts (5s connect / 10s response) + 3-try retry on every POST.
  • ESP-IDF Task Watchdog (30s) on camera, reporter, and loop tasks; panic → reboot → reason surfaces in the next heartbeat.
  • Software heartbeat-miss watchdog: 6 consecutive missed heartbeats (~6 h) triggers a clean reboot.
  • Persistent NVS event-log ring buffer (32 entries) surfaced in the heartbeat's recent_events field.
  • New heartbeat fields: reset_reason, heap_free, heap_min_free, last_disconnect_code, recent_events.

24-hour field checks

After deploying a device, run through this checklist against the server's heartbeat records at the 24-hour mark:

  • Heartbeat count ≥ 22 — ≥ 92% uptime across 24 h at the hourly cadence.
  • No sustained t=6 (EVT_HEARTBEAT_MISS) entries in recent_events — transient singletons are expected; repeated misses indicate a sticky network problem worth investigating.
  • heap_min_free stable day over day — a downward drift indicates a leak. Alert threshold: min-free drops by more than 20% vs baseline.
  • last_disconnect_code matches known AP behavior — reason 8 (assoc lost) and reason 15 (4-way handshake timeout) are common on busy APs; recurring reason 200+ indicates a firmware bug.
  • reset_reason has no unexpected values — see table below.
reset_reason Meaning Expected?
1 Power-on Normal immediately after a deployment.
4 Software reset (our ESP.restart()) Correlate with EVT_REBOOT in recent_events.
6 Task watchdog Investigate — a task hung for 30s.
7 Brownout Investigate power supply / USB cable.
8 SDIO reset Unusual — investigate.

Decoding recent_events

The recent_events array is a ring buffer of {t, d0, d1, ts} entries. Tag definitions live in firmware/lib/event_log/event_log.h:

t Event d0 d1
1 EVT_BOOT esp_reset_reason()
2 EVT_WIFI_UP RSSI
3 EVT_WIFI_DOWN disconnect reason code; 0xFF = silent-death fallback
4 EVT_HTTP_OK fnv1a-16 path hash elapsed ms (capped at 65535)
5 EVT_HTTP_FAIL path hash HTTP status or negative errno cast to uint16
6 EVT_HEARTBEAT_MISS consecutive miss count
7 EVT_NTP_SYNC reserved
8 EVT_REBOOT RebootReason: 1=HEARTBEAT_MISS, 2=FACTORY_RESET, 3=OTA, 4=WIFI_REPROV

Server-side decoder tables (EVENT_TAG_DECODER, REBOOT_REASON_DECODER) live in server/heartbeat_diagnostics_stub.py.

Description
No description provided
Readme 293 KiB
Languages
C++ 58.6%
Python 36.8%
C 4.6%