Files
DoorCounter/README.md
Peter Woolery a795cfa0ad fix(firmware): reboot on FATAL failures + emit NTP_SYNC + server-coord warning
- Config-load and camera-init FATAL branches now reboot (3s LED signal
  before restart) instead of hanging forever. Matches the enum name
  REBOOT_FATAL_* and makes camera-init failures diagnosable via the
  next boot's heartbeat recent_events. Config failures produce a
  visible reboot loop rather than a silent hang.
- Emit EVT_NTP_SYNC(seconds_since_boot) on the first NTP-synced
  reporter iteration so slow / failed NTP sync is a visible signal in
  the heartbeat's recent_events window.
- README "Deploying firmware 1.1" now opens with a "Before you flash"
  warning directing the operator to land server-side heartbeat
  schema changes first (migration 005 + stub integration) to avoid a
  strict-schema 4xx reboot loop after deployment.
2026-04-23 14:10:32 -07:00

289 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DoorCounter
Retail door traffic counter using M5Stack TimerCamera-F (ESP32 + OV3660). Counts walker traversals via overhead camera CV, passively scans BLE foot traffic, and reports hourly to `logs.research.bike`.
> **Known limitation — directional accuracy.** This firmware reports counts as `{entries, exits}` for API compatibility, but **per-walk direction labelling is not reliable at the current mount (7' overhead, straight down).** In bench testing, event detection was 100% (8/8 walks detected) while per-walk direction matched the physical walk only ~50% of the time — the centroid trajectories produced by entries and exits were nearly indistinguishable. **The number to trust is gross traffic: `entries + exits` ≈ total walkers through the doorway.** The directional split is an unreliable best-effort heuristic. See [Directional counting](#directional-counting) for why.
## Hardware
- **Device**: M5Stack TimerCamera-F (ESP32-S, OV3660, PSRAM, WiFi/BLE)
- **Mount**: Overhead, camera pointing straight down, centered above doorway
- **Power**: USB (any phone charger)
## Firmware
Built with PlatformIO. Target: `timercam`.
```bash
cd firmware
pio run -t upload --upload-port /dev/ttyUSB0
```
### What it does
| Module | Behavior |
|--------|----------|
| CV pipeline | 5 fps, 96×96 grayscale, event-based walker detector (foreground-count state machine; centroid-trajectory direction heuristic) with post-fire refractory period |
| Detection LED | Single blink on entry, double blink on exit (preserves upload/no-WiFi status LED) |
| BLE scanner | Continuous passive scan; deinits during hourly upload to free heap |
| Reporter | Hourly HMAC-signed POST; 60s boot report for fast connectivity check |
| Provisioning | Captive portal AP on first boot for WiFi setup |
| OTA | Arduino OTA; operator push via `ota_push.py` |
### Reporting intervals
- **First report**: 60 seconds after NTP sync (connectivity check)
- **Subsequent reports**: every 3600 seconds
### Counting model — event-based walker detector
The CV pipeline is a **single event state machine** (no per-blob tracking
for counting). Per-frame foreground pixel count gates event start and end;
centroid trajectory within the active event decides direction.
**Event lifecycle:**
1. **Idle → Active**: `fg_count ≥ CV_EVENT_ENTER_THRESH` (250 px) fires event start.
Background updates freeze while the event is active so the walker does
not get absorbed into the baseline.
2. **Active accumulation**: every frame updates `first_c` (once), `min_c`,
`max_c`, `last_c`, `min_y_seen`, `max_y_seen`, and the frame count.
3. **Active → End** (either):
- **Quiet exit**: `fg_count < CV_EVENT_EXIT_THRESH` (150 px) for
`CV_EVENT_QUIET_FRAMES` (3) consecutive frames — walker has left.
- **Timeout**: `event_frame_count > CV_EVENT_MAX_FRAMES` (25 frames ≈ 5s).
4. On end, the event is finalized: gated by minimum duration, vertical
extent (must span a large fraction of the frame), and minimum centroid
trajectory magnitude. Background snaps to the current frame.
5. A **refractory period** (`CV_EVENT_REFRACTORY_FRAMES` = 10 ≈ 2s) after
a fire blocks a new event from starting — absorbs residual lingering
motion that would otherwise double-count.
**Direction heuristic** (applied only if the event passes all gates):
- `up_score = first_c min_c` (how far centroid excursed upward)
- `down_score = max_c first_c` (how far it excursed downward)
- Quiet-exit events: `is_entry = (up_score ≥ down_score)`
- Timeout events: `is_entry = (last_c < first_c)` — net displacement is
more reliable than excursion when the walker is still in frame at timeout.
Per-mount convention: centroid moving **up through the frame** (y decreasing)
= **entry** into the store.
### Directional counting — known limitation
**Per-walk direction labelling is unreliable at the current mount.** In
bench testing (8 alternating entry/exit walks at 4s intervals, 7' overhead
mount pointing straight down):
- **Event detection**: 8/8 (100%) — every walk produced exactly one event.
- **Aggregate split**: 4 entries + 4 exits — matches the 4+4 ground truth.
- **Per-walk direction**: 4/8 (50%) — essentially a coin flip.
At this mount, entries and exits produce nearly identical centroid
trajectories: both begin near mid-frame (walker is already large when
`fg_count` crosses 250), both reach a peak excursion toward the top, and
both end near mid-frame (walker's tail is still visible when `fg_count`
drops below 150). No heuristic over the recorded centroid statistics
separates them with better than ~50% accuracy on alternating walks.
**What we ship, and what the server should trust:**
- **Gross traffic (`entries + exits`) is accurate.** This is the number
downstream analytics should use as "people through the door this hour."
- **Directional split is reported but unreliable.** Treat individual
`entries` and `exits` values as a best-effort labelling. Do not infer
net flow or dwell from them.
To actually recover per-walk direction would require either a physical
change (raise or tilt the camera so walkers enter/leave through the frame
edges) or a richer signal than centroid statistics (e.g. time-resolved
optical flow, or a second sensor). That work is out of scope for v1.
See `firmware/lib/cv/cv.h` for tuning constants and `cv.cpp` for the
finalize logic.
## Operator Setup
### 1. Flash firmware
```bash
cd firmware
pio run -t upload --upload-port /dev/ttyUSB0
```
### 2. Provision device identity
```bash
python tools/flash_device.py \
--port /dev/ttyUSB0 \
--device-id dc-0042 \
--location-id retailer-123 \
--hmac-secret <32-byte-hex> \
--wifi-ssid "StoreWiFi" \
--wifi-password "secret"
```
WiFi credentials are optional — if omitted, device starts captive portal on boot.
> **Re-provision after firmware uploads.** Flashing firmware via
> `pio run -t upload` may clear the NVS partition on this board. If the device
> boots into a ~1 Hz LED blink (the "not provisioned" fatal state) after a
> firmware update, re-run `flash_device.py` with the same credentials. See
> [Troubleshooting](#troubleshooting).
### 3. OTA updates
```bash
python tools/ota_push.py \
--host dc-0042.local \
--firmware firmware/.pio/build/timercam/firmware.bin
```
## End User Setup
1. Mount device overhead, camera pointing straight down
2. Plug into USB power
3. Connect phone to `DoorCounter-Setup` WiFi
4. Browser opens automatically → enter store WiFi password → done
**LED indicators**: Red = no WiFi · Blue = counting · Yellow = uploading · Brief flash (×1) on entry · Brief flash (×2) on exit
## API
Endpoint: `http://logs.research.bike`
| Endpoint | Data |
|----------|------|
| `POST /api/v1/camera/events/batch` | Hourly entry/exit counts |
| `POST /api/v1/events/batch` | Hourly BLE proximity records |
| `POST /api/v1/heartbeat` | Device health (uptime, RSSI, pending records) |
All requests are HMAC-SHA256 signed. See [design spec](docs/superpowers/specs/2026-04-13-door-counter-design.md) for full API shapes and auth scheme.
## Project Structure
```
DoorCounter/
├── firmware/
│ ├── platformio.ini
│ ├── lib/
│ │ ├── cv/ — CV pipeline (event state machine, centroid-trajectory direction)
│ │ └── hmac/ — HMAC-SHA256 signing library
│ └── src/
│ ├── main.cpp — FreeRTOS tasks, boot sequence
│ ├── config.* — NVS read/write
│ ├── provisioning.* — captive portal
│ ├── camera.* — frame capture + CV pipeline
│ ├── ble_scanner.* — BLE passive scan
│ └── reporter.* — hourly batch POST + local buffer
├── tools/
│ ├── flash_device.py — NVS provisioning script
│ ├── ota_push.py — OTA push script
│ └── serial_monitor.py — reset + read serial with timestamps (diagnostic)
├── docs/
│ ├── server-prompt-crossing-cooldown.md — server-side coordination notes
│ └── superpowers/specs/2026-04-13-door-counter-design.md
└── server/ — API server (separate deployment)
```
## Troubleshooting
| Symptom | Likely cause | Remedy |
|---------|--------------|--------|
| ~1 Hz LED blink after boot, no serial beyond `esp_core_dump_flash: No core dump partition found!` | NVS missing `device_id` / `location_id` / `hmac_secret`. Commonly triggered by a firmware upload wiping NVS. | Re-run `flash_device.py` with the device's known credentials. |
| Device stays on `DoorCounter-Setup` AP instead of joining customer WiFi | SSID/password in NVS wrong, or network out of range. | Connect phone to `DoorCounter-Setup` → captive portal → re-enter WiFi. Or reflash NVS with correct `--wifi-ssid` / `--wifi-password`. |
| No entries/exits counted for a known-walking doorway | WiFi captive portal still up (camera task starts only after connect); or camera blocked/unfocused. | Check LED: solid on = booting/uploading, off = counting. Run `serial_monitor.py` to see `[CV] entry/exit` log lines. |
Capture a boot log with timestamps:
```bash
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 30
```
## Deploying firmware 1.1 (network resilience)
### Before you flash
Firmware 1.1 adds five new fields to the `POST /api/v1/heartbeat` payload
(`reset_reason`, `heap_free`, `heap_min_free`, `last_disconnect_code`,
`recent_events`). **The real server must accept these optional fields before
you deploy firmware 1.1**, or strict-schema validation will 4xx every
heartbeat; after 6 consecutive misses (~6h) the heartbeat-miss watchdog
will reboot the device, producing a reboot loop.
Reference migration and handler code for the real server are in this repo:
- `server/heartbeat_diagnostics_stub.py` — Pydantic model extensions,
`store_heartbeat_diagnostics()` helper, and `EVENT_TAG_DECODER` /
`REBOOT_REASON_DECODER` reference tables.
- `server/migrations/005_heartbeat_diagnostics.sql` — adds five nullable
columns to the `heartbeats` table (adjust table name to match the real
server's schema).
Copy the stub additions into the production server repo, run the
migration, and confirm a v1.1.0-shape heartbeat returns 200 before you
flash any device.
### Flash command
```bash
cd firmware && pio run -e timercam -t upload
```
### Expected first boot
On the serial log (115200 baud), the device prints the boot banner, then
initializes `event_log`, then records the reset reason via `EVT_BOOT`.
The first heartbeat fires roughly 60-70s after power-on (15s WiFi
busy-wait + NTP sync + 60s `BOOT_REPORT_DELAY_S`). Monitor with
`pio device monitor` or:
```bash
python tools/serial_monitor.py --port /dev/ttyUSB0 --reset --timestamp --seconds 90
```
### What's new in 1.1
- Event-driven WiFi reconnect with 1s→60s exponential backoff (`net_guard` module); disconnect reasons logged.
- HTTP timeouts (5s connect / 10s response) + 3-try retry on every POST.
- ESP-IDF Task Watchdog (30s) on camera, reporter, and loop tasks; panic → reboot → reason surfaces in the next heartbeat.
- Software heartbeat-miss watchdog: 6 consecutive missed heartbeats (~6 h) triggers a clean reboot.
- Persistent NVS event-log ring buffer (32 entries) surfaced in the heartbeat's `recent_events` field.
- New heartbeat fields: `reset_reason`, `heap_free`, `heap_min_free`, `last_disconnect_code`, `recent_events`.
### 24-hour field checks
After deploying a device, run through this checklist against the server's
heartbeat records at the 24-hour mark:
- **Heartbeat count ≥ 22** — ≥ 92% uptime across 24 h at the hourly cadence.
- **No sustained `t=6` (EVT_HEARTBEAT_MISS) entries in `recent_events`** — transient singletons are expected; repeated misses indicate a sticky network problem worth investigating.
- **`heap_min_free` stable day over day** — a downward drift indicates a leak. Alert threshold: min-free drops by more than 20% vs baseline.
- **`last_disconnect_code` matches known AP behavior** — reason 8 (assoc lost) and reason 15 (4-way handshake timeout) are common on busy APs; recurring reason 200+ indicates a firmware bug.
- **`reset_reason` has no unexpected values** — see table below.
| `reset_reason` | Meaning | Expected? |
|----------------|---------|-----------|
| 1 | Power-on | Normal immediately after a deployment. |
| 4 | Software reset (our `ESP.restart()`) | Correlate with `EVT_REBOOT` in `recent_events`. |
| 6 | Task watchdog | Investigate — a task hung for 30s. |
| 7 | Brownout | Investigate power supply / USB cable. |
| 8 | SDIO reset | Unusual — investigate. |
### Decoding recent_events
The `recent_events` array is a ring buffer of `{t, d0, d1, ts}` entries.
Tag definitions live in `firmware/lib/event_log/event_log.h`:
| `t` | Event | `d0` | `d1` |
|-----|-------|------|------|
| 1 | `EVT_BOOT` | `esp_reset_reason()` | — |
| 2 | `EVT_WIFI_UP` | RSSI | — |
| 3 | `EVT_WIFI_DOWN` | disconnect reason code; `0xFF` = silent-death fallback | — |
| 4 | `EVT_HTTP_OK` | fnv1a-16 path hash | elapsed ms (capped at 65535) |
| 5 | `EVT_HTTP_FAIL` | path hash | HTTP status or negative errno cast to `uint16` |
| 6 | `EVT_HEARTBEAT_MISS` | consecutive miss count | — |
| 7 | `EVT_NTP_SYNC` | reserved | — |
| 8 | `EVT_REBOOT` | `RebootReason`: 1=HEARTBEAT_MISS, 2=FACTORY_RESET, 3=OTA, 4=WIFI_REPROV | — |
Server-side decoder tables (`EVENT_TAG_DECODER`, `REBOOT_REASON_DECODER`)
live in `server/heartbeat_diagnostics_stub.py`.