← Back to Blogs
HN Story

Indexing a Year of Video Locally: Leveraging Gemma 4 and M1 Max Hardware

May 21, 2026

Indexing a Year of Video Locally: Leveraging Gemma 4 and M1 Max Hardware

For photographers and videographers, the archive is often a growing liability. The volume of raw footage—captured across iPhones, drones, and professional cameras—typically grows far faster than the capacity to edit it. The bottleneck isn't usually the editing software itself, but the index: the ability to find a specific moment, like "the elephant on the hill at golden hour," without scrubbing through hundreds of hours of unlabeled IMG_*.mov files.

Recently, a project demonstrated that the solution to this problem isn't a high-level AI video editor, but a robust, local-first indexing pipeline. By utilizing a 2021 M1 Max MacBook Pro and the Gemma 4 31B model, it is possible to transform a year of raw video into a searchable, English-language database without uploading a single gigabyte to the cloud.

The Architecture of a Local Index

Most AI video tools assume footage is already labeled. To solve the "first problem" (the index), a pipeline was constructed to generate sidecar files—plain text .description.md files that live alongside each clip. This ensures the data remains grep-able and portable across different drives.

The Per-Clip Pipeline

The indexing process follows a rigorous sequence to ensure every piece of available metadata is captured in a single vision pass:

  1. Metadata Extraction: ffprobe handles technical metadata, while exiftool extracts GPS coordinates (latitude, longitude, and altitude) from iPhone, DJI, and drone footage.
  2. Geocoding: GPS coordinates are converted to human-readable locations via Nominatim.
  3. Frame Extraction: ffmpeg extracts five evenly-spaced frames at 1920px to provide the vision model with a representative sample of the clip.
  4. Transcription: WhisperX provides word-level alignment and speaker diarization across 97 languages.
  5. Facial Recognition: insightface detects faces and stores 512-dimensional ArcFace embeddings in a centralized SQLite database for cross-archive person queries.
  6. Vision Analysis: A vision model (Gemma 4 31B via LM Studio) analyzes the frames, transcript, and folder context to produce a YAML frontmatter block and a prose description.

Hardware Constraints and the "Swap" Strategy

One of the most surprising revelations of this build was the longevity of the M1 Max hardware. Running a 31B-parameter model (Q4 quantization) on a machine with 64GB of RAM pushed the system to its limits. During bulk runs, Activity Monitor reported over 50GB of swap usage, pushing memory pressure into the yellow zone.

While running a machine in a high-swap state is generally discouraged for daily use, the project found that for short-term, intensive batch processing, the M1 Max's unified memory architecture handles the load effectively. This proves that local LLMs are becoming efficient enough that hardware from five years ago can still serve as a powerful foundation for modern AI workloads.

Technical Lessons from the Build

Building the pipeline using Claude Code revealed several critical lessons in AI orchestration and schema design:

1. Enum Constraints vs. Open Prose

Open-ended prompts are prone to confabulation. For example, a model might describe a nighttime scene as "brightly lit" if it misinterprets reflections. By forcing the model to choose from a strict enum (e.g., golden_hour | bright_daylight | nighttime | unclear), the model is constrained to a set of valid options, significantly reducing hallucinations.

2. Defensive API Integration

When using fast-moving AI libraries like WhisperX, API signatures often change. The project implemented "signature introspection," where the script attempts a new keyword argument (e.g., token=) and falls back to the old one (use_auth_token=) if a TypeError is raised, ensuring the pipeline doesn't break during library updates.

3. The Danger of Union-Type Schemas

Allowing a field to be either an integer or a specific string (e.g., people_count: 5 or people_count: "many") creates friction for downstream consumers. The lesson learned was to stick to a single type—always an integer—and use explicit guidance for estimations.

4. Permissive Culling for Memories

While professional photography requires aggressive culling (removing blur or jitter), personal archives require a more permissive approach. A handheld, blurry clip of a motorcycle ride might be technically "bad" but emotionally valuable. The culling criteria were reframed to remove only "non-recordings" (e.g., lens caps or pocket footage).

The Bigger Picture: The Index as the Prerequisite

The core insight of this project is that the current AI video editing market is pitched one layer too high. Most tools compete on the "surface"—the editing and assembly—while skipping the prerequisite: the index.

Once an archive is queryable in plain English, the editor becomes a thin orchestration layer. This allows for a two-tier scaling strategy: bulk indexing runs locally for cost and privacy, while high-end cloud models (like Claude 3.5/4.6) are used only for a final "re-rate" pass on clips flagged for review.

By shifting the focus from the editor to the index, massive archives of raw footage are transformed from a digital graveyard into a functional, searchable asset.

References

HN Stories