sharmahritik2002@gmail.com

Semantic Video Search: From Problem to Prototype - 11 Sep 2025

How I built a tool to help creators quickly find moments inside their videos

Why I Built This

This started with a simple but real problem faced by my friend Jay, who is also a content creator.

He spends hours digging through his old travel videos (gigabytes of them stored on Google Drive) just to find short clips that match his script. What he really needed was a way to search his own videos semantically---instead of scrubbing through timelines manually.

That’s where the idea clicked: can we build a tool that saves him hours by letting him search through his video repository as easily as searching text?



Researching Solutions

The problem wasn’t unique to him---hundreds of creators face the same thing. So I started researching if there were already tools that solved it.

I came across Twelve Labs, a platform that allows video embeddings and semantic search. It looked promising but came with usage limits and a high price tag. That made me curious: could I build a lighter, cheaper version of it for Jay?

I dove into reading about vision models and tested a few open-source ones, like LLaVA on Hugging Face.



My First Approach

After hours of tinkering, I came up with this workflow:

  1. Split the video into 5-second clips.
  2. Extract frames (1 frame per second) from each clip and run them through LLaVA to get image descriptions.
  3. Combine frame descriptions and feed them to an LLM to summarize the 5-second clip (since LLaVA only works on images, not full videos).
  4. Store embeddings of these clip descriptions in a vector database along with timestamps and the original video URL.

That covered the video embedding and ingestion step.



Semantic Search

Once the clips were embedded, searching became straightforward:

  1. Convert the query into an embedding.
  2. Search against the vector DB using cosine similarity.

To my surprise, it worked! The results weren’t perfect, but they were good enough to prove the idea. Jay could finally search semantically instead of wasting time scrubbing through hours of footage.

Of course, there are inefficiencies and edge cases I haven’t solved yet, but this was a solid start. The important part was that it saved manual effort.



What’s Next

I’m now refactoring the codebase to make it usable for Jay beyond static test videos. Once the platform/server is ready, I’ll share another blog post with usage stats---and maybe even open-source the project.


Let's see how it goes.

Till then, happy coding!

Get in touch

Email me at sharmahritik2002@gmail.com sharmahritik2002@gmail.com link or follow me via my social links.