Scene detection with timestamps from videos using Gemini
I had the privilege of going to the Google AI Hackathon at AGI House
recently and building a Gemini-powered project.
My team and I built this tool that uses AI
to dynamically create GIFs from any Youtube Video. That means no more relying on Giphy to have
that exact GIF you're looking for in their library.
GemGIF
Here I'll share what I learned about working with video data with Gemini's Flash 2 model.
Video understanding
Gemini has built-in video understanding capabilities, which is a pretty huge thing in my opinion.
Before this, you might do something like feed the video transcript into the model
and infer moments from transcript data. Having native video understanding is a big step up from that.
This opens the door up to some really cool products.
Getting video data from Gemini
This is a part that's not yet documented well at the time of writing.
In the Gemini web app, you can add Youtube videos and Gemini will understand them. I figured there'd be
an equivalent API, but there isn't yet, so don't waste as much time as I did trying to track down the API. The
existing solution would be to use a tool like youtube-dl to download the video
if you are taking the video from the web.
Once you have the video data, you can upload it directly to Gemini. Here's the code to do that:
This function uploads the video to Gemini servers and returns a reference to it.
There is some processing time involved, so you'll need to do something like poll and wait for the
processing to complete.
Once you have done this and Gemini finishes processing it, you can start using it.
This is the function we used to grab GIFs from the input video that match the input description:
Breaking this down a little bit, in the first section we construct the prompt. You can
see we ask Gemini to give us timestamps in the video. Gemini understands this and will
include timestamps as a result:
It's really that simple! You'd need such a robust pipeline to do this in the past, but we can see here
it's now just a function call:
Note that video is the return value from the upload_video function above.
We can take that video and attach it to our Gemini conversation as context.
The config parameter
directs Gemini to output structured JSON adhering to our expected format.
That's it
I really feel like there's several lucrative projects that can be created with this Gemini feature
while it's still fresh. From my tests, it works really well for video content.
If you want to see the full code for the project my team made, you can find it on Github here.