Scene detection with timestamps from videos using Gemini

I had the privilege of going to the Google AI Hackathon at AGI House recently and building a Gemini-powered project.

My team and I built this tool that uses AI to dynamically create GIFs from any Youtube Video. That means no more relying on Giphy to have that exact GIF you're looking for in their library.

GemGIF

Here I'll share what I learned about working with video data with Gemini's Flash 2 model.

Video understanding

Gemini has built-in video understanding capabilities, which is a pretty huge thing in my opinion. Before this, you might do something like feed the video transcript into the model and infer moments from transcript data. Having native video understanding is a big step up from that.

This opens the door up to some really cool products.

Getting video data from Gemini

This is a part that's not yet documented well at the time of writing.

In the Gemini web app, you can add Youtube videos and Gemini will understand them. I figured there'd be an equivalent API, but there isn't yet, so don't waste as much time as I did trying to track down the API. The existing solution would be to use a tool like youtube-dl to download the video if you are taking the video from the web.

Once you have the video data, you can upload it directly to Gemini. Here's the code to do that:

def upload_video(video_file_name):
    video_file = client.files.upload(path=video_file_name)
 
    while video_file.state == "PROCESSING":
        print("Waiting for video to be processed.")
        time.sleep(10)
        video_file = client.files.get(name=video_file.name)
 
    if video_file.state == "FAILED":
        raise ValueError(video_file.state)
    print(f"Video processing complete: " + video_file.uri)
 
    return video_file

This function uploads the video to Gemini servers and returns a reference to it. There is some processing time involved, so you'll need to do something like poll and wait for the processing to complete. Once you have done this and Gemini finishes processing it, you can start using it.

This is the function we used to grab GIFs from the input video that match the input description:

def call_gemini_model(video, user_query: str):
    model_name = "gemini-2.0-flash-exp"
    prompt = f"""
    Task: Generate Multiple Engaging GIF Scenes
 
    Creative Brief: {user_query}
 
    Scene Selection Guidelines:
    - Identify 3-5 distinct scenes that comprehensively capture the user's intent
    - Each scene should be a 2-5 second segment with:
      1. Unique, dynamic visual moment
      2. High entertainment or informative value
      3. Clear alignment with the creative description
 
    Output Requirements:
    - Provide a JSON array of scene descriptions
    - Ensure scenes are diverse and complementary
    - Scenes should work well as individual GIFs
 
    Output Format:
    ```json
    [
      {{
        "start_time": "MM:SS",
        "end_time": "MM:SS",
        "caption": "Descriptive GIF scene explanation"
      }},
      {{
        "start_time": "MM:SS",
        "end_time": "MM:SS",
        "caption": "Another unique scene description"
      }}
    ]
    ```"""
 
    response = client.models.generate_content(
        model=model_name,
        contents=[
            types.Content(
                role="user",
                parts=[
                    types.Part.from_uri(file_uri=video.uri, mime_type=video.mime_type),
                ],
            ),
            prompt,
        ],
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema={
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "start_time": {"type": "string"},
                        "end_time": {"type": "string"},
                        "caption": {"type": "string"},
                    },
                    "required": ["start_time", "end_time", "caption"],
                },
            },
        ),
    )
    return response

Breaking this down a little bit, in the first section we construct the prompt. You can see we ask Gemini to give us timestamps in the video. Gemini understands this and will include timestamps as a result:

prompt = f"""
    Task: Generate Multiple Engaging GIF Scenes
 
    Creative Brief: {user_query}
 
    Scene Selection Guidelines:
    - Identify 3-5 distinct scenes that comprehensively capture the user's intent
    - Each scene should be a 2-5 second segment with:
      1. Unique, dynamic visual moment
      2. High entertainment or informative value
      3. Clear alignment with the creative description
 
    Output Requirements:
    - Provide a JSON array of scene descriptions
    - Ensure scenes are diverse and complementary
    - Scenes should work well as individual GIFs
 
    Output Format:
    ```json
    [
      {{
        "start_time": "MM:SS",
        "end_time": "MM:SS",
        "caption": "Descriptive GIF scene explanation"
      }},
      {{
        "start_time": "MM:SS",
        "end_time": "MM:SS",
        "caption": "Another unique scene description"
      }}
    ]
    ```"""

It's really that simple! You'd need such a robust pipeline to do this in the past, but we can see here it's now just a function call:

response = client.models.generate_content(
        model=model_name,
        contents=[
            types.Content(
                role="user",
                parts=[
                    types.Part.from_uri(file_uri=video.uri, mime_type=video.mime_type),
                ],
            ),
            prompt,
        ],
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema={
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "start_time": {"type": "string"},
                        "end_time": {"type": "string"},
                        "caption": {"type": "string"},
                    },
                    "required": ["start_time", "end_time", "caption"],
                },
            },
        ),
    )

Note that video is the return value from the upload_video function above. We can take that video and attach it to our Gemini conversation as context. The config parameter directs Gemini to output structured JSON adhering to our expected format.

That's it

I really feel like there's several lucrative projects that can be created with this Gemini feature while it's still fresh. From my tests, it works really well for video content.

If you want to see the full code for the project my team made, you can find it on Github here.