Google Gemini: Multimodal AI for Next-Generation Applications

The Gemini Family

┌──────────────────────────────────────────────────────────────┐
│              Google Gemini Model Lineup (2025)               │
├──────────────────┬─────────────┬────────────┬───────────────┤
│  Model           │  Context    │  Modalities│  Strength     │
├──────────────────┼─────────────┼────────────┼───────────────┤
│  Gemini 2.5 Pro  │  1M tokens  │ Everything │  Best overall │
│  Gemini 2.5 Flash│  1M tokens  │ Everything │  Speed+cost   │
│  Gemini 2.0 Flash│  1M tokens  │ Everything │  Production   │
│  Gemini 1.5 Pro  │  2M tokens  │ Text+Vision│  Long context │
│  Gemma 3 (OSS)   │  128K       │ Text+Vision│  Self-hosted  │
└──────────────────┴─────────────┴────────────┴───────────────┘
  Modalities: Text, Images, Audio, Video, Code, Documents

Setup and Authentication

python
1pip install google-generativeai

python
1import google.generativeai as genai
2import os
3
4genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
5
6model = genai.GenerativeModel(
7    model_name="gemini-2.5-flash",
8    system_instruction="You are a concise, expert technical assistant.",
9    generation_config=genai.GenerationConfig(
10        temperature=0.3,
11        max_output_tokens=2048,
12        response_mime_type="text/plain"
13    )
14)
15
16response = model.generate_content("Explain transformer attention in 3 sentences.")
17print(response.text)
18print(f"Input tokens: {response.usage_metadata.prompt_token_count}")
19print(f"Output tokens: {response.usage_metadata.candidates_token_count}")

Multimodal: Text + Images + Video

python
1import google.generativeai as genai
2from pathlib import Path
3import PIL.Image
4
5model = genai.GenerativeModel("gemini-2.5-pro")
6
7# Image analysis
8def analyze_image(image_path: str, prompt: str) -> str:
9    img = PIL.Image.open(image_path)
10    response = model.generate_content([img, prompt])
11    return response.text
12
13result = analyze_image(
14    "screenshot.png",
15    "Identify all UI components and suggest accessibility improvements."
16)
17
18# Video understanding (unique to Gemini)
19def analyze_video(video_path: str, prompt: str) -> str:
20    video_file = genai.upload_file(video_path, mime_type="video/mp4")
21    
22    # Wait for processing
23    import time
24    while video_file.state.name == "PROCESSING":
25        time.sleep(2)
26        video_file = genai.get_file(video_file.name)
27    
28    response = model.generate_content([video_file, prompt])
29    return response.text
30
31# Analyze a recorded user testing session
32insights = analyze_video(
33    "user-session.mp4",
34    "Identify usability issues — where does the user hesitate, look confused, or make errors?"
35)
36
37# PDF / Document processing — Gemini natively understands PDFs
38def process_pdf(pdf_path: str, query: str) -> str:
39    pdf_file = genai.upload_file(pdf_path, mime_type="application/pdf")
40    response = model.generate_content([pdf_file, query])
41    return response.text

Structured Output with Response Schema

python
1import google.generativeai as genai
2import typing_extensions as typing
3
4class TechArticle(typing.TypedDict):
5    title: str
6    summary: str
7    key_concepts: list[str]
8    difficulty: typing.Literal["BEGINNER", "INTERMEDIATE", "ADVANCED"]
9    estimated_read_time_minutes: int
10
11model = genai.GenerativeModel(
12    model_name="gemini-2.5-flash",
13    generation_config=genai.GenerationConfig(
14        response_mime_type="application/json",
15        response_schema=TechArticle
16    )
17)
18
19response = model.generate_content(
20    "Analyze this article about RAG systems and extract metadata."
21)
22
23import json
24article: TechArticle = json.loads(response.text)
25print(f"Difficulty: {article['difficulty']}, Read time: {article['estimated_read_time_minutes']} min")

Grounding with Google Search

Gemini can search the web in real-time and ground its answers in current information:

python
1model = genai.GenerativeModel("gemini-2.5-pro")
2
3tool = genai.protos.Tool(
4    google_search=genai.protos.GoogleSearch()
5)
6
7response = model.generate_content(
8    "What are the latest AI model releases in the last week?",
9    tools=[tool]
10)
11
12print(response.text)
13# Cites sources, includes current information
14for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
15    print(f"Source: {chunk.web.title} — {chunk.web.uri}")

Code Execution

Gemini 2.5 Pro can write AND run Python code, returning actual computed results:

python
1model = genai.GenerativeModel(
2    model_name="gemini-2.5-pro",
3    tools=["code_execution"]
4)
5
6response = model.generate_content(
7    """Analyze this dataset and find statistical outliers:
8    [23, 45, 12, 67, 234, 34, 56, 11, 890, 45, 23, 67]
9    Plot a box plot and return the outlier values."""
10)
11
12for part in response.candidates[0].content.parts:
13    if hasattr(part, "executable_code"):
14        print(f"Code executed:\n{part.executable_code.code}")
15    if hasattr(part, "code_execution_result"):
16        print(f"Output:\n{part.code_execution_result.output}")
17    if hasattr(part, "text"):
18        print(f"Analysis:\n{part.text}")

Multi-Turn Conversations

python
1chat = model.start_chat(history=[])
2
3# Maintains full conversation history automatically
4response1 = chat.send_message("I'm building a recommendation engine. What algorithm should I use?")
5print(response1.text)
6
7response2 = chat.send_message("My dataset has 50M users and 1M items. Does that change your recommendation?")
8print(response2.text)  # References previous context
9
10response3 = chat.send_message("Show me a Python implementation of the approach you suggested.")
11print(response3.text)  # Builds on both previous messages
12
13# Inspect full history
14for message in chat.history:
15    print(f"{message.role}: {message.parts[0].text[:100]}...")

Long-Context: 1M Token Window

Gemini's 1M-2M token context is transformative for large codebase analysis:

python
1import os
2
3def analyze_entire_codebase(project_dir: str, question: str) -> str:
4    """Feed an entire codebase to Gemini for analysis."""
5    code_content = []
6    total_chars = 0
7    
8    for root, dirs, files in os.walk(project_dir):
9        # Skip node_modules, .git, etc.
10        dirs[:] = [d for d in dirs if d not in {".git", "node_modules", "dist", ".next"}]
11        
12        for file in files:
13            if file.endswith((".ts", ".tsx", ".py", ".go", ".rs")):
14                path = os.path.join(root, file)
15                content = Path(path).read_text(errors="ignore")
16                rel_path = os.path.relpath(path, project_dir)
17                code_content.append(f"### {rel_path}\n```\n{content}\n```")
18                total_chars += len(content)
19    
20    full_context = "\n\n".join(code_content)
21    print(f"Feeding {len(code_content)} files ({total_chars:,} chars) to Gemini")
22    
23    model = genai.GenerativeModel("gemini-1.5-pro")  # 2M context
24    response = model.generate_content(
25        f"Codebase:\n{full_context}\n\nQuestion: {question}"
26    )
27    return response.text
28
29# Ask architectural questions across the entire codebase
30analysis = analyze_entire_codebase(
31    "./my-app",
32    "Find all security vulnerabilities (XSS, SQLi, auth issues) in this codebase."
33)

Vertex AI Integration (Production)

python
1import vertexai
2from vertexai.generative_models import GenerativeModel, Part
3
4vertexai.init(project="my-gcp-project", location="us-central1")
5
6model = GenerativeModel(
7    "gemini-2.5-pro",
8    system_instruction="You are a production AI assistant."
9)
10
11# Vertex AI provides: enterprise SLAs, VPC, audit logs, IAM, no data training
12response = model.generate_content(
13    ["Explain cloud architecture best practices for fintech applications."],
14    generation_config={
15        "max_output_tokens": 2048,
16        "temperature": 0.1
17    }
18)

Gemini vs Claude vs GPT-4o at a Glance

┌───────────────┬──────────────┬──────────────┬──────────────┐
│ Feature       │ Gemini 2.5   │ Claude 3.7   │ GPT-4o       │
├───────────────┼──────────────┼──────────────┼──────────────┤
│ Context       │ 1M tokens    │ 200K tokens  │ 128K tokens  │
│ Video input   │ ✅ Native    │ ❌           │ ❌           │
│ Audio input   │ ✅ Native    │ ❌           │ ✅ (Whisper) │
│ Code execution│ ✅ Built-in  │ ❌           │ ✅ (sandbox) │
│ Web search    │ ✅ Built-in  │ ❌           │ ✅ (plugin)  │
│ Cost/MTok out │ ~$3.50       │ ~$15         │ ~$10         │
│ Safety focus  │ Medium       │ High (Const.)│ Medium       │
└───────────────┴──────────────┴──────────────┴──────────────┘

Google Gemini: Multimodal AI for Next-Generation Applications

Google Gemini: Multimodal AI for Next-Generation Applications

The Gemini Family

Setup and Authentication

Multimodal: Text + Images + Video

Structured Output with Response Schema

Grounding with Google Search

Code Execution

Multi-Turn Conversations

Long-Context: 1M Token Window

Vertex AI Integration (Production)

Gemini vs Claude vs GPT-4o at a Glance

Sumit Kumar Pandey

Share this article

Discussion (0)