Pydantic AI with Gemini Pro

Update: 6.11.2025: I have updated the code to use the latest version of Pydantic AI and added Github repo which was missing from the original post.

Building an AI-Powered SRE Agent with Pydantic AI: A Tool-Based Approach

Full disclosure: This modified from examples and AI has been used to help me write this post.

Imagine being on-call when a flood of support tickets, alerts, sms messages and calls suddenly hit your phone. Questions are coming in from all directions: “Is the system down?” “Why is everything so slow?” “Why are we getting so many error logs?” “We didn’t deploy anything new recently, why is this happening?”

As an SRE, you need to quickly assess the situation, check metrics, and communicate with stakeholders. But what will you say? How quickly can you assess the situation and provide a clear answer? What if an AI agent could help you do this instantly while you focus on solving the problem?

That’s exactly what this project demonstrates. This is bare-bone minimum proof-of-concept that showcases how to build an intelligent incident analysis system using Pydantic AI together with Gemini Pro. Pydantic AI is a framework that makes AI agents not just conversational, but actually capable of interacting with your systems programmatically.

The key benefit? Instead of making the AI guess or hallucinate information, we give it tools to fetch real data and programmatically inject context about ongoing incidents. The result is an agent that can analyze production issues with actual metrics, not assumptions. Here we focus on SRE incident analysis but the same principles can be applied to other domains. SRE just happens to be a domain where I work and where I have a lot of experience.

Project Overview

Tool simulates an SRE incident response system powered by Google’s Gemini 2.5 Pro model. Yes, using Gemini Pro is overkill for this proof-of-concept but that’s what I’ll be using today. When you ask it about system behavior (e.g., “There’s a sudden increase in support requests. What’s going on?”), it:

  1. Receives incident context (title and description) through dynamic system prompts
  2. Uses its incident_metrics tool to fetch real-time system metrics (in this case mocked data about CPU, memory and disk usage)
  3. Analyzes the data with an SRE mindset (Our system prompt)
  4. Returns a structured assessment including severity classification

The system is built with pydantic and type safety can, leverage Pydantic’s validation to ensure both inputs and outputs are well-defined.

Example output

Here is an example output from the agent:

1
2
3
4
5
6
{
  "response_text": "There is an ongoing incident with a cloud provider, causing a major outage in one of the regions. This has resulted in a loss of 2/3 of our capacity, leading to severe degradation of our services. The remaining systems are under extreme load, with CPU, memory, and disk pressure at 90%. This is the root cause for the increase in support requests.",
  "degradation": true,
  "emergency": true,
  "criticality": 5
}

Architecture: File by File

Let’s break down how this elegant system works by examining each component.

incident_dependencies.py - The Context Container

1
2
3
4
5
6
7
8
from dataclasses import dataclass
from database import DatasourceConnection


@dataclass
class IncidentDependencies:
    incident_id: int
    db: DatasourceConnection

This simple file defines the dependency injection pattern used throughout the application. Every time the agent runs, it receives:

  • An incident_id to know which incident to analyze
  • A db connection to fetch data

This pattern is what enables the agent to be contextually aware without hardcoding any specific incident details. We use mock database (in this case a python class with dictionary from database.py) to fetch data but in production you would use actual database or API to fetch data.

database.py - The Mock Data Layer

This file simulates a production incident database with two sample incidents:

Incident 1: “System has somewhat constrained CPU resources”

  • CPU: 56%, Memory Pressure: 10%, Disk Pressure: 10%
  • Represents a moderate resource constraint scenario

Incident 2: “Cloud provider outage”

  • CPU: 90%, Memory Pressure: 90%, Disk Pressure: 90%
  • Represents an obvious critical infrastructure failure

The DatasourceConnection class provides async methods to fetch data from the mock database:

  • incident_title() - The incident headline (e.g. “System has somewhat constrained CPU resources”)
  • incident_description() - Detailed context about what’s happening (e.g. “The system is experiencing a moderate resource constraint scenario”)
  • incident_metrics() - Real-time system metrics (e.g. CPU: 56%, Memory Pressure: 10%, Disk Pressure: 10%)

In a production system, these would connect to actual monitoring systems (Prometheus, Datadog, CloudWatch), ticketing systems (PagerDuty, Opsgenie), and incident management platforms. This is the beauty of the tool-based approach. You can easily integrate with other systems by simply adding a new tool.

Code for this is in database.py file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from dataclasses import dataclass
from typing import Any

@dataclass
class Incident:
    id: int
    title: str
    description: str
    metrics: dict[str, Any]

# Mock database
INCIDENTS = {
    1: Incident(id=1, title="System has somewhat constrained CPU resources", description="Underlying K8S cluster is running low on CPU resource and average CPU load has increased beyond normal level", metrics={"CPU": 56, "MEMORY_PRESSURE": 10, "DISK_PRESSURE": 10}),
    2: Incident(id=2, title="Cloud provider outage", description="Cloud provider region failure. Unable to provision more capacity. Lost 2/3 of existing capacity.", metrics={"CPU": 90, "MEMORY_PRESSURE": 90, "DISK_PRESSURE": 90}),
}

class DatasourceConnection:
    async def incident_title(self, id: int) -> str:
        incident = INCIDENTS.get(id)
        return incident.title if incident else "Unknown Incident"

    async def incident_description(self, id: int) -> str:
        incident = INCIDENTS.get(id)
        return incident.description if incident else "Unknown Incident - No Description"

    async def incident_metrics(self, id: int) -> dict[str, Any]:
        incident = INCIDENTS.get(id)
        return incident.metrics if incident else {"CPU": 0, "MEMORY_PRESSURE": 0, "DISK_PRESSURE": 0, }

main.py - Where the Magic Happens

This is the the main entrypoint for the program. It initializes the agent and runs it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import asyncio
import json
import os
from typing import Any

from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext

from database import DatasourceConnection
from incident_dependencies import IncidentDependencies

load_dotenv()
model_name = os.getenv("MODEL_NAME", "gemini-2.5-pro")


class IncidentOutput(BaseModel):
    response_text: str = Field(description="Description of the incident status")
    degradation: bool = Field(description="Is production service degraded")
    emergency: bool = Field(description="Is there production outage")
    criticality: int = Field(description="Criticality level", ge=0, le=5)


incident_agent = Agent(
    model=model_name,
    deps_type=IncidentDependencies,
    output_type=IncidentOutput,
    system_prompt=(
        "You are SRE engineer. Your task is to provide analysis of the incident reports based on the metrics."
        "Provide clear and concise advice, evaluate possible effect to production and assess urgency."
    ),
)


@incident_agent.system_prompt
async def add_incident_title(ctx: RunContext[IncidentDependencies]) -> str:
    incident_title = await ctx.deps.db.incident_title(id=ctx.deps.incident_id)
    return f"Incident title: {incident_title!r}."


@incident_agent.system_prompt
async def add_incident_description(ctx: RunContext[IncidentDependencies]) -> str:
    incident_description = await ctx.deps.db.incident_description(id=ctx.deps.incident_id)
    return f"Incident description: {incident_description!r}."


@incident_agent.tool
async def incident_metrics(ctx: RunContext[IncidentDependencies]) -> dict[str, Any]:
    """Returns latest metrics for the incident"""
    return await ctx.deps.db.incident_metrics(id=ctx.deps.incident_id)


async def main() -> None:
    deps = IncidentDependencies(incident_id=2, db=DatasourceConnection())

    result = await incident_agent.run(
        # "System is giving errors and is not responding to my requests.",
        "There is sudden increase in support requests. What is going on with the system?",
        deps=deps,
    )
    print(json.dumps(result.output.model_dump(), indent=2))
    """
    Example result:
    response_text='There is major incident, operations team is working on it and relevant people are informed.'
    degradation=True
    emergency=True
    criticality=5
    """


if __name__ == "__main__":
    asyncio.run(main())

1. Structured Output Definition

1
2
3
4
5
class IncidentOutput(BaseModel):
    response_text: str = Field(description="Description of the incident status")
    degradation: bool = Field(description="Is production service degraded")
    emergency: bool = Field(description="Is there production outage")
    criticality: int = Field(description="Criticality level", ge=0, le=5)

The agent doesn’t just return free-form text. It returns a structured, validated response that your code can immediately act upon. Need to trigger a PagerDuty escalation if criticality >= 4? You have that data reliably.

2. Dynamic System Prompt Construction

1
2
3
4
5
6
7
8
9
@incident_agent.system_prompt
async def add_incident_title(ctx: RunContext[IncidentDependencies]) -> str:
    incident_title = await ctx.deps.db.incident_title(id=ctx.deps.incident_id)
    return f"Incident title: {incident_title!r}."

@incident_agent.system_prompt
async def add_incident_description(ctx: RunContext[IncidentDependencies]) -> str:
    incident_description = await ctx.deps.db.incident_description(id=ctx.deps.incident_id)
    return f"Incident description: {incident_description!r}."

Here’s where it gets interesting. The system prompt isn’t static — it’s dynamically constructed from your database. Before the agent even sees the user’s question, it already knows:

  • What incident it’s dealing with
  • The context and background of the issue

You’re not asking the user to copy-paste incident details; you’re seamlessly providing them from your systems.

3. Tool-Based Information Retrieval

1
2
3
4
@incident_agent.tool
async def incident_metrics(ctx: RunContext[IncidentDependencies]) -> dict[str, Any]:
    """Returns latest metrics for the incident"""
    return await ctx.deps.db.incident_metrics(id=ctx.deps.incident_id)

Instead of dumping all metrics into the prompt upfront, you give the agent a tool to fetch them on-demand.

Why does this matter?

Traditional approach: “Here’s the incident title, description, and 50 metrics. Now answer my question.”

  • Token-heavy
  • Overwhelming context
  • Agent may focus on irrelevant data

Tool-based approach: “Here’s the incident context. You have access to a metrics tool. Use it if you need it.”

  • Token-efficient
  • Agent decides what information it needs
  • Mirrors how real SREs work (checking specific metrics when hypothesizing)

When you ask “What’s going on with the system?”, the agent recognizes it needs metrics data to give an informed answer. It calls incident_metrics(), receives the data, and incorporates it into its analysis — all automatically.

The Tool-Based Paradigm: Why It Matters?

The tool-based approach fundamentally changes how AI agents interact with your infrastructure. Let’s explore why this is powerful:

1. Separation of Concerns

The AI doesn’t need to know how to fetch metrics—just when to fetch them. Your database layer handles the how. This means you can:

  • Swap data sources without touching the agent code
  • Add caching layers transparently
  • Implement rate limiting or authentication

2. Cost and Context Efficiency

By fetching data only when needed, you save on token costs and keep the context window focused on relevant information. This is not that important for this proof-of-concept but it becomes more important if your agent is working in a domain where there are a lot of metrics or other information that is not needed for the current question.

3. Auditability and Control

Every tool call is explicit and logged. You can see exactly what data the agent accessed, when, and why. This is essential for production systems where data access needs to be tracked. Working in a domain where audit trails and compliance are important this is a must have.

4. Extensibility

Want to add more capabilities? Just add more tools:

  • get_recent_deployments() - Check if recent changes caused the issue
  • query_logs() - Fetch error logs for the timeframe
  • get_affected_customers() - Identify blast radius
  • create_incident_timeline() - Generate a timeline of events

Each tool is a new capability the agent can intelligently use.

How Programmatic Context Makes the Agent Smarter

Agent never have to ask “What incident are you referring to?” or “Can you give me more details?” It already knows because we programmatically provided:

  1. The base system prompt - “You are an SRE engineer…”
  2. Dynamic incident context - Title and description fetched from the database
  3. Tools for on-demand data - Metrics available when needed

This creates an experience where the agent feels a bit like a teammate who has access to the same dashboards and tools you do. You can ask natural questions like:

  • “There is sudden increase in support requests. What is going on with the system?”
  • “Should we wake up the infrastructure team?”
  • “Is this customer-facing?”

The agent can answer confidently because it has real data, not hallucinations. This can be combined with other tools like text-to-speech so automation can be used to answer questions and provide updates to stakeholders.

Ideas for Expansion

This proof-of-concept is for demonstration purposes only. In production you would need to connect to real monitoring and observability platforms, add tools for historical analysis, automated response actions, multi-agent collaboration, learning from resolutions, proactive monitoring, compliance and reporting, and integration with chat platforms.

1. Multi-Source Data Integration

Connect to real monitoring and observability platforms:

  • Metrics: Prometheus, Datadog, New Relic
  • Logs: Elasticsearch, Splunk, Loki
  • Traces: Jaeger, Zipkin
  • Incidents: PagerDuty, Opsgenie, Incident.io

2. Historical Analysis Tools

Add tools that provide temporal context:

  • get_similar_incidents() - Find past incidents with similar symptoms
  • get_recent_deployments() - Check what changed recently
  • get_incident_timeline() - Build a chronological event sequence

3. Automated Response Actions

Give the agent tools to take action (with appropriate safeguards):

  • scale_deployment() - Increase replica counts
  • restart_service() - Bounce a problematic service
  • rollback_deployment() - Revert to previous version
  • create_war_room() - Spin up a Slack channel and invite relevant people

4. Multi-Agent Collaboration

Create specialized agents for different domains:

  • Database SRE Agent - Specializes in database incidents
  • Network SRE Agent - Focuses on networking issues
  • Application SRE Agent - Handles application-level problems
  • Coordinator Agent - Routes questions to the right specialist

5. Learning from Resolutions

Store incident analyses and resolutions:

  • Build a knowledge base of “incident fingerprints”
  • Learn from past resolutions
  • Suggest playbooks based on similar historical incidents

6. Proactive Monitoring

Instead of reactive analysis, run the agent continuously:

  • Monitor metrics streams
  • Detect anomalies before they become incidents
  • Generate early warnings: “CPU trending upward, may hit threshold in 15 minutes”

7. Compliance and Reporting

Generate structured outputs for compliance:

  • Incident post-mortems with analys on what triggered the incident and how it was resolved
  • MTTR (Mean Time To Resolution) tracking
  • SLA impact assessments
  • Stakeholder communication templates

8. Integration with Chat Platforms

Deploy as a Slack/Teams bot:

  • Answer questions in incident channels (e.g. Slack, Teams, Discord)
  • Provide real-time metric updates
  • Automatically update incident status
  • Generate executive summaries for leadership

Technical Considerations for Production

If you’re considering building on this pattern in production, consider the following:

Type Safety

The use of Pydantic models throughout ensures:

  • Input validation (incident IDs must be valid)
  • Output validation (for example criticality must be 0-5)
  • Runtime error prevention

Async/Await

The entire system is async, which means:

  • Non-blocking I/O for database calls
  • Concurrent tool executions possible
  • Scalable to handle multiple simultaneous incidents

Dependency Injection

The IncidentDependencies pattern allows:

  • Easy testing with mock databases
  • Different configurations per environment
  • Clean separation of concerns

Error Handling

Production systems would need:

  • Graceful degradation when tools fail
  • Retry logic for transient failures
  • Fallback responses when data is unavailable (e.g. if metrics are not available, the agent can still answer the question with the available information)

Security

Consider:

  • Authentication for tool access
  • Rate limiting to prevent abuse
  • Audit logging for all actions
  • Read-only vs. write access controls

Conclusion

This proof-of-concept demonstrates that modern AI agents can be very powerful when they can interact with your systems programmatically. By combining:

  • Dynamic context injection (system prompts with real data)
  • Tool-based architecture (letting the agent fetch what it needs and when it needs it)
  • Structured outputs (type-safe responses you can act on)
  • Dependency injection (clean, testable design with mock data for testing)

This is just the beginning. As AI models improve and tool-based frameworks mature, we’ll see agents that can not just analyze incidents, but help resolve them.

Ready to build your own? Check out Pydantic AI and start experimenting with tool-based agents in your domain. The pattern shown here applies far beyond SRE: customer support, DevOps automation, data analysis, and anywhere you need AI to work with real data from your systems.

source code

You can find the source code for this proof-of-concept in my GitHub repository: https://github.com/ristkari/pydantic-ai-sre-agent

In addition you will need access to Gemini Pro to run this code. You can get access to Gemini Pro by applying for it here or if you have API_KEY you can use that by placing it in the .env file under GOOGLE_API_KEY variable.

It is also possible to use other models than Gemini Pro. You can find the list of models in the Pydantic AI documentation.

To change the model