BabyAGI started as a weekend experiment: 105 lines of Python that looped forever, chaining LLM calls to execute tasks. Over three years and nine iterations, it became a testing ground for every major idea in autonomous agent design — task graphs, parallel execution, plugin systems, self-extending tools, persistent memory, multi-channel I/O. Each version asked a different question about what it takes to build an agent that can actually do things on its own.
The journey mirrors the broader evolution of the agent ecosystem. The classic era (Apr–Sep 2023) explored task list architectures during the initial burst of excitement after GPT-4. The framework era (Sep–Oct 2024) reconsidered the problem from first principles as tool-calling APIs matured. The assistant era (Feb 2026) confronted everything the earlier versions ignored: error recovery, context management, concurrency, and real-world I/O — building not just an agent but an autonomous assistant.
| BabyAGI | Bee | Cat | Deer | Elf | Fox | v2 | 2o | v3 | |
|---|---|---|---|---|---|---|---|---|---|
| Date | Apr '23 | Apr '23 | May '23 | Jun '23 | Jul '23 | Sep '23 | Sep '24 | Oct '24 | Feb '26 |
| Lines | 105 | 300 | 320 | 354 | 887 | 2,299 | 5,962 | 174 | 33,506 |
| Files | 1 | 1 | 1 | 1 | 11 | 25+ | 40+ | 1 | 70+ |
| Model | davinci-003 | GPT-4 | GPT-4 + 3.5 | GPT-3.5 | GPT-3.5 | GPT-3.5-16k | any | any (litellm) | any (litellm) |
| Planning | per-loop | dynamic replan | upfront | upfront | upfront + reflect | upfront + reflect | — | implicit (LLM) | implicit (LLM) |
| Execution | sequential | sequential | sequential | parallel | parallel | parallel | sequential | sequential | async + pool |
| Task deps | none | single | multi | multi | multi | multi | fn graph | none | objective tree |
| Termination | never | all complete | all complete | all complete | all complete | all complete | manual | tool signal | end_turn |
| Memory | Pinecone | session str | dep chain | dep chain | embeddings | ndjson + summary | exec logs | messages[] | SQLite + KG |
| Tools | 0 | 3 | 3 | 4 | 7 | 15+ | fn packs | self-creating | self-creating + persist |
| Extensibility | edit source | edit source | edit source | edit source | skill plugins | skill plugins | fn registry | runtime exec() | register + DB |
| I/O | CLI print | CLI print | CLI print | CLI + file | CLI + file | Flask web UI | Flask dashboard | CLI | multi-channel |
| Concurrency | — | — | — | threads | threads | threads | — | — | async + semaphore |
| Error handling | — | try/except | try/except | try/except | try/except | try/except | logged | try/except | retry + backoff + repair |
| Key insight | LLMs can chain tasks | tasks need structure | plan upfront | parallelize DAG | plugin architecture | chat + reflection | functions as atoms | LLM is the planner | autonomous assistant |
Each column links to the detailed analysis below. Highlighted cells mark the first appearance of a significant capability.
The original BabyAGI is a deceptively simple infinite loop: execute a task, store the result, create new tasks, reprioritize, repeat. The entire system is three LLM calls chained inside a while True.
# The three-agent loop — entire system logic
result = execution_agent(OBJECTIVE, task["task_name"])
# Store in Pinecone with ada-002 embeddings
index.upsert([(result_id, get_ada_embedding(vector),
{"task": task['task_name'], "result": result})])
# Generate new tasks from result
new_tasks = task_creation_agent(OBJECTIVE, enriched_result,
task["task_name"], [t["task_name"] for t in task_list])
# Reprioritize everything
prioritization_agent(this_task_id)
Uses text-davinci-003 (completion API, not chat). Pinecone for vector memory via text-embedding-ada-002. Task list is a Python deque. The context_agent retrieves the top-5 nearest results from Pinecone by cosine similarity but its output is never actually passed to the execution prompt — a telling sign this was a proof of concept.
BabyBeeAGI introduces two foundational concepts: task dependencies and tools. Tasks are no longer a flat queue — each task carries a dependent_task_id, a tool specifier, and a status. Pinecone is dropped entirely; context is managed through a session_summary.
The single task_creation_agent becomes a task_manager_agent that both creates and reprioritizes tasks — and it uses GPT-4 (chat API, not completion). Tasks now have explicit tool routing:
if task["tool"] == "text-completion":
result = text_completion_tool(task_prompt)
elif task["tool"] == "web-search":
result = web_search_tool(task_prompt)
elif task["tool"] == "web-scrape":
result = web_scrape_tool(str(task['task']))
The task manager receives the entire task list (minus results, to fit context) and outputs a new JSON task list. Tasks are capped at 7 items. The loop now terminates when all tasks have status: "complete".
BabyCatAGI refactors BabyBeeAGI in two important ways. First, task creation is extracted into its own dedicated task_creation_agent that runs once at startup to produce the entire task plan upfront. Second, dependent_task_id becomes dependent_task_ids (plural) — tasks can now depend on multiple predecessors.
# BabyBeeAGI: single dependency
"dependent_task_id": 2
# BabyCatAGI: multiple dependencies
"dependent_task_ids": [1, 3, 4]
The web search tool now automatically scrapes and extracts from each result URL. The extraction uses chunked processing with overlap — a primitive RAG pattern:
chunk_size = 3000
overlap = 500
for i in range(0, len(large_string), chunk_size - overlap):
chunk = large_string[i:i + chunk_size]
# LLM extracts relevant info, appends to notes
The session summary agent is removed. Context flows through task outputs via dependency chains. The task manager agent is also removed — the plan is fixed at creation time.
BabyDeerAGI's contribution is parallel execution. Tasks that have no mutual dependencies can now run concurrently using ThreadPoolExecutor.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
while True:
for task in task_list:
if task["status"] == "incomplete" \
and task_ready_to_run(task, task_list):
future = executor.submit(
execute_task, task, task_list, OBJECTIVE)
task["status"] = "running"
New additions: a user-input tool for interactive queries (human-in-the-loop), task creation downgraded from GPT-4 to GPT-3.5-turbo (cost optimization), and a smarter web search that uses dependent task outputs to refine search queries:
query = text_completion_tool(
"Generate a Google search query based on the following task: "
+ query + ". " + dependent_task + "\nSearch Query:")
Results are saved to timestamped files. The session summary is replaced by a concatenated output log.
BabyElfAGI is the first multi-file architecture. The monolithic script splits into a SkillRegistry, a TaskRegistry, and individual skill modules. This is the birth of the plugin system.
class Skill:
name = 'base skill'
description = 'This is the base skill.'
api_keys_required = []
def __init__(self, api_keys):
missing_keys = self.check_required_keys(api_keys)
self.valid = not missing_keys
def execute(self, params, dependent_task_outputs, objective):
raise NotImplementedError
Skills are discovered at runtime via filesystem scan and importlib. The registry filters by available API keys — skills with missing keys are silently skipped.
Task creation now uses few-shot example matching: example objective/tasklist pairs are stored as JSON files, and the most relevant example is selected using cosine similarity on ada-002 embeddings. An experimental reflection step can modify the task list after each execution.
api_keys_required validation and dynamic discovery via importlib is a pattern that persists through every subsequent version. This is where BabyAGI becomes extensible rather than just editable.BabyFoxAGI adds a web UI and chat interface, powered by Flask. The agent is no longer a batch script — it's an interactive application. A user sends a message, and the system routes it to one of three paths: direct chat response, single skill execution, or full task list generation.
# GPT determines the path using OpenAI function calling
functions=[{
"name": "determine_response_type",
"parameters": {
"properties": {
"path": {
"enum": ["ChatCompletion", "Skill", "TaskList"]
},
"skill_used": { ... },
"objective": { ... },
"message_to_user": { ... }
}
}
}]
Backgrounds tasks execute in threads. A forever_cache.ndjson file stores the full conversation history. A rolling summary is maintained by combining the latest 20 messages with a running overall summary — a two-tier memory system. Skills expand significantly: image_generation, play_music, game_generation, startup_analysis, airtable_search, google_jobs_api_search, and drawing.
The task registry adds objective reflection: before creating a task list, the system reflects on the objective to generate "helpful notes" that guide the task creation agent. After execution, it reflects on both the result (self-analysis) and the task list (generating improved versions saved as new example objectives).
reflect_skills() method analyzes whether the available skills were sufficient — a proto-form of the self-extending agent that arrives in v2o and v3.Between BabyFoxAGI and BabyAGI 2, three standalone projects explored a parallel question: can LLMs build and maintain structured knowledge graphs? The ideas developed here — schema-driven extraction, plugin architectures, persistent graph storage — directly influenced BabyAGI 2's function registration system and BabyAGI 3's memory architecture.
Text in, knowledge graph out. A single API call to GPT-4 with a prompt asking it to identify entities and relationships, rendered with Cytoscape.js. No persistence, no schema, no deduplication — just postData("/get_response_data", payload) and a cose layout.
The proof of concept: LLMs can extract structured graphs from unstructured text without custom NLP pipelines.
A Flask app with a plugin integration system, schema-driven entity types, CRUD operations, and search. Uses OpenAI's function calling API with a knowledge_graph function definition to get structured extraction — entities, relationships, and source snippets — rather than hoping the LLM returns valid JSON.
The integration manager pattern and schema-as-function-signature approach both reappear in BabyAGI 2's function registration and BabyAGI 3's tool system.
Full graph-based memory system. Two LLM loops: ingest() processes text through a SmartNodeProcessor (entity extraction, deduplication, graph updates via chain-of-thought) and ask() answers questions via SmartRetrievalTool with multi-step reasoning. Supports multiple backends (local JSON, Neo4j, FalkorDB), ontology-driven schemas, embeddings, and batch operations.
The memory architecture that BabyAGI 3 absorbed: persistent graph storage, LLM-powered entity extraction, and embedding-based retrieval — all wrapped in a clean Memory class.
The progression from "ask the LLM once" (Instagraph) to "let the LLM manage a persistent graph through structured tool calls" (MindGraph → Graphista) traces the same arc as BabyAGI itself. The graph became the agent's memory layer — BabyAGI 3's SQLite-backed knowledge graph with entity extraction is the direct descendant of this work.
BabyAGI 2 is a complete rewrite, informed by the graph experiments above. The concept shifts from "task list execution" to functions as first-class entities. The core abstraction is Functionz — a framework where every capability is a registered function with versioning, dependency resolution, logging, triggers, and a database backend. MindGraph's integration manager pattern reappears here as a general-purpose function registry.
@func.register_function(
metadata={"description": "Search the web using SerpAPI"},
imports=["serpapi"],
key_dependencies=["SERPAPI_API_KEY"],
triggers=["log_search_result"]
)
def search_web(query: str) -> dict:
...
# Functions are stored in DB with full metadata:
# - versioned code (rollback support)
# - parsed input/output parameters via AST
# - dependency graph (auto-resolution at exec time)
# - triggers (functions that fire after execution)
# - execution logs with timing
The executor resolves the full dependency graph at runtime: it loads function code from the database, exec()s it into a local scope, resolves imports (auto-installing missing packages via pip), injects secret keys, and wraps dependent functions so they're logged when called. Every execution is logged with parameters, output, timing, parent log ID, and trigger chain.
The self_build draft shows the ultimate goal: an LLM that can write, register, and execute its own functions. The react_agent draft implements a ReAct loop on top of Functionz.
BabyAGI 2o is a radical compression. The entire agent fits in 174 lines. It uses LiteLLM for model-agnostic inference and the native tool calling API. The agent starts with exactly three tools: create_or_update_tool, install_package, and task_completed.
# The agent builds its own tools. This is the entire system:
tools = [] # starts empty (plus 3 built-ins)
def create_or_update_tool(name, code, description, parameters):
exec(code, globals())
register_tool(name, globals()[name], description, parameters)
# Main loop: LLM calls tools, tools create more tools
while iteration < max_iterations:
response = completion(model=MODEL_NAME, messages=messages,
tools=tools, tool_choice="auto")
for tool_call in response_message.tool_calls:
result = call_tool(function_name, args)
if 'task_completed' in tool_calls:
break
The system prompt tells the LLM to be self-sufficient: if information is needed, create a tool to find it. If a package is needed, install it. Auto-detect available API keys from environment variables. The LLM orchestrates everything through tool use — no explicit task list, no planner, no skill registry. Just a conversation with tool calls.
tool_choice="auto". The LLM is the planner.BabyAGI 3 is an autonomous assistant — not a task executor but a persistent agent that listens, remembers, and acts across channels. The core insight from the codebase docstring: "Everything is still a message." User input, tool execution, background objectives, scheduled tasks — all messages in conversation threads. The architecture extends the BabyAGI 2o pattern (LLM + tool loop) but adds every production concern: memory, multi-channel I/O, scheduling, budget tracking, context management, self-improvement, and error recovery.
The fundamental pattern is identical to 2o — an LLM message loop with tool calling — but wrapped in production infrastructure:
async def run_async(self, user_input, thread_id="main", context=None):
async with self._get_thread_lock(thread_id):
thread = self.threads.setdefault(thread_id, [])
self.repair_thread(thread_id) # fix orphaned tool_use
thread.append({"role": "user", "content": user_input})
self._refresh_tool_selection(user_input, context)
while True:
# Trim thread to fit context window
thread = self._context_budget.trim_thread(thread, ...)
# LLM call with 3-stage overflow recovery
try:
response = await self.client.messages.create(...)
except ContextOverflow:
# Stage 1: aggressive trim
# Stage 2: minimal thread + core tools
# Stage 3: clear thread, inform user
if response.stop_reason == "end_turn":
return self._extract_text(response)
# Execute tools (async in thread pool)
for block in response.content:
if block.type == "tool_use":
result = await asyncio.to_thread(
self.tools[block.name].execute, ...)
# Large results → LLM summarization
Background objectives run as separate agent loops in their own conversation threads with full concurrency control:
@dataclass
class Objective:
id: str
goal: str
status: str # pending → running → completed/failed/cancelled
priority: int # 1-10, lower = higher
budget_usd: float # cost cap (None = unlimited)
token_limit: int # token cap
retry_count: int # auto-retry with exponential backoff
error_history: list # fed to retries for adaptive strategy
Max 5 concurrent objectives via semaphore. Failed objectives retry with exponential backoff (2s, 4s, 8s). Each retry receives the full error history so the LLM can adapt its approach. Budget/token limits halt execution when exceeded.
SQLite-backed persistent memory with event logging, entity extraction, knowledge graph, embeddings search, and context assembly — the direct descendant of the Instagraph → MindGraph → Graphista line. Memory is assembled into the system prompt dynamically. The ToolContextBuilder selects which tools to include in each API call based on query relevance, usage patterns, and current channel.
Tools created at runtime via register_tool are persisted to the database and reloaded on startup. Three tool types: executable (Python code), skill (behavioral instructions), and composio (third-party wrappers). External packages detected via AST analysis are sandboxed in e2b. Tools track execution statistics for monitoring.
The listener/sender architecture decouples I/O. Each message carries context: channel, is_owner, sender. The system prompt adapts — iMessage gets terse responses with no markdown; emails get structured prose; external contacts get privacy-respecting replies. Owner vs. non-owner access controls are enforced at the prompt level.
Thread repair fixes orphaned tool_use blocks (a common crash mode). 3-stage context overflow recovery. Per-thread locks prevent race conditions between concurrent sources. Large tool results are summarized by a fast model before entering the context. Graceful fallbacks at every layer — broken tools are disabled rather than crashing startup.
Each cell shows when a capability appeared and how sophisticated it became. Darker = more advanced implementation.
Approximate breakdown of ~33.5k lines by function. The production concerns dwarf the actual agent logic — the core loop that decides what to do next is under 200 lines.
The codebase progression reveals several non-obvious lessons for agent builders:
The original BabyAGI proved that 105 lines could produce emergent multi-step behavior. BabyAGI 2o later showed that even with tool calling and self-extension, the core agent fits in a few hundred lines. Everything beyond that is production concerns — the same concerns that dominate any real system.
The classic-era task list wasn't wrong — it was too simple. Long-running objectives shouldn't block chat. Background tasks need their own threads, budgets, and retry logic. BabyAGI 3's objectives system is far more sophisticated than a list, but the core idea of structured task tracking persists.
BabyFoxAGI's reflect_skills() identified missing skills but couldn't act on it. BabyAGI 2 stored improvements but lacked the runtime. BabyAGI 3 closes the loop: create tool → persist to DB → reload on startup.
BabyAGI 3 dedicates more code to context budgeting, thread trimming, tool result summarization, and overflow recovery than to its actual agent logic. Infrastructure is ~11,500 lines; the core loop is ~500.
The parallel Instagraph → MindGraph → Graphista line proved that LLMs can maintain structured knowledge graphs through tool calls. BabyAGI 3's memory absorbed this: entity extraction, knowledge graph queries, and embedding search — not just a rolling summary.
BabyAGI didn't evolve in isolation. The timeline below maps each version against the industry milestones that shaped it — organized by the concepts that matter to agent builders rather than by calendar. scroll →
| Q1 '23 | Langchain Plan-and-Execute | Q3 '23 | Q4 '23 | Q1 '24 | Q2 '24 | Q3 '24 | Q4 '24 | Q1 '25 | Q1 '26 | |
|---|---|---|---|---|---|---|---|---|---|---|
| BabyAGI | v1 | Bee · Cat · Deer | Elf · Fox | v2 | 2o | v3 | ||||
| Models | GPT-4 | GPT-3.5 16k | GPT-4 Turbo | GPT-4o · Claude 3.5 Sonnet | o1 | DeepSeek R1 | ||||
| Tool calling | fn calling API | Assistants API | Claude tool use | MCP | ||||||
| Structured output | JSON mode | Structured Outputs | ||||||||
| Agent frameworks | AutoGPT | LangChain agents | CrewAI | LangGraph | AutoGen + SK merge | OpenAI Agents SDK | OpenClaw | |||
| Graph projects | Instagraph | MindGraph | GraphRAG | Graphista | ||||||
| BabyAGI insight | LLMs can chain tasks · plan upfront · parallelize | plugin arch · reflect | fns as atoms | LLM is planner | autonomous assistant |
The evolution of tool calling directly shaped BabyAGI's architecture. The classic era (v1–Fox) predated OpenAI's function calling API — every version had to invent its own mechanism for getting an LLM to choose and invoke tools, typically via fragile JSON parsing from chat completions. When function calling launched in June '23, it validated BabyFoxAGI's approach (which had landed days earlier using the same pattern). But it also made the explicit task list unnecessary: if the model can natively call functions, why maintain a separate planner?
That insight took 13 months to crystallize. The gap between Fox (Sep '23) and v2 (Sep '24) spans the arrival of the Assistants API, Claude tool use, GPT-4o, CrewAI, LangGraph, and structured outputs — a complete rewiring of the agent infrastructure stack. When BabyAGI returned, it had absorbed all of it: v2 replaced the task list with a function runtime, and 2o compressed the entire agent to 174 lines by delegating planning to native tool calling.
The second gap (Oct '24 → Feb '26) coincided with MCP and the OpenAI Agents SDK — the industry's shift from "how do agents call tools" to "how do agents interoperate." BabyAGI 3 reflects this: multi-channel I/O, standardized tool registration, and the production infrastructure (context management, budgeting, error recovery) that none of the earlier versions or frameworks addressed.
Structured outputs (Aug '24) are the quiet revolution in this timeline. Before them, every agent framework spent significant code parsing LLM responses into actionable data — BabyAGI's classic era is littered with json.loads() wrapped in try/except. Structured outputs eliminated that entire class of failure, and BabyAGI 2o's radical simplicity (174 lines) is partly possible because it doesn't need response parsing infrastructure.