Binary Explorer: Agentic RAG over MCP for Vulnerability Analysis

Overview

Binary Explorer is my Master's thesis project at the University of Calabria. The goal: build an agentic system that can autonomously analyse compiled binaries for security vulnerabilities — without requiring the analyst to manually run Ghidra, parse assembly, or write queries.

The system chains together:

Ghidra (headless) for decompilation
FAISS for semantic indexing of decompiled functions
An LLM agent (via MCP) that reasons over the index and produces structured vulnerability reports

Architecture

Binary (ELF/PE/Mach-O)
        │
        ▼
  Ghidra Headless        ← decompiles to pseudo-C + extracts function metadata
        │
        ▼
  Chunking + Embedding   ← each function → vector via sentence-transformers
        │
        ▼
  FAISS Index            ← persisted on disk, queryable by semantic similarity
        │
        ▼
  MCP Server             ← exposes tools: search_functions, get_xrefs, run_gdb
        │
        ▼
  LLM Agent              ← Claude/GPT-4o with tool use, produces final report

Why MCP?

The Model Context Protocol lets the LLM call structured tools rather than receiving a giant context dump. Instead of feeding 200 decompiled functions into the prompt, the agent:

Asks search_functions("buffer overflow") → gets top-10 semantically similar functions
Calls get_xrefs(func_name) → traces call chains
Optionally calls run_gdb(breakpoint, input) → dynamic validation

This keeps token usage low and makes the reasoning traceable.

Key Technical Decisions

Chunking strategy

Splitting at function boundaries (not arbitrary token counts) preserves semantic coherence. A function is the natural unit of analysis in binary reversing — splitting mid-function destroys context.

def extract_functions(ghidra_output: str) -> list[Function]:
    # Parse Ghidra's pseudo-C output into discrete function objects
    # Each Function has: name, address, pseudo_c, calls[], called_by[]
    ...

Embedding model choice

I tested three models for the function → vector step:

Model	Recall@10	Latency	Notes
`all-MiniLM-L6-v2`	71%	12ms	Too generic for code
`code-search-net`	84%	18ms	Better, trained on code
`nomic-embed-code`	89%	22ms	Best results, used in final

Vulnerability patterns

The agent uses a prompt that encodes known vulnerability patterns as search intents:

VULN_QUERIES = [
    "strcpy strcat unbounded copy",
    "malloc free use after free",
    "integer overflow arithmetic check",
    "format string printf user input",
]

Each query retrieves candidate functions; the agent then reasons about whether the pattern is actually exploitable given the context.

Results

Tested on a corpus of 40 intentionally vulnerable binaries (DVWA native, custom CTF binaries):

True positive rate: 78% for stack-based buffer overflows
False positive rate: 12% (mostly flagging safe uses of strcpy with bounded input)
Mean time to report: 47 seconds per binary (vs. ~2 hours manual analysis)

The system performs best on well-structured C code with standard vulnerability patterns. It struggles with obfuscated code or custom allocators.

What I'd do differently

1. Use a code-specific LLM for the reasoning step. GPT-4o is good but not trained on assembly/decompiled C. Models like deepseek-coder or a fine-tuned CodeLlama would likely improve precision on the reasoning step.

2. Add a feedback loop. Right now the agent produces a report and stops. A human-in-the-loop step where the analyst validates/rejects findings could feed back into the retrieval weights over time.

3. Persistent cross-binary index. Each binary is indexed in isolation. Indexing a corpus of malware families together would enable cross-binary similarity queries — useful for variant detection.

Repository

The full implementation is on GitHub. The README includes setup instructions for Ghidra headless mode, which is the most painful part of the stack.