Overview
Binary Explorer is my Master's thesis project at the University of Calabria. The goal: build an agentic system that can autonomously analyse compiled binaries for security vulnerabilities — without requiring the analyst to manually run Ghidra, parse assembly, or write queries.
The system chains together:
- Ghidra (headless) for decompilation
- FAISS for semantic indexing of decompiled functions
- An LLM agent (via MCP) that reasons over the index and produces structured vulnerability reports
Architecture
Binary (ELF/PE/Mach-O)
│
▼
Ghidra Headless ← decompiles to pseudo-C + extracts function metadata
│
▼
Chunking + Embedding ← each function → vector via sentence-transformers
│
▼
FAISS Index ← persisted on disk, queryable by semantic similarity
│
▼
MCP Server ← exposes tools: search_functions, get_xrefs, run_gdb
│
▼
LLM Agent ← Claude/GPT-4o with tool use, produces final report
Why MCP?
The Model Context Protocol lets the LLM call structured tools rather than receiving a giant context dump. Instead of feeding 200 decompiled functions into the prompt, the agent:
- Asks
search_functions("buffer overflow")→ gets top-10 semantically similar functions - Calls
get_xrefs(func_name)→ traces call chains - Optionally calls
run_gdb(breakpoint, input)→ dynamic validation
This keeps token usage low and makes the reasoning traceable.
Key Technical Decisions
Chunking strategy
Splitting at function boundaries (not arbitrary token counts) preserves semantic coherence. A function is the natural unit of analysis in binary reversing — splitting mid-function destroys context.
def extract_functions(ghidra_output: str) -> list[Function]:
# Parse Ghidra's pseudo-C output into discrete function objects
# Each Function has: name, address, pseudo_c, calls[], called_by[]
...
Embedding model choice
I tested three models for the function → vector step:
| Model | Recall@10 | Latency | Notes |
|---|---|---|---|
all-MiniLM-L6-v2 |
71% | 12ms | Too generic for code |
code-search-net |
84% | 18ms | Better, trained on code |
nomic-embed-code |
89% | 22ms | Best results, used in final |
Vulnerability patterns
The agent uses a prompt that encodes known vulnerability patterns as search intents:
VULN_QUERIES = [
"strcpy strcat unbounded copy",
"malloc free use after free",
"integer overflow arithmetic check",
"format string printf user input",
]
Each query retrieves candidate functions; the agent then reasons about whether the pattern is actually exploitable given the context.
Results
Tested on a corpus of 40 intentionally vulnerable binaries (DVWA native, custom CTF binaries):
- True positive rate: 78% for stack-based buffer overflows
- False positive rate: 12% (mostly flagging safe uses of
strcpywith bounded input) - Mean time to report: 47 seconds per binary (vs. ~2 hours manual analysis)
The system performs best on well-structured C code with standard vulnerability patterns. It struggles with obfuscated code or custom allocators.
What I'd do differently
1. Use a code-specific LLM for the reasoning step. GPT-4o is good but not trained on assembly/decompiled C. Models like deepseek-coder or a fine-tuned CodeLlama would likely improve precision on the reasoning step.
2. Add a feedback loop. Right now the agent produces a report and stops. A human-in-the-loop step where the analyst validates/rejects findings could feed back into the retrieval weights over time.
3. Persistent cross-binary index. Each binary is indexed in isolation. Indexing a corpus of malware families together would enable cross-binary similarity queries — useful for variant detection.
Repository
The full implementation is on GitHub. The README includes setup instructions for Ghidra headless mode, which is the most painful part of the stack.