Agentic Network Forensic — Autonomous Threat Hunting with LLMs
Context
This is my final project for SC4063 Network Forensics. The brief was straightforward: given a massive PCAP dataset, produce a forensic investigation report. Part 1 was done manually. For Part 2, I built an autonomous AI agent that does the entire investigation on its own — from ingesting raw PCAPs to generating a PDF report with MITRE ATT&CK mappings.
The Problem
Network forensic investigations are tedious. You’re staring at Wireshark, grepping through Zeek logs, cross-referencing Suricata alerts, and trying to piece together what happened. The dataset I was working with was 52 GB of PCAPs, 600K+ Suricata alerts, and 30 GB of Zeek logs. Doing this manually takes days. I wanted to see if an LLM agent could do it autonomously.
What I Built
Agentic Network Forensic is a tool-calling LLM agent that follows a strict 13-step investigation protocol. It streams through massive files without loading them into memory, calls specialized analysis tools, cross-references findings, and generates comprehensive forensic reports.
The agent isn’t just a wrapper around an LLM — it has guardrails against hallucination, mandatory tool-call requirements, and a findings pipeline that reconstructs results directly from raw evidence (bypassing LLM context window truncation).
Tech Stack
LLM / Agent Layer:
Google Gemini(gemini-2.0-flash) — primary LLM, 1M token contextGroq(LLaMA 3.3 70B) — alternative, 128K tokens, faster inferenceOllama(Qwen3:14b / LLaMA 3.1) — fully local, no API key needed
Network Analysis & PCAP Processing:
tshark— streaming packet field extraction from 20 GB+ PCAPsnfstream— C-based flow engine for beaconing and exfiltration detectiondpkt— lightweight packet parsingscapy— packet crafting and deep inspectionpyshark— Python wrapper around tshark
Log Parsing & Data Processing:
- Custom streaming Zeek JSON parser (handles 30 GB ECS-wrapped logs)
- Custom streaming Suricata alert parser (600K+ alerts, ET/ETPRO deduplication)
pandas/numpy— data aggregationnetworkx— connection graph analysis
Threat Intelligence & Detection:
yara-python— malware signature scanningVirusTotal API— hash and IP reputation lookupsdnspython— DNS resolution and analysisipwhois— IP geolocation and ASN lookups- Built-in IOC lists (ThreatFox, Cobalt Strike, SocGholish, cryptominers)
Web Dashboard:
Flask+Flask-SocketIO— real-time WebSocket communicationChart.js— protocol charts, top talkers, MITRE ATT&CK visualization
Report Generation:
ReportLab— PDF with formatted sections, tables, and timelinesJinja2— template-based rendering- Markdown + JSON output for dashboard consumption
CLI & Utilities:
click— command-line interfacerich— terminal output with colors and progress barspydantic— structured data validation
Why These Choices
Google Gemini — The 1M token context window was critical. With 600K+ alerts and 30 GB of logs, I needed a model that could hold substantial context. Gemini’s free tier made it practical for repeated runs. Groq was added for speed, Ollama for fully offline analysis where sensitive network data shouldn’t leave the machine.
tshark + nfstream over loading PCAPs into memory — 52 GB of PCAPs can’t fit in RAM. tshark streams packet fields with -c count limits for chunked extraction. nfstream computes flow-level features via its C engine in a streaming fashion — both run with constant memory regardless of input size.
Custom parsers over off-the-shelf — The Zeek logs were 30 GB of ECS-wrapped JSON from Filebeat. Existing parsers would try to load the entire file. My streaming parsers process line-by-line, bucketing by severity, deduplicating rule pairs, and tracking per-source-IP stats — all with constant memory.
YARA + embedded IOCs — For instant matching against known-bad infrastructure. The agent can flag Cobalt Strike, SocGholish, and cryptominer indicators without any external API call.
Flask + SocketIO — The analysis takes 30-60 minutes. The real-time dashboard streams agent logs, findings, and visualizations as they’re produced so you don’t stare at a blank screen.
System Architecture
graph TB
subgraph Input["Input Data"]
PCAP["PCAPs<br/>52 GB"]
Zeek["Zeek Logs<br/>30 GB"]
Suricata["Suricata Alerts<br/>600K+"]
end
subgraph StreamingLayer["Streaming Parsers (Constant Memory)"]
TS["tshark<br/>Chunked Extraction"]
NF["nfstream<br/>C Flow Engine"]
ZP["Custom Zeek<br/>JSON Parser"]
SP["Custom Suricata<br/>Alert Parser"]
end
subgraph Agent["LLM Agent (Tool-Calling)"]
LLM["Gemini / Groq / Ollama"]
Protocol["13-Step Protocol<br/>Mandatory Tool Calls"]
Guardrails["Hallucination<br/>Guardrails"]
Checkpoint["Checkpoint<br/>System"]
end
subgraph ThreatIntel["Threat Intelligence"]
YARA["YARA Rules"]
VT["VirusTotal API"]
IOC["Built-in IOCs<br/>ThreatFox, Cobalt Strike"]
end
subgraph Output["Output"]
Dashboard["Flask + SocketIO<br/>Real-time Dashboard"]
PDF["ReportLab<br/>PDF Report"]
JSON["Structured JSON<br/>Findings"]
MITRE["MITRE ATT&CK<br/>Mapping"]
end
PCAP --> TS & NF
Zeek --> ZP
Suricata --> SP
TS & NF & ZP & SP --> Agent
Agent --> ThreatIntel
ThreatIntel --> Agent
Agent --> Output
How the Agent Works
The agent follows a mandatory 13-step investigation protocol. It cannot finish until all required tools are called — this prevents the LLM from jumping to conclusions.
- Parse Suricata alert summary (severity breakdown, top rules)
- Extract C2 IOCs from Suricata alerts
- Identify affected hosts
- Deep-dive Suricata alerts (Cobalt Strike, exploits, credentials, DGA, cryptominers)
- Analyze Zeek DCE/RPC for lateral movement
- Parse Zeek connections
- DNS analysis (beaconing detection, DGA campaigns)
- HTTP analysis
- TLS/SSL analysis
- File transfer analysis
- Generate MITRE ATT&CK mapping
- Save structured JSON findings
- Generate PDF report
After every tool call, a checkpoint is saved — so if the agent hits a rate limit or crashes, it resumes without re-running the entire analysis.
Investigation Protocol Flow
graph TD
S1["1. Parse Suricata<br/>Alert Summary"] --> S2["2. Extract C2<br/>IOCs"]
S2 --> S3["3. Identify<br/>Affected Hosts"]
S3 --> S4["4. Deep-Dive Alerts<br/>Cobalt Strike, DGA, Exploits"]
S4 --> S5["5. Analyze Zeek<br/>DCE/RPC"]
S5 --> S6["6. Parse Zeek<br/>Connections"]
S6 --> S7["7. DNS Analysis<br/>Beaconing, DGA"]
S7 --> S8["8. HTTP<br/>Analysis"]
S8 --> S9["9. TLS/SSL<br/>Analysis"]
S9 --> S10["10. File Transfer<br/>Analysis"]
S10 --> S11["11. MITRE ATT&CK<br/>Mapping"]
S11 --> S12["12. Save JSON<br/>Findings"]
S12 --> S13["13. Generate<br/>PDF Report"]
style S4 fill:#7f1d1d,stroke:#ef4444,color:#fca5a5
style S7 fill:#7f1d1d,stroke:#ef4444,color:#fca5a5
style S11 fill:#164e63,stroke:#22d3ee,color:#22d3ee
style S13 fill:#14532d,stroke:#4ade80,color:#4ade80
Detection Highlights
On the test dataset (52 GB PCAPs, 600K alerts, 30 GB Zeek logs), the agent found:
- Cobalt Strike beacon at 10.128.239.57 with hourly callbacks to 179.60.146.34:3389
- .click DGA campaign — 197,684 NXDOMAIN queries across 287 unique domains from a single host
- TacticalRMM compromise — 27+ internal hosts querying icanhazip.tacticalrmm.io (45,482 queries)
- WebLogic exploits (CVE-2020-2551, CVE-2018-2893) targeting a domain controller
- Virut DGA + cryptominer on 10.128.239.98 connecting to herominers pools
- WPAD poisoning surface — 314,095 unanswered WPAD queries across 20+ hosts
- Active Directory enumeration — 22 workstations making 90–307 DRSCrackNames calls each (BloodHound-style recon)
Interesting Technical Challenges
Beaconing detection with coefficient of variation. Detecting C2 beacons means finding regular timing patterns in DNS queries. I track all query timestamps per (source IP, domain) pair, compute the inter-query intervals, then check if the coefficient of variation (std dev / mean) is below 0.05. A CV that low means the timing is almost perfectly regular — a dead giveaway for automated C2 callbacks. This flagged wallhaven.ufcfan.org at 78.1s ± 0.1s mean interval over 3 days.
Suricata rule deduplication. Suricata’s ET and ETPRO rule sets often have mirrored rules for the same event (e.g., “ET Trojan.X” and “ETPRO Trojan.X”). Without deduplication, alert counts double. I wrote a normalizer that collapses these pairs so the severity counts and top-rule rankings reflect reality.
Hallucination prevention. LLMs hallucinate IPs, invent alerts, and claim certainty where none exists. The guardrails layer extracts all IPs from LLM output and cross-references them against raw evidence. It also flags overconfident language (“definitely”, “confirmed attack”) and downgrades findings below 0.6 confidence from critical/high to medium. The findings pipeline reconstructs the final report directly from raw tool outputs, so even if the LLM’s context window truncates data, the report stays accurate.
Memory management at scale. 52 GB of PCAPs can’t fit in memory. Every tool in the pipeline streams: tshark extracts fields in chunks, nfstream processes flows via its C engine, Zeek logs are parsed line-by-line, Suricata alerts are bucketed on-the-fly. The entire analysis runs with constant memory regardless of input size.
LLM context window limits. Even with Gemini’s 1M tokens, 600K alerts don’t fit. The solution: specialized tools pre-aggregate and summarize before the LLM sees anything. The agent gets statistics, top-N lists, and flagged anomalies — not raw data. The full evidence is preserved separately for report generation.
What I Learned
Building an agentic system for security forensics taught me that the hard part isn’t the LLM — it’s the tooling around it. The parsers, streaming architecture, guardrails, and evidence pipeline are where the real engineering lives. The LLM is just the orchestration layer that decides what to look at next.
The mandatory tool-call protocol was essential. Without it, the agent would run 2-3 tools and declare “investigation complete.” Forcing it through all 13 steps consistently produced findings it would have otherwise missed.
The code is available on GitHub.