Automated Program Repair Using LLM
Integrating Coverage-Guided Fuzzing and LLM Reasoning for Automated Repair of Crash-Inducing Bugs
Motivation
By combining fuzzing to discover crashes, GDB to extract stack traces, and LLM reasoning with execution context, we create an execution-aware repair loop that significantly improves bug localization and repair accuracy.
Solution
A five-stage pipeline: (1) preprocess code to remove hints, (2) generate seed inputs via LLM, (3) fuzz with AFL to discover and minimize crashes, (4) extract stack traces with GDB, (5) iteratively repair with LLM using execution context. Crash deduplication and patch scoring optimize the process.
- Fuzzing & Crash Discovery – AFL mutates inputs to find crashes, then minimizes them to create focused test cases that capture exact failure conditions.
- Stack Trace Extraction – GDB extracts execution paths for each crash, with FNV-1a hashing to deduplicate and preserve only unique failure signatures.
- Execution-Aware Prompting – LLM receives sanitized code, minimized crash inputs, and stack traces (not just code), enabling precise bug localization.
- Iterative Repair Loop – Each patch is compiled and tested. The model refines repairs across iterations, with persistent memory of previous attempts.
- Patch Scoring – Ranks candidates by compilation success, crash elimination, test passing, and edit distance to select optimal fixes.
- 85% Success Rate – 11 of 13 programs fixed within 3 iterations
- Execution-Aware Prompting – Crash inputs + stack traces improve repair from 38% to 85%
- Iterative Repair Loop – Persistent memory across iterations refines patches
- Crash Deduplication – FNV-1a hashing preserves unique failure signatures
- Model-Agnostic – Works with various LLM backends
Design Process
The pipeline integrates fuzzing, debugging, and LLM reasoning into a cohesive workflow. Key design: preprocessing removes hints, LLM generates seed inputs, AFL discovers/minimizes crashes, GDB extracts traces, and iterative LLM repair uses execution context. Evaluation compared three conditions: code-only (38% success), +traces (69%), and +traces+crashes (85%).
Preprocessing
Strips comments and annotations to force the LLM to reason from code structure and execution evidence alone. In later iterations, the model's own comments are retained as self-generated memory.
Preprocessing creates a clean baseline by removing hints, then builds memory by retaining the model's own comments across iterations.
- Removes human-written hints before first repair attempt
- Retains LLM-generated comments in subsequent cycles for memory
Seed Input Generation
The LLM writes Python scripts to generate syntactically valid seed inputs for AFL. This automates seed creation and ensures the fuzzer starts with diverse test cases that exercise different code paths.
Automates seed input generation, giving AFL high-quality starting points for efficient crash discovery.
- LLM generates Python scripts for seed input creation
- Identifies input method (stdin vs. file) for AFL configuration
- Populates AFL seed corpus automatically
Fuzzing & Crash Discovery
AFL performs coverage-guided fuzzing, mutating inputs to explore execution paths and discover crashes. When crashes occur, afl-tmin shrinks inputs while keeping them reproducible. These minimized inputs become focused test cases that show exactly what triggers the bug.
Fuzzing discovers crashes and minimizes inputs to create precise test cases that help the LLM understand and fix the bug.
- Coverage-guided mutation explores new execution paths
- Minimizes crash inputs to capture exact failure conditions
- Provides concrete examples for LLM repair
Stack Trace Extraction
GDB extracts stack traces for each crash, showing the execution path that led to failure. FNV-1a hashing deduplicates crashes, preserving only the five most representative traces. This pinpoints the exact function and line where the bug occurs.
Stack traces provide precise fault localization, showing where crashes occur so the LLM can target repairs accurately.
- Extracts execution paths via GDB batch mode
- Deduplicates using FNV-1a hashing to preserve unique signatures
- Identifies exact function and line of failure
LLM Repair
The LLM receives code, crash inputs, and stack traces—not just code. It generates patches that are compiled and tested iteratively. Execution context (traces + crashes) improves success from 38% (code-only) to 85% (+traces+crashes) with fewer attempts.
Execution-aware prompting with concrete crash data enables precise bug localization and repair, achieving 85% success rate.
- Structured prompt includes code, crash inputs, and stack traces
- Iterative compilation and testing validates each patch
- 85% success rate with execution context vs 38% without
Results
Tested on 13 crash-inducing programs. Results: code-only (38% success, 4 attempts), +traces (69%, 3 attempts), +traces+crashes (85%, 2 attempts). Execution context cuts attempts in half and doubles success rate.
Execution-aware prompting doubles success rate (38% → 85%) and halves repair attempts, proving execution context is essential for automated repair.
- 13 programs tested: 10 synthetic + 3 from AFL demos
- Best: 85% success with 2 median attempts (+traces+crashes)
- Baseline: 38% success with 4 median attempts (code-only)
- 11 of 13 programs fixed within 3 iterations
Reflection
Outcomes
Execution-aware prompting doubles repair success (38% → 85%) and halves attempts needed. Key insight: LLMs need concrete execution context (crashes + traces), not just code. The pipeline demonstrates that combining fuzzing, debugging, and LLM reasoning creates a practical solution for automated repair of runtime defects.
If I had more time
Future work: extend to multi-file projects, integrate symbolic execution for richer fault localization, evaluate on SARD/Juliet benchmarks, and explore parallel fuzzing for faster crash discovery.