Autonomous fuzzing process under LLM supervision

The CCN project is co-financed by the European Regional Development Fund and the State Budget under the European Funds for Digital Development Programme 2021-2027.

European Funds logo bar

Fuzzing is an automated software testing technique that involves feeding random or deliberately malformed input data to detect bugs and security vulnerabilities. For years it has remained one of the most effective ways of finding them, but it has one drawback - it requires many hours of preparation before each run.

At the NASK Cybersecurity Center, we are building a system that performs this work on its own: from code analysis, through test generation, all the way to classifying findings and preparing a ready-to-submit report for the code author. The first campaigns have already helped identify real vulnerabilities in widely used open-source software.

Fuzzing under LLM supervision

Preparing a fuzzing campaign follows a similar pattern in every project. For each new piece of software being tested, you need to:

read the documentation and understand how to correctly call the API functions,
write a short test program that feeds them random data,
configure tools that detect memory errors,
after a crash is found - analyze the trace, check whether it is a new vulnerability or a duplicate, and write it up in a report for the author.

These are repetitive, formulaic tasks, exactly the kind that a large language model (LLM) can now perform faster, and often more effectively, than a human. On one condition: it must be guided through a clear procedure, rather than left to improvise (or, possibly, hallucinate).

fuzzlab is a research project developed within the Fuzzing and Malware Analysis Laboratory (FUMAL). It consists of four cooperating Python modules, a library of nearly three thousand test programs covering libraries written in C, C++, Python, and Go, and a database storing the results and metadata of all campaigns to date. All the decision logic - when to retrain the model, when to invoke the LLM, when to end a campaign - runs locally in Python.

In this architecture, the LLM plays two specialized roles. First, it operates on specific segments of the process: it pre-filters data, generates test programs, and classifies the crashes that are found. Second, it acts as a supervisor of the entire pipeline - it observes whether individual stages behave normally and detects anomalies (for example, a sudden drop in code coverage, unusual patterns in the logs, recurring build errors in the tested project, or a misconfiguration blocking access to specific execution paths). It then either tries to fix the problem itself or proposes an improvement to the procedure. As a result, tests can run uninterrupted, without waiting for human intervention. This is still not a general-purpose agent that decides for itself what to do next - it is more of a specialized operator working within clearly defined boundaries, and it is precisely those boundaries that make it reliable.

The solution has been designed to be independent of the AI model provider - any LLM (local or cloud-based) or external agent can take over process control. Communication takes place through a standardized interface with structured input and output data, so swapping the model or connecting a new agent does not require any changes to the process flow itself.

The first pilot campaigns confirm that this approach works. Real vulnerabilities were found, among others, in ModSecurity (the WAF rules engine used with Apache, nginx, and Microsoft IIS) and in Oracle VirtualBox. We describe both cases in detail later in the article. It is worth noting that fuzzlab is currently at the proof-of-concept stage - a research project in which we are testing which elements of this approach work in practice. The architecture, the choice of predictive signals, and the way fuzzlab integrates with the LLM and agents may all still change substantially. This article describes the state as of the date of publication.

The test pipeline "in a nutshell"

Below is a bird's-eye view of a single campaign - from the operator's command to the final report.

Start. The operator types a single command in the terminal, providing the project name and the path to the source code, and optionally selecting one of the predefined configuration profiles.
Preflight - pre-launch checklist. Before the system engages any of the fuzzers, it verifies the runtime environment: whether the project name is correct, whether the source directory exists, whether the necessary components are installed, whether the database has the proper structure, whether there is enough disk space, and whether a previous failed run left any hint about the cause of the failure. Any critical error stops the campaign before launch. Warnings are reported but do not block startup.
Test preparation. A phase launched at the start of the campaign. First, static analyzers go through all the project's functions and compute a set of features for each one: how densely it uses dangerous memory operations, how complex its structure is, which other functions it is adjacent to, how often it has been changed in the project's recent history, and what other security tools say about it. Next, the machine learning model is either loaded from disk (if it already exists) or trained from scratch on fresh data. The LLM can also be used to generate missing test programs for functions assessed as the highest risk.
Working cycle - the heart of the campaign. A campaign consists of ten cycles (the default value), each of which proceeds through the same sequence of phases described below.
Campaign productivity safeguards. After each cycle, four independent conditions decide whether it makes sense to continue the campaign. They check whether enough crashes have been found, whether the function ranking has stabilized, whether code coverage has already saturated, and whether recent cycles included enough "productive" runs. An unmet condition ends the campaign - and the reason is explicitly recorded in the report. This guards against wasting time on something that no longer yields results.
Post-mortem. If the campaign ends with a weak result - no new crashes, many errors, or only a small number of productive cycles - the analyzer classifies the outcome into one of four categories and saves a hint for the next run. The next preflight stage reads this hint and warns the operator before the system starts again. This is the mechanism for learning from the missteps of previous campaigns.
Final report. After the campaign ends, the operator receives a summary containing the number of unique crashes with their CWE classification and error category, the decisions of all early-stopping conditions, the change in code coverage compared with the previous run, telemetry of LLM usage broken down by integration type, and a list of changes in the test set.

Phases of the working cycle

Each of the ten cycles goes through the same fixed sequence of nine phases.

Fuzzing. The selected engine (LibFuzzer, AFL++, honggfuzz, or centipede) runs for a fixed period, triggering crashes and producing a map of the code paths it has visited. A rotational resource-allocation algorithm - a variant of the classic multi-armed bandit problem - decides which harnesses receive a larger budget in the next cycle. The decision is based on their previous results: programs that consistently discover new paths in the code get more time; those that have "run out of breath" get less.
Coverage measurement. The system records how code coverage grew over time and estimates how many more unvisited paths could theoretically be discovered under the current configuration. This makes it possible to answer the question: is it worth digging further here, or has nearly everything findable already been found?
Basic crash triage. For each new crash, the system does four things. First, it normalizes the error trace to a unified signature and checks whether it has already seen an identical one. Then the LLM evaluates whether this is a real bug in the library or merely an artifact of a flawed test program. Next, it groups crashes that share the same underlying root cause. Finally, it generates a report for submission to the software's author.
Test program rotation. Test programs that have stopped discovering new paths in the code receive a lower priority in the next cycle. The system may also change the configuration of error detectors if the current one has hit a dead end.
Model retraining. The machine learning model is trained on fresh data. Functions that caused a crash despite the model having previously rated them as safe are given increased weight - this forces the model not to repeat the same mistake in the next iteration.
Reorganization of the test file set. Duplicates are removed, and a three-tier hot / warm / cold corpus system eliminates old and unhelpful samples.
Deep exploration of new targets. For functions rated as the highest risk, the system runs an additional pass with a fuzzer targeting specific code fragments. This is the phase in which the machine learning model's decisions actually translate into where CPU time is allocated.
Solver for structured data. The LLM generates values that both satisfy the grammar rules of a given format and steer the program into the desired paths. It is only enabled in late cycles, once classic random mutation stops producing results.
Cycle report. Measurements are written to the database, an event stream flows to the operator in real time, and all decisions concerning the continuation of the campaign are recorded for later review.

Diagram of the test process flow

Fig. 1 Diagram of the test process flow.

High-level architecture

Four modules implement the successive stages of the process - each responsible for its own slice of work (test set management, function analysis and ranking, test program generation, and supervision). They operate independently but exchange information through a shared data layer. The files themselves (crash-inducing inputs and test corpora) are kept separately on disk, where each file is identified by the SHA256 hash of its content. The exchange layer holds only metadata that points to where the original file is stored.

One-way data flow

Data passes through the system sequentially, module by module:

kura (Japanese: storehouse) manages the entire set of test files, which today numbers around 35 million entries. It indexes files on the fly and removes duplicates based on a cryptographic hash of their content (two files differing by even a single byte are treated as separate entries, while identical copies of the same material are merged into a single entry with a reference counter). Each project has its own pool of test files selected for code coverage, aimed at maximizing the chance of discovering new execution paths while keeping file sizes as small as possible.

The old, no-longer-useful test corpus is systematically retired through a three-tier elimination mechanism: fresh, active samples remain on the "hot" fast-access tier; those that no longer discover new paths drop to the "warm" tier; and after a longer period of inactivity they move to the archive, from which they can be restored if needed without weighing on day-to-day operations. With this structure in place, the growth of the set does not degrade search performance, and metadata (sample origin, source project, usage history, last cycle result) is available almost instantly.

kiri (Japanese: blade) is the analytical layer of the pipeline and, at the same time, its most complex component. Its job is to answer one question: which functions in the analyzed code most likely contain security bugs? To answer it, kiri extracts several dozen signals from the source code using a hybrid approach - combining static analysis methods (examining code without running it: looking for known patterns of dangerous constructs, inspecting graph structure) with dynamic ones (observing program behavior during execution and gathering code coverage profiles).

In addition, there are contextual signals: version control history (who has changed a given function and how often, whether it contains recent changes) and data from external sources such as Fuzz Introspector¹. From these signals, kiri trains several machine learning models in parallel, whose votes are combined by a meta-model into a single, coherent ranking of functions, from most to least likely to contain bugs. This ranking is the key decision signal for the next module - it determines where the fuzzer's computational budget will go in the current cycle, and consequently where the pipeline has a real chance of finding a vulnerability and where it is wasting time.

kata (Japanese: pattern) automates what is traditionally written by a human after many hours of reading documentation and source code: it generates ready-to-run test programs (so-called harnesses) for libraries written in C, C++, Python, and Go. Based on the ranking received from kiri, the module identifies the functions most worth covering with fuzzing first, and then creates harnesses for them that follow the conventions of the particular project. Each project can have its own set of plugins describing how to properly call its API: what arguments are required, how to initialize input structures, in what context the function operates (e.g. after a prior call to a function that sets the appropriate fields in required structures), and what boundary values are worth testing.

As a result, the generated harnesses are not generic - they are tailored to how the functions of a given library are actually called, which significantly increases the effectiveness of fuzzing (a generic harness often "bounces off" the first assertion that requires properly initialized state, or produces false positives). In addition, kata uses the LLM as an iterative "proofreader": after generating an initial version of a harness, it compiles the harness, tries to run it under the fuzzer, and in the event of build errors or premature coverage saturation (when the fuzzer quickly stops discovering new paths) asks the model for a corrected version, with concrete feedback. Fuzzing thus targets precisely the code fragments flagged by the model as risky, rather than randomly bombarding the entire attack surface.

Nowhere in this process is there a "feedback loop" in the sense of writing results directly back into the model. Feedback only reaches the machine learning layer in the next cycle - when kiri once again extracts signals from the now-enriched set. As a result, nothing modifies the model "on the fly" in a way that would be hard to trace.

Diagram of data flow between modules

Fig. 2 Diagram of data flow between the kura, kiri, and kata modules.

Machine learning

The heart of the machine learning layer is an ensemble of three predictive models - XGBoost, LightGBM, and CatBoost - working together rather than separately. This is a committee-of-experts approach, in which each model has slightly different strengths: XGBoost copes well with numerical signals, LightGBM scales better when there are many features, and CatBoost natively handles features such as function type and CWE error class.

The predictions of the three models feed into a second-level meta-model - a simple logistic regression that learns to optimally weight their votes. Two techniques are worth a brief explanation:

Out-of-fold stacking - the base models are trained in a 5-fold cross-validation scheme, so that the meta-model learns on predictions of models that had not previously seen the data in question. Without this discipline, the meta-model would overestimate the accuracy of the base models, and the function ranking would lead the fuzzer to the wrong places.
Platt calibration - the raw scores of the models make sense in ordinal terms (higher = more suspicious), but not in probabilistic terms. Calibration converts them into actual probabilities: instead of "function A is more suspicious than B", we can say "function A has a 73% probability of containing a bug".

The main evaluation metric is Precision@K for K = 10, 50, 100 - how many of the functions flagged by the ranking as the riskiest actually contain a bug. AUC² may be globally high, but it is the Precision@K parameter that decides whether the ranking is suitable for steering the fuzzer's budget. Each trained model is saved together with its hyperparameters, the features used, and a hash of the training set, so that every prediction can be reproduced when investigating bug-discovery effectiveness.

Where do the signals come from?

The features on which the models operate come from several dozen signals collected at three levels - from the simplest to the most complex:

Level 1: quick scans - simple searches of the code for known patterns, function by function.
Level 2: structural analyses - these require building a complete program call graph and comparing the results of several static analysis tools.
Level 3: external context - drawing on version control history (who changed a given function and when) and outside sources such as Fuzz Introspector.

Each of these signals has documented support in the scientific literature. The "probability of reaching a function" signal originates from Lee & Böhme (FSE 2023, "Statistical Reachability Analysis"), the estimate of code coverage saturation from the "jackknife" method described by Liyanage et al. (ICSE 2023, "Reachable Coverage"), and the identification of unsafe pointer operations from the work of Vital (TOSEM 2025, "MCTS-Guided Symbolic Execution Toward Unsafe Pointers").

Telemetry

Every significant decision "made" by the pipeline leaves a trace, either in the database or in the event stream. Each cycle phase produces a structured event with a label drawn from a closed, predefined list. The operator (or AI agent) never has to guess what happened inside - the answer is always available in the appropriate format.

Layer zero: transactional record

A dedicated set of tables was created from the outset to store the pipeline's working data: static and dynamic signal values for each function, aggregated feature vectors feeding the model, output predictions, registered models together with their training metadata, the task queue with the status and execution time of each entry, metrics for each test program in each cycle, the trajectory of code coverage over time, sets of visited paths for each input (in compressed form), crash reports with deduplication by hash, results of preflight scans with severity levels, and feature stability reports. This is the foundational record of every campaign.

ML model parameter verification

Fig. 3 Verification of machine learning model parameters.

Layer one: LLM cache and integration telemetry

This layer consists of telemetry from individual integrations. Each LLM-using feature has its own key prefix and its own counter in the final report. The list of these features is long - it includes, among others: test program generation, semantic review of the generated code, iterative correction, dictionary synthesis, triage of static analysis results, clustering of crashes and of the entire test set, hints about coverage gaps, escaping from stagnation in code coverage growth, diversification of test files, re-ranking of functions, generation of bug reports, assessment of harness condition, CWE classification, validation consensus, and feasibility checks. In total, several dozen categories - each with its own counter visible in the campaign report and its own version of the cache.

Layer two: real-time event stream

The system's logs are captured and transformed into events: cycle start, phase start, metric, error, decision (in detailed tracing mode), early stop, campaign end, cycle result (with productivity deltas), post-mortem hint, and test program metrics. The data can be consumed through three channels: richly formatted in the terminal, in JSONL (JSON Lines) format to a file, or on a web dashboard in the browser.

Layer three: decision trail

Once the appropriate flag is enabled, the process starts emitting, in each cycle, over a dozen additional events describing every decision: four during initialization, five concerning the main phases (when to enable deep exploration, when to launch the solver for structured data), five operational ones (how to divide resources among various modes, how to filter test files, how to rank functions), and three early-stopping conditions. Each event follows the same format, with three fields: gate name, condition, and result. An agent reading the stream sees not only what happened, but also why.

Layer four: operational telemetry

The last layer is telemetry for day-to-day operations: each task has a status, phase, and timestamps; each crash has a stack signature, CWE class, and error category; each campaign ends with a report containing LLM integration counters, a list of cycles with productivity metrics, a dump of environment variables at startup, and a post-mortem hint if the campaign was unproductive.

Self-healing pipeline

The pipeline has been built on the assumption that "something can go wrong at every stage": the LLM may return a response that cannot be parsed; a fuzzer run may end with an error; model training may end with a worse result than the previous one, and so on. Below are some example mechanisms used for the process's self-diagnosis.

Detecting problems with harnesses

The pipeline continuously monitors the "health" of each test program - it tracks the rate of code coverage growth, execution time, and the moment at which the harness stops discovering new paths in the code. From these signals it detects typical problems: harnesses that crash immediately after starting (a compilation or initialization problem), harnesses that run but discover no paths (a broken wrapper), and harnesses that fall into stagnation prematurely - within the first thirty percent of the target execution time.

In each of these cases, an iterative fix is triggered. The LLM receives the problem context - the coverage trajectory, feedback from the error detectors, and, if needed, build error logs. It then rewrites the harness from scratch, eliminating the constraints that were blocking further exploration. The result is stored in the cache, and the next campaign initialization picks up the corrected test program.

Harness quality statistics

Fig. 4 Quality statistics of harnesses used during testing.

The crash validator protects the pipeline from false positives

Crashes classified as unreachable are flagged in the database and skipped by all subsequent steps that use the LLM. Without this filter, the modules responsible for generating bug reports, ranking priorities, and proposing fixes would spend their time on defects in test programs rather than on real bugs.

The post-mortem report closes the loop of unproductive reruns

When a campaign ends without producing the expected results, the analyzer classifies its outcome into one of four categories (early failure, error dominance, no results, low performance) and saves a hint with a specific suggested action. The next preflight of the same campaign reads this hint and emits a warning before the process starts - the operator sees a message such as "campaign X previously ended with no results, gate configuration is missing" and decides whether to change the settings or to deliberately ignore the hint.

Protecting the model against regression

A model test runs in every cycle. Features with unstable predictions are automatically disabled during retraining. The model is retired not only when its accuracy drops, but also when the stability of its predictions deteriorates. Without this mechanism, the model ensemble would, over time, start learning superficial textual patterns instead of actual structural signals.

Interface for AI agents

From the start, fuzzlab has been designed for control by an AI agent - every entry point returns data that is both human-readable and easy to handle programmatically. CLI commands return results in a structured JSON format, a real-time event stream is available, and all typical campaign scenarios are accessible through named presets that the agent can list on demand. As a result, an agent can run a campaign from start to finish, basing its decisions on structured data rather than on interpreting textual messages.

Event stream generated for AI agents

Fig. 5 Example event stream generated for AI agents in graphical form.

Selected publicly disclosed findings

CVE-2026-35251 - Privilege escalation in VirtualBox Core and possible virtual machine escape

The vulnerability lies in the emulation of Intel's DMAR mechanism - an IOMMU component whose role is to prevent DMA from virtual devices from reaching host memory. The component that validates entries in the IOMMU context table should reject those with invalid values in the "translation type" field (outside the 0-2 range defined in the specification). In version 7.2.6, validation exists at the reporting level (a warning about an incorrect translation type is written to the log), but does not block the subsequent execution path. The code continues processing the entry despite the detected error. The guest kernel can exploit this flaw by writing a crafted entry with a disallowed translation-type value and a host memory address as the page-table base, and then performing DMA from a virtual device. The IOMMU emulator, despite its internal warning, treats the request as valid, which, in a successful attack, enables escape from the virtual machine. The patch was released in VirtualBox 7.2.8.

CVE-2026-42268 - Integer overflow in validation operators of ModSecurity v3

A vulnerability in three operators verifying sensitive identification numbers - @verifySSN, @verifyCPF, and @verifySVNR - in the ModSecurity v3 project allowed a remote attacker to halt the web server process with a single HTTP request containing an empty parameter. A simple query was enough - curl "http://TARGET/path?x= - to make the WAF rule hang the entire worker and, as a result, stop serving web clients.

The most interesting aspect of this vulnerability, however, is not the bug itself, but the fact that an identical pattern occurs in three separate files - verify_ssn.cc, verify_cpf.cc, and verify_svnr.cc - which suggests copy-paste from one operator to the next, without being caught either by code review or by the tests. The very fact that this bug survived in the library until 2026 is itself an argument that even mature security projects benefit from systematic, automated testing.

Statistics from the last 21 days (21 April 2026 -> 12 May 2026)

50 open-source projects tested continuously (38 with at least one crash)
16 million new corpus files
2,057 completed cycles
over 100,000 "raw" crashes
696 unique crashes (550 real, 141 false positives)
2,786 test programs in rotation
Machine learning model quality
Per-project model - 16,805 training sessions, average AUC 0.981, half a million samples
Global model - 1,191 training sessions, average AUC 0.947, five million samples

Conclusion

"Classic" fuzzing is cheap; the preparation for it, however, is significantly resource-intensive. We built fuzzlab to investigate how much of that preparation can be automated, treating AI as a specialized tool working within clearly defined boundaries, rather than as a general-purpose agent. The architecture described is the state as of the date of publication. The project remains in the proof-of-concept phase, and many decisions may still change as further experiments are conducted.

In this approach, fuzzing ceases to be an activity in its own right and becomes one element of a broader, continuous process, alongside static analysis, risk prediction, classification of findings, and reporting. The value lies not in any single campaign, but in a loop that learns by itself, repairs itself, and evolves between iterations.

The results observed at the current stage suggest that such an approach - combining classic fuzzing techniques with a decision layer based on machine learning and specialized LLM integrations - can significantly shorten the cycle from identifying new security vulnerabilities to their documented disclosure, provided that the automation preserves the researcher's control, observability, and reproducibility of results.

Fuzz Introspector ↩
AUC (Area Under the ROC Curve) - a popular measure of classifier quality, describing how well the model ranks examples: whether suspicious functions really receive higher scores than safe ones. A value of 0.5 means a random result, 1.0 - perfect. AUC describes the overall quality of the ranking, but says nothing about how well the model performs at the very top - and it is precisely the top that decides where the fuzzer's computational budget will go. ↩