[image: image.png]
It's impossible not to notice that we live in an age of technological
wonders, stretching back to the primitive hominids who dared to ask "Why?"
but also continually accelerating and pulling everything apart while it
does, in the exact same manner as the Universe at large. It is why all the
hackers you know are invested so heavily in Deep Learning right now, as if
someone got on a megaphone at Chaos Communication Camp and said "ACHTUNG.
DID YOU KNOW THAT YOU CAN CALCULATE UNIVERSAL TRUTHS BACKWARDS ACROSS A
MATRIX USING BASIC CALCULUS. VIELEN DANK UND AUF WIEDERSEHEN!".
Hackers are nothing if not obsessed with the theoretical edges of
computation, the way a 1939 Niels Bohr was obsessed with the
boundary between physics and philosophy, and how to push that line as far
as possible using math in such a way that you could find complementary
pairs of truths lying about everywhere and of course one aspect of his work
is that you can almost a century later deter an adversary from attacking
you by threatening to end the world, and another aspect is you can study
the stars and as a bureaucrat you would call this "dual-use" although
governments tend to have a heck of a lot of use for the former and almost
no use at all for the latter.
All of which is to say that while NO HAT <http://nohat.it> is a very good
conference, with a lot of great talks, the one I have enjoyed the most so
far is "LLMs FOR VULNERABILITY DISCOVERY: FINDING 0DAYS AT SCALE WITH A
CLICK OF A BUTTON <https://www.youtube.com/watch?v=Z5LMRS3AF1k>" (by
Marcello Salvati and Dan McInerney). This talk goes over their lessons
learned developing VulnHunter <https://github.com/protectai/vulnhuntr>,
which finds real vulns in Python apps given just the source directory, and
those lessons are roughly as follows:
1. Focus on Specific Tasks and Structured Outputs: LLMs can be
unreliable when given open-ended or overly broad tasks. To mitigate
hallucinations and ensure accurate results, it's crucial to provide highly
specific instructions and enforce structured outputs. (aka, the naive
metrics people are providing are probably not useful).
2. Manage Context Windows Effectively: While larger context windows are
beneficial, strategically feeding code to the LLM in smaller, relevant
chunks, like focusing on the call chain from user input to output, is key
to maximizing efficiency and accuracy. They did a great job here, and this
is important even if you have a huge context window to play with (aka,
Gemini).
3. Leverage Existing Libraries for Code Parsing: Dynamically typed
languages like Python present unique challenges for static analysis.
Utilizing libraries like Jedi, which is designed for Python autocompletion,
can significantly streamline the process of identifying and retrieving
relevant code segments for the LLM. They recommend a rewrite here
themselves using treesitter to look at C/C++, although I would probably
personally have used an IDE plugin to handle Python (and give you debugger
access).
4. Prompt Engineering is Essential: The way you structure prompts has a
huge impact on the LLM's performance. Clear instructions, well-defined
tasks, and even the use of XML tags for clarity can make a significant
difference in the LLM's ability to find vulnerabilities. But of course, in
my experience, the better your LLM (larger really), the less prompt
engineering matters. And when you are testing against multiple LLMs you
don't want to introduce prompt engineering as a variable.
5. Bypass Identification and Multi-Step Vulnerability Analysis: LLMs can
be remarkably effective at identifying security bypass techniques and
understanding complex, multi-step vulnerabilities that traditional static
code analyzers might miss. There's a ton of work to be done in future
analysis in how this happens and what the boundaries are.
6. Avoid Over-Reliance on Fine-Tuning and RAG: While seemingly
promising, fine-tuning LLMs on vulnerability datasets can lead to
oversensitivity and an abundance of false positives. Similarly, retrieval
augmented generation (RAG) may not be sufficiently precise for pinpointing
the specific code snippets required for comprehensive analysis. Knowing
that everyone is having problems with these techniques actually is good
because it goes against the common understanding of how you would build
something like this.
At its core, vulnerability discovery is as much about understanding as it
is about finding flaws. To find a vulnerability, one has to unravel the
code, decipher its intent, and even the assumptions of its creator. This
process mirrors the deeper question of what it means to truly understand
code—of seeing beyond syntax and function to grasp the logic, intention,
and potential points of failure embedded within. Like Bohr’s exploration of
complementary truths in physics, understanding code vulnerabilities
requires seeing both what the code does and what it could do under
different conditions. In this way, the act of discovering vulnerabilities
is itself a study in comprehension, one that goes beyond detection to touch
on the very nature of insight.
-dave