image.png

It's impossible not to notice that we live in an age of technological wonders, stretching back to the primitive hominids who dared to ask "Why?" but also continually accelerating and pulling everything apart while it does, in the exact same manner as the Universe at large. It is why all the hackers you know are invested so heavily in Deep Learning right now, as if someone got on a megaphone at Chaos Communication Camp and said "ACHTUNG. DID YOU KNOW THAT YOU CAN CALCULATE UNIVERSAL TRUTHS BACKWARDS ACROSS A MATRIX USING BASIC CALCULUS. VIELEN DANK UND AUF WIEDERSEHEN!". 

Hackers are nothing if not obsessed with the theoretical edges of computation, the way a 1939 Niels Bohr was obsessed with the boundary between physics and philosophy, and how to push that line as far as possible using math in such a way that you could find complementary pairs of truths lying about everywhere and of course one aspect of his work is that you can almost a century later deter an adversary from attacking you by threatening to end the world, and another aspect is you can study the stars and as a bureaucrat you would call this "dual-use" although governments tend to have a heck of a lot of use for the former and almost no use at all for the latter.

All of which is to say that while NO HAT is a very good conference, with a lot of great talks, the one I have enjoyed the most so far is "LLMs FOR VULNERABILITY DISCOVERY: FINDING 0DAYS AT SCALE WITH A CLICK OF A BUTTON" (by Marcello Salvati and Dan McInerney). This talk goes over their lessons learned developing VulnHunter, which finds real vulns in Python apps given just the source directory, and those lessons are roughly as follows:

  1. Focus on Specific Tasks and Structured Outputs: LLMs can be unreliable when given open-ended or overly broad tasks. To mitigate hallucinations and ensure accurate results, it's crucial to provide highly specific instructions and enforce structured outputs. (aka, the naive metrics people are providing are probably not useful). 
  2. Manage Context Windows Effectively: While larger context windows are beneficial, strategically feeding code to the LLM in smaller, relevant chunks, like focusing on the call chain from user input to output, is key to maximizing efficiency and accuracy. They did a great job here, and this is important even if you have a huge context window to play with (aka, Gemini).
  3. Leverage Existing Libraries for Code Parsing: Dynamically typed languages like Python present unique challenges for static analysis. Utilizing libraries like Jedi, which is designed for Python autocompletion, can significantly streamline the process of identifying and retrieving relevant code segments for the LLM. They recommend a rewrite here themselves using treesitter to look at C/C++, although I would probably personally have used an IDE plugin to handle Python (and give you debugger access).
  4. Prompt Engineering is Essential: The way you structure prompts has a huge impact on the LLM's performance. Clear instructions, well-defined tasks, and even the use of XML tags for clarity can make a significant difference in the LLM's ability to find vulnerabilities. But of course, in my experience, the better your LLM (larger really), the less prompt engineering matters. And when you are testing against multiple LLMs you don't want to introduce prompt engineering as a variable.
  5. Bypass Identification and Multi-Step Vulnerability Analysis: LLMs can be remarkably effective at identifying security bypass techniques and understanding complex, multi-step vulnerabilities that traditional static code analyzers might miss. There's a ton of work to be done in future analysis in how this happens and what the boundaries are.
  6. Avoid Over-Reliance on Fine-Tuning and RAG: While seemingly promising, fine-tuning LLMs on vulnerability datasets can lead to oversensitivity and an abundance of false positives. Similarly, retrieval augmented generation (RAG) may not be sufficiently precise for pinpointing the specific code snippets required for comprehensive analysis. Knowing that everyone is having problems with these techniques actually is good because it goes against the common understanding of how you would build something like this.
At its core, vulnerability discovery is as much about understanding as it is about finding flaws. To find a vulnerability, one has to unravel the code, decipher its intent, and even the assumptions of its creator. This process mirrors the deeper question of what it means to truly understand code—of seeing beyond syntax and function to grasp the logic, intention, and potential points of failure embedded within. Like Bohr’s exploration of complementary truths in physics, understanding code vulnerabilities requires seeing both what the code does and what it could do under different conditions. In this way, the act of discovering vulnerabilities is itself a study in comprehension, one that goes beyond detection to touch on the very nature of insight.

-dave