I've been working with LLMs for a bit, and also looking at the DARPA Cyber
AI Challenge <https://www.darpa.mil/news-events/2023-08-09>. And to that
end I put together CORVIDCALL which uses various LLMs to essentially 100%
find-and-patch any bug example I can throw at it from the various GitHub
repos that store these things (see below).
So I learned a lot of things doing this, and one article that came out
talked about the future of LLMs, and if you're doing this challenge you
really are building for future LLMs and not the ones available right now.
One thing they pointed out in that article (which I highly recommend
reading) is that huggingface is basically doing a disservice with their
leaderboard - but the truth is more complicated. It's nice to know which
models do better than other models, but the comparison between them is not
a simple number any more than the comparison between people is a simple
number. There's no useful IQ score for models or for people.
For example, one of the hardest things to measure is how well a model can
handle interleaved and recursive problems. If you have an SQL Query inside
your Python code being sent to a server, does it notice errors in that code
or do they fly under the radar as "just a string".
Can the LLM handle optimization problems, indicating it understands
performance implications of a system?
Can the LLM handle LARGER problems. People are obsessed with context window
sizes but what you find is a huge degradation of accuracy in following
instructions when you hit even 1/8th the context window size for any of the
leading models. This means you have to know how to compress up your tasks
to fit basically into a teacup. And for smaller models, this degradation is
even more severe.
People in the graph database world are obsessed with getting "Knowledge
graphs" out of unstructured data + a graph database. I think "Knowledge
graphs" are pretty useless, but what is not useless is connecting
unstructured data by topic in your graph database, and using that to make
larger community detection-based decisions. And the easiest way to do this
is to pass your data into an LLM and ask it to generate the topics for you,
typically in the form of a Twitter hashtag. Code is unstructured data.
If you want to measure your LLM you can do some fun things. Asking a good
LLM for 5 twitter hashtags in comma separated value format will work MOST
of the time. But the smaller and worse the LLM, the more likely it is to go
off the rails and fail to do it when faced with larger data, or more
complicated data, or data in a different language which it first has to
translate. To be fair, most of them will fail to do the right number of
hashtags. You can try this yourself on various models which otherwise are
at the top of a leaderboard, within "striking distance" on the benchmarks
against Bard, Claude, or GPT-4. (#theyarenowhereclose, #lol)
Obviously the more neurons you have making sure you don't say naughty
things, the worse you are at doing anything useful, and you can see that in
the difference between StableBeluga and LLAMA2-chat, for example, with
these simple manual evaluations.
And this matters a lot when you need your LLM to output structured data
<https://twitter.com/RLanceMartin/status/1696231512029777995?s=20> based on
So we can divide up the problem of automating finding and patching bugs in
source code in a lot of ways, but one way is to notice the process real
auditors take, and just replicate this by passing in data flow diagrams and
other various summaries into the models. Right now hundreds of academics
are "inventing" new ways to use LLMs. For example "Reason and Act
I've never seen so much hilarity as people put obvious computing patterns
into papers and try to invent some terminology to hang their career on.
And of course when it comes to a real codebase, say, libjpeg, or a real web
app, following the data through a system is important. Understanding code
flaws is important. But also building test triggers and doing debugging is
important to test your assumptions. And coalescing this information in, for
example, the big graph database that is your head is how you make it all
But what you want with bug finding is not to mechanistically re-invent
source-sink static analysis with LLMs. You want intuition. You want flashes
It's a hard and fun problem at the bigger end of the scale. We may have to
give our bug finding systems the machine equivalent of serotonin. :)
The authors claim in their conclusion: "We want to stress the importance of
books as journeys to explore and experience topics from the unique
viewpoint of the authors."
And in this they succeeded. This book works best as a proposed curriculum
for a five day workshop for experts to reproduce fuzzing frameworks that
target embedded platforms - including Android and iOS. Largely this is done
by figuring out how to get various emulation frameworks (QEMU in
particular) to carry the weight of virtualizing a platform and getting
snapshots out of it and pushing data into it.
Fuzzing is a childishly easy concept that is composed of devilishly hard
problems in practice (7 and 8 being the ones this book covers in depth -
the fuzzers themselves are simplistic other than those topics):
1. Managing scale
2. Getting decent per-iteration performance
3. Triaging crashes
4. Building useful harnesses
5. Knowing when you have fuzzed enough, vs. being in a local minima
6. Figuring out root causes
7. *Getting your fuzzer to properly instrument your target so you can
have coverage-guided fuzzing*
8. *Handling weird architectures*
9. Generating useful starting points for your fuzzer (or input grammars)
All of these things are basically impossible in the real world. Your
typical experience with a new fuzzing framework is that you install it on a
fresh Linux, pick a target, and then watch as it fails to instrument or
In other words, just knowing which fuzzer versions to use, and on what, is
When I read a book on security, a good one, I want it to feel like I'm
putting on a brand new powersuit, ready to march into the wilderness with a
flamethrower and a mindset of extreme violence. This book delivers that
feeling. Because while my current business practices have nothing to do
with fuzzing the Shannon baseband, that doesn't mean some small part of me
doesn't want to. We all have the dark urge. We crave SIGSEGV in things
people rely on.
So in summary: 10/10, great book. Would recommend buying 10, setting up a
class, and going over it all together. Of course, this field is RAPIDLY
EVOLVING and you're going to want to get it updated, perhaps with the fancy
new PCODE fuzzer Airbus released earlier today. (
The Vegas security conferences used to feel like diving into a river. While
yes, you networked and made deals and talked about exploits, you also felt
for currents and tried to get a prediction of what the future held. A lot
of this was what the talks were about. But you went to booths to see what
was selling, or what people thought was selling, at least.
But it doesn't matter anymore what the talks are about. The talks are about
everything. There's a million of them and they cover every possible topic
under the sun. And the big corpo booths are all the same. People want to
sell you XDR, and what that means for them is a per-seat or per-IP charge.
When there's no differentiation in billing, there's no differentiation in
That doesn't mean there aren't a million smaller start-ups with tiny
cubicles in the booth-space, like pebbles on a beach. Hunting through them
is like searching for shells - for every Thinkst Canary there's a hundred
newly AI-enabled compliance engines.
DefCon and Blackhat in some ways used to be more international as well -
but a lot of the more interesting speakers can't get visas anymore or
aren't allowed to talk publicly by their home countries.
If you've been in this business for a while, you have a dreadful fear of
being in your own bubble. To not swim forward is to suffocate. This is what
drove you to sit in the front row of as many talks as possible at these two
huge conferences, hung over, dehydrated, confused by foreign terminology in
a difficult accent.
But now you can't dive in to make forward progress. Vegas is even more of a
forbidding dystopia, overloaded with crowds so heavy it can no longer feed
them or even provide a contiguous space for the ameba-like host to gather.
Talks echo and muddle in cavernous rooms with the general acoustics of a
high school gymnasium. You are left with snapshots and fragmented memories
instead of a whole picture.
For me, one such moment was a Senate Staffer, full of enthusiasm, crowing
about how smart the other people working on policy and walking the halls of
Congress were - experts and geniuses at healthcare, for example! But if our
cyber security policy matches our success at a health system we are doomed.
I brought my kids this year and it helps to be able to see through the
chaos with new eyes. What's "cool" I asked? in the most boomery way
possible. Because I know Jailbreaking an AI to say bad things is not it,
even though it had all the political spotlights in the world focused on
examining the "issue".
The more crowded the field gets, the less immersion you have. Instead of
diving in you are holding your palm against the surface of the water,
hoping to sense the primordial tube worms at the sea vents feeding on raw
data leagues below you. "Take me to the beginning, again" you say to them,
through whatever connection you can muster.