I've been working with LLMs for a bit, and also looking at the DARPA Cyber
AI Challenge <https://www.darpa.mil/news-events/2023-08-09>. And to that
end I put together CORVIDCALL which uses various LLMs to essentially 100%
find-and-patch any bug example I can throw at it from the various GitHub
repos that store these things (see below).
[image: image.png]
So I learned a lot of things doing this, and one article that came out
recently (https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini)
talked about the future of LLMs, and if you're doing this challenge you
really are building for future LLMs and not the ones available right now.
One thing they pointed out in that article (which I highly recommend
reading) is that huggingface is basically doing a disservice with their
leaderboard - but the truth is more complicated. It's nice to know which
models do better than other models, but the comparison between them is not
a simple number any more than the comparison between people is a simple
number. There's no useful IQ score for models or for people.
For example, one of the hardest things to measure is how well a model can
handle interleaved and recursive problems. If you have an SQL Query inside
your Python code being sent to a server, does it notice errors in that code
or do they fly under the radar as "just a string".
Can the LLM handle optimization problems, indicating it understands
performance implications of a system?
Can the LLM handle LARGER problems. People are obsessed with context window
sizes but what you find is a huge degradation of accuracy in following
instructions when you hit even 1/8th the context window size for any of the
leading models. This means you have to know how to compress up your tasks
to fit basically into a teacup. And for smaller models, this degradation is
even more severe.
People in the graph database world are obsessed with getting "Knowledge
graphs" out of unstructured data + a graph database. I think "Knowledge
graphs" are pretty useless, but what is not useless is connecting
unstructured data by topic in your graph database, and using that to make
larger community detection-based decisions. And the easiest way to do this
is to pass your data into an LLM and ask it to generate the topics for you,
typically in the form of a Twitter hashtag. Code is unstructured data.
If you want to measure your LLM you can do some fun things. Asking a good
LLM for 5 twitter hashtags in comma separated value format will work MOST
of the time. But the smaller and worse the LLM, the more likely it is to go
off the rails and fail to do it when faced with larger data, or more
complicated data, or data in a different language which it first has to
translate. To be fair, most of them will fail to do the right number of
hashtags. You can try this yourself on various models which otherwise are
at the top of a leaderboard, within "striking distance" on the benchmarks
against Bard, Claude, or GPT-4. (#theyarenowhereclose, #lol)
Obviously the more neurons you have making sure you don't say naughty
things, the worse you are at doing anything useful, and you can see that in
the difference between StableBeluga and LLAMA2-chat, for example, with
these simple manual evaluations.
And this matters a lot when you need your LLM to output structured data
<https://twitter.com/RLanceMartin/status/1696231512029777995?s=20> based on
your input.
So we can divide up the problem of automating finding and patching bugs in
source code in a lot of ways, but one way is to notice the process real
auditors take, and just replicate this by passing in data flow diagrams and
other various summaries into the models. Right now hundreds of academics
are "inventing" new ways to use LLMs. For example "Reason and Act
<https://blog.research.google/2022/11/react-synergizing-reasoning-and-acting…>".
I've never seen so much hilarity as people put obvious computing patterns
into papers and try to invent some terminology to hang their career on.
And of course when it comes to a real codebase, say, libjpeg, or a real web
app, following the data through a system is important. Understanding code
flaws is important. But also building test triggers and doing debugging is
important to test your assumptions. And coalescing this information in, for
example, the big graph database that is your head is how you make it all
pay off.
But what you want with bug finding is not to mechanistically re-invent
source-sink static analysis with LLMs. You want intuition. You want flashes
of insight.
It's a hard and fun problem at the bigger end of the scale. We may have to
give our bug finding systems the machine equivalent of serotonin. :)
[image: image.png]
-dave
https://www.packtpub.com/product/fuzzing-against-the-machine/9781804614976
The authors claim in their conclusion: "We want to stress the importance of
books as journeys to explore and experience topics from the unique
viewpoint of the authors."
And in this they succeeded. This book works best as a proposed curriculum
for a five day workshop for experts to reproduce fuzzing frameworks that
target embedded platforms - including Android and iOS. Largely this is done
by figuring out how to get various emulation frameworks (QEMU in
particular) to carry the weight of virtualizing a platform and getting
snapshots out of it and pushing data into it.
Fuzzing is a childishly easy concept that is composed of devilishly hard
problems in practice (7 and 8 being the ones this book covers in depth -
the fuzzers themselves are simplistic other than those topics):
1. Managing scale
2. Getting decent per-iteration performance
3. Triaging crashes
4. Building useful harnesses
5. Knowing when you have fuzzed enough, vs. being in a local minima
6. Figuring out root causes
7. *Getting your fuzzer to properly instrument your target so you can
have coverage-guided fuzzing*
8. *Handling weird architectures*
9. Generating useful starting points for your fuzzer (or input grammars)
All of these things are basically impossible in the real world. Your
typical experience with a new fuzzing framework is that you install it on a
fresh Linux, pick a target, and then watch as it fails to instrument or
even run.
In other words, just knowing which fuzzer versions to use, and on what, is
valuable information.
When I read a book on security, a good one, I want it to feel like I'm
putting on a brand new powersuit, ready to march into the wilderness with a
flamethrower and a mindset of extreme violence. This book delivers that
feeling. Because while my current business practices have nothing to do
with fuzzing the Shannon baseband, that doesn't mean some small part of me
doesn't want to. We all have the dark urge. We crave SIGSEGV in things
people rely on.
So in summary: 10/10, great book. Would recommend buying 10, setting up a
class, and going over it all together. Of course, this field is RAPIDLY
EVOLVING and you're going to want to get it updated, perhaps with the fancy
new PCODE fuzzer Airbus released earlier today. (
https://github.com/airbus-cyber/ghidralligator)
-dave
The Vegas security conferences used to feel like diving into a river. While
yes, you networked and made deals and talked about exploits, you also felt
for currents and tried to get a prediction of what the future held. A lot
of this was what the talks were about. But you went to booths to see what
was selling, or what people thought was selling, at least.
But it doesn't matter anymore what the talks are about. The talks are about
everything. There's a million of them and they cover every possible topic
under the sun. And the big corpo booths are all the same. People want to
sell you XDR, and what that means for them is a per-seat or per-IP charge.
When there's no differentiation in billing, there's no differentiation in
product.
That doesn't mean there aren't a million smaller start-ups with tiny
cubicles in the booth-space, like pebbles on a beach. Hunting through them
is like searching for shells - for every Thinkst Canary there's a hundred
newly AI-enabled compliance engines.
DefCon and Blackhat in some ways used to be more international as well -
but a lot of the more interesting speakers can't get visas anymore or
aren't allowed to talk publicly by their home countries.
If you've been in this business for a while, you have a dreadful fear of
being in your own bubble. To not swim forward is to suffocate. This is what
drove you to sit in the front row of as many talks as possible at these two
huge conferences, hung over, dehydrated, confused by foreign terminology in
a difficult accent.
But now you can't dive in to make forward progress. Vegas is even more of a
forbidding dystopia, overloaded with crowds so heavy it can no longer feed
them or even provide a contiguous space for the ameba-like host to gather.
Talks echo and muddle in cavernous rooms with the general acoustics of a
high school gymnasium. You are left with snapshots and fragmented memories
instead of a whole picture.
For me, one such moment was a Senate Staffer, full of enthusiasm, crowing
about how smart the other people working on policy and walking the halls of
Congress were - experts and geniuses at healthcare, for example! But if our
cyber security policy matches our success at a health system we are doomed.
I brought my kids this year and it helps to be able to see through the
chaos with new eyes. What's "cool" I asked? in the most boomery way
possible. Because I know Jailbreaking an AI to say bad things is not it,
even though it had all the political spotlights in the world focused on
examining the "issue".
The more crowded the field gets, the less immersion you have. Instead of
diving in you are holding your palm against the surface of the water,
hoping to sense the primordial tube worms at the sea vents feeding on raw
data leagues below you. "Take me to the beginning, again" you say to them,
through whatever connection you can muster.
-dave
https://www.youtube.com/live/YY-ugAHPu4M?feature=share&t=1057
I have on my todo list to reply to our thread last month, but in the
meantime, here is a video that goes over all the lessons learned from my
last couple years doing Neo4j.
But as a reminder, we should still port Ghidra to Neo4j. :)
-dave
There's a new Ghidra release last week! Lots of improvements to the
debugger, which is awesome. But this brings up some thoughts that have been
triggering my vulnerability-and-exploitation-specific OCD for some time now.
Behind every good RE tool is a crappy crappy database. Implicitly we, as a
community, understand there is no good reason that every reverse
engineering project needs to implement a key-value store, or a B-Tree
<https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Framewo…>,
or partner with a colony of bees which maintain tool state by
various wiggly dances. But yet each and every tool has a developer with
decades of reverse engineering experience on rare embedded platforms either
building custom indexes in a pale imitation of a real DB structure or
engaging in insect-based diplomacy efforts.
I think the Ghidra team (and Binja/IDA teams!) are geniuses, but they are
probably NOT geniuses at building database engines. And reading through the
issues <https://github.com/NationalSecurityAgency/ghidra/issues/985> with
ANY reverse engineering product you find that performance even for the base
feature-set is a difficult ask.
My plea is this: We need to port Ghidra to Neo4j as soon as possible.
Having a real Graph DB store underneath Ghidra solves the scalability
issues. I understand the difficulty here is: There are few engineers who
understand both Neo4j and reverse engineering to the point where this can
be done. I mean, why do it in Neo4j and not PostGres? An argument can be
made for both, in the sense that PostGres is truly Free and the most solid
DB on the market. The pluses for Neo4j are that RE data is typically
graph-based more than linear.
I spent the last two years learning graph dbs, out of some masochistic
desire and ended up getting certified - and I can still RE a little bit. I
will manage the team porting Ghidra to Neo4j if someone funds it. :)
Either way, sooner is better than later. There are so many companies and
people relying on these tools that it seems silly to do anything else.
-dave
P.S. Yes, I remember BinNavi used MsSQL installs for its data, and this was
annoying to install but ... I get why Halvar did it at the time. It's
because he had real work to do and building a DB was not it. I can only
assume Reven doesn't use their own DB? I mean the benefits for
interoperability would be huge between tools. . . like literally everything
you want to do with these tools is better with a real DB underneath.
Lately I've been watching a lot of online security talks - the new thing
for conferences to do is publish them almost immediately, which is amazing.
So like, today I watched Chompie's talk:
https://www.sstic.org/2023/presentation/deep_attack_surfaces_shallow_bugs/
(I was honestly hoping it went from RCE to logic bug and allowed you to log
in, but maybe left as an exercise for the reader).
And yesterday I watched Natalie's talk:
https://www.youtube.com/watch?v=quw8SnmMWg4&t=663s&ab_channel=OffensiveCon
(I'm still a bit confused as to how you connect to a phone's baseband with
SIP, but maybe I will ask later at some point). Does the baseband just have
some TCP ports open for RTC shenanigans? If so, that's great, #blessed, etc.
I actually forgot to post my own talk here, so if you want to watch that,
it's here:
https://www.youtube.com/watch?v=BarJCn4yChA&t=1669s&ab_channel=OffensiveCon
My talk is not actually a call to hack enterprise products - which ya'll
are clearly already doing a lot of (I'm just assuming there are members of
the FIN11 ransomware crew on this list somewhere). It's more about
understanding which business models lead to easy bugs - enterprise software
obviously being one of them, but, for example, DRM components are another
one. There is an endless supply really.
Today I noticed Barracuda is saying that if your appliance gets hacked
<https://www.bleepingcomputer.com/news/security/barracuda-says-hacked-esg-ap…>,
you should replace it immediately. It is now trash, or "e-waste" if you
prefer. This is a surprisingly honest thing to say. Previously appliance
companies who get hacked say things like "Meh? Upgrade pls! Don't forget to
change your passwords!" because as we all know, the firmware
<https://arstechnica.com/information-technology/2023/03/malware-infecting-wi…>
and
boot partitions inside expensive security appliances are all protected by
angry leprechauns, which is why it's still ok in 2023 to have Perl
installed on them, even if you don't know what a ../.../ does in a tar
file.
This honesty would be nice if it also applied to our government agencies -
like instead of this very long report CISA
<https://www.cisa.gov/news-events/cybersecurity-advisories/aa21-110a#:~:text….>
put out about what to do if you think your Pulse Secure VPN was hacked,
which recommends performing a factory reset, updating your appliance to the
very latest version, and then calling your therapist to have a good cry
about it, they should have instead said: "Yes, your appliance did at one
point control authentication for everyone accessing your network, but
because it had issues with gzip files and opening URIs, it is now e-waste."
Crap, I forgot to write about the graph disassembler I want and why.
Tomorrow, for sure.
-dave
[image: image.png]https://twitter.com/thezdi/status/1638617627626176513
[image: image.png]
Yawps
So one thing I have as a "lessons learned" from the past 20 years is that
security is not a proactive sport. In fact, we are all experts at running
to where the ball _was_as opposed to where it is _going_.
Like, if you listen to Risky Biz this week, Patrick asks Metlstorm whether
it's time to go out and replace all the old enterprise file sharing systems
<https://twitter.com/vxunderground/status/1641629743534559233?s=20> you
have around, proactively. And the answer, from Metl, who's hacked into
every org in Oceania for the past 20 years, is "yeah, this is generating
huge return on investment for the ransomware crews so they're just going to
keep doing it, and being proactive might be a great idea." But what he
didn't say, but clearly had in his head was "but lol, nobody is going to
actually do that. So good luck out there chooms!"
At some level, STIX and TAXII and the whole CTI market are about passing
around information on what someone _might_ have used to hack something, at
some point in the _distant past_. It's a paleontology of hackers past - XML
schemas about huge ancient reptiles swimming in the tropical seas of
your networks, the taxonomies of extinct orders we now know only through a
delicate finger-like flipper bone or a clever piece of shellcode.
-dave
So last week at offensivecon I watched a talk on Fuzzilli (
https://github.com/googleprojectzero/fuzzilli) which, I have to admit I had
no idea what it was. Obviously I knew it was a Googlely Javascript fuzzer,
finding bugs. But I did not realize that it was applying mutations to its
own intermediate language which it then compiled to Javascript. I just
assumed it was, like most fuzzers, mutating the javascript directly (f.e.
https://sean.heelan.io/2016/04/26/fuzzing-language-interpreters-using-regre…
).
But having an IL designed for fuzzing-related mutations is clearly a great
idea! And this year, they've expanded on that to build a
Javascript->Fuzzilli compiler/translation layer. So you can pass in sample
Javascript and then it will create the IL and then it will mutate the IL.
The reason this is necessary is that Javascript is. like almost all modern
languages, extremely complicated underneath the covers, so in order to
generate crashes you may need to have a lot of different fields set
properly in a particular order in a structure. They try to do some
introspection on objects and generate their samples from that as well, but
there's no beating "real user code" for learning how an object needs to be
created and used.
These advances generate a lot more bugs! In theory none of these bugs
matter in the future because of the mitigations (no pointers outside the
Javascript gigacage!) going into place by the very authors of the fuzzer?
(I have my doubts, but we all will live and learn?)
It would be...very cool, I think, if Bard or another LLM was the one doing
the Javascript sample generation as well. If you think about it, these LLMs
all have a good understanding of Javascript and you can give them various
weird tasks to do, and let them generate your samples, and then when a
crash happens you can have them mutate around that crash, or if you have a
sample not getting any more code coverage you can have them mutate that
sample to attempt to make it weirder. :)
-dave
*Logical Operations on Knowledge Graphs:*
So if you've spent enough time looking at graph databases, you will
invariably run into people who are obsessed with "ontologies". The basic
theory being that if you've already organized your data in some sort of
directed graph, maybe you can then apply a logical ruleset to those
relationships and infer knowledge about things you didn't already know,
which could be useful? There are a lot of people trying to do this with
SBOMs and I...wish them the best.
In real life I think it's basically impossible to create a large, useful,
graph database+inference engine that has data clean enough for anything
useful. Also the word *ontology* is itself very annoying.
But philosophically, while any complex enough data set will have to embrace
paradoxes, you can get a lot of value out of putting some higher level
structure based on the text in your data.
And this is where modern AI comes in - in specific, the tree of "Natural
Language Understanding" that broke off from Chomsky-esque-and-wrong
"Natural Language Processing" some time ago.
One article covering this topic is here
<https://medium.com/@anthony.mensier/gpt-4-for-defense-specific-named-entity…>,
which combines entity extraction and classification in order to look into
finding military topics in an article.
But these techniques can be abstracted and broadened as a general purpose
and very useful algorithm: Essentially you want to extract keywords from
text fields within your graph data, then relate those keywords to each
other, which gives you conceptual groupings and allows you to make further
queries that uncover insights about those groups.
*Our Solution:*
One of the team-members over at Margin Research working on SocialCyber
<https://www.darpa.mil/program/hybrid-ai-to-protect-integrity-of-open-source…>
with me, Matt Filbert, came up with the idea of using OpenAI's GPT to get
hashtags from text, which it does very very well. If you store these as
nodes you get something like the picture below (note that hashtags are
BASED on the text, but they are NOT keywords and may not be in the text
itself):
[image: image.png]
Next you want to figure out how topics are related to each other! Which you
can do in a thousand different ways - the code words to search on are "Node
Similarity" - but almost all those ways will either not work or create bad
results because you have a very limited directional graph of just
"Things->Topics".
In the end we used a modified Jaccardian algo (aka, you are similar if you
have shared parents), which I call Daccardian because it creates weighted
directed graphs (which comes in handy later):
[image: image.png]
So once you've done that, you get this kind of directed graph:
[image: image.png]
From here you could build communities of related topics using any community
detection algorithm, but even just being able to query against them is
extremely useful. In theory you could query just against one topic at a
time, but because of the way your data is probably structured, you want
both that topic, and any closely related topics to be included.
So for example, looking for Repos that have topics that either are "#UI" or
closely related, can be queried like this (not all topics are shown because
of the LIMIT clause):
[image: image.png]
*Some notes on AI Models:*
OpenAI's APIs are a bit slow, and often throw errors randomly, which is fun
to handle. And of course, when doing massive amounts of inference, it's
probably cheaper to run your own equipment, which will leave you casting
about on Huggingface like a glow worm in a dark cave for a model that can
do this and is open source. I've tried basically all of them and they've
all started out promising but then been a lot worse than ChatGPT 3.5.
You do want one that is multilingual, and Bard might be an option when they
open their API access up. There's a significant difference between the
results from the big models and the little ones, in contrast to the paper
that just "leaked" from Google
<https://www.semianalysis.com/p/google-we-have-no-moat-and-neither> about
how small tuned models are going to be just as good as bigger models (which
they are not!).
One minor exception is the new Mosaic model (
https://huggingface.co/spaces/mosaicml/mpt-7b-instruct) which is
multilingual and four times cheaper than OpenAI but it's also about 1/4th
as good. It may be the first "semi-workable" open model though, which is a
promising sign and it may be worth moving to this if you have data you
can't run through an open API for some reason.
*Conclusion:*
If you have a big pile of structured data, you almost certainly have
natural language data that you want to capture as PART of that structure,
but this would have been literally impossible six months ago before LLMs.
Using this AI-tagging technique and doing some basic graph algorithms can
really open up the power of your queries to take the most valuable part of
your information into account, while not losing the performance and
scalability of having it in a database in the first place.
Thanks for reading,
Dave Aitel
PS: I have a dream, and that dream is to convert Ghidra into a Graph
Database as a native format so we can use some of these techniques (and
code embeddings) as a native feature. If you sit next to the Ghirda team,
and you read this whole post, give them a poke for me. :)
So my first thought is that performance measurement tools seem exactly
aimed at a lot of security problems but performance people are extremely
reluctant <https://aus.social/@brendangregg/110276319669838295> to admit
that because of the drama involved in the security market. Which is very
smart of them! :)
Secondly, I wanted to re-link to Halvar's QCon keynote
<https://docs.google.com/presentation/d/1wOT5kOWkQybVTHzB7uLXpU39ctYzXpOs2xV…>.
He has a section on the difficulties of getting good performance
benchmarks, which typically you would do as part of your build chain. So in
theory, you have a lot of compilation features you can twiddle when
compiling and you want to change those values, compile your program, and
get a number for how fast it is. But this turns out to basically be
impossible in the real world for reasons I'll let him explain in his
presentation (see below).
[image: image.png]
A lot of these problems with performance seem only solvable by a continuous
process of evolutionary algorithms - where you have a population of
different compilation variables, and you probably introduce new ones over
time, and you kill off the cloud VMs where you're getting terrible
performance under real-world situations and let the ones getting good or
average performance thrive.
I'm sure this is being done, and probably if I listened to more of Dino dai
Zovi's talks I'd know where and how, but aside from having performance
implications, it also has security implications because it will tend
towards offering offensive implants value for becoming less parasitic and
more symbiotic.
-dave