[image: image.png]https://twitter.com/thezdi/status/1638617627626176513
[image: image.png]
Yawps
So one thing I have as a "lessons learned" from the past 20 years is that
security is not a proactive sport. In fact, we are all experts at running
to where the ball _was_as opposed to where it is _going_.
Like, if you listen to Risky Biz this week, Patrick asks Metlstorm whether
it's time to go out and replace all the old enterprise file sharing systems
<https://twitter.com/vxunderground/status/1641629743534559233?s=20> you
have around, proactively. And the answer, from Metl, who's hacked into
every org in Oceania for the past 20 years, is "yeah, this is generating
huge return on investment for the ransomware crews so they're just going to
keep doing it, and being proactive might be a great idea." But what he
didn't say, but clearly had in his head was "but lol, nobody is going to
actually do that. So good luck out there chooms!"
At some level, STIX and TAXII and the whole CTI market are about passing
around information on what someone _might_ have used to hack something, at
some point in the _distant past_. It's a paleontology of hackers past - XML
schemas about huge ancient reptiles swimming in the tropical seas of
your networks, the taxonomies of extinct orders we now know only through a
delicate finger-like flipper bone or a clever piece of shellcode.
-dave
So last week at offensivecon I watched a talk on Fuzzilli (
https://github.com/googleprojectzero/fuzzilli) which, I have to admit I had
no idea what it was. Obviously I knew it was a Googlely Javascript fuzzer,
finding bugs. But I did not realize that it was applying mutations to its
own intermediate language which it then compiled to Javascript. I just
assumed it was, like most fuzzers, mutating the javascript directly (f.e.
https://sean.heelan.io/2016/04/26/fuzzing-language-interpreters-using-regre…
).
But having an IL designed for fuzzing-related mutations is clearly a great
idea! And this year, they've expanded on that to build a
Javascript->Fuzzilli compiler/translation layer. So you can pass in sample
Javascript and then it will create the IL and then it will mutate the IL.
The reason this is necessary is that Javascript is. like almost all modern
languages, extremely complicated underneath the covers, so in order to
generate crashes you may need to have a lot of different fields set
properly in a particular order in a structure. They try to do some
introspection on objects and generate their samples from that as well, but
there's no beating "real user code" for learning how an object needs to be
created and used.
These advances generate a lot more bugs! In theory none of these bugs
matter in the future because of the mitigations (no pointers outside the
Javascript gigacage!) going into place by the very authors of the fuzzer?
(I have my doubts, but we all will live and learn?)
It would be...very cool, I think, if Bard or another LLM was the one doing
the Javascript sample generation as well. If you think about it, these LLMs
all have a good understanding of Javascript and you can give them various
weird tasks to do, and let them generate your samples, and then when a
crash happens you can have them mutate around that crash, or if you have a
sample not getting any more code coverage you can have them mutate that
sample to attempt to make it weirder. :)
-dave
*Logical Operations on Knowledge Graphs:*
So if you've spent enough time looking at graph databases, you will
invariably run into people who are obsessed with "ontologies". The basic
theory being that if you've already organized your data in some sort of
directed graph, maybe you can then apply a logical ruleset to those
relationships and infer knowledge about things you didn't already know,
which could be useful? There are a lot of people trying to do this with
SBOMs and I...wish them the best.
In real life I think it's basically impossible to create a large, useful,
graph database+inference engine that has data clean enough for anything
useful. Also the word *ontology* is itself very annoying.
But philosophically, while any complex enough data set will have to embrace
paradoxes, you can get a lot of value out of putting some higher level
structure based on the text in your data.
And this is where modern AI comes in - in specific, the tree of "Natural
Language Understanding" that broke off from Chomsky-esque-and-wrong
"Natural Language Processing" some time ago.
One article covering this topic is here
<https://medium.com/@anthony.mensier/gpt-4-for-defense-specific-named-entity…>,
which combines entity extraction and classification in order to look into
finding military topics in an article.
But these techniques can be abstracted and broadened as a general purpose
and very useful algorithm: Essentially you want to extract keywords from
text fields within your graph data, then relate those keywords to each
other, which gives you conceptual groupings and allows you to make further
queries that uncover insights about those groups.
*Our Solution:*
One of the team-members over at Margin Research working on SocialCyber
<https://www.darpa.mil/program/hybrid-ai-to-protect-integrity-of-open-source…>
with me, Matt Filbert, came up with the idea of using OpenAI's GPT to get
hashtags from text, which it does very very well. If you store these as
nodes you get something like the picture below (note that hashtags are
BASED on the text, but they are NOT keywords and may not be in the text
itself):
[image: image.png]
Next you want to figure out how topics are related to each other! Which you
can do in a thousand different ways - the code words to search on are "Node
Similarity" - but almost all those ways will either not work or create bad
results because you have a very limited directional graph of just
"Things->Topics".
In the end we used a modified Jaccardian algo (aka, you are similar if you
have shared parents), which I call Daccardian because it creates weighted
directed graphs (which comes in handy later):
[image: image.png]
So once you've done that, you get this kind of directed graph:
[image: image.png]
From here you could build communities of related topics using any community
detection algorithm, but even just being able to query against them is
extremely useful. In theory you could query just against one topic at a
time, but because of the way your data is probably structured, you want
both that topic, and any closely related topics to be included.
So for example, looking for Repos that have topics that either are "#UI" or
closely related, can be queried like this (not all topics are shown because
of the LIMIT clause):
[image: image.png]
*Some notes on AI Models:*
OpenAI's APIs are a bit slow, and often throw errors randomly, which is fun
to handle. And of course, when doing massive amounts of inference, it's
probably cheaper to run your own equipment, which will leave you casting
about on Huggingface like a glow worm in a dark cave for a model that can
do this and is open source. I've tried basically all of them and they've
all started out promising but then been a lot worse than ChatGPT 3.5.
You do want one that is multilingual, and Bard might be an option when they
open their API access up. There's a significant difference between the
results from the big models and the little ones, in contrast to the paper
that just "leaked" from Google
<https://www.semianalysis.com/p/google-we-have-no-moat-and-neither> about
how small tuned models are going to be just as good as bigger models (which
they are not!).
One minor exception is the new Mosaic model (
https://huggingface.co/spaces/mosaicml/mpt-7b-instruct) which is
multilingual and four times cheaper than OpenAI but it's also about 1/4th
as good. It may be the first "semi-workable" open model though, which is a
promising sign and it may be worth moving to this if you have data you
can't run through an open API for some reason.
*Conclusion:*
If you have a big pile of structured data, you almost certainly have
natural language data that you want to capture as PART of that structure,
but this would have been literally impossible six months ago before LLMs.
Using this AI-tagging technique and doing some basic graph algorithms can
really open up the power of your queries to take the most valuable part of
your information into account, while not losing the performance and
scalability of having it in a database in the first place.
Thanks for reading,
Dave Aitel
PS: I have a dream, and that dream is to convert Ghidra into a Graph
Database as a native format so we can use some of these techniques (and
code embeddings) as a native feature. If you sit next to the Ghirda team,
and you read this whole post, give them a poke for me. :)