Dailydave May 2023

dailydave@lists.aitelfoundation.org

3 participants
3 discussions

Yawps from the rooftops
by Dave Aitel 06 Jun '23

06 Jun '23

[image: image.png]https://twitter.com/thezdi/status/1638617627626176513 [image: image.png] Yawps So one thing I have as a "lessons learned" from the past 20 years is that security is not a proactive sport. In fact, we are all experts at running to where the ball _was_as opposed to where it is _going_. Like, if you listen to Risky Biz this week, Patrick asks Metlstorm whether it's time to go out and replace all the old enterprise file sharing systems <https://twitter.com/vxunderground/status/1641629743534559233?s=20> you have around, proactively. And the answer, from Metl, who's hacked into every org in Oceania for the past 20 years, is "yeah, this is generating huge return on investment for the ransomware crews so they're just going to keep doing it, and being proactive might be a great idea." But what he didn't say, but clearly had in his head was "but lol, nobody is going to actually do that. So good luck out there chooms!" At some level, STIX and TAXII and the whole CTI market are about passing around information on what someone _might_ have used to hack something, at some point in the _distant past_. It's a paleontology of hackers past - XML schemas about huge ancient reptiles swimming in the tropical seas of your networks, the taxonomies of extinct orders we now know only through a delicate finger-like flipper bone or a clever piece of shellcode. -dave

1 1

Fussing about with fuzzers
by Dave Aitel 23 May '23

23 May '23

So last week at offensivecon I watched a talk on Fuzzilli ( https://github.com/googleprojectzero/fuzzilli) which, I have to admit I had no idea what it was. Obviously I knew it was a Googlely Javascript fuzzer, finding bugs. But I did not realize that it was applying mutations to its own intermediate language which it then compiled to Javascript. I just assumed it was, like most fuzzers, mutating the javascript directly (f.e. https://sean.heelan.io/2016/04/26/fuzzing-language-interpreters-using-regre… ). But having an IL designed for fuzzing-related mutations is clearly a great idea! And this year, they've expanded on that to build a Javascript->Fuzzilli compiler/translation layer. So you can pass in sample Javascript and then it will create the IL and then it will mutate the IL. The reason this is necessary is that Javascript is. like almost all modern languages, extremely complicated underneath the covers, so in order to generate crashes you may need to have a lot of different fields set properly in a particular order in a structure. They try to do some introspection on objects and generate their samples from that as well, but there's no beating "real user code" for learning how an object needs to be created and used. These advances generate a lot more bugs! In theory none of these bugs matter in the future because of the mitigations (no pointers outside the Javascript gigacage!) going into place by the very authors of the fuzzer? (I have my doubts, but we all will live and learn?) It would be...very cool, I think, if Bard or another LLM was the one doing the Javascript sample generation as well. If you think about it, these LLMs all have a good understanding of Javascript and you can give them various weird tasks to do, and let them generate your samples, and then when a crash happens you can have them mutate around that crash, or if you have a sample not getting any more code coverage you can have them mutate that sample to attempt to make it weirder. :) -dave

2 1

Knowledge Graph + AI = ?
by Dave Aitel 09 May '23

09 May '23

*Logical Operations on Knowledge Graphs:* So if you've spent enough time looking at graph databases, you will invariably run into people who are obsessed with "ontologies". The basic theory being that if you've already organized your data in some sort of directed graph, maybe you can then apply a logical ruleset to those relationships and infer knowledge about things you didn't already know, which could be useful? There are a lot of people trying to do this with SBOMs and I...wish them the best. In real life I think it's basically impossible to create a large, useful, graph database+inference engine that has data clean enough for anything useful. Also the word *ontology* is itself very annoying. But philosophically, while any complex enough data set will have to embrace paradoxes, you can get a lot of value out of putting some higher level structure based on the text in your data. And this is where modern AI comes in - in specific, the tree of "Natural Language Understanding" that broke off from Chomsky-esque-and-wrong "Natural Language Processing" some time ago. One article covering this topic is here <https://medium.com/@anthony.mensier/gpt-4-for-defense-specific-named-entity…>, which combines entity extraction and classification in order to look into finding military topics in an article. But these techniques can be abstracted and broadened as a general purpose and very useful algorithm: Essentially you want to extract keywords from text fields within your graph data, then relate those keywords to each other, which gives you conceptual groupings and allows you to make further queries that uncover insights about those groups. *Our Solution:* One of the team-members over at Margin Research working on SocialCyber <https://www.darpa.mil/program/hybrid-ai-to-protect-integrity-of-open-source…> with me, Matt Filbert, came up with the idea of using OpenAI's GPT to get hashtags from text, which it does very very well. If you store these as nodes you get something like the picture below (note that hashtags are BASED on the text, but they are NOT keywords and may not be in the text itself): [image: image.png] Next you want to figure out how topics are related to each other! Which you can do in a thousand different ways - the code words to search on are "Node Similarity" - but almost all those ways will either not work or create bad results because you have a very limited directional graph of just "Things->Topics". In the end we used a modified Jaccardian algo (aka, you are similar if you have shared parents), which I call Daccardian because it creates weighted directed graphs (which comes in handy later): [image: image.png] So once you've done that, you get this kind of directed graph: [image: image.png] From here you could build communities of related topics using any community detection algorithm, but even just being able to query against them is extremely useful. In theory you could query just against one topic at a time, but because of the way your data is probably structured, you want both that topic, and any closely related topics to be included. So for example, looking for Repos that have topics that either are "#UI" or closely related, can be queried like this (not all topics are shown because of the LIMIT clause): [image: image.png] *Some notes on AI Models:* OpenAI's APIs are a bit slow, and often throw errors randomly, which is fun to handle. And of course, when doing massive amounts of inference, it's probably cheaper to run your own equipment, which will leave you casting about on Huggingface like a glow worm in a dark cave for a model that can do this and is open source. I've tried basically all of them and they've all started out promising but then been a lot worse than ChatGPT 3.5. You do want one that is multilingual, and Bard might be an option when they open their API access up. There's a significant difference between the results from the big models and the little ones, in contrast to the paper that just "leaked" from Google <https://www.semianalysis.com/p/google-we-have-no-moat-and-neither> about how small tuned models are going to be just as good as bigger models (which they are not!). One minor exception is the new Mosaic model ( https://huggingface.co/spaces/mosaicml/mpt-7b-instruct) which is multilingual and four times cheaper than OpenAI but it's also about 1/4th as good. It may be the first "semi-workable" open model though, which is a promising sign and it may be worth moving to this if you have data you can't run through an open API for some reason. *Conclusion:* If you have a big pile of structured data, you almost certainly have natural language data that you want to capture as PART of that structure, but this would have been literally impossible six months ago before LLMs. Using this AI-tagging technique and doing some basic graph algorithms can really open up the power of your queries to take the most valuable part of your information into account, while not losing the performance and scalability of having it in a database in the first place. Thanks for reading, Dave Aitel PS: I have a dream, and that dream is to convert Ghidra into a Graph Database as a native format so we can use some of these techniques (and code embeddings) as a native feature. If you sit next to the Ghirda team, and you read this whole post, give them a poke for me. :)

2 1

2026

2025

2024

2023

2022

2021

2020

Dailydave May 2023