Logical Operations on Knowledge Graphs:

So if you've spent enough time looking at graph databases, you will invariably run into people who are obsessed with "ontologies". The basic theory being that if you've already organized your data in some sort of directed graph, maybe you can then apply a logical ruleset to those relationships and infer knowledge about things you didn't already know, which could be useful? There are a lot of people trying to do this with SBOMs and I...wish them the best.

In real life I think it's basically impossible to create a large, useful, graph database+inference engine that has data clean enough for anything useful. Also the word ontology is itself very annoying. 

But philosophically, while any complex enough data set will have to embrace paradoxes, you can get a lot of value out of putting some higher level structure based on the text in your data. 

And this is where modern AI comes in - in specific, the tree of "Natural Language Understanding" that broke off from Chomsky-esque-and-wrong "Natural Language Processing" some time ago. 
 
One article covering this topic is here, which combines entity extraction and classification in order to look into finding military topics in an article.

But these techniques can be abstracted and broadened as a general purpose and very useful algorithm: Essentially you want to extract keywords from text fields within your graph data, then relate those keywords to each other, which gives you conceptual groupings and allows you to make further queries that uncover insights about those groups.

Our Solution:

One of the team-members over at Margin Research working on SocialCyber with me, Matt Filbert, came up with the idea of using OpenAI's GPT to get hashtags from text, which it does very very well. If you store these as nodes you get something like the picture below (note that hashtags are BASED on the text, but they are NOT keywords and may not be in the text itself): 
image.png

Next you want to figure out how topics are related to each other! Which you can do in a thousand different ways - the code words to search on are "Node Similarity" - but almost all those ways will either not work or create bad results because you have a very limited directional graph of just "Things->Topics". 

In the end we used a modified Jaccardian algo (aka, you are similar if you have shared parents), which I call Daccardian because it creates weighted directed graphs (which comes in handy later):
image.png

So once you've done that, you get this kind of directed graph:

image.png

From here you could build communities of related topics using any community detection algorithm, but even just being able to query against them is extremely useful. In theory you could query just against one topic at a time, but because of the way your data is probably structured, you want both that topic, and any closely related topics to be included. 

So for example, looking for Repos that have topics that either are "#UI" or closely related, can be queried like this (not all topics are shown because of the LIMIT clause):
image.png




Some notes on AI Models:

OpenAI's APIs are a bit slow, and often throw errors randomly, which is fun to handle. And of course, when doing massive amounts of inference, it's probably cheaper to run your own equipment, which will leave you casting about on Huggingface like a glow worm in a dark cave for a model that can do this and is open source. I've tried basically all of them and they've all started out promising but then been a lot worse than ChatGPT 3.5.  

You do want one that is multilingual, and Bard might be an option when they open their API access up. There's a significant difference between the results from the big models and the little ones, in contrast to the paper that just "leaked" from Google about how small tuned models are going to be just as good as bigger models (which they are not!). 

One minor exception is the new Mosaic model (https://huggingface.co/spaces/mosaicml/mpt-7b-instruct) which  is multilingual and four times cheaper than OpenAI but it's also about 1/4th as good. It may be the first "semi-workable" open model though, which is a promising sign and it may be worth moving to this if you have data you can't run through an open API for some reason.

Conclusion:

If you have a big pile of structured data, you almost certainly have natural language data that you want to capture as PART of that structure, but this would have been literally impossible six months ago before LLMs. Using this AI-tagging technique and doing some basic graph algorithms can really open up the power of your queries to take the most valuable part of your information into account, while not losing the performance and scalability of having it in a database in the first place.

Thanks for reading,
Dave Aitel
PS: I have a dream, and that dream is to convert Ghidra into a Graph Database as a native format so we can use some of these techniques (and code embeddings) as a native feature. If you sit next to the Ghirda team, and you read this whole post, give them a poke for me. :)