I've been working with LLMs for a bit, and also looking at the DARPA Cyber AI Challenge. And to that end I put together CORVIDCALL which uses various LLMs to essentially 100% find-and-patch any bug example I can throw at it from the various GitHub repos that store these things (see below).
One thing they pointed out in that article (which I highly recommend reading) is that huggingface is basically doing a disservice with their leaderboard - but the truth is more complicated. It's nice to know which models do better than other models, but the comparison between them is not a simple number any more than the comparison between people is a simple number. There's no useful IQ score for models or for people.
For example, one of the hardest things to measure is how well a model can handle interleaved and recursive problems. If you have an SQL Query inside your Python code being sent to a server, does it notice errors in that code or do they fly under the radar as "just a string".
Can the LLM handle optimization problems, indicating it understands performance implications of a system?
Can the LLM handle LARGER problems. People are obsessed with context window sizes but what you find is a huge degradation of accuracy in following instructions when you hit even 1/8th the context window size for any of the leading models. This means you have to know how to compress up your tasks to fit basically into a teacup. And for smaller models, this degradation is even more severe.
People in the graph database world are obsessed with getting "Knowledge graphs" out of unstructured data + a graph database. I think "Knowledge graphs" are pretty useless, but what is not useless is connecting unstructured data by topic in your graph database, and using that to make larger community detection-based decisions. And the easiest way to do this is to pass your data into an LLM and ask it to generate the topics for you, typically in the form of a Twitter hashtag. Code is unstructured data.
If you want to measure your LLM you can do some fun things. Asking a good LLM for 5 twitter hashtags in comma separated value format will work MOST of the time. But the smaller and worse the LLM, the more likely it is to go off the rails and fail to do it when faced with larger data, or more complicated data, or data in a different language which it first has to translate. To be fair, most of them will fail to do the right number of hashtags. You can try this yourself on various models which otherwise are at the top of a leaderboard, within "striking distance" on the benchmarks against Bard, Claude, or GPT-4. (#theyarenowhereclose, #lol)
Obviously the more neurons you have making sure you don't say naughty things, the worse you are at doing anything useful, and you can see that in the difference between StableBeluga and LLAMA2-chat, for example, with these simple manual evaluations.
And this matters a lot when you need your LLM to output
structured data based on your input.
So we can divide up the problem of automating finding and patching bugs in source code in a lot of ways, but one way is to notice the process real auditors take, and just replicate this by passing in data flow diagrams and other various summaries into the models. Right now hundreds of academics are "inventing" new ways to use LLMs. For example "
Reason and Act". I've never seen so much hilarity as people put obvious computing patterns into papers and try to invent some terminology to hang their career on.
And of course when it comes to a real codebase, say, libjpeg, or a real web app, following the data through a system is important. Understanding code flaws is important. But also building test triggers and doing debugging is important to test your assumptions. And coalescing this information in, for example, the big graph database that is your head is how you make it all pay off.
But what you want with bug finding is not to mechanistically re-invent source-sink static analysis with LLMs. You want intuition. You want flashes of insight.
It's a hard and fun problem at the bigger end of the scale. We may have to give our bug finding systems the machine equivalent of serotonin. :)
-dave