I've been using Joern with Ghidra for a couple years. For my needs, the
ghidra2cpg project is too limiting. It works at the disassembly level by
mapping various CPU instructions for a few common archs via a
pre-patched bundled Ghidra build (ie, no custom loaders, processor
definitions, etc). As a work-around, I've hacked together an alternative
(worse) approach that operates on the decompiler level: a Ghidra script
that just enumerate every function and adds the decompiler psuedo-C to a
Joern project, using Joern's fuzzy C parser. It's a lossy hack, but it
mostly just works. But both options undermine a lot of the value Ghidra
provides.
A more ideal yet reachable solution that I've been toying with is to
write a proper Joern CPG frontend parser for PCODE. This could provide a
graphdb representations of the AST, CFG, CDG, DDG, and PDG, plus the CPG
glue tying them (and the parent PCODE) all together.
Joern switched away from using native neo4j a while back for performance
reasons, but they do provide an export utility to generate a neo4j-csv
so you can use graphviz, import into neo4j, graphml, etc.
The big question: is the CPG format an ideal graph representation? Has
anyone besides the authors' companies been working with at a high level?
What known limitations does it present? I'd be interested in seeing what
changes ShiftLeft and QwietAI have made. For reference, here's the
original academic research:
https://www.sec.cs.tu-bs.de/pubs/2014-ieeesp.pdf
On 6/27/23 10:08, Shane Macaulay via Dailydave wrote:
> There is joernio's ghidra2cpg, not sure why they now seem to be
> pushing a forked set of patches
https://github.com/joernio/ghidra,
> probably the DB format changes too rapidly or some other "we
> automatically intake unknown relationships lost statically". That
> might get part of what you're looking for, even though, it isn't an
> exact fit, bringing in some higher level tooling, like all the graphql
> UI's that contextualize queries with type context are so helpful,
> whenever I don't have context aware syntax support, thar barrier to
> actually do anything limit's my enthusiasm so that only the most
> impactful (perceived before getting too far) get my attention (and I'm
> often wrong so :). I forget if joern still uses Neo4j, I am confident
> that it's the best FOSS available for describing code/binaries right now.
>
> Getting more tools in this space is a great initiative that deserves
> attention. Being able to communicate so expressively, codifying
> knowledge for bugs some helpers around supporting guided generation of
> queries for arbitrary conditions, the benefits for invariant analysis
> (as can been seen with Semmle/CodeQL) are extreme.
>
> On Mon, Jun 26, 2023 at 3:46 PM Dave Aitel via Dailydave
> <dailydave(a)lists.aitelfoundation.org> wrote:
>
> There's a new Ghidra release last week! Lots of improvements to
> the debugger, which is awesome. But this brings up some thoughts
> that have been triggering my
> vulnerability-and-exploitation-specific OCD for some time now.
>
> Behind every good RE tool is a crappy crappy database. Implicitly
> we, as a community, understand there is no good reason that every
> reverse engineering project needs to implement a key-value store,
> or a B-Tree
>
<https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Framework/DB/src/main/java/db>,
> or partner with a colony of bees which maintain tool state by
> various wiggly dances. But yet each and every tool has a developer
> with decades of reverse engineering experience on rare embedded
> platforms either building custom indexes in a pale imitation of a
> real DB structure or engaging in insect-based diplomacy efforts.
>
> I think the Ghidra team (and Binja/IDA teams!) are geniuses, but
> they are probably NOT geniuses at building database engines. And
> reading through the issues
> <https://github.com/NationalSecurityAgency/ghidra/issues/985> with
> ANY reverse engineering product you find that performance even for
> the base feature-set is a difficult ask.
>
> My plea is this: We need to port Ghidra to Neo4j as soon as
> possible. Having a real Graph DB store underneath Ghidra solves
> the scalability issues. I understand the difficulty here is: There
> are few engineers who understand both Neo4j and reverse
> engineering to the point where this can be done. I mean, why do it
> in Neo4j and not PostGres? An argument can be made for both, in
> the sense that PostGres is truly Free and the most solid DB on the
> market. The pluses for Neo4j are that RE data is typically
> graph-based more than linear.
>
> I spent the last two years learning graph dbs, out of some
> masochistic desire and ended up getting certified - and I can
> still RE a little bit. I will manage the team porting Ghidra to
> Neo4j if someone funds it. :)
>
> Either way, sooner is better than later. There are so many
> companies and people relying on these tools that it seems silly to
> do anything else.
>
> -dave
> P.S. Yes, I remember BinNavi used MsSQL installs for its data, and
> this was annoying to install but ... I get why Halvar did it at
> the time. It's because he had real work to do and building a DB
> was not it. I can only assume Reven doesn't use their own DB? I
> mean the benefits for interoperability would be huge between
> tools. . . like literally everything you want to do with these
> tools is better with a real DB underneath.
>
> _______________________________________________
> Dailydave mailing list -- dailydave(a)lists.aitelfoundation.org
> To unsubscribe send an email to
> dailydave-leave(a)lists.aitelfoundation.org
>
>
> _______________________________________________
> Dailydave mailing list -- dailydave(a)lists.aitelfoundation.org
> To unsubscribe send an email to dailydave-leave(a)lists.aitelfoundation.org