[Dailydave] Re: Tooling, Graph Databases, etc.

27 Jun 2023

      I've been using Joern with Ghidra for a couple years. For my needs, the 
ghidra2cpg project is too limiting. It works at the disassembly level by 
mapping various CPU instructions for a few common archs via a 
pre-patched bundled Ghidra build (ie, no custom loaders, processor 
definitions, etc). As a work-around, I've hacked together an alternative 
(worse) approach that operates on the decompiler level: a Ghidra script 
that just enumerate every function and adds the decompiler psuedo-C to a 
Joern project, using Joern's fuzzy C parser. It's a lossy hack, but it 
mostly just works. But both options undermine a lot of the value Ghidra 
provides.
A more ideal yet reachable solution that I've been toying with is to 
write a proper Joern CPG frontend parser for PCODE. This could provide a 
graphdb representations of the AST, CFG, CDG, DDG, and PDG, plus the CPG 
glue tying them (and the parent PCODE) all together.
Joern switched away from using native neo4j a while back for performance 
reasons, but they do provide an export utility to generate a neo4j-csv 
so you can use graphviz, import into neo4j, graphml, etc.
The big question: is the CPG format an ideal graph representation? Has 
anyone besides the authors' companies been working with at a high level? 
What known limitations does it present? I'd be interested in seeing what 
changes ShiftLeft and QwietAI have made. For reference, here's the 
original academic research: https://www.sec.cs.tu-bs.de/pubs/2014-ieeesp.pdf
On 6/27/23 10:08, Shane Macaulay via Dailydave wrote:
...
There is joernio's ghidra2cpg, not sure why they now seem to be 
pushing a forked set of patches https://github.com/joernio/ghidra, 
probably the DB format changes too rapidly or some other "we 
automatically intake unknown relationships lost statically". That 
might get part of what you're looking for, even though, it isn't an 
exact fit, bringing in some higher level tooling, like all the graphql 
UI's that contextualize queries with type context are so helpful, 
whenever I don't have context aware syntax support, thar barrier to 
actually do anything limit's my enthusiasm so that only the most 
impactful (perceived before getting too far) get my attention (and I'm 
often wrong so :).  I forget if joern still uses Neo4j, I am confident 
that it's the best FOSS available for describing code/binaries right now.
Getting more tools in this space is a great initiative that deserves 
attention.  Being able to communicate so expressively, codifying 
knowledge for bugs some helpers around supporting guided generation of 
queries for arbitrary conditions, the benefits for invariant analysis 
(as can been seen with Semmle/CodeQL) are extreme.
On Mon, Jun 26, 2023 at 3:46 PM Dave Aitel via Dailydave 
dailydave@lists.aitelfoundation.org wrote:
There's a new Ghidra release last week! Lots of improvements to
the debugger, which is awesome. But this brings up some thoughts
that have been triggering my
vulnerability-and-exploitation-specific OCD for some time now.

Behind every good RE tool is a crappy crappy database. Implicitly
we, as a community, understand there is no good reason that every
reverse engineering project needs to implement a key-value store,
or a B-Tree
<https://github.com/NationalSecurityAgency/ghidra/tree/master/Ghidra/Framework/DB/src/main/java/db>,
or partner with a colony of bees which maintain tool state by
various wiggly dances. But yet each and every tool has a developer
with decades of reverse engineering experience on rare embedded
platforms either building custom indexes in a pale imitation of a
real DB structure or engaging in insect-based diplomacy efforts.

I think the Ghidra team (and Binja/IDA teams!) are geniuses, but
they are probably NOT geniuses at building database engines. And
reading through the issues
<https://github.com/NationalSecurityAgency/ghidra/issues/985> with
ANY reverse engineering product you find that performance even for
the base feature-set is a difficult ask.

My plea is this: We need to port Ghidra to Neo4j as soon as
possible. Having a real Graph DB store underneath Ghidra solves
the scalability issues. I understand the difficulty here is: There
are few engineers who understand both Neo4j and reverse
engineering to the point where this can be done. I mean, why do it
in Neo4j and not PostGres? An argument can be made for both, in
the sense that PostGres is truly Free and the most solid DB on the
market. The pluses for Neo4j are that RE data is typically
graph-based more than linear.

I spent the last two years learning graph dbs, out of some
masochistic desire and ended up getting certified - and I can
still RE a little bit. I will manage the team porting Ghidra to
Neo4j if someone funds it. :)

Either way, sooner is better than later. There are so many
companies and people relying on these tools that it seems silly to
do anything else.

-dave
P.S. Yes, I remember BinNavi used MsSQL installs for its data, and
this was annoying to install but ... I get why Halvar did it at
the time. It's because he had real work to do and building a DB
was not it. I can only assume Reven doesn't use their own DB? I
mean the benefits for interoperability would be huge between
tools. . . like literally everything you want to do with these
tools is better with a real DB underneath.

_______________________________________________
Dailydave mailing list -- dailydave@lists.aitelfoundation.org
To unsubscribe send an email to
dailydave-leave@lists.aitelfoundation.org

Dailydave mailing list -- dailydave@lists.aitelfoundation.org
To unsubscribe send an email to dailydave-leave@lists.aitelfoundation.org

2026

2025

2024

2023

2022

2021

2020

[Dailydave] Re: Tooling, Graph Databases, etc.