Welcome to the Decompiler home page!
Introduction
This is the home page of (yet another) open source machine code decompiler project. The goal of a machine
code decompiler is to analyze executable files (like .EXE or .DLL files in Windows or ELF files in Unix-like environments)
and attempt to create a high level representation of the machine code in the executable file:
the decompiler tries to reconstruct the source code from which the executable was compiled in the first place.
To download the Decompiler, go to the project page:
https://sourceforge.net/projects/decompiler
Since compilation is a non-reversible process (information such as comments and variable data types is irretrievably lost), decompilation can never completely recover the
source code of a machine code executable. However, with some oracular (read "human") assistance, it can go a long way
towards this goal. An oracle can provide function parameter types, the locations of otherwise unreachable code, and
user-specified comments.
The decompiler is designed to be processor- and platform-agnostic. The intent is that you should be able to use
it to decompile executables for any processor architecture and not be tied to a particular instruction set. Although currently only
a x86 front end is implemented, there is nothing preventing you from implementing a 68K, Sparc, or VAX front end if you
need one.
The decompiler can be run as a command-line tool, in which case it can be fed either with a simple executable file, or a
decompiler project file, which not only specifies the executable file to decompile but also any oracular information that assists its work. The decompiler also has a graphical front end, which lets an operator specify oracular information
while examining the decompiled executable.
The outputs of the decompiler are a C source code file containing all the disassembled
code and a header file in which type-reconstructed data types can be found.
Design
The decompiler consists of several phases.
- The loading phase loads the executable into memory
and determines what kind of executable is being decompiled. The
executable format usually defines the processor format and the
expected operating system environment. For older formats, such
as plain MS-DOS .EXE files, the processor (x86 real mode) and
operating system environent (MS-DOS) are implicit. Once the
format is determined, the binary is loaded into memory
(uncompressing it if necessary) and pointer or segment
relocations are carried out. These relocations are also helpful
in later stages of the decompiler, as each relocated pointer value
can be given a preliminary type pointer-to(<unknown>) and
each relocated segment selector the type segment-selector.
-
The scanning phase follows the loading phase. The executable
will usually have one or more entry points, addresses pointing to
executable code. The code at the the entry points is disassembled and traced,
looking in particular for branch, call,
and return statements. Successively, individual procedures
are discovered, and call graph is built up, whose edges represent
calls between procedures.
- The rewriting phase rewrites all machine-specific instructions into
low-level machine-independent instructions. Idiomatic instruction sequences
are rewritten to expressions. From this point on, the decompilation process is processor independent.
-
The analysis phase first does a interprocedural reaching definitions analysis.
This is done to determine, for each procedure proc of the program, which
processor registers
are preserved and which processor registers are modified after
a call to proc. A subsequent interprocedural liveness analysis, combined
with the results of the reaching definitions analysis, determins which processor
registers are used as parameters and return value registers for each procedure. Note
that this analysis avoids depending on a specific processor/platform ABI or calling
convention. Once the two interprocedural analyses are complete, the procedures can
be rewritten with their explicit arguments. Subsequent analyses are then performed
on a procedure-by-procedure basis. Procedures are converted into SSA Form,
condition code flags are eliminated and expressions are simplified. Finally the
procedures are converted
out of SSA Form.
-
The interprocedural type analysis phase attempts to recover the
data types used in the program by analyzing the way in which values are used by
the program code, incorporating clues obtained from the relocation data
as well as any "oracular"
information provided by the user. Memory access expressions
are converted into their C equivalents: pointer dereferences (*foo), member access
expressions (foo->bar), and array references (foo[bar]).
-
Finally, a structure analysis rewrites the control structures from
unstructured goto-sphaghetti code to C-language if, while- / do-loops,
and switch-statements.
Development
The decompiler is written in C# and currently targets CLR version 2.0.
It's currently developed with Visual Studio 2005, but the plan
is to have a working MonoDevelop project soon (wanna pitch in?)
The project implements the Test Driven Development methodology, with heavy emphasis on unit testing.
No new code is allowed into the project unless it has one or more associated tests written for it. Developing a
decompiler is notoriously tricky work with lots of special cases. Not having unit tests would make development an
eternal bug hunt as fixes for one bug introduce other bugs. Unit tests are developed using NUnit v2.2.
Subversion is used for source control.
Status
The decompiler project is in a pre-alpha stage. As it stands, it is able to load MS-DOS and PE binary files,
disassemble their contents, rewrite the disassembled instructions into intermediate code, and perform the analysis phase
mentioned above. Currently work is focussed on type analysis, while code structuring is on the back-burner as it's considerably
less complex than type recovery.If you'd like to chip in, feel free to contact us!