Welcome to the Decompiler home page!

Introduction

This is the home page of (yet another) open source machine code decompiler project. The goal of a machine code decompiler is to analyze executable files (like .EXE or .DLL files in Windows or ELF files in Unix-like environments) and attempt to create a high level representation of the machine code in the executable file: the decompiler tries to reconstruct the source code from which the executable was compiled in the first place.

To download the Decompiler, go to the project page:
https://sourceforge.net/projects/decompiler

Since compilation is a non-reversible process (information such as comments and variable data types is irretrievably lost), decompilation can never completely recover the source code of a machine code executable. However, with some oracular (read "human") assistance, it can go a long way towards this goal. An oracle can provide function parameter types, the locations of otherwise unreachable code, and user-specified comments.

The decompiler is designed to be processor- and platform-agnostic. The intent is that you should be able to use it to decompile executables for any processor architecture and not be tied to a particular instruction set. Although currently only a x86 front end is implemented, there is nothing preventing you from implementing a 68K, Sparc, or VAX front end if you need one.

The decompiler can be run as a command-line tool, in which case it can be fed either with a simple executable file, or a decompiler project file, which not only specifies the executable file to decompile but also any oracular information that assists its work. The decompiler also has a graphical front end, which lets an operator specify oracular information while examining the decompiled executable.

The outputs of the decompiler are a C source code file containing all the disassembled code and a header file in which type-reconstructed data types can be found.

Design

The decompiler consists of several phases.

The loading phase loads the executable into memory and determines what kind of executable is being decompiled. The executable format usually defines the processor format and the expected operating system environment. For older formats, such as plain MS-DOS .EXE files, the processor (x86 real mode) and operating system environent (MS-DOS) are implicit. Once the format is determined, the binary is loaded into memory (uncompressing it if necessary) and pointer or segment relocations are carried out. These relocations are also helpful in later stages of the decompiler, as each relocated pointer value can be given a preliminary type pointer-to(<unknown>) and each relocated segment selector the type segment-selector.
The scanning phase follows the loading phase. The executable will usually have one or more entry points, addresses pointing to executable code. The code at the the entry points is disassembled and traced, looking in particular for branch, call, and return statements. Successively, individual procedures are discovered, and call graph is built up, whose edges represent calls between procedures.
The rewriting phase rewrites all machine-specific instructions into low-level machine-independent instructions. Idiomatic instruction sequences are rewritten to expressions. From this point on, the decompilation process is processor independent.
The analysis phase first does a interprocedural reaching definitions analysis. This is done to determine, for each procedure proc of the program, which processor registers are preserved and which processor registers are modified after a call to proc. A subsequent interprocedural liveness analysis, combined with the results of the reaching definitions analysis, determins which processor registers are used as parameters and return value registers for each procedure. Note that this analysis avoids depending on a specific processor/platform ABI or calling convention. Once the two interprocedural analyses are complete, the procedures can be rewritten with their explicit arguments. Subsequent analyses are then performed on a procedure-by-procedure basis. Procedures are converted into SSA Form, condition code flags are eliminated and expressions are simplified. Finally the procedures are converted out of SSA Form.
The interprocedural type analysis phase attempts to recover the data types used in the program by analyzing the way in which values are used by the program code, incorporating clues obtained from the relocation data as well as any "oracular" information provided by the user. Memory access expressions are converted into their C equivalents: pointer dereferences (*foo), member access expressions (foo->bar), and array references (foo[bar]).
Finally, a structure analysis rewrites the control structures from unstructured goto-sphaghetti code to C-language if, while- / do-loops, and switch-statements.

Development

The decompiler is written in C# and currently targets CLR version 2.0. It's currently developed with Visual Studio 2005, but the plan is to have a working MonoDevelop project soon (wanna pitch in?)

The project implements the Test Driven Development methodology, with heavy emphasis on unit testing. No new code is allowed into the project unless it has one or more associated tests written for it. Developing a decompiler is notoriously tricky work with lots of special cases. Not having unit tests would make development an eternal bug hunt as fixes for one bug introduce other bugs. Unit tests are developed using NUnit v2.2.

Subversion is used for source control.

Status

The decompiler project is in a pre-alpha stage. As it stands, it is able to load MS-DOS and PE binary files, disassemble their contents, rewrite the disassembled instructions into intermediate code, and perform the analysis phase mentioned above. Currently work is focussed on type analysis, while code structuring is on the back-burner as it's considerably less complex than type recovery.If you'd like to chip in, feel free to contact us!