We intend to present a method of code obfuscation, in combination with the use of genetic algorithms, which may help make the challenge of breaking software less appealing to malicious users, without forcing developers to spend too much time and effort on the obfuscation process.
Before describing the details of our theory, we give a brief background introduction to software copy protection, as well as to code obfuscation and genetic algorithms.
Introduction
Looking back at software development over the years, there has been many different ways in which the software developers have tried to protect their software from piracy.
In the older days, preventing piracy could be attempted by marking a tape as copy protected, and when marked as such, the computer would refuse to copy it using the tape player / recorder. It soon became obvious that if you had a double deck regular tape player, you could produce a perfect copy of the tape there, thus getting past the copy protection without hassle. Other methods included extracting the program from the original tape and moving it onto another tape where it could be loaded using generic loader programs, or even moving it to floppy discs. Some of the methods used to fool the copy protection were indeed so easy that anyone, even people without any computer knowledge, could use them. All you needed to do was own a good stereo, and your problems were gone already.
Later on, more advanced copyright protection measures were developed to stop piracy. In some cases, bad sectors on disc were added, and checked at start–up, knowing that even though the readable content may be copied correctly, the bad sectors would not. Checking these sectors would be a dead give–away to if the disc had been copied or not.
In other cases, advanced paper wheel constructions were included in the software package, where the user was prompted to give the correct answer to a given challenge by turning the wheels around and enter the given output.
Yet another common copy protection method included asking the user to open a given page in the software manual, find a specific paragraph and enter one or more words found there.
All of these methods should have been rather effective protection measures against piracy, since they were not easily solvable by the regular user. Copying entire manuals or trying to memorize or reproduce paper wheels meant a lot of work and trouble, and in the end it may be cheaper and more convenient to just buy the original product. Yet piracy flourished. Why? Because there were users who managed to alter the program itself and remove the copy protection totally. Adding a very advanced anti piracy function to your software to see if the user has actually purchased the product or not, means nothing in the end if the method is never actually executed. The key to avoiding protection is simply to remove the part of the program which does the check, or to make sure that it always returns true ( tell the rest of the program that the user did indeed manage to enter the secret word from page 43, paragraph 5 ).
So how is it possible? The reason that software can so easily be cracked and copy protection removed, is that in order for the computer to be able to run the program, the program executable has to contain all the operations to be carried out during the execution. This is rather straightforward, as the program cannot be run without them. However, if the computer can read the instructions, then is it not possible for the user to do so as well? Yes and no. Most people would have a hard time reading the instructions from the file since they are at a very low level. What they can do is to try to reverse engineer the instructions inside the executable file into a more readable list of instructions, which may be understood by the user. Normally, reverse engineering will transform the executable code 'one step back' into assembly language instructions. As high level languages are becoming more and more frequently used, most people never bother to learn the assembly language. There is simply no use anymore.
Many years ago, using the assembly language for software developing was the only actual way to go, if you wanted to make sure that your program would not be a lot slower than needed be. Back then, the difference between executing a program generated by a higher language compiler and one written by hand in assembler, in terms of execution time, were tremendous. CPU cycles were very limited, so saving a few instructions here, and a few instructions there, would have a large impact on performance. Also, the uniformity of the computing environment (as all computers contained the same chipsets, memory size, CPU model, etc) would allow the software developers to directly access the hardware in their assembly code to achieve the results they wanted in a single instruction or two, rather than the 25 or 50 or even more than 100 instructions which may be needed if it were done through higher level library calls. Interestingly enough, this kind of "dirty programming", was in fact industry standard, as computers were mostly used as game machines, and games would require as good performance as possible to look good, feel new and run smoothly.
However, despite the fact that fewer and fewer people are nowadays learning the assembly language, this does not mean that piracy has ended, or that reverse engineering is not a threat to software piracy prevention methods. Rather on the contrary. The globalization of the internet, in combination with higher connection speeds, leads to that few people actually need to know assembler. It is enough that a few people know it, reverse engineer the software and remove the copy protection function, then release it on the internet for everyone to enjoy. Most of them have little actual interest in the software product itself, but work for fame and glory, or possibly for the challenge of breaking others software. The key problem with placing any kind of copy protection inside the software is that it will always be possible to see exactly what the program does, and as long as you can see what the program does, you can either alter the program to remove unwanted parts (cracks), or figure out other ways to get past it (key generators, etc). An executable file is simply not as secret and well protected as some people may believe, and many may wish.
Protecting Software
So how is it possible to protect software, knowing that everything placed inside the program will be accessible to smart end–side users? There are many different ways to try to address the problem, some of which include:
- Hardware challenges, where the user must have a piece of hardware installed to use the software correctly.
- Serial key protection mechanisms and license key file protection.
- Program encryption, with run–time decryption, which effectively prevents anyone in hold of the executable file to reverse engineer it into assembly code.
- Server–sided client copy protection, where the client 'calls home' to check if a license key is correct, or if the software's file integrity is still intact.
All copy protection methods have drawbacks. Hardware challenges require that the developer creates custom made hardware, preferably a new version for each program / version of program, adding a new dimension of work and expense to the development, often raising the price while lowering user freedom, as the software cannot be run if the hardware is missing or broken.
Serial key protection and license files are not truly safe, as the validation of serial numbers must be located somewhere inside the software, and as such, can be found by anyone who decides to reverse engineer the program to look for it.
Program encryption is a very good way of preventing hackers from seeing the code easily, but the execution time of the program is greatly affected by the need of runtime decryption, and if a kernel level debugger is used, it may still be possible to see exactly what the program is doing at any given time, as this shows the actual instructions being executed.
Server–sided client security is becoming increasingly common as more and more computers can be expected to have an internet connection. However, there are two different kinds of scenarios possible when it comes to this solution. The first scenario being that the user has a firewall, and simply blocks the program. Experienced users will probably see no reason to let any program access the internet unless it is part of the programs key functionality to do so. The second case scenario is when the program must be allowed to access the internet for one reason or another, and here the server sided security may actually work. But that brings us back to our previous conclusion, that as long as the user has the executable file, it is possible to simply remove the protection completely. If the client software does not call home to see if it is pirated or not, it will never discover the truth.
Throughout the years, software developers have made many brave attempts to thwart pirates, but there are few examples of cases where it has actually been successful for any longer period of time. Preventing users from finding out exactly what software does is in fact quite impossible. As long as the end user of a program should be able to run the software, there will always be ways to reverse engineer it.
The growth of the internet has also helped piracy become more widely spread, as it is possible to quickly distribute pirated software across the planet using high speed connections, peer to peer networks, and so on.
Code Obfuscation
A common saying is that "there is no such thing as clientside security". As we have demonstrated in the introduction above, this is in many ways true. There is no way to prevent users who have gotten in hold of the executable file to find out exactly what the file does and how to alter it to remove copy protection functions, or to create solutions to get past them. Instead, we decided to take a look a one method for making it harder for the malicious user to understand the program after it has been reverse engineered, namely code obfuscation.
Code obfuscation is the process of facing the fact that end users will be able to reverse engineer your software but deciding to give them a hard time understanding exactly what it does. Basic code obfuscation methods may include:
- Adding code that does nothing or is never actually executed.
- Find different, more complex and less readable, ways to carry out the same operations.
- Move the code around, placing related functions as far away from each other as possible, so it becomes harder to follow the execution of the program.
- Encode data in different ways. For instance, encrypt all data that is stored in memory to make it more difficult to see them when debugging the program
Code obfuscation does not focus on preventing users from seeing what the program does. Instead it takes a different approach and creates a needle–in–the–haystack kind of scenario for the would–be hacker. If the code is obfuscated enough, it may take days or weeks to understand what you are looking for and where to find it inside all the code, and perhaps even then, the code contains too many trick operations to be able to make sense of it.
Naturally, it is not possible to prevent the most dedicated reverse engineers from eventually finding some way to break the software, but if the work becomes too slow and tiresome, most of them just may give up before finding their way.
There are, of course, drawbacks with using code obfuscation. If we add too much nonsense code into the program, it may damage the execution time, making the program a lot slower than it was intended to be. Another drawback would be the file size of the executable file.
Adding too many instructions into it may make the size of the executable grow to an undesirable size, but this problem is not as urgent, as HD producers are constantly making bigger and bigger discs while prices are remaining stable or even dropping.
Code obfuscation focuses on making it harder for the malicious user to make sense of what the program actually does, by adding code, encoding strings, and so on. Drawbacks may include performance issues and file size.
Genetic Algorithms
Genetic algorithms are a commonly used optimization technique, which can be considered as a random beam search of the search space. It aims to find an optimal (or locally optimal) solution to a problem as quickly as possible.
In genetic algorithms, the possible solutions are made up from a collection of individuals, called the population. Each individual in the population holds one possible solution, and the initial population consists of randomly generated individuals. Normally the solution will be stored inside the individual as a bit string value, which will be translated into actual values at a later stage, for instance when trying to determine how well an individual performs.
For each individual and generation, a fitness value is calculated. The fitness value is a measure of how well the individual's solution solves the problem in question, and is calculated using a fitness function. The fitness function must, of course, be individual for each problem, and may be some kind measure of performance, solution size, and so on.
After calculating the fitness value of each individual for the generation, different methods are used to copy the best individuals into the next generation, and to let two good individuals create an offspring (also called crossover), which hopefully will be even better than the individual parents. Crossover is normally carried out by copying parts of the bit string from each parent into the child, using a policy which determines how to choose which bits to copy. Please see figure 1 below for a very simple example of single point crossover.
Figure 1. Single Point Crossover Example
There are many different ways of selecting individuals for crossover, but as genetic algorithms are inspired by biology, and the goal is to create the best possible individual, most of the selection models somehow involve the previously mentioned fitness value of the individuals, trying to make sure that the chance to reproduce is higher for individuals who represent betters solutions to the given problem.
Normally, around 40 percent of the individuals, namely the best ones, will be copied directly into the next generation, while the remaining 60 percent will be created using crossover. Something called mutation will also be used to randomly change some value for less than 1% of the population. As mentioned above, most selection methods, both for crossover and for copying, are strongly linked to the fitness of the individuals, so the reason for using mutation is to prevent premature convergence and to help increase the diversity of the population. Normally it involves the inversion of the value of one (or more) bits, as shown in figure 2.
Figure 2. Single Bit Mutation Example
Genetic algorithms are an efficient way of finding a good solution to a given problem within in a short period of time. Starting out with a random population, and using fitness proportionate selection methods, genetic algorithms are able to quickly home in on a good solution, and use random mutation as a safe–guard to prevent the search from getting stuck in a local optima, with too many individuals converging around a single solution.
Anyone interested in reading more about genetic algorithms, and how to implement them, should take a look at [1]. This paper merely gives a very brief introduction, in order for the reader who has not yet come across genetic algorithms to get a general idea of what they are.
Genetic Algorithms and Code Obfuscation
As we mentioned above, code obfuscation is when software developers intentionally change the code to make it more complex than it needs to be, in order to make it less readable for anyone trying to reverse engineer their product.
Genetic algorithms, on the other hand, are an automated search / optimization function for finding good solutions to a problem in a short time.
We decided to take a closer look at the possibilities of combining the two into an application that would automatically help create code obfuscation using genetic algorithms. We mostly focused on the first two versions of code obfuscation, namely adding code that does nothing or is never run, and finding different, more complex (and less readable) ways to carry out the same operations. In our study, we had in mind the option of extracting a section of the source code, obfuscating it, and then inserting the obfuscated code in the stead of the original code. However, other obfuscation techniques, such as moving code around, creating new functions that may or may not be run at some point in the execution, etc, should be far from impossible to implement using the same kind of obfuscation engine that we discuss in the rest of this paper.
General Aspects
First of all, one would have to consider the goal of the task at hand. We start out with a section of source code, and we would like to create a new section of source code which does the same but which is harder to understand, and possibly carries out some operations that are not actually needed. Another goal would be to avoid letting the code obfuscation slow down the final application too much. In table 1 below, we have listed a few of the goals we would like to achieve, as well as things we would like to avoid.
| Outcome | Result |
|---|---|
| More code | Good |
| More complex code | Good |
| Longer execution time | Bad |
| More functions | Good |
| Large output file | Bad |
| Losing original functionality | Very, very bad |
Naturally, some of the different possible outcomes are in conflict with each other. For instance, we would like to have more code, and more complex code, but we would probably not like to have code that is infinitely longer and more complex than the original code, or the program will no longer be able to function properly.
Also, one very important aspect is that the original functionality of the code that is being obfuscated must be preserved. In case the functionality is altered, the obfuscation becomes a problem. Thus, the method for obfuscating the code must be well tested and flawless in its execution.
Given all the information above, one could draw the conclusion that the following aspects would be important to the design of the obfuscation engine:
- Computational effort must be calculated and compared to the original computational effort for the code being replaced. If the difference is too large, the solution is inadequate.
- Original functionality must be preserved at all times. It is of paramount importance that the code obfuscation does not introduce bugs or altered behavior to the application, or the obfuscation will end up creating useless code that is neither readable nor useable.
- The relationship between the original code size and the output code size must not be too large. Although we want more code, to be able to better hide our original – important – code, we do not want it to be grossly oversized.
One more thing to consider would be that the solution produced by the obfuscation engine would have to be created in such a way that it would not be overly obvious to anyone reading the output code that it was created by the obfuscator, lest the reader may choose to simply ignore it and move on. However, in some cases this may actually be a positive effect, in case the code created by the obfuscating engine actually contains both valid and invalid code, as the user may miss important code, judging it to be automated code rather than real.
Representation Scheme
The representation scheme of the genetic algorithms is what determines how the bit string should be interpreted into an actual solution. In our case, it would be feasible to use a rather straightforward representation, where the bit strings are translated into different programming statements and operations, using both randomly generated variables and variables that have been found by analyzing the given code.
However, we feel that in this case, a bit string would be insufficient to use to try to contain the entire solution. Instead we would suggest using an object oriented design, where each individual keeps a list of "lines of code". It would not be impossible to use bit strings but for functional reasons, as well as programming simplicity, it would probably be better to use a more complex representation. Translation between real code and the individual's data, as well as verification of code functionality and calculation of computational effort should be more intuitive and easily implemented that way.
Creating the Fitness Function
When designing the fitness function for the code obfuscator, one would have to take great care to the aspects we have mentioned earlier. First of all, the most important criteria must be that the code contains the original functionality. It must in no way alter the values of variables in such a way that the output given by the obfuscated code would differ from the output given by the original code. This is rather complex, given that one of the points with obfuscating the code must be to make it harder for a hacker to understand, and as such, it should preferably be working with variables and data that is important in the real program. Regardless of complexity, any individual who contains a solution that does not fully represent the original source code, must have a fitness value of 0. It must be deemed as unsuitable and either be removed, to be replaced by a new individual, or allowed to remain in the hope that crossover or mutation would once again restore the functionality.
Another important characteristic of the individual that should affect the fitness value would be faulty code, which cannot be run or exhibits faulty behavior. Same as above, it should be given a fitness value of 0, since our goal is to create obfuscated code, not defect code.
Apart from these two important safeguards against getting useless output, the fitness function should try to calculate the computational effort needed to run the code, as well as the code size and code complexity of the output. Longer, more complex code would give a better fitness value, as long as the time needed to execute it would not have grown too much, or the code size would not be by far too big. It would be important for the user of the obfuscation engine to be able to tweak these kinds of settings to make it better suit his needs. In some cases, execution time may be of great importance, while file size is not, and so on.
Code Analysis and Verification
One very important part of the obfuscation engine would be the code analysis. It must be able to correctly read and interpret the code it is given to be able to create new code that does exactly the same thing but in a different way. If the code analysis is not completely correct, slight variances may occur, and the output code will lead to unpredictable behavior.
Problems with the code analysis would also lead to problems during the verification phase: if the code analysis failed to properly interpret the given code, it may believe that the newly created code is functionally equal to the original code, while in fact, it is only equal to its own interpretation of the original code.
Challenges
There are a number of serious challenges that would arise when using automated code obfuscation. First of all, when calculating the computational effort, the system would have to either generalize to estimate the time needed for execution of the given set of instructions, or it would have to be able to run and measure the real time actually needed. In case of the latter, either by being able to run the original program and injecting the code into the expected location, or by creating a new program where the code has been replaced by the obfuscated code, then comparing the time of the two. Even though estimating the computational effort may be more time saving, to actually execute the new code would possibly give other advantages.
For instance, newly created bugs would be clearly visible if the program would be unable to execute the newly created part of the code. Another benefit of running the code would be the possibility of functionality verification, where the new code could be run, along with the original version, and the exact outputs from the two could be compared.
One way of solving all these things would be if the obfuscation engine contained a debugger or similar that would be able to run the original program liked during normal execution, to a breakpoint, namely where the code to be obfuscated starts. It could then save all the relevant values, run to the next breakpoint, where the code to be obfuscated ends, save all the relevant values once more, go back to the first breakpoint and then restore the values contained there. Then for every individual to validate, proceed with the execution from the first breakpoint using the new code stored in the individual currently being evaluated. Once the execution of the newly created code was complete, the computational effort would be known by comparing the new execution time with the original execution time, the number of bugs would be known, hopefully none, and the number of changes in the output compared to the original code would be known, by simply comparing the new state with that obtained during the initial run.
Even though the implementation of such a debugger would be a considerable task in itself, it would probably help save time in the end, as well as add reliability to the obfuscator. For instance, time would no longer be estimated but measured, and changes to the environment during the run of the obfuscated code would be clearly visible rather than predicted.
Conclusions
Code obfuscation is a good way of deterring hackers from bothering to hack your software. It does not try to prevent them from seeing what is inside the executable but instead it makes the task harder and more time consuming by adding more code and more complexity.
Manual code obfuscation is rather time consuming, and it is hard to verify if the manually created code is effective or not, as we cannot tell for sure exactly how it affects the execution time of the program, or the state of the system, compared to the original code.
Using genetic algorithms to find solutions that are big enough, yet not too big, complex enough, yet not too slow, and 100% surely do not alter the execution of the important parts of the original software, would be a big challenge but far from impossible, and very useful indeed. It is very difficult and time consuming to manually obfuscate the code, but with an automated code obfuscating engine that actually runs the newly created code, it should be possible to make it a lot harder for malicious users to remove for instance copy protection functions from software, and yet be able to know for sure that the new code is functioning properly. Given enough thought to the design of the obfuscator, the software developers using it should be able to finish all their work on their software, then feed the output to the obfuscation engine, which in turn creates a new application which is verified to function as the original did, contains a lot more complex code, yet runs at an acceptable speed, and is a lot harder to reverse engineer.
Considerable challenges in creating such an obfuscation engine would involve:
- The analysis of the original source code.
- Selecting a good representation scheme for the individuals.
- Finding good ways to create the new random code, given the results from the code analysis.
- Creating good functions for crossover of individuals.
- Implementing a reliable debugger in which to run the newly created code.
- Verifying the output and comparing it to the original results.
Although there are a number of difficulties to address, and pitfalls to avoid, we believe that it should be possible to create a good code obfuscation engine, using genetic algorithms, and that there may be industrial interest in such a product. Protecting software has been a battle between hackers and developers going on for many years, and in the end the hackers always find a way to break that new, 'unbreakable' protection system. By obfuscating the code enough, it does not prevent hackers from reading it, but it makes the task less interesting. If much enough of the code has been obfuscated, it is no longer amusing for the hacker to try to find what he is looking for.
The reason why we believe that genetic algorithms would be a good choice for this task is because of the flexibility and speed. Since the initial generation is randomly created, and the size of the possible search space is very large, then if the creation function is good, the genetic algorithms should be able to find good obfuscated versions of the source code in a rather short time. Using a good debugging engine to test it should ensure that the resulting code is reliable and functional.
In the end, the obfuscation engine could both save time for the developers, who no longer need to manually obfuscate their code, or spend time on extensive testing to verify the new code once it is done, as well as lower the risk of that people who reverse engineering the code find anything useful unless spending a lot of time and effort on it, thus raising the security of the software.
References
- Reeves, Colin R.and Rowe, Jonathan E., Genetic Algorithms — Principles and Perspectives : A Guide to GA Theory. Kluwer Academic Publishers, ISBN 1–40–207240–6 (2002)
0 comments:
Post a Comment