Getting Started With Genetic Programming
If you’ve never heard of Genetic Programming or GP, then let me introduce you.
GP is a really simple idea to solve complex problems using an approach found in nature.
GP attempts to solve problems by modeling the evolutionary process. In GP you evolve a population of individuals (programs) at random, then test their suitability (fitness) for solving the problem. If a suitable individual is not found, you breed those that are the most fit. Then you test the new generation for fitness and the process continues until you find a solution. Sounds simple right? And it really is at its root.
You might wonder how you breed programs… Well, there are two methods. The first, called crossover takes some of the genes of two individuals and randomly combines them to create a new offspring. The second method is mutation. Mutation creates offspring by randomly altering a randomly chosen portion of the selected parent’s DNA.
We all know what DNA is in a broad sense. We have all heard that DNA is the blue-print for life. We know that DNA is made up of a sequence of genes, each having their own function or purpose. DNA for a GP program is no different. Digital DNA is just a blue-print for a program. Now programs are made up of functions. Some functions are complex, some very simple. In GP we create programs by building a blue-print of functions, variables, and constants. All the parts of most working programs…
GP programs don’t usually build the actual program code. Instead they build a syntax tree that represents the function to be called and when. The variables and constants in the program are leaves of the tree and in GP they are called terminals. The functions are internal nodes of the tree and are called, well, functions. The set of functions and terminals together form a very primitive system.
This simple approach can be built up to include more complex functions or subroutines. In which case the program is usually represented as a set of trees. One for each subroutine and one for the main program that links them all together under a root node. GP researchers refer to these sub-trees as branches.
In this article I can’t hope to explain every aspect of GP, all it’s terminology, or techniques. I only want to give you a basic understanding and one or two crude samples to get you started.
GP was first explored using earlier languages like Lisp. One of the reasons was because Lisp uses a prefix notation which makes parsing easier. For example, in Lisp you type (+ 1 3) to add the digits 1 and 3 together. This approach is typical of stack based languages like lisp. While this may seem confusing for some humans, it is a very powerful technique that reduces complexity in language parsing.
If you consider that all computer programs are simply an abstraction of a hand-full of primitive instructions, then you can consider the GP syntax tree to be simply the source code for a program. The GP DNA is really nothing more than a high level language for creating another program. So using a language that allowed for the ease of parsing was natural. However, today GP research is being done in almost every popular language.
In most cases GP function all take the same number of arguments. However, in some cases you may need functions to take a variable number of arguments. This is a property known as arity in GP. If your GP function all take the same number of arguments then the tree structure becomes unnecessary and a simple sequential list can be used. Trees seem to remain the most popular representation however due to the fact that it is often required to include complex features in the primitives. This is often more easily done using trees.
You can’t start producing individuals without parents to breed or mutate. So where do these parents come from? The initial set of parent(s) are usually randomly generated. The two most popular methods for generating the initial parent(s) are the full and grow methods. They are also perhaps the simplest.
In both cases the random generation algorithm limits the complexity of the parent to a maximum depth. The depth is the distance of a node from the root node which is placed at a depth of zero. The distance is calculated by the number of edges (connecting lines) you must traverse to arrive at a particular node. The depth of the entire tree is noted as the depth of the deepest leaf. If you use a sequential list instead, then the depth is simply the length of the sequence.
In the full method all leaves in the final tree are at the same depth. Nodes are randomly selected from the function set and then never reused. The range of trees produced by the full method is limited. Thus, the number, size, and shape of the resultant programs can be limited.
To solve this, the grow method was introduced. The grow method produces trees of more varied shape and size. The grow method never removes a function from the list of available functions so a function may be selected multiple times and placed at different locations in the tree. This greatly increases the number of possible trees and programs that can be produced with the given set of primitives.
There are other methods as well, and even some combinations of these and other methods for generating the initial population (parents). Some work better than others for a given type of problem. Choosing an effective method for any particular problem is often a non-trivial task and sometimes best solved by experimentation.
So now let us move on to the selection process. How do we go about calculating the fitness of an individual?
There are many methods for selecting individuals for propagation. One of the simplest is the tournament selection process. This method selects a number of individuals at random and from the current generation. Their fitness value is computed and the best individual(s) are selected as the parents of the next generation. If you use the mutation method you only need a single parent. The crossover method requires two parents.
Once you have selected parent(s) for the next generation, how do you mate them to produce the next generations of offspring? If you’ve studied biology you may know that all humans and most mammals begin life as a female. That’s right, all male humans are inherently female for the first few days of life. It isn’t until the fetus gets a dose of testosterone that the fetus begins to take on male traits. This transition is almost immediate once the testosterone is released in the developing fetus.
This fact gives us a clue as to how we can combine our parents into a single but distinct individual. To do this, we choose one of our most fit individuals as the female parent and create a copy. Then we randomly select a node in the copy’s tree to replace with a randomly selected node from the father. The result, a unique individual having traits of both parents. This process is sometimes repeated swapping the gender roles of the two parents, thus giving us two new individuals for the next generation. Copies of the parents are used for that each time a new individual is created they are created from the original parents and not perturbed derivations. Typically, you want the crossover or mutation process to only change a small amount of the DNA. In this way, small changes propagate over many generation until a fit candidate if found for the solution.
One of the most popular approaches to mutation randomly selects a sub-tree of the parent and randomly generates a new sub-tree at that location. There are many forms of mutation and no one seems to produce consistently better results in every situation. Each have their own place and application.
In this article we have just skimmed the surface of a fascinating branch of computer science. I have attempted to explain GP in as simplistic terms as possible and provide more of an introduction, giving general ideals rather than provide specifics for implementation. In the next installment we will try our hand at some very simple GP programs and introduce some more advanced topics in GP. Until then, you may find the following books helpful.
- Genetic Programming
- Authors: Moraglio, A., Silva, S., Krawiec, K., Machado, P., Cotta, C.
- ISBN: 978-3-642-29138-8
- Cartesian Genetic Programming
- Author: Miller, Julian F. (Ed.)
- ISBN: 978-3-642-17310-3
- Genetic Programming, An Introduction
- Authors: Banzhaf & Nordin & Keller & Francone
- ISBN: 9781558605107
- Genetic Programming: On the Programming of Computers by Means of Natural Selection
- URL: Genetic Programming