Team:Waterloo/Modeling
From 2009.igem.org
Contents |
Abstract
The primary goal of our software was to model integrase mediated DNA rearrangement. After our software was capable of simulating a biological system to steady state, our secondary goal was to be able to generate the initial reactants needed to arrive at a given solution.
Introduction
The mathematical modeling component of this year's project consisted of a computational simulation of DNA recombination as mediated by the ΦC31 integrase enzyme. The necessity for this simulation arose directly from challenges faced by the design team in its attempts to create a recursively repeatable technique for inserting sequences of interest onto chromosomes. Specifically, it was noted that manually examining the possible results of interactions between DNA strands quickly becomes infeasible due to the number of potential reaction pathways.
The first stage of the modeling project was therefore to formally codify the reaction rules employed by the design team with the aim of applying computational power to the problem. The predominant challenge faced at this stage was to abstract the concept of DNA strands into a computationally workable form along with developing mathematically rigorous definitions of the behaviours of reaction sites.
Formally, the grand object of the modeling project was the determination of a finite deterministic sequence of att sites and their enclosed operators that would allow one to predictably insert into a chromosome all of the desired sequences.
The form that the solution could take, we postulated would be a sequence of two or three plasmids such that each would contain at least one matching set of att sites with the addition of several incomplete att sites.
There were two general approaches used in modelling. Software development toward a top-down (inductive, brute force) solver has finished. Characterization of the algorithm underpinning the solver however revealed that the problem is NP-hard at least, and NP-complete at worse. An auxiliary approach was attempted whereupon we tried to map the sequence problem onto a mathematical problem with known solutions.
Software
In order to run the solver, we had to make a few assumptions. First, as we do not know what combination of sequences with att sites is part of the solution, we assume that any product in our search space is fair game for the next generation of reactions. We further assume that any plasmid with valid att sites and complementary operators is capable of self reacting and also of reacting with any other plasmid in the history of the modelled cell. Second, because of the exponential behaviour of the search space, we assume that the smallest solution that exists can be found within the search space generated after reacting 10E7 plasmids. This second assumption is made in order to have sane parameters for termination.
Math
An ancillary branch morphed out of the necessity to tend the exponential behaviour of the problem. There may exist some math that inherently facilitates the modeling and solving of this problem. We explored maths that mainly dealt with topology (knot theory) and functional reasoning (lambda theory, combinatory calculus) but finally could not identify a good candidate as a scaffold to our solution.
Results
To test the program, we ensured that it was capable of doing cassette exchange. In cassette exchange, a plasmid has a gene on interest flanked by two attB sites and the chromosome has a marker flanked by two attP sites. After enzyme-mediated recombination, the gene of interest should be in the chromosome, where the flanking sites will be changed to attL, and the marker will be in the plasmid, where the flanking sites will be changed to attR. No other products are possible after this reaction has run to completion. Our program was able to correctly perform the recombination and emulate the selection process on various types of selectable media.
One of our team members designed a theoretical stackable recombination system. We converted this into a format usable by the program and ran it. Unfortunately, the program ran out of memory before running to steady state. After several round of optimization and running the program on a computer with 12GB of RAM provided by Dr. Moreno's lab at Wilfrid Laurier University, we were still unable to run the program to steady state.
The combinatorial explosion of reaction products of the integrase reaction was far greater than anticipated. Even increasing the selection pressure beyond the point of biological possibility failed to control the combinatorial explosion.
Conclusion
Due to the sheer number of permutations that would have occurred in our given biological system, our initial assumptions upon which we built our software were incorrect. That is, given our hardware resources, it was not feasible for a brute force algorithm to reach steady state.
From the beginning until our actual simulations, emphasis was placed on algorithm design rather than choice of language or underlying implementations of data structures. Because it had been empirically proven that our given solution was not sufficient, a C++ port had been initiated to maximize efficiency given a finite amount of computing power. However, at the time of writing, this has yet to be finished. Given the combinatorial explosion in the Python version, we hope that the C++ version will have better memory efficiency, but this may not be the case. If so, effort must be put into devising a new algorithm rather than into optimizing the program.
Steps that the team could have taken that may have lead to our simulations reaching steady state include:
- Optimizing on the lowest level from start to finish.
- Use of distributed computing techniques.
- Use of a field programmable gate array (FPGA) rather than a conventional computing solution.
- More exploration into devising a non-brute force algorithm.
Following the Jamboree, the modeling team is likely to seek to solutions in two avenues: optimizing our current algorithm, be it in Python or otherwise, and continue searching non-obvious rigorous methods to compute a biological systems steady state.