Team:Alberta/Project/Modeling

From 2009.igem.org

Revision as of 08:21, 18 October 2009 by Ebennett (Talk | contribs)

University of Alberta - BioBytes










































































































Modeling

Literature information is insufficient for making good predictions about what genes are required for a minimal E.coli metabolism

Many of the literature essential gene lists are either based on single gene deletions or the metabolic pathways of mycoplasma rather than E.coli. Due to pathway redundancy, many metabolic genes that would be essential in a minimal genome would not appear essential in single gene knock out experiments. As E.coli possesses structures such as the cell wall that are not present in mycoplasma, minimal mycoplasma genome propositions are not sufficient.

To test which metabolic genes are essential for E.coli, we developed a computer model.

The E.coli MG1655 genome was modeled by the Palsson group at the University of San Diego and was used for our modeling experiments. Additionally, the Cobra Toolbox developed by the System`s Biology Research Group was used to interface the model with the MATLAB program.

A series of multiple gene deletions were performed using the model in order to determine the essential metabolic genes. These were compared to the literature genes selected and based on the individual gene`s function and involvement in various metabolic pathways, the gene was either added to our master essential gene list or removed. Additionally, media conditions were altered for the cells environment allowing for predictions for the conditions which should be applied to the minimal cell once it is developed. Once completed our MATLAB model contributed 116 genes to our master gene list.

Metabolic modeling allows for computational analysis of entire genomes which would be impossible to accomplish any other way. The various sources and methods used to collect data has allowed for an unique gene list which has the best possible chance of producing a minimal genome.

The MATLAB protocols demonstrated in the modeling section can be used to identify any organism’s essential genes provided a model is available.

Constraint Based Flux Analysis – Cobra Toolbox and SBML

Constraint Based Flux Analysis is a mathematical way of representing biological system information and allows for easy manipulation of this data. It is based on a stoichiometric matrix of reactions which correspond to individual enzymatic or transport reactions which have been characterized inside of the organism. In other words, it computationally represents each reaction using a linear array of numbers (see Figure 1).

The flux can be defined as the amount of substrate moving to product for each individual reaction. The model assumes that the system is at steady state therefore, the overall flux is zero since each product becomes a substrate of another reaction. All substrates entering the system will have the same amount leaving the system. The products leaving the system can be removed by changing the boundary condition of the compound (that is making it unavailable to the system) or by using it to produce growth of the organism. A master growth equation determines which products are required for the cell to grow and this represented in units of DW/unit time. Systems Biology Markup Language (SBML) and the Cobra Toolbox (both produced from System’s Biology Research Group) allows for flux analysis to be performed in the MATLAB program.

Finding Essential Genes

We can perform single gene knockouts in the e. coli metabolic gene model and determine the resulting growth rate. Knockouts that result in a growth rate of approximately zero are put in a list for the researcher to study later. Deletion of any one of these genes will result in no growth. These genes are clearly essential.

FindEssentialGenes.m

The Cobra Toolbox makes this single gene deletion analysis for every gene very easy. We found 187 of the model's genes to be essential with this code.

After enriching the in silico media with various substances, the code was run again, and a list of 112 genes resulted. These results were analyzed, and another iteration of media enriching occurred, resulting in a list of 98 genes. Clearly, enriching the media relieves some genes of their function, perhaps by allowing the bypassing of a reaction catalyzed by a protein coded for by that gene. Comparing this list of 98 genes to our previously complied list of researched literature essential genes, we found many differences which we found to be very helpful. Many of these missing genes were added to our master list of essential genes.

Obviously, since the code only analyses single gene deletions, the e. coli cell would not survive with just these 98 genes (these 98 genes are not the minimal genome). Multiple gene deletions were not taken into account. For instance, consider 2 genes that at first appear unessential. When one is knocked out, the cell lives. When the other is knocked out, the cell lives. But lets say that once both in the pair of genes is deleted, the cell dies. The single gene deletion analysis would not detect this.

Another code was written to find out how many pair of genes are essentially entangled in this way. We found that there are 81 of these pairs in the model (assuming glucose minimal media). To validate this model's claim of paired essentiality, we performed single and double gene knockouts in the lab for these essential pairs. See the Model Validation section for details.

Finding the Model's Minimal Genome

It was our goal to find the model's minimal genome. It is clear that single gene knockout analysis would not suffice. Since performing every possible combination of multiple gene knockouts in the model would take most computers many years to accomplish, a different strategy was taken.

There are a few factors that will affect the size of a minimal genome found with our code. One of them is the order of gene deletions. For example, once you delete a particular gene, it may cause 3 other genes that were previously inessential to become essential. So depending on which gene was deleted first, the size of the resulting minimal genome after further deletions will change. Another factor to consider is the final growth rate. If you want a growth rate that is 90% what it was originally, your minimal genome may be quite large. If you want a final growth rate of 10% of what it was originally, then your minimal genome can afford to become much smaller. Another factor is the growth medium. Highly enriched media will aid in achieving a smaller minimal genome. For these reasons, we wrote a code that will accommodate these factors, and allow the researcher to find the minimal genomes of any in silico organism (with an appropriate model), having the ability to choose final growth rates.

FindMinimalGenomes.m

The code finds the minimal genome by first starting with one random model gene. It is deleted from the model and the resulting growth rate is assessed. If the growth rate is acceptable (based on a threshold input argument from the researcher), the gene is permanently deleted. If not, the gene is put back into the model, not to be touched for a while. The code repeats the process for the next random gene. When the resulting growth rate is calculated, it is not only assessing the impact of this one gene deletion. It assess the impact of all previously permanently deleted genes. If the growth rate is deemed acceptable, then that gene is also permanently deleted. It repeats this process until it has gone through every gene. After every gene has been either permanently deleted or kept in the model, the whole thing starts over again with some important differences. This time, the threshold growth rate (for assessing gene deletions) is lowered, and the permanently deleted genes are still deleted. After this iteration is over, the threshold growth rate is again lowered while keeping the deleted out of the model. The iterations continue until the final growth rate (specified by the researcher) is reached. It may seem reasonable to skip the iterations and do the whole thing in one step, but it soon became obvious that these iterations were necessary to make the resulting minimal genome smaller. After the algorithm performs the last assessment at the specified final growth rate, the permanently deleted genes are compared with the original model's genes, and a minimal genome is found. The whole process is repeated again and again to find more minimal genomes with the same final growth rate.

After running this code, we found minimal genomes varying in size that averaged around 300 genes (which is around 24% of the original size) with a 73% decrease from its original growth rate. After making many growth media changes, the minimal genomes that we found had an average size of around 190 (~15% of the model's original genome size). That's over 1050 gene knockouts out of 1260 metabolic genes with the cell still surviving!! The smallest minimal genome was found to be made up of 178 genes.

Throughout the processes described above, we were constantly crosschecking our previously compiled list of researched genes from the literature with these model genes. Each gene in the model's minimal genome were researched further to judge whether or not it truly needed to be included in a minimal genome. Consequently, most of these model genes were included in our master list of essential genes needed for the minimal e. coli genome.

Modeling in MATLAB has proven to be an essential aspect of our project. And in the process, we have created tools for other researchers to use as aid in their search for organism's essential genes and minimal genome.

Our Codes & How We Used Them

Our two most important codes are:

FindMinimalGenomes.m and

FindEssentialGenes.m

The rest are handy tools to help analyze and compare large gene lists created within the model, or imported from a database such as PEC.

Start.m - this will set the solver to glpk (a way of solving the vast amounts of matrix equations), and load the model that the researcher will be working with.

FindEssentialGenes.m - Uses single gene deletion analysis to find a list of essential genes. A deletion of just one of these genes will cause cell death.

CheckForRepeats.m - Checks for repeats in a gene list (and can also remove the repeats if wanted). We use this to check our imported list (our researched literature essential gene list).

InvertList.m - "Inverts" the input List1 with respect to the input List2. The output is a list of genes in List2 that did not appear in List1. We use this to find the genes in our imported literature list that we would be deleting from the model. This helps us analyze the effects of deleting genes that we previously thought to be inessential.

FindMatches.m - Finds the matches between two lists of genes. We use this to compare our researched literature essential genes with our model's predicted essential genes. It outputs the number of matches and the list of matches.

MediaChange.m - We could reduce the amount of genes that were essential by changing the in silico media. We added to the glucose minimal media:

  • 20 Standard Amino Acids
  • Adenosine, Guanosine, Uridine, Thymidine, Cytidine
  • Thiamine
  • Pantothenic Acid
  • Xylose
  • Shikimate
  • NMN
  • Meso Diaminopimelic Acid
  • Biotin

FindMinimalGenomes.m - This is the code that finds the model's minimal genome(s) through iterative semi-random multiple gene deletions. Because of the fact that many genes code for the same reaction, many reactions perform the same function, and that one gene may control a few reactions, there are easily thousands of potential minimal genomes. This code finds many of them. We are not necessarily interested in the smallest minimal genome; we want to use the resulting lists to gain a better understanding of what genes should be included in an actual minimal genome that could be created in a real lab environment. You can specify the number of minimal genomes you want (the more you specify, the longer it will take). You can also specify the final growth rate wanted, and 2 other factors relating to how the algorithm is carried out which will affect the time it takes, and how small the minimal genomes will be. It outputs an array of minimal genome lists, and can also output the list of genes that are deleted for each minimal genome.

MaxMinAvg.m - This is a simple code which lets the user know the largest minimal genome, the smallest minimal genome, and the average size of the minimal genomes created in the FindMinimalGenomes.m code. It can be used on other gene lists (for instance, on an array of gene lists that represent what genes are missing from the researched literature list with respect to each minimal genome found).

InvertListArray.m - "Inverts" input1 with respect to every gene list in input2. input2 may include any number of gene lists. This code is typically used when comparing your researched literature gene list with each of the model's minimal genome lists (produced by FindMinimalGenomes).

FindMatchesArray.m - Typically only used with the array of minimal genomes and the researched literature essential genes. Outputs a list of matches for every minimal genome creating the output MatchListArray. The code also outputs the number of matches for each list and the position of the smallest difference between the model's minimal genome and your designed minimal genome (designed from research/literature). This is very helpful when trying to find what genes you are missing in your researched literature essential genes list when comparing them with the many minimal genomes produced by our code.

CellToMatrix.m - Code that makes it more convenient for you to copy and paste the array of lists into Excel. Without this step, you'd have to copy and paste the hundreds of unique minimal genome lists one at a time.