University of Alberta - BioBytes

Our Modeling Revolution

Using our novel modeling system, we’ve identified 116 E.coli metabolic genes never before predicted to be essential in a minimal genome. Moreover we’ve determined optimal media conditions for the growth of our proposed minimal genome E.coli. We provide a cutting edge, streamlined tool for determining essential genes in different media conditions, a tool essential for the genome scale construction projects that synthetic biology aspires to. Using our model advances, the development of viable synthetic genomes is now within reach – one of the most important modeling advances within the field of synthetic biology.

Why Use Modeling?

Literature information is insufficient for making good predictions about what genes are required for a minimal E.coli metabolism

Many of the literature essential gene lists are either based on single gene deletions or the metabolic pathways of mycoplasma rather than E.coli. Due to pathway redundancy, many metabolic genes that would be essential in a minimal genome would not appear essential in single gene knock out experiments. As E.coli possesses structures such as the cell wall that are not present in mycoplasma, minimal mycoplasma genome propositions are not sufficient.

To test which metabolic genes are essential for E.coli, we developed a computer model.

The E.coli MG1655 genome was modeled by the Palsson group at the University of San Diego and was used for our modeling experiments. Additionally, the Cobra Toolbox developed by the System`s Biology Research Group was used to interface the model with the MATLAB program.

A series of multiple gene deletions were performed using the model in order to determine the essential metabolic genes. These were compared to the literature genes selected and based on the individual gene`s function and involvement in various metabolic pathways, the gene was either added to our master essential gene list or removed. Additionally, media conditions were altered for the cells environment allowing for predictions for the conditions which should be applied to the minimal cell once it is developed. Once completed our MATLAB model contributed 116 genes to our master gene list.

Metabolic modeling allows for computational analysis of entire genomes which would be impossible to accomplish any other way. The various sources and methods used to collect data has allowed for an unique gene list which has the best possible chance of producing a minimal genome.

The MATLAB protocols demonstrated in the modeling section can be used to identify any organism's essential genes provided a model is available.

Constraint Based Flux Analysis – Cobra Toolbox and SBML

Constraint Based Flux Analysis is a mathematical way of representing biological system information and allows for easy manipulation of this data. It is based on a stoichiometric matrix of reactions which correspond to individual enzymatic or transport reactions which have been characterized inside of the organism. In other words, it computationally represents each reaction using a linear array of numbers.

The flux can be defined as the amount of substrate moving to product for each individual reaction. The model assumes that the system is at steady state therefore, the overall flux is zero since each product becomes a substrate of another reaction. All substrates entering the system will have the same amount leaving the system. The products leaving the system can be removed by changing the boundary condition of the compound (that is making it unavailable to the system) or by using it to produce growth of the organism. A master growth equation determines which products are required for the cell to grow and this represented in units of DW/unit time. Systems Biology Markup Language (SBML) and the Cobra Toolbox (both produced from System’s Biology Research Group) allows for flux analysis to be performed in the MATLAB program.

Finding Essential Genes

We can perform single gene knockouts in the E. coli metabolic gene model and determine the resulting growth rate. Knockouts that result in a growth rate of approximately zero are put in a list for the researcher to study later. Deletion of any one of these genes will result in no growth. These genes are clearly essential.


The Cobra Toolbox makes this single gene deletion analysis for every gene very easy. We found 187 of the model's genes to be essential with this code.

After enriching the in silico media with various substances, the code was run again, and a list of 112 genes resulted. These results were analyzed, and another iteration of media enriching occurred, resulting in a list of 98 genes. Clearly, enriching the media relieves some genes of their function, perhaps by allowing the bypassing of a reaction catalyzed by a protein coded for by that gene. Comparing this list of 98 genes to our previously complied list of researched literature essential genes, we found many differences which we found to be very helpful. Many of these missing genes were added to our master list of essential genes.

Obviously, since the code only analyses single gene deletions, the E. coli cell would not survive with just these 98 genes (these 98 genes are not the minimal genome). Multiple gene deletions were not taken into account. For instance, consider 2 genes that at first appear inessential. When one is knocked out, the cell lives. When the other is knocked out, the cell lives. But lets say that once both in the pair of genes is deleted, the cell dies. The single gene deletion analysis would not detect this.

Another code was written to find out how many pair of genes are essentially entangled in this way. We found that there are 81 of these pairs in the model (assuming glucose minimal media). To validate this model's claim of paired essentiality, we performed single and double gene knockouts in the lab for these essential pairs. See the Model Validation section for details.

Finding the Model's Minimal Genome

It was our goal to find the model's minimal genome. It is clear that single gene knockout analysis would not suffice. Since performing every possible combination of multiple gene knockouts in the model would take most computers many years to accomplish, a different strategy was taken.

There are a few factors that will affect the size of a minimal genome found with our code. One of them is the order of gene deletions. For example, once you delete a particular gene, it may cause 3 other genes that were previously inessential to become essential. So depending on which gene was deleted first, the size of the resulting minimal genome after further deletions will change. Another factor to consider is the final growth rate. If you want a growth rate that is 90% what it was originally, your minimal genome may be quite large. If you want a final growth rate of 10% of what it was originally, then your minimal genome can afford to become much smaller. Another factor is the growth medium. Highly enriched media will aid in achieving a smaller minimal genome. For these reasons, we wrote a code that will accommodate these factors, and allow the researcher to find the minimal genomes of any in silico organism (with an appropriate model), having the ability to choose final growth rates.


The code finds the minimal genome by first starting with one random model gene. It is deleted from the model and the resulting growth rate is assessed. If the growth rate is acceptable (based on a threshold input argument from the researcher), the gene is permanently deleted. If not, the gene is put back into the model, not to be touched. The code repeats the process for the next random gene. When the resulting growth rate is calculated, it is not only assessing the impact of this one gene deletion. It assess the impact of all previously permanently deleted genes. If the growth rate is deemed acceptable, then that gene is also permanently deleted. It repeats this process until it has gone through every gene. After every gene has been either permanently deleted or kept in the model, the whole thing starts over again with some important differences. This time, the threshold growth rate (for assessing gene deletions) is lowered, and there are permanently deleted to ignore. After this iteration of deletions is over, the threshold growth rate is again lowered while keeping the deleted genes out of the model. The iterations (deletion cycles) continue until the final growth rate (specified by the researcher) is reached. It may seem reasonable to skip the multiple deletion cycles and do the whole thing in one step (one iteration), but it soon became obvious that multiple deletion cycles at decreasing growth rate thresholds were necessary to make the resulting minimal genome smaller. After the algorithm performs the last assessment at the specified final growth rate, the permanently deleted genes are subtracted from the original model genes, and a minimal genome is found. The whole process is repeated again and again to find more minimal genomes with the same final growth rate.


After running this code for the glucose minimal media model, we found minimal genomes varying in size that averaged around 300 genes (which is around 24% of the original size) with a 73% decrease from its original growth rate. After making many growth media changes (see the MediaChange code near the bottom of the page), the minimal genomes that we found had an average size of around 190 (~15% of the model's original genome size). That's over 1050 gene knockouts out of 1260 metabolic genes with the cell still surviving! The smallest minimal genome was found to be made up of 178 genes.

Throughout the processes described above, we were constantly crosschecking our previously compiled list of researched genes from the literature with these model genes. Each gene in the model's minimal genome were researched further to judge whether or not it truly needed to be included in a minimal genome. Consequently, most of these model genes were included in our master list of essential genes needed for the minimal E. coli genome.

Algorithm details for these results:

The initial growth rate threshold was set to 95% of the initial growth rate. This threshold was decreased by 17% every iteration (deletion cycle) until the final growth rate that was 27.15% of the initial growth rate. 76 minimal genomes were found. The smallest was 282 genes in size and and the biggest was 312. (Every list differed from every other list by an average of around 50 genes - even lists of the same size.) After media changes, this process was repeated again. The initial growth rate threshold was set to 91.2% of the initial growth rate. This threshold decreased by about 15.1% for every deletion cycle until the final growth rate was 0.57% of the initial growth rate. The smallest genome found after this was executed was 178 genes in size. The largest was 214.

Modeling in MATLAB has proven to be an essential aspect of our project. And in the process, we have created tools for other researchers to use as aid in their search for organism's essential genes and minimal genome.

Our Codes & How We Used Them

Our two most important codes are:



The rest are handy tools to help analyze and compare large gene lists created within the model, or imported from a database such as the PEC database.

Start.m - this will set the solver to glpk (a way of solving the vast amounts of matrix equations involved with flux balance analysis), and load the model that the researcher will be working with.

FindEssentialGenes.m - Uses single gene deletion analysis to find a list of essential genes. A deletion of just one of these genes will cause cell death.

CheckForRepeats.m - Checks for repeats in a gene list (and can also remove the repeats if wanted).

InvertList.m - "Inverts" the input List1 with respect to the input List2. The output is a list of genes in List2 that did not appear in List1. It can be used if one wants to find the inessential genes that need to be deleted from a model given an essential gene list (either from a database or compiled from many literature sources). This helps the researcher analyze the effects of deleting genes that were previously thought to be inessential.

FindMatches.m - Finds the matches between two lists of genes. Can be used to compare the researched literature essential genes with the model's predicted essential genes. It outputs the number of matches and the list of matches.

MediaChange.m We reduced the amount of genes that were essential by changing the in silico media. In the end, we added to the glucose minimal media:

  • 20 Standard Amino Acids
  • Adenosine, Guanosine, Uridine, Thymidine, Cytidine
  • Thiamine
  • Pantothenic Acid
  • Xylose
  • Shikimate
  • NMN
  • Meso Diaminopimelic Acid
  • Biotin

FindMinimalGenomes.m - This code outputs an array of minimal genome lists, the genes that were deleted for every list, and the resulting models with the genes deleted. Input arguments include the number of minimal genomes to be found, the final growth rate wanted, and two other factors relating to how the algorithm is carried out which will affect the time it takes, and how small the minimal genomes will be.

MaxMinAvg.m - This is a simple code which lets the user know the largest minimal genome, the smallest minimal genome, and the average size of the minimal genomes created in the FindMinimalGenomes.m code. It can be used on other gene lists (for instance, on an array of gene lists that represent what genes are missing from the researched literature list with respect to each minimal genome found).

InvertListArray.m - "Inverts" input2 with respect to every gene list in input1. input1 may include any number of gene lists. This code is typically used when comparing the researched literature gene list with each of the model's minimal genome lists (produced by FindMinimalGenomes).

FindMatchesArray.m - Typically only used with the array of minimal genomes and the researched literature essential genes. Outputs a list of matches for every minimal genome creating the output MatchListArray. The code also outputs the number of matches for each list and the position of the smallest difference between the model's minimal genome and the designed minimal genome (designed from research/literature). This is very helpful when trying to find what genes are missing in the researched literature essential genes list when compared to many minimal genomes produced by our code.

CellToMatrix.m - Code that makes it more convenient for you to copy and paste the cell array data into Excel.

Modeling Validation

In order to ensure that the information being provided by the model is accurate, a series of dual deletions are being tested in the lab in order to ensure that what is being observed in silico is seen in vivo. These dual deletions are predicted to kill the cell only in combination. From this experiment, deleting the genes individually should still result in cell viability while deleting both should result in the death of the cell.

Click here for more...