Team:Heidelberg/HEARTBEAT gui

From 2009.igem.org

(Difference between revisions)
(Do you speak "subparts"?)
(HEARTBEAT-GUI: Your promoters your rules)
Line 38: Line 38:
Obviously, a linguistic representation of DNA-code on a lower level of abstraction – the level of subparts – is needed. To stick with the example above: If we describe the same promoter with subpart-notions, we achieve a balance of abstraction and unambiguousness intelligible for both computers and humans: “A promoter with only the following subparts: core promoter: TATA-Box; proximal promoter: "1 x TFBS for HIF (hypoxia induced factor) at optimal position, 1 x TFBS for NF".
Obviously, a linguistic representation of DNA-code on a lower level of abstraction – the level of subparts – is needed. To stick with the example above: If we describe the same promoter with subpart-notions, we achieve a balance of abstraction and unambiguousness intelligible for both computers and humans: “A promoter with only the following subparts: core promoter: TATA-Box; proximal promoter: "1 x TFBS for HIF (hypoxia induced factor) at optimal position, 1 x TFBS for NF".
This subpart-based linguistic representation of genetic information achieves a balance of abstraction and unambiguousness intelligible for both computers and humans. We therefore defined the subpart notion as fundament of the input format for a DNA-code – human language interface.
This subpart-based linguistic representation of genetic information achieves a balance of abstraction and unambiguousness intelligible for both computers and humans. We therefore defined the subpart notion as fundament of the input format for a DNA-code – human language interface.
 +
 +
==Practical implementation==
 +
===(inter)action guarenteed===
 +
 +
The straight forward solution would be to turn this input format into a full programming language. For this purpose the subpart-based input logic would have to be extended with a collection of functions and commands related to the creation and processing of genetic information. Furthermore a compiler which is able to directly translate the input format into DNA-code would be necessary.
 +
To design a promoter scientists would simply have to write a program code which could start like this:
 +
*Create $promoter_sequence (600 bp);
 +
*Insert TFBS (Sp1, opt);
 +
*…
 +
This textfile would subsequently be interpreted by the compiler and the corresponding quaternary DNA code would be displayed.
 +
 +
Second, the interface could be implemented as a graphical user interface (GUI) which would collect the necessary information through interactive formulars. This solution has major advantages over the use of a mere DNA programming language:
 +
 +
The syntax for the assembly of subparts to valid genetic information is highly complex. The final function of a concatenation of subparts can for example be influenced by the positioning of or the interaction between different subparts. In addition, biological processes exhibit a variety of context dependent exceptions. An illustrative example is the context sensitivity of the biobrick alpha cloning standard [http://partsregistry.org/Assembly:RBS-CDS_issues].
 +
Because of these difficulties, the work with a mere DNA programming language becomes very unhandy. Interactive graphical user interfaces can reduce this complexity:
 +
They can provide real-time control of user inputs. Moreover, the interface can display graphical information in order to support the current design step. Moreover they can compute the implications of the input parameters that the user has already supplied.
 +
The final compilation of the input data into DNA-code can be carried out in separated steps allowing the user to control the process.
= HEARTBEAT Features =
= HEARTBEAT Features =

Revision as of 00:02, 22 October 2009

HEARTBEAT-GUI: Your promoters your rules

Introduction

Why you spend too much time reading quaternary code?

Computers save and process information in binary code whereas humans use completely different semantic systems comprising speech, alphanumerical texts and graphics. Still communication with computers is fairly easy because of interfaces which can translate between binary code and human language. Only after this abstraction of binary code was accomplished computers could unfold their full potential. We believe that synthetic biology would greatly benefit from an analogue paradigm shift in the handling of genetic information: Cells save and process genetic information in the form of DNA and RNA representing i.e. quaternary code. Because this code is as unintelligible to us as binary code we need interfaces to translate between human language and DNA-code. Based on our proposition for the subpart-based modeling of the properties of biological parts, we have developed a concept to meet this challenge. Although this concept is versatilely applicable we will explain our approach on the basis of our implementation of this concept in the HEARTBEAT (Heidelberg Artificial Transcription Factor Binding Site Engineering and Assembly Tool) project.

Point of departure

The secret romance between biologists and text editors

The conventional way to make the HEARTBEAT project accessible for the synthetic biology community would be, to publish the modeling part and to create an interface to access the parameters in the HEARTBEAT-database (DB). In order to be able to design a synthetic promoter, researchers would then need to acquaint themselves with the model, retrieve the data they are interested in and open their favorite text editor to implement the following algorithm:

1. Insert a number of N's (random base pair) corresponding to your promoter length
2. Use the character count function of your text editor to copy and paste the binding motives at the right position in the promoter.
3. Copy and Paste the sequence in a cloning software to find unwanted restriction sites. Remove the restrictions sites manually.
4. Copy and Paste the sequence into a transcription factor binding site (TFBS) detection tool. Remove the TFBS manually.
5. Have you thought of placing the TFBS relative to the TSS and not relative to the end of the sequence in your text editor?

Why is the design of synthetic promoters so complicated although we can resort to an elaborate model of their functional structure? Because in this example, the genetic information has mainly been processed in the form of quaternary DNA-Code.



The theoretical solution

Do you speak "subparts"?

The example given above highlights the need for a representation of genetic information which can be better understood and easier edited than ACTG-code. The key to this problem is the ability to translate between a human semantic system and DNA-code.

Bioinformatics have partially succeeded in the translation of DNA-code to human language. An example in the context of our project is the online tool Match [1] which receives a DNA-sequence as input and displays the TFBS in the sequence represented in alphanumeric code. But so far, there are only a very few tools to translate a linguistic representation of genetic information into DNA-code. One reason for this might be that the terms used to describe genetic informations have not provided the right level of abstraction: “A promoter which is active under hypoxic conditions and apoptotic cellular behavior” is an abstract representation of a functional unit of genetic information. But it is not possible to compile this information into DNA-code. It is too ambiguous – although this information is already described on the level of biological parts. Obviously, a linguistic representation of DNA-code on a lower level of abstraction – the level of subparts – is needed. To stick with the example above: If we describe the same promoter with subpart-notions, we achieve a balance of abstraction and unambiguousness intelligible for both computers and humans: “A promoter with only the following subparts: core promoter: TATA-Box; proximal promoter: "1 x TFBS for HIF (hypoxia induced factor) at optimal position, 1 x TFBS for NF". This subpart-based linguistic representation of genetic information achieves a balance of abstraction and unambiguousness intelligible for both computers and humans. We therefore defined the subpart notion as fundament of the input format for a DNA-code – human language interface.

Practical implementation

(inter)action guarenteed

The straight forward solution would be to turn this input format into a full programming language. For this purpose the subpart-based input logic would have to be extended with a collection of functions and commands related to the creation and processing of genetic information. Furthermore a compiler which is able to directly translate the input format into DNA-code would be necessary. To design a promoter scientists would simply have to write a program code which could start like this:

  • Create $promoter_sequence (600 bp);
  • Insert TFBS (Sp1, opt);

This textfile would subsequently be interpreted by the compiler and the corresponding quaternary DNA code would be displayed.

Second, the interface could be implemented as a graphical user interface (GUI) which would collect the necessary information through interactive formulars. This solution has major advantages over the use of a mere DNA programming language:

The syntax for the assembly of subparts to valid genetic information is highly complex. The final function of a concatenation of subparts can for example be influenced by the positioning of or the interaction between different subparts. In addition, biological processes exhibit a variety of context dependent exceptions. An illustrative example is the context sensitivity of the biobrick alpha cloning standard [http://partsregistry.org/Assembly:RBS-CDS_issues]. Because of these difficulties, the work with a mere DNA programming language becomes very unhandy. Interactive graphical user interfaces can reduce this complexity: They can provide real-time control of user inputs. Moreover, the interface can display graphical information in order to support the current design step. Moreover they can compute the implications of the input parameters that the user has already supplied. The final compilation of the input data into DNA-code can be carried out in separated steps allowing the user to control the process.

HEARTBEAT Features

Access the GUI here

Selecting initial parameters

On the first page of our GUI the users are requested to enter basic parameters determining the primary assembly of the promoter. The users only have to provide the final length of the construct assuming that the users will take the CMV core promoter from Part:BBa K203113 to complete their construct. In this case the length of the core is 80 bases and the transcription start site is 20 bases upstream regarding the end of the CMV promoter. Users who already possess a preferred promoter have the opportunity to specify its length and the location of the TSS in the advanced mode. The users are not able to enter values over 1000 base, since the frequency distributions for each transcription factor binding site (TFBS) stored in the HEARTBEAT-database (DB) range from 0 to 1000 bases upstream of the TSS.

Selecting main and auxiliary transcription factors

As soon as the users submit their selected parameters, they are forwarded to the selection of the main transcription factor (TF), whose binding site is supposed to be integrated into the construct. The users have the possibility to choose between 144 different binding sites (respectively 144 TFs) which they can build in into their promoter. Before designing the promoter, one should be aware of a mechanism to exclusively induce the pathway where the TF is involved in. For further recommendations and information on how to create an experiment fulfilling this requirement, please read Induction. Once a primary TF is chosen two plots become visible. The first one shows the frequency histogram deduced from the HEARTBEAT-DB. The solid red line depicts the probability density function ( pdf) smoothing the TFBS distribution. The position of the vertical red line is stored in the background and later used in the final promoter assembly. After the sequence is automatically assembled the users will be able to modify the sequence by their own account. In case of multiple maxima the users can manually introduce further binding sites influenced by the pdf. In the second plot the frequencies of all co-occurring TFBS can be seen. The five most frequent occurring TFBS can be chosen in the next frames. Again the maxima of the respective pdf are stored by the program.

Selecting or entering a consensus motive

When all favored TFBS are chosen, the users are asked to either enter a known binding motive or select one out of the calculated possibilities. The selection comprises one possible binding motive of the chosen TFBS for each consensus matrix found in the Transfac database. Therefore all bases, which have only been defined by excluding other bases at that position (see [http://insilico.ehu.es/restriction/Nucleotide_ambiguity_code.html Ambiguity Code] e.g. M, R, W etc.) are replaced by either A, C, G or T in random process. This might lead to different selection each time this function is used. In order to determine the quality of a particular TF binding motive, we recommend to test the sequence with the open-source program TRAP and to use only binding motives with a high binding affinity. The selected or entered motive is together with the other earlier defined parameters finally transmitted to the crucial sequence assembly algorithm.

The assembly algorithm

The assembly algorithm represents the central program of the HEARTBEAT-GUI. As input parameters the algorithm needs the absolute promoter length, the core-promoter length, the TSS position relative to the end of the core-promoter, the chosen binding motives and their distance to the TSS. In an iterative process the algorithm attempts to locate first the main TFBS and then the auxiliary TFBSs at the position of their pdf-maxima. If the consecutive TFBS overlaps with the preceding one, the consecutive TFBS is randomly moved 1 base either to the left or to the right until a valid position is found. Every repositioning done by the algorithm can be read in a warning message in the output window. The final sequence is transferred into a TyneMCE open source web-editor in the end. Within this editor, the users can manually modify their sequence. If the users want to introduce new binding sites, they have to highlight the position in a different color in order to enable the program to recognize the position that should be changed.

Adding spacer sequence to the construct

By pushing the button add spacer sequence a small perl-based program introduces every chosen TFBS into a deposited random sequence at the pre-defined positions. From this sequence every restriction site used in any BioBrick standard has been removed. Furthermore we also eliminated every TFBS detected by the Transfac Match tool assuming the sequence to be TFBS free.

Test for restriction sites

With this feature the users are enabled to test their final sequence for every restriction site used in BioBrick format. This program addresses the problem of newly evolving restriction sites at the cutting sites between random sequence and biding motive. The restriction site is manipulated systematically by altering single bases without touching the original binding motive.