Team:Heidelberg/HEARTBEAT database

From 2009.igem.org

(Difference between revisions)
(Introduction)
(Introduction)
Line 11: Line 11:
=== Introduction ===
=== Introduction ===
-
In parallel to the biochemical method [[Team:Heidelberg/Project_Synthetic_promoters#Results|RA-PCR]] our team worked on a computational approach for the  rational design of promoter libraries. This strategy focused on spatial dependencies between pairs of binding motives and how the distance of a particular binding motive affects transcriptional activity. Similar to existing methods which predict spatial preferences of transcription factor binding sites (TFBS) by detecting statistically overrepresented motives [] we used [http://genome.dkfz-heidelberg.de/| Promotersweep] [] to analyse and process the information of over 4000 human promoter sequences. The data of a total of 29966 TF-binding sites was then stored in a MySQL database (DB) which we termed  [Team:Heidelberg/HEARTBEAT_gui| HEARTBEAT] ('''He'''idelberg '''Ar'''tificial '''T'''ranscription factor '''B'''inding site '''E'''ngineering and '''A'''ssembly '''T'''ool, link). Based on frequency distributions (see [[Team:Heidelberg/HEARTBEAT_database#Results|results]]) for [[Team:Heidelberg/Eucaryopedia#SREBP|SREBP]] and [[Team:Heidelberg/Eucaryopedia#VDR|VDR]] derived from HEARTBEAT, we were able to design 25 promoter sequences with different arrangement of TF-binding sites (see [[Team:Heidelberg/HEARTBEAT_database#Promoter design|Promoter design]]).  
+
In parallel to the biochemical method [[Team:Heidelberg/Project_Synthetic_promoters#Results|RA-PCR]] our team worked on a computational approach for the  rational design of promoter libraries. This strategy focused on spatial dependencies between pairs of binding motives and how the distance of a particular binding motive affects transcriptional activity. Similar to existing methods which predict spatial preferences of transcription factorbinding sites (TFBS) by detecting statistically overrepresented motives [] we used [http://genome.dkfz-heidelberg.de/| Promotersweep] [] to analyse and process the information of over 4000 human promoter sequences. The data of a total of 29966 TF-binding sites was then stored in a MySQL database (DB) which we termed  [Team:Heidelberg/HEARTBEAT_gui| HEARTBEAT] ('''He'''idelberg '''Ar'''tificial '''T'''ranscription factor '''B'''inding site '''E'''ngineering and '''A'''ssembly '''T'''ool, link). Based on frequency distributions (see [[Team:Heidelberg/HEARTBEAT_database#Results|results]]) for [[Team:Heidelberg/Eucaryopedia#SREBP|SREBP]] and [[Team:Heidelberg/Eucaryopedia#VDR|VDR]] derived from HEARTBEAT, we were able to design 25 promoter sequences with different arrangement of TF-binding sites (see [[Team:Heidelberg/HEARTBEAT_database#Promoter design|Promoter design]]).  
VDR- as well as SREBP-responsive promoter constructs were first cloned into a reporter plasmid ([[Team:Heidelberg/Parts#Core promoters|BBa_K203100]]) and consequently transfected into [[Team:Heidelberg/Eucaryopedia#MCF-7|MCF-7]] and [[Team:Heidelberg/Eucaryopedia#Hela|Hela]] cells, respectively. The final screening was accomplished by [[Team:Heidelberg/Notebook_MaM#Screening by TECAN|TECAN]] and [[Team:Heidelberg/Notebook_MaM#Flow Cytometry|Flow Cytometry]] measurements (results). Sentence about first results. According to our results, we think that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.   
VDR- as well as SREBP-responsive promoter constructs were first cloned into a reporter plasmid ([[Team:Heidelberg/Parts#Core promoters|BBa_K203100]]) and consequently transfected into [[Team:Heidelberg/Eucaryopedia#MCF-7|MCF-7]] and [[Team:Heidelberg/Eucaryopedia#Hela|Hela]] cells, respectively. The final screening was accomplished by [[Team:Heidelberg/Notebook_MaM#Screening by TECAN|TECAN]] and [[Team:Heidelberg/Notebook_MaM#Flow Cytometry|Flow Cytometry]] measurements (results). Sentence about first results. According to our results, we think that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.   
    
    
To enable the construction of HEARTBEAT-based rational promoter sequences which comprises sequence assembly with TF-binding sites of choice, adding spacer sequences, checking for restriction sites and unintentional TFBS etc., we developed the HEARTBEAT graphical user interface (GUI link). Furthermore, we developed a computer model based on fuzzy logic, which is able to simulate the activity of the designed rational and random synthetic promoter sequences. In a reverse approach considering the output, the model helps the user in optimizing the input sequence. Altogether HEARTBEAT which comprises data analysis (HEARTBEAT-DB), a graphical user interface (HEARTBEAT-GUI) and network modeling (HEARTBEAT fuzzy network (FN)) provides a powerful and promising instrument for the synthetic biology community.
To enable the construction of HEARTBEAT-based rational promoter sequences which comprises sequence assembly with TF-binding sites of choice, adding spacer sequences, checking for restriction sites and unintentional TFBS etc., we developed the HEARTBEAT graphical user interface (GUI link). Furthermore, we developed a computer model based on fuzzy logic, which is able to simulate the activity of the designed rational and random synthetic promoter sequences. In a reverse approach considering the output, the model helps the user in optimizing the input sequence. Altogether HEARTBEAT which comprises data analysis (HEARTBEAT-DB), a graphical user interface (HEARTBEAT-GUI) and network modeling (HEARTBEAT fuzzy network (FN)) provides a powerful and promising instrument for the synthetic biology community.
 +
 +
===Methods===
 +
In order to create a reliable database providing sufficient data, the very first task we faced was the
 +
promoter / gene selection. (vielleicht den Satz streichen) For this purpose our promoter sequences
 +
was defined to be 1000 bp upstream of the TSS. The UCSC Genome Browser
 +
(http://genome.ucsc.edu/cite.html) [refA] provides reference sequence and working draft assemblies
 +
for a large collection of genomes including the human genome. Text and sequence-based searches
 +
provide quick and precise access to any region of specific interest. In our case we derived the
 +
provided dataset which included the 1000 bases upstream of annotated TSS for each RefSeq genes
 +
[refB].
 +
Upon this pre-selection, we further narrowed our choice of promoter set by selection of distinct
 +
pathways. KEGG (Kyoto Encyclopedia of Genes and Genomes, Link: http://www.genome.jp/kegg/)
 +
[refC,D,E] provides a database of biological systems, consisting of several building blocks
 +
including e.g. genes and proteins (KEGG GENES) or hierarchies and relationships of various
 +
biological objects (KEGG BRITE). Here, KEGG PATHWAY which comprises molecular
 +
interaction and reaction networks for metabolism, various cellular processes and human diseases
 +
was of particular interest. From this collection of molecular wiring maps, we chose all
 +
physiologically relevant pathways for our project, thereby ruling out tissue specific pathways,
 +
highly specific pathways like olfactory / taste transduction as well as several pathways related to
 +
human diseases.
 +
 +
===Promotersweep===
 +
On of the most challenging problems in bioinformatics remains the computerised localisation of
 +
TFBS and the transcriptional start sites (TSS) as well as the determination of the core promoter.
 +
Many of the available motive discovery tools exhibit the problem, of extended false positive
 +
predictions. The Promotersweep web-tool improves the accuracy of this analysis by combining a
 +
vast number of different algorithms and methods simultaneously. It integrates information from
 +
three homology databases (EnsEMBL Compara, NCBI HomoloGene,DoOP database), five
 +
promoter databases(EPD, DBTSS), six sequence motive identification tools (e.g. Meme, Gibbs
 +
MotifSampler) and two matrix profile databases (Jaspar Core Library, Transfac
 +
ProfessionalLibrary) to identify and annotate TFBS. The Promotersweep pipeline is started by
 +
entering a sequence, chosen between human or mouse as origin. Initially a homology search is
 +
performed by using different BLAST algorithms. As a result orthologous promoter regions are
 +
deduced from EnsEMBL, Homologene or DoOP, respectively. Subsequently several motive
 +
discovery tools determine shared motives of orthologous or co-regulated sequences. In the last step
 +
each TFBS is identified and evaluated with the help of the Transfac core library and the Jaspar Core
 +
library. Every identified TFBS is classified as weak, conserved or reliable according to the
 +
similarity of the predictions of the different algorithms. In Fig. [] the result for the Heat shock
 +
cognate 71 kDa protein (NM_153201) promoter is shown. For this promoter four different binding
 +
sites were discovered. Three of them were classified as reliable and one as conserved. For each hit
 +
the output of promotersweep contains the position of the motive relative to the TSS. So far we were
 +
able to analyse 4395 different promoter sequences, which hold 29966 TFBS in total.
 +
 +
===The HEARTBEAT-database===
 +
In order to retrieve the information computed by promotersweep as fast as possible we decided to
 +
develop a database structure based on MySQL (My systems query language). MySQL is one of the
 +
most popular relational database management systems. It offers not only a language to set up a
 +
hierarchical database but also an interface for easy manipulation of data. The advantage of MySQL
 +
is its very intuitive command language and the table structure which helps to minimize redundant
 +
data. Simple queries are written in a “SELECT - FROM - WHERE” format. With SELECT value
 +
all requested columns are specified. The FROM value calls the corresponding table and WHERE
 +
allows a more accurate selection. For our database the average query duration is below 200 ms. This
 +
enabled us to provide a fluent online access of HEARTBEAT through the HEARTBEAT-GUI.
 +
Our data is stored in the tables “Main_Info” and “Gene_Info”. Main_Info contains all necessary
 +
data to define the location, binding motive and quality of a TFBS, whereas Gene_Info offers
 +
additional information for the gene as well as several gene annotations, where the TFBS is located
 +
on. In Fig. [] the table structure is shown for Main_Info and Gene_Info.
 +
 +
===Results===
 +
For the statistical analysis we plotted the absolute frequency of occurrence for each TF-binding site
 +
in a histogram against the position relative to the TSS where the TSS is located at base 1001 -1003.
 +
Each bin comprises 20 bases analogue to different low resolution approaches which analysed the
 +
spatial distribution of TFBS with a sliding window of 20-25 bp [Daigoro]. From 356 different
 +
TFBS for which Transfac contains at least one binding matrix 144 TFBS occurred at least within 50
 +
from 4390 natural promoters. TFBS with less than 50 counts were removed from the selection and
 +
not considered for further analysis. In Fig. [] the spatial distributions of SP1, AP-2, IPF1(Insulin
 +
promoter factor 1) and Kid3 binding sites are shown. The red solid line represents the re-scaled
 +
probability density function (pdf). We introduced this function for two reasons. On the one hand the
 +
pdf is more robust with respect to outliers than a normal histogram. On the other hand we used the
 +
rescaled area under the curve between a shifting frame of 20 bases as a measure for the significance
 +
of a particular TFBS occurrence. The vertical red line in each plot defines the maximum of the pdf.
 +
Around the respective base position the majority of binding motives are located within the natural
 +
promoters. The maximum of the pdf will serve in the following as the position where binding sites
 +
are introduced into our rational designed promoter sequences.
== Background / Motivation ==
== Background / Motivation ==

Revision as of 16:04, 19 October 2009


HEARTBEAT: Design of rational promoter sequences

Introduction

In parallel to the biochemical method RA-PCR our team worked on a computational approach for the rational design of promoter libraries. This strategy focused on spatial dependencies between pairs of binding motives and how the distance of a particular binding motive affects transcriptional activity. Similar to existing methods which predict spatial preferences of transcription factorbinding sites (TFBS) by detecting statistically overrepresented motives [] we used [http://genome.dkfz-heidelberg.de/| Promotersweep] [] to analyse and process the information of over 4000 human promoter sequences. The data of a total of 29966 TF-binding sites was then stored in a MySQL database (DB) which we termed [Team:Heidelberg/HEARTBEAT_gui| HEARTBEAT] (Heidelberg Artificial Transcription factor Binding site Engineering and Assembly Tool, link). Based on frequency distributions (see results) for SREBP and VDR derived from HEARTBEAT, we were able to design 25 promoter sequences with different arrangement of TF-binding sites (see Promoter design).

VDR- as well as SREBP-responsive promoter constructs were first cloned into a reporter plasmid (BBa_K203100) and consequently transfected into MCF-7 and Hela cells, respectively. The final screening was accomplished by TECAN and Flow Cytometry measurements (results). Sentence about first results. According to our results, we think that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.

To enable the construction of HEARTBEAT-based rational promoter sequences which comprises sequence assembly with TF-binding sites of choice, adding spacer sequences, checking for restriction sites and unintentional TFBS etc., we developed the HEARTBEAT graphical user interface (GUI link). Furthermore, we developed a computer model based on fuzzy logic, which is able to simulate the activity of the designed rational and random synthetic promoter sequences. In a reverse approach considering the output, the model helps the user in optimizing the input sequence. Altogether HEARTBEAT which comprises data analysis (HEARTBEAT-DB), a graphical user interface (HEARTBEAT-GUI) and network modeling (HEARTBEAT fuzzy network (FN)) provides a powerful and promising instrument for the synthetic biology community.

Methods

In order to create a reliable database providing sufficient data, the very first task we faced was the promoter / gene selection. (vielleicht den Satz streichen) For this purpose our promoter sequences was defined to be 1000 bp upstream of the TSS. The UCSC Genome Browser (http://genome.ucsc.edu/cite.html) [refA] provides reference sequence and working draft assemblies for a large collection of genomes including the human genome. Text and sequence-based searches provide quick and precise access to any region of specific interest. In our case we derived the provided dataset which included the 1000 bases upstream of annotated TSS for each RefSeq genes [refB]. Upon this pre-selection, we further narrowed our choice of promoter set by selection of distinct pathways. KEGG (Kyoto Encyclopedia of Genes and Genomes, Link: http://www.genome.jp/kegg/) [refC,D,E] provides a database of biological systems, consisting of several building blocks including e.g. genes and proteins (KEGG GENES) or hierarchies and relationships of various biological objects (KEGG BRITE). Here, KEGG PATHWAY which comprises molecular interaction and reaction networks for metabolism, various cellular processes and human diseases was of particular interest. From this collection of molecular wiring maps, we chose all physiologically relevant pathways for our project, thereby ruling out tissue specific pathways, highly specific pathways like olfactory / taste transduction as well as several pathways related to human diseases.

Promotersweep

On of the most challenging problems in bioinformatics remains the computerised localisation of TFBS and the transcriptional start sites (TSS) as well as the determination of the core promoter. Many of the available motive discovery tools exhibit the problem, of extended false positive predictions. The Promotersweep web-tool improves the accuracy of this analysis by combining a vast number of different algorithms and methods simultaneously. It integrates information from three homology databases (EnsEMBL Compara, NCBI HomoloGene,DoOP database), five promoter databases(EPD, DBTSS), six sequence motive identification tools (e.g. Meme, Gibbs MotifSampler) and two matrix profile databases (Jaspar Core Library, Transfac ProfessionalLibrary) to identify and annotate TFBS. The Promotersweep pipeline is started by entering a sequence, chosen between human or mouse as origin. Initially a homology search is performed by using different BLAST algorithms. As a result orthologous promoter regions are deduced from EnsEMBL, Homologene or DoOP, respectively. Subsequently several motive discovery tools determine shared motives of orthologous or co-regulated sequences. In the last step each TFBS is identified and evaluated with the help of the Transfac core library and the Jaspar Core library. Every identified TFBS is classified as weak, conserved or reliable according to the similarity of the predictions of the different algorithms. In Fig. [] the result for the Heat shock cognate 71 kDa protein (NM_153201) promoter is shown. For this promoter four different binding sites were discovered. Three of them were classified as reliable and one as conserved. For each hit the output of promotersweep contains the position of the motive relative to the TSS. So far we were able to analyse 4395 different promoter sequences, which hold 29966 TFBS in total.

The HEARTBEAT-database

In order to retrieve the information computed by promotersweep as fast as possible we decided to develop a database structure based on MySQL (My systems query language). MySQL is one of the most popular relational database management systems. It offers not only a language to set up a hierarchical database but also an interface for easy manipulation of data. The advantage of MySQL is its very intuitive command language and the table structure which helps to minimize redundant data. Simple queries are written in a “SELECT - FROM - WHERE” format. With SELECT value all requested columns are specified. The FROM value calls the corresponding table and WHERE allows a more accurate selection. For our database the average query duration is below 200 ms. This enabled us to provide a fluent online access of HEARTBEAT through the HEARTBEAT-GUI. Our data is stored in the tables “Main_Info” and “Gene_Info”. Main_Info contains all necessary data to define the location, binding motive and quality of a TFBS, whereas Gene_Info offers additional information for the gene as well as several gene annotations, where the TFBS is located on. In Fig. [] the table structure is shown for Main_Info and Gene_Info.

Results

For the statistical analysis we plotted the absolute frequency of occurrence for each TF-binding site in a histogram against the position relative to the TSS where the TSS is located at base 1001 -1003. Each bin comprises 20 bases analogue to different low resolution approaches which analysed the spatial distribution of TFBS with a sliding window of 20-25 bp [Daigoro]. From 356 different TFBS for which Transfac contains at least one binding matrix 144 TFBS occurred at least within 50 from 4390 natural promoters. TFBS with less than 50 counts were removed from the selection and not considered for further analysis. In Fig. [] the spatial distributions of SP1, AP-2, IPF1(Insulin promoter factor 1) and Kid3 binding sites are shown. The red solid line represents the re-scaled probability density function (pdf). We introduced this function for two reasons. On the one hand the pdf is more robust with respect to outliers than a normal histogram. On the other hand we used the rescaled area under the curve between a shifting frame of 20 bases as a measure for the significance of a particular TFBS occurrence. The vertical red line in each plot defines the maximum of the pdf. Around the respective base position the majority of binding motives are located within the natural promoters. The maximum of the pdf will serve in the following as the position where binding sites are introduced into our rational designed promoter sequences.

Background / Motivation

We present two different approaches for promoter design resulting in three different types of synthetic promoters: randomly assembled constitutive and inducible promoters as well as rationally designed promoters. As an additional type of promoters those occurring in nature can be integrated into vector systems. These heterogeneous cocktail of promoters can be combined for a precise regulation of pathways. This represents the power of our entire HEARTBEAT project. Synthesized promoters can be then used e.g. as a combinatorial gene therapy, i.e. several promoters that are of different types and/or have different strength will be applied as treatment agents. Therefore, a model that not only simulates single promoter activity and following gene expression but also accurately predicts gene expression from combined promoter sequences is indispensable.

We constructed a Fuzzy Logic model to provide a formal mathematical framework for prediction of combined activity of multiple promoters upon several stimuli and to gain insight into the mechanisms that generate diverse expression levels.

A Short Introduction into Fuzzy Logic Modeling

Fuzzy Logic is a rule-based approximate artificial reasoning method developed by [http://en.wikipedia.org/wiki/Lofti_Zadeh| Lotfi Zadeh] in 1965. Its motivation is the observation that humans often think and communicate in a vague way, and yet can make precise decisions [10]. It has been widely used in engineering and Artificial Intelligence approaches such as Fuzzy Controllers and Fuzzy Expert Systems. Fuzzy Logic has also been used for the modeling of biological pathways [11] and very recently to analyze gene regulatory networks [12]. Key advantages of Fuzzy logic-based approaches are (i) the ability to construct models based on prior knowledge of the system and experimental data and (ii) encode intermediate states for inputs and outputs, thus improving other logic-approaches that can only deal with ON/OFF states such as Boolean models [13] and (iii) simulations can be derived from both qualitative and quantitative data, both of which can be cast into the form of IF-THEN rules. Thus, FL constitutes a powerful approach for the understanding of heterogeneous datasets.

A Model combining XXX (2 nice words for prediction of HB DB) with experimental data

In our project, the complete set of rules will capture the behavior of each promoter in a Multiple-Input Single-Output (MISO) Fuzzy Logic model. Combining the MISO models in a network of all promoters will constitute the final Multiple-Input Multiple-Output (MIMO) model allowing for the simulation and prediction of combined activation of pahways regulated by our promoters. A key advantage of this methodology towards understanding the exclusive pathway activation of our promoters of interest is the possibility to study not only the individual activity of each promoter but also the combined activity, as the signal progresses from one MISO to another.

Achievements

Model description

blah

References

[1] Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004).

[2] Hu, Z., Killion, P. J. & Iyer, V. R. Genetic reconstruction of a functional transcriptional regulatory network. Nature Genet. 39, 683-687 (2007).

[3] Gertz, J., Siggia E. D. & Cohen, B. A. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457. 215-218 (2009)

[4] Roider, H. G. et al. Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 23, 134-141 (2006)

[5] ref to come

[6] Andianantoandro, E. et al. Synthetic biology: new engineering rules for an emerging discipline. Mol Sys Biol (2006)

[7] Alberts, B. et al. Molecular Biology of the Cell, 5th edition. Garland Science, 2008, Chapter 6

[8] Vardhanabhuti, S., Wang, J. & Hannenhalli, S. Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation. Nucl Acid Res 35, 3203-3213 (2007).

[9] Yokoyama, K. D., Ohler, U. & Wray, G. A. Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships. Nucl Acid Res, 1-21 (2009)

[10] Nelles, O. Nonlinear System Identification. Springer, 2000.

[11] Bosl, W. J. BMC systems biology 1, 13 (2007).

[12] Mathematical modeling of the lambda switch:a fuzzy logic approach.

[13] B. B. Aldridge, J. Saez-Rodriguez, J. L. Muhlich et al., PLoS computational biology 5 (4), e1000340 (2009).



|width="250px" style="background-color:#d8d5d0"|

|}