Team:Heidelberg/HEARTBEAT database

From 2009.igem.org

(Difference between revisions)
m (Methods)
(Contents)
 
(156 intermediate revisions not shown)
Line 3: Line 3:
<html><body id="modeling"></body></html>
<html><body id="modeling"></body></html>
{|
{|
-
|-valign="top" border="0"
+
|-valign="top" border="0" style="margin-left: 2px;"
-
|width="650px" style="padding: 0 15px 15px 20px; background-color:#ede8e2"|
+
|width="750px" style="padding: 0 15px 15px 20px; background-color:#ede8e2"|
__NOTOC__
__NOTOC__
-
= HEARTBEAT: Design of rational promoter sequences =
+
= HEARTBEAT: Rational design of promoter sequences =
 +
 
 +
== Contents ==
 +
*[[Team:Heidelberg/HEARTBEAT_database#Introduction|Introduction]]
 +
*[[Team:Heidelberg/HEARTBEAT_database#Methods|Methods]]
 +
*[[Team:Heidelberg/HEARTBEAT_database#Results|Results]]
 +
*[[Team:Heidelberg/HEARTBEAT_database#Discussion|Discussion]]
 +
*[[Team:Heidelberg/HEARTBEAT_database#References|References]]
== Introduction ==
== Introduction ==
-
In parallel to the biochemical method [[Team:Heidelberg/Project_Synthetic_promoters#Results|RA-PCR]] our team worked on a computational approach for the  rational design of promoter libraries. This strategy focused on spatial dependencies between pairs of binding motives and how the distance of a particular binding motive affects transcriptional activity. Similar to existing methods which predict spatial preferences of transcription factorbinding sites (TFBS) by detecting statistically overrepresented motives [] we used [http://genome.dkfz-heidelberg.de/| Promotersweep] [] to analyse and process the information of over 4000 human promoter sequences. The data of a total of 29966 TF-binding sites was then stored in a MySQL database (DB) which we named '''He'''idelberg '''Ar'''tificial '''T'''ranscription factor '''B'''inding site '''E'''ngineering and '''A'''ssembly '''T'''ool,  [[Team:Heidelberg/HEARTBEAT_gui|HEARTBEAT]]. Based on frequency distributions (see [[Team:Heidelberg/HEARTBEAT_database#Results|results]]) for [[Team:Heidelberg/Eucaryopedia#SREBP|SREBP]] and [[Team:Heidelberg/Eucaryopedia#VDR|VDR]] derived from HEARTBEAT, we were able to design 25 promoter sequences with different arrangement of TF-binding sites (see [[Team:Heidelberg/HEARTBEAT_database#Promoter design|Promoter design]]).
 
-
VDR- as well as SREBP-responsive promoter constructs were first cloned into a reporter plasmid ([[Team:Heidelberg/Parts#Core promoters|BBa_K203100]]) and consequently transfected into [[Team:Heidelberg/Eucaryopedia#MCF-7|MCF-7]] and [[Team:Heidelberg/Eucaryopedia#Hela|Hela]] cells, respectively. The final screening was accomplished by [[Team:Heidelberg/Notebook_MaM#Screening by TECAN|TECAN]] and [[Team:Heidelberg/Notebook_MaM#Flow Cytometry|Flow Cytometry]] measurements (see [[Team:Heidelberg/HEARTBEAT_database#Results|results]]). '''Sentence about first results'''. According to our results, we think that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.
+
In the field of Synthetic Biology mathematical models based on bioinformatic methods are mainly used in order to describe the systems behavior of the involved synthetic constructs. To conduct the orchestra of these parts systems biologists try out different modeling approaches for example analytical or stochastic modeling depending on the underlying system. The introduced parameters  in these models are naturally fitted to experimental data acquired ''in vivo'' [[Team:Heidelberg/HEARTBEAT_database#References|[1]]]. But the interests in synthetic biology will not only be concerned about systems of existing parts, the focus of research in systems biology will rather  be in the design of totally rational assembled constructs not described in nature so far. For that purpose we need models which predict and optimize the properties of these ''in silico'' designed constructs. 
-
 
+
 
-
To enable the construction of HEARTBEAT-based rational promoter sequences which comprises sequence assembly with TFBS of choice, adding spacer sequences, checking for restriction sites and unintentional TFBS etc., we developed the HEARTBEAT graphical user interface [[Team:Heidelberg/HEARTBEAT_gui|HEARTBEAT GUI]]. Furthermore, we developed a computer model based on fuzzy logic, which is able to simulate the activity of the designed rational and random synthetic promoter sequences. In a reverse approach considering the output, the model helps the user in optimizing the input sequence. Altogether HEARTBEAT which comprises data analysis ([[Team:Heidelberg/HEARTBEAT_database|HEARTBEAT-DB]]), a graphical user interface [[Team:Heidelberg/HEARTBEAT_gui|HEARTBEAT GUI]] and network modeling ([[Team:Heidelberg/HEARTBEAT_network|HEARTBEAT fuzzy network (FN)]]) provides a powerful and promising instrument for the synthetic biology community.
+
In order to enable this modeling approach it is useful to define a part as a basic biological function encoded as genetic material ([http://blip.tv/file/548184 listen to Drew Endy, President of the BioBricks Foundation]). Starting from this abstract concept we propose to devide up the "parental" part into smaller functional units which we define  “subparts” that exhibit their own biological function. However, this function can only be exerted if all subparts are exactly assembled together in the “parental” part. If all system related parameters for every subpart are known, one will be able to model the characteristics of the "parental" part.
 +
 
 +
In this work we propose a model which describes the rational design of a certain promoter of interest. Furthermore, the model is able to predict the functional outcome of the created sequence and it can be used to improve the input sequence, both based on fuzzy logic modeling ([[Team:Heidelberg/HEARTBEAT_network|HEARTBEAT fuzzy network (FN)]]). For the development of the model understanding of the functionality and composition of a promoter was essential: it consists of a core-promoter comprising one or more subparts which can work as the binding site for the basal transcription machinery, i.e. RNA polymerase, TATA-box and the Inr-element. Transcription factors regulating the transcriptional activity interact preferentially with the proximal promoter  [[Team:Heidelberg/HEARTBEAT_database#References|[2]]]. The intensity of the regulation is further defined by the position of the transcription factor binding site (TFBS) relative to the transcriptional start site (TSS), the binding affinity of the TF to a particular binding motive and also by synergistic effects evolving from interactions between the simultaneously binding of multiple Tfs. [[Team:Heidelberg/HEARTBEAT_database#References|[3]]],[[Team:Heidelberg/HEARTBEAT_database#References|[4]]] In addition the spacer sequences are directly influencing  the promoter strength [[Team:Heidelberg/HEARTBEAT_database#References|[5]]].  
 +
 
 +
Based on these assumptions the optimal synthetic promoter sequence can be designed using the following parameters (Fig. 1):
 +
 
 +
* all TFBS selected to be in the sequence
 +
* auxiliary TFBS, which are supposed to have a synergistic effect
 +
* optimal distance of all TFBS to the TSS
 +
* maximal spacer sequence quality between the TFBS
 +
* optimal distance between TFBSs
 +
 
 +
In the following articles we propose a standardized strategy to specify these parameters with computerized methods based on HEARTBEAT ('''He'''idelberg '''Ar'''tificial '''T'''ranscription Factor '''B'''inding Site '''E'''ngineering and '''A'''ssembly '''T'''ool). We will further explain the manual design of promoter sequences using these parameters  for [[Team:Heidelberg/Eukaryopedia#SREBP|SREBP]] and [[Team:Heidelberg/Eukaryopedia#VDR|VDR]]. Initital measurements show the potential of HEARTBEAT-designed sequences to function ''in vivo'' (see [[Team:Heidelberg/HEARTBEAT_database#Results|results]]). On the basis of our observations, we claim that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.
 +
 
 +
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
 +
 
 +
[[Image:Heartbeat_flow2.png|thumb|none|500px|'''Figure 1: The generation of promoter sequences- from genome scanning to sequence design''']]
 +
 
 +
[[Team:Heidelberg/HEARTBEAT_database#Introduction|[TOP]]]
==Methods==
==Methods==
-
In order to create a reliable database providing sufficient data, the very first task we faced was the
+
Our promoter sequence
-
promoter / gene selection.  For this purpose our promoter sequence
+
was defined to be 1000 bp upstream of the TSS. The  
was defined to be 1000 bp upstream of the TSS. The  
-
[http://genome.ucsc.edu/cite.html| UCSC Genome Browser] [refA] provides reference sequence and working draft assemblies
+
[http://genome.ucsc.edu/cite.html| UCSC Genome Browser] [[Team:Heidelberg/HEARTBEAT_database#References|[6]]] provides reference sequence and working draft assemblies
-
for a large collection of genomes including the human genome. Text and sequence-based searches
+
for a large collection of genomes including the human genome. We derived the
-
provide quick and precise access to any region of specific interest. In our case we derived the
+
provided sequences containing the 1000 bases upstream of annotated TSS for each RefSeq gene
-
provided dataset which included the 1000 bases upstream of annotated TSS for each RefSeq genes
+
[[Team:Heidelberg/HEARTBEAT_database#References|[7]]].
-
[refB].
+
Upon this pre-selection, we further narrowed our choice of promoter set by selection of distinct
Upon this pre-selection, we further narrowed our choice of promoter set by selection of distinct
pathways. KEGG ([http://www.genome.jp/kegg/Kyoto| Kyoto Encyclopedia of Genes and Genomes])
pathways. KEGG ([http://www.genome.jp/kegg/Kyoto| Kyoto Encyclopedia of Genes and Genomes])
-
[refC,D,E] provides a database of biological systems, consisting of several building blocks
+
[[Team:Heidelberg/HEARTBEAT_database#References|[8]]], [[Team:Heidelberg/HEARTBEAT_database#References|[9]]], [[Team:Heidelberg/HEARTBEAT_database#References|[10]]] provides a database of biological systems, consisting of several building blocks
-
including e.g. genes and proteins (KEGG GENES) or hierarchies and relationships of various
+
including for instance genes and proteins (KEGG GENES) or hierarchies and relationships of various
biological objects (KEGG BRITE). Here, KEGG PATHWAY which comprises molecular
biological objects (KEGG BRITE). Here, KEGG PATHWAY which comprises molecular
interaction and reaction networks for metabolism, various cellular processes and human diseases
interaction and reaction networks for metabolism, various cellular processes and human diseases
was of particular interest. From this collection of molecular wiring maps, we chose all
was of particular interest. From this collection of molecular wiring maps, we chose all
-
physiologically relevant pathways for our project, thereby ruling out tissue specific pathways,
+
physiologically relevant pathways for our project, thereby ruling out tissue or other highly specific pathways,
-
highly specific pathways like olfactory / taste transduction as well as several pathways related to
+
like olfactory / taste transduction as well as several pathways related to human diseases.
-
human diseases.
+
 
 +
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
=== Promotersweep ===
=== Promotersweep ===
-
On of the most challenging problems in bioinformatics remains the computerised localisation of
+
One of the most challenging problems in bioinformatics remains the computerized localization of
TFBS and the transcriptional start sites (TSS) as well as the determination of the core promoter.
TFBS and the transcriptional start sites (TSS) as well as the determination of the core promoter.
-
Many of the available motive discovery tools exhibit the problem, of extended false positive
+
Many of the available motive discovery tools exhibit the problem of extended false positive
predictions. The Promotersweep web-tool improves the accuracy of this analysis by combining a
predictions. The Promotersweep web-tool improves the accuracy of this analysis by combining a
-
vast number of different algorithms and methods simultaneously. It integrates information from
+
vaste number of different algorithms and methods simultaneously [[Team:Heidelberg/HEARTBEAT_database#References|[11]]]. It integrates information from three homology databases (EnsEMBL Compara [[Team:Heidelberg/HEARTBEAT_database#References|[12]]], NCBI HomoloGene [[Team:Heidelberg/HEARTBEAT_database#References|[13]]],DoOP database [[Team:Heidelberg/HEARTBEAT_database#References|[14]]]), five
-
three homology databases (EnsEMBL Compara, NCBI HomoloGene,DoOP database), five
+
promoter databases(EPD [[Team:Heidelberg/HEARTBEAT_database#References|[15]]], DBTSS [[Team:Heidelberg/HEARTBEAT_database#References|[16]]]), six sequence motive identification tools (e.g. Gibbs
-
promoter databases(EPD, DBTSS), six sequence motive identification tools (e.g. Meme, Gibbs
+
MotifSampler [[Team:Heidelberg/HEARTBEAT_database#References|[17]]]) and two matrix profile databases (Jaspar Core Library [[Team:Heidelberg/HEARTBEAT_database#References|[18]]], Transfac
-
MotifSampler) and two matrix profile databases (Jaspar Core Library, Transfac
+
ProfessionalLibrary [[Team:Heidelberg/HEARTBEAT_database#References|[19]]]) to identify and annotate TFBS. The Promotersweep pipeline is started by entering a sequence, chosen between human or mouse as origin. Initially a homology search is performed by using different BLAST algorithms. As a result orthologous promoter regions are
-
ProfessionalLibrary) to identify and annotate TFBS. The Promotersweep pipeline is started by
+
deduced from EnsEMBL, Homologene or DoOP, respectively. Subsequently, several motive discovery tools determine shared motives of orthologous or co-regulated sequences. In the last step each TFBS is identified and evaluated with the help of the Transfac - and the Jaspar Core library. Every identified TFBS is classified as weak, conserved or reliable - according to the similarity of the predictions of the different algorithms. In Fig. 2 the result for the Heat shock
-
entering a sequence, chosen between human or mouse as origin. Initially a homology search is
+
-
performed by using different BLAST algorithms. As a result orthologous promoter regions are
+
-
deduced from EnsEMBL, Homologene or DoOP, respectively. Subsequently several motive
+
-
discovery tools determine shared motives of orthologous or co-regulated sequences. In the last step
+
-
each TFBS is identified and evaluated with the help of the Transfac core library and the Jaspar Core
+
-
library. Every identified TFBS is classified as weak, conserved or reliable according to the
+
-
similarity of the predictions of the different algorithms. In Fig. [] the result for the Heat shock
+
cognate 71 kDa protein (NM_153201) promoter is shown. For this promoter four different binding
cognate 71 kDa protein (NM_153201) promoter is shown. For this promoter four different binding
-
sites were discovered. Three of them were classified as reliable and one as conserved. For each hit
+
sites were discovered. Three of them were classified as reliable and one as conserved. For each hit,
-
the output of promotersweep contains the position of the motive relative to the TSS. So far we were
+
the output of Promotersweep contains the position of the motive relative to the TSS. So far we were
able to analyse 4395 different promoter sequences, which hold 29966 TFBS in total.
able to analyse 4395 different promoter sequences, which hold 29966 TFBS in total.
 +
{|
 +
|[[Image:Promotersweep.png|thumb|none|600px|'''Figure 2: Promotersweep output for the Heat shock
 +
cognate 71 kDa protein (NM_153201).''' Four different binding sites were discovered.]]
 +
|}
 +
 +
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
=== The HEARTBEAT-database ===
=== The HEARTBEAT-database ===
In order to retrieve the information computed by promotersweep as fast as possible we decided to
In order to retrieve the information computed by promotersweep as fast as possible we decided to
-
develop a database structure based on MySQL (My systems query language). MySQL is one of the
+
develop a database structure based on MySQL (My systems query language) [[Team:Heidelberg/HEARTBEAT_database#References|[20]]]. MySQL is one of the
most popular relational database management systems. It offers not only a language to set up a
most popular relational database management systems. It offers not only a language to set up a
hierarchical database but also an interface for easy manipulation of data. The advantage of MySQL
hierarchical database but also an interface for easy manipulation of data. The advantage of MySQL
Line 70: Line 92:
data to define the location, binding motive and quality of a TFBS, whereas Gene_Info offers
data to define the location, binding motive and quality of a TFBS, whereas Gene_Info offers
additional information for the gene as well as several gene annotations, where the TFBS is located
additional information for the gene as well as several gene annotations, where the TFBS is located
-
on. In Fig. [] the table structure is shown for Main_Info and Gene_Info.
+
on. In table 1 and 2 the table structure is shown for Main_Info and Gene_Info.
-
==Results==
+
'''Table 1: Database structure: Main Info'''
-
For the statistical analysis we plotted the absolute frequency of occurrence for each TF-binding site
+
-
in a histogram against the position relative to the TSS where the TSS is located at base 1001 -1003.
+
-
Each bin comprises 20 bases analogue to different low resolution approaches which analysed the
+
-
spatial distribution of TFBS with a sliding window of 20-25 bp '''[Daigoro]?'''. From 356 different
+
-
TFBS for which Transfac contains at least one binding matrix 144 TFBS occurred at least within 50
+
-
from 4390 natural promoters. TFBS with less than 50 counts were removed from the selection and
+
-
not considered for further analysis. In Fig. '''[]''' the spatial distributions of [[Team:Heidelberg/Eucaryopedia#Sp1|Sp1]], [[Team:Heidelberg/Eucaryopedia#AP-2|AP-2]], [[Team:Heidelberg/Eucaryopedia#IPF1|IPF1]] (Insulin
+
-
promoter factor 1) and [[Team:Heidelberg/Eucaryopedia#Kid3|Kid3]] binding sites are shown. The red solid line represents the re-scaled probability density function '''(pdf)'''. We introduced this function for two reasons. On the one hand the pdf is more robust with respect to outliers than a normal histogram. On the other hand we used the
+
-
rescaled area under the curve between a shifting frame of 20 bases as a measure for the significance of a particular TFBS occurrence. The vertical red line in each plot defines the maximum of the pdf. Around the respective base position the majority of binding motives are located within the natural promoters. The maximum of the pdf will serve in the following as the position where binding sites are introduced into our rational designed promoter sequences.
+
-
<div class="center">
 
-
[[Image:TimBild1.png|400px|center]]
 
-
 
-
[[Image:TimBild2.png|400px|center]]
 
-
 
-
[[Image:TimBild3.png|400px|center]]
 
-
 
-
[[Image:TimBild4.png|400px|center]]
 
-
</div>
 
-
 
-
==MySQL==
 
{| class="wikitable centered"  style="margin: 1em auto 1em auto"
{| class="wikitable centered"  style="margin: 1em auto 1em auto"
|- style="background-color:#99cccc;"  
|- style="background-color:#99cccc;"  
Line 129: Line 131:
|-
|-
|}
|}
 +
 +
'''Table 1: Database structure: Gene Info'''
{| class="wikitable centered"  style="margin: 1em auto 1em auto"
{| class="wikitable centered"  style="margin: 1em auto 1em auto"
Line 142: Line 146:
|}
|}
-
==Promoter design==
+
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
 +
 
 +
=== Promoter design ===
For the rational design of a responsive promoter construct several preliminary considerations have
For the rational design of a responsive promoter construct several preliminary considerations have
to be done. The first question which needs to be addressed concerns the inducibility of the pathway
to be done. The first question which needs to be addressed concerns the inducibility of the pathway
-
of interest. Preceding experiments revealed that for [[Team:Heidelberg/Eucaryopedia#VDR|VDR]] (vitamin D receptor) as well as for [[Team:Heidelberg/Eucaryopedia#SREBP|SREBP]] (sterol regulatory element binding protein), convenient conditions and treatments exist
+
of interest. Preceding experiments revealed that for [[Team:Heidelberg/Eukaryopedia#VDR|VDR]] (vitamin D receptor) as well as for [[Team:Heidelberg/Eukaryopedia#SREBP|SREBP]] (sterol regulatory element binding protein), convenient conditions and treatments exist
under which each pathway can be exclusively activated without killing the chassis, that is in our
under which each pathway can be exclusively activated without killing the chassis, that is in our
-
studies the transfected cells '''(for further information about the experimental set up see [])'''. After we
+
studies the transfected cells. For further reference about induction conditions refer to [[Team:Heidelberg/Notebook_MaM#Induction|Material and Methods]]. After deciding what kind of TFBS is to be included appropriate consensus motives where TFs would presumably bind on had to be selected. Reliable consensus motives can be deduced from matrices
-
decided what kind of TFBS we want to include we had to find appropriate consensus motives our
+
provided in the TRANSFAC database. In case of several different binding matrices, we chose the
-
chosen TFs would presumably bind on. Reliable consensus motives can be deduced from matrices
+
longest motive which contains the most definite bases [[Team:Heidelberg/HEARTBEAT_database#References|[19]]].
-
provided in the Transfac database '''[]'''. In case of several different binding matrices, we chose the
+
Focusing from now on only on VDR and SREBP we created the frequency distributions of the TFBS
-
longest motive which contains the most definite bases ('''further explanation in the eukarypedia)?'''.
+
occurrence for both TFs based on HEARTBEAT (see Fig. 3 and 4). As mentioned above the basic
-
Focusing from now on only VDR and SREBP we created the frequency distributions of the TFBS
+
-
occurrence for both TFs based on HEARTBEAT (see Fig []). As mentioned above the basic
+
assumption of our model is that most transcription factors exhibit a spatial preference for binding
assumption of our model is that most transcription factors exhibit a spatial preference for binding
to the DNA which includes not only the binding sequence and the distance to the TSS but also the
to the DNA which includes not only the binding sequence and the distance to the TSS but also the
-
mutual distance between potential TF pairs. Based on this concept we specified distance of the pdfmaxima
+
mutual distance between potential TF pairs. Based on this concept we specified distance of the pdf-maxima
to the TSS for both VDR and SREBP. Subsequently the binding motive is embedded into
to the TSS for both VDR and SREBP. Subsequently the binding motive is embedded into
the artificial promoter around the position where the majority of binding sites are located in natural
the artificial promoter around the position where the majority of binding sites are located in natural
promoters. With this idea we created first a series of synthetic promoters in which we differed only
promoters. With this idea we created first a series of synthetic promoters in which we differed only
-
the number of binding motives positioned around the pdf-maxima. Examples regarding this series
+
the number of binding motives positioned around the pdf-maxima. For all sequences designed for VDR, see [[Media:HEARTBEAT_VDR.pdf|HEARTBEAT_VDR.pdf]]. For all sequences designed for SREBP, see [[Media:HEARTBEAT_SREBP.pdf|HEARTBEAT_SREBP.pdf]].
-
can be found in Fig[].
+
 
With a second series of artificially designed promoter sequences we tried to answer to what extend
With a second series of artificially designed promoter sequences we tried to answer to what extend
further auxiliary TFBS affect the binding activity of VDR and SREBP. Therefore we plotted the
further auxiliary TFBS affect the binding activity of VDR and SREBP. Therefore we plotted the
frequencies of all TFBS which are co-occuring when VDR or SREBP is present in a natural
frequencies of all TFBS which are co-occuring when VDR or SREBP is present in a natural
-
promoter sequence as well(see Fig []). For SREBP ZF5 was also present with a relative frequency
+
promoter sequence as well (see Fig. 5 and 6). For SREBP [[Team:Heidelberg/Eukaryopedia#ZF5|ZF5]] was also present with a relative frequency
-
of 60%. In case of VDR, AP-2 co-exists in 48% and WT1 in 54% of all VDR-promoters. In the
+
of 60%. In case of VDR, [[Team:Heidelberg/Eukaryopedia#AP-2|AP-2]] co-exists in 48% and [[Team:Heidelberg/Eukaryopedia#WT1|WT1]] in 54% of all VDR-promoters. In the
-
following we proceeded analogue to series one. We created a variety of sequences where we
+
following we proceeded in analogy to series one. We created a variety of sequences where we
included TFBS in proximity to the pdf-maximum of their frequency distribution besides the VDR
included TFBS in proximity to the pdf-maximum of their frequency distribution besides the VDR
and SREBP binding sites. Depending on the number of species of TFBS we distinguish between a
and SREBP binding sites. Depending on the number of species of TFBS we distinguish between a
blue (1 TFs), green (2 TFs) and orange (3 different TFs) series. Finally all spacer sequences were
blue (1 TFs), green (2 TFs) and orange (3 different TFs) series. Finally all spacer sequences were
-
filled with a random sequence with equal A:T and C:G content.
+
filled with a random sequence with A:T and C:G content ratios being equal.
To make sure that our sequences are as specific as possible we iteratively checked and modified our
To make sure that our sequences are as specific as possible we iteratively checked and modified our
-
sequence with the Transfac match tool as long as no other TFBS expect for our chosen ones were
+
sequence with the Transfac match tool [[Team:Heidelberg/HEARTBEAT_database#References|[21]]] until no other TFBS expect for our chosen ones were
detected. Additionally we tested the sequence for every restriction site used in any Biobrick
detected. Additionally we tested the sequence for every restriction site used in any Biobrick
-
standard. Finally we added a HindIII at the 5' end and a SpeI restriction site at the 3' end to enable
+
standard. Finally we added a HindIII at the 5' end and a SpeI restriction site at the 3' end to allow
-
to clone the construct into the reporter plasmid.
+
for cloning the construct into the reporter plasmid.
 +
 
 +
[[Team:Heidelberg/HEARTBEAT_database#Introduction|[TOP]]]
 +
 
 +
==Results==
 +
 
 +
=== HEARTBEAT analyzes transcription factor binding preferences ===
 +
 
 +
For the statistical analysis we plotted the absolute frequency of occurrence for each TF-binding site
 +
in a histogram against the position relative to the TSS where the TSS is located at base 1001 -1003.
 +
Each bin comprises 20 bases analogous to different low resolution approaches which analysed the
 +
spatial distribution of TFBS with a sliding window of 20-25 bp [[Team:Heidelberg/HEARTBEAT_database#References|[22]]], [[Team:Heidelberg/HEARTBEAT_database#References|[3]]], [[Team:Heidelberg/HEARTBEAT_database#References|[23]]]. From 356 different TFBS for which Transfac contains at least one binding matrix 144 TFBS occurred at least within 50
 +
from 4390 natural promoters. TFBS with less than 50 counts were removed from the selection and
 +
not considered for further analysis. In Fig. 7-10 the spatial distributions of [[Team:Heidelberg/Eukaryopedia#Sp1|Sp1]], [[Team:Heidelberg/Eukaryopedia#AP-2|AP-2]], [[Team:Heidelberg/Eukaryopedia#IPF1|IPF1]] (Insulin
 +
promoter factor 1) and [[Team:Heidelberg/Eukaryopedia#Kid3|Kid3]] binding sites are shown. The red solid line represents the re-scaled probability density function (pdf). We introduced this function for two reasons. On the one hand the pdf is more robust with respect to outliers than a normal histogram. On the other hand we used the
 +
rescaled area under the curve between a shifting frame of 20 bases as a measurement for the significance of a particular TFBS occurrence. The vertical red line in each plot defines the maximum of the pdf. Around the respective base position the majority of binding motives are located within the natural promoters. The maximum of the pdf will serve in the following as the position where binding sites are introduced into our rational designed promoter sequences.
 +
 
 +
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
 +
 
 +
 
 +
=== An ''in vivo'' Test of HEARTBEAT predicted sequences ===
 +
 
 +
<div style="text-align:justify;"> SREBP-responsive sequences designed as [[Team:Heidelberg/HEARTBEAT_database#Promoter_design|outlined above]] were transfected in HeLa cells and SREBP was induced as [[Team:Heidelberg/Notebook_MaM#Induction|described]]. Promoter activity was then preliminarily analyzed by TECAN (automated fluorescence plate reader). Sequence HB9 showed significant induction (Fig. 11)</div>.
 +
 
 +
<div class="center">
 +
{| class="wikitable centered" border="2" rules="rows" style="border-color:white;"
 +
|[[Image:HD09_VDRspatial.png|thumb|200x200px|center|'''Figure 3: VDR binding site distribution''' X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.]]
 +
|[[Image:HD09_SREBPspat.png|thumb|200x200px|center|'''Figure 4: SREBP binding site distribution''']]
 +
|[[Image:HD09_VDRcoin.png|thumb|200x200px|center|'''Figure 5: Transcription factors co-occurring with VDR.''' X axis corresponds to transcription factor ID, vertical axis corresponds to number of hits (transcription factor x occuring together with VDR)]]
 +
|-
 +
|[[Image:HD09_SREBPcoin.jpg|thumb|200x200px|center|'''Figure 6:  Transcription factors co-occurring with SREBP.''' X axis corresponds to transcription factor ID, vertical axis corresponds to number of hits (transcription factor x occurring together with SREBP)]]
 +
|[[Image:TimBild1.png|thumb|200x200px|center|'''Figure 7: Kid3 binding site distribution'''X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.]]
 +
|[[Image:TimBild2.png|thumb|200x200px|center|'''Figure 8: IPF1 binding site distribution''' X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.]]
 +
|-
 +
|[[Image:TimBild3.png|thumb|200x200px|center|'''Figure 9: Ap-2 binding site distribution'''X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.]]
 +
|[[Image:TimBild4.png|thumb|200px|center|'''Figure 10: Sp1 binding site distribution'''X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.]]
 +
|[[Image:HD09_HB9.png|thumb|200x300px|left|<div style="text-align:justify;">'''Figure 11: HEARTBEAT-predicted SREBP responsive promoter is induced by SREBP.''' Characterization by TECAN plate fluorescence reader. For Induction and inhibition conditions, refer to  [[Team:Heidelberg/Notebook_MaM#Induction|Material and Methods]]</div>]]
 +
|-
 +
|}
 +
</div>
 +
 
 +
[[Team:Heidelberg/HEARTBEAT_database#Introduction|[TOP]]]
==Discussion==
==Discussion==
-
During an intensive but at the same time extremely educative summer we seized the chance to
+
 
-
glance at several fundamental domains of the bioinformatic research. Thereby we followed one of
+
In predicting a functional promoter sequence, we followed one of the basic principles of synthetic biology: Studying nature quantitatively and applying the knowledge thus gained to the ''de novo'' creation of biological entities. Four points require further discussion:
-
the basic principles of synthetic biology: learning from the natural ideal before using it for our
+
# What is the scientific significance of the analysis created by HEARTBEAT?
-
own purposes. Starting from the initial idea to design synthetic promoter sequences in a totally
+
# How can the HEARTBEAT database be improved?
-
rational and computerised process we first screened the human genome for patterns in the structure
+
# What does the outcome of our ''in vivo'' test tell us?
-
of natural promoter sequences. After analysing over 4000 promoter sequences from the UCSC
+
# Can promoter design become fully automated?
-
database with the result of nearly 30000 detected transcription factor binding sites (TFBS) we
+
 
-
established the HEARTBEAT-database. Moreover upon statistical analysis we discovered 90 out of
+
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
-
356 TFBS distributions with one significant peak reflecting the occurrence of a particular binding
+
 
-
motive. 54 distributions contain two or equally high local maxima. For the remaining 212
+
 
-
transcription factors less than 50 TFBS could be detected by Promotersweep and hence , not in the
+
=== Transcription factors have a spatial binding preference ===
-
scope of our analysis . In order to overcome this problem in the future we plan to accomplish the
+
 
-
screening of human promoters and to systematically expand our promoter screening by including
+
Upon statistical analysis of over 4000 promoter sequences, we discovered 90 out of 356 TFBS distributions with one significant peak reflecting the spatial occurrence of a particular binding motive. 54 distributions contain two equally high local maxima. For the remaining 212 transcription factors less than 50 TFBS could be detected by Promotersweep and hence were not in the scope of our analysis. Transcription factors have been shown before to have certain spatial binding preferences [[Team:Heidelberg/HEARTBEAT_database#References|[3]]], [[Team:Heidelberg/HEARTBEAT_database#References|[4]]], but no analysis based upon multiple databases and gene orthologies was ever conducted. Our method therefore minimizes false positives, and still, our results confirm existing findings.
-
genomes from different mammalian species. With the increased variety of input sequences we hope
+
 
-
to decrease the influence of false positive hits in our statistics. Furthermore it could also help to
+
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
-
understand the 54 distributions with multiple maxima and answer the question about the occurance
+
 
-
of multiple potential binding sites of one single TF within a given promoter sequence. To make
+
=== Improving HEARTBEAT database ===
-
HEARTBEAT even more universal we plan to screen a broader range of the natural promoters
+
 
-
which will comprise the sequence downstream of the TSS as well. In parallel we want to include
+
As noted above, for 212 TFs, less than 50 TFBSs could be detected. In order to overcome this problem in the future we plan to accomplish the screening of the entire genome and to systematically expand our promoter screening by including genomes from different mammalian species. With the increased variety of input sequences we hope to decrease the influence of false positive hits in our statistics. Furthermore it could also help to understand the 54 distributions with multiple maxima and answer the question about the occurance of multiple potential binding sites of one single TF within a given promoter sequence. To make HEARTBEAT even more universal we plan to screen a broader range of the natural promoters which will comprise the sequence downstream of the TSS as well, as it is a well known fact that TF bind to introns and 5' UTR also [[Team:Heidelberg/HEARTBEAT_database#References|[24]]]. In parallel we want to focus on promoters involved in disease related pathways which promises deeper insight in the differences in their molecular regulation.
-
promoters involved in disease related pathways which promises deeper insight in the differences in
+
 
-
their molecular regulation. With this approach we will be able to model not only the physiological
+
[[https://2009.igem.org/Team:Heidelberg/HEARTBEAT_database TOP]]
-
state but contribute also to clinical research.
+
 
-
So far we only have been able to speculate about the biological consequences of the TF binding in a
+
 
-
particular distance to the TSS. Influenced by previous studies [] we assumed that not only the
+
=== ''In vivo'' test of predicted sequences shows that functional promoter can be predicted by HEARTBEAT ===
-
spatial preference but also the pairwise interaction of different TFs on one promoter determines a
+
 
-
cis-regulatory element. To prove our assumption, we developed a strategy first to design artificial
+
[[Image:HD09_hbdesign.jpg|thumb|left|380px|'''Figure 12: HEARTBEAT generates functional promoter sequences'''. Sequence HB 8 and 9 are inducible under  [[Team: Heidelberg/Notebook_MaM#Induction|appropriate conditions]] . Sequences marked red were introduced as a negative control]] <div style="text-align:justify;">Of all the designed sequences, HB8 and HB9 showed activity under [[Team: Heidelberg/Notebook_MaM#Induction|appropriate conditions]](compare Fig. 11, Results). These sequences are highly similar (see Fig. 12 for how SREs were placed) since they have Transcription Factor Binding Sites under ''both'' peaks of the distribution with at least one auxiliary Sp1 element. The other sequences either lack the second SREBP binding site or the Sp1 binding site. Sequences having another type of putative SREBP response element are not functional. '''We thus showed that the sequences which most closely reflect the expected distribution also work best!''' These results point out that promoters can be predicted by HEARTBEAT. It also becomes clear that from the study of synthetic promoters, very useful information about gene regulation can be deduced. In order to obtain the information it requires [[Team:Heidelberg/HEARTBEAT_network|elaborate models]], which we started to develop. In the combination with [[Team:Heidelberg/Project_Synthetic_promoters#Results|RA-PCR]], we provide two very powerful tools for the generation of synthetic promoters. We believe that especially by combining the two methods (see [[Team:Heidelberg/Project_Synthetic_promoters#M-RA-PCR|discussion M-RA-PCR]]), we are able to create ''any'' promoter desired in the foreseeable future, thereby accelerating the progress of virotherapy development and fundamental research alike.</div>
-
DNA constructs based on the outcome of HEARTBEAT and then to test these constructs in vivo. As
+
<br>
-
a proof of principle we focused on the transcription factors VDR and SREBP. We were able to
+
<br>
-
exclusively induce the pathways of VDR and SREBP in [[Team:Heidelberg/eucariopedia#MCF-7|MCF-7 ]]and [[Team:Heidelberg/eucariopedia#Hela|Hela]] cells, respectively (see [[Team:Heidelberg/Notebook_MaM#induction| induction]] Subsequently we assembled a total of 25 constructs divided into a blue (only VDR or
+
<br>
-
SREBP), a green (one auxiliary TFBS – e.g. AP-2) and an orange series (two auxiliary TFBS) based
+
<br>
-
on the pdf maxima of each TFBS-frequency distribution ([[Team:Heidelberg/HEARTBEAT_database#Results|Results]]). We further standardised and automated the process of promoter construction in our HEARTBEAT-GUI, which is from now on
+
<br>
-
online available for the iGEM community. By applying fuzzy logic, we furthermore purpose a network model which is capable not only of error checking as well as proving pathway functionality
+
<br>
-
but also of investigating exclusive pathway activation. We believe that the succeeding appliance of
+
 
-
the HEARTBEAT-DB, the HEARTBEAT-GUI [[Team:Heidelberg/HEARTBEAT_gui|HEARTBEAT GUI]] and finally the HEARTBEAT-FN will enable the
+
=== Automatization of promoter sequence generation ===
-
user to i) receive detailed information about the spatial occurrence of over 100 human TFBS in
+
 
-
natural promoter sequences ii) fast but precise design rational synthetic promoters and iii) predict
+
In this work, the design of promoter sequences was an ''ad-hoc'' process which took an entire day. We therefore developed a [[Team:Heidelberg/HEARTBEAT_gui|GUI]] which assists users  with sequence design and automates certain step. Tools such as this will make promoter design a very easy task in the future.
-
and check the experimental outcome and quality of his/her created sequence.
+
 
 +
[[Team:Heidelberg/HEARTBEAT_database#Introduction|[TOP]]]
== References ==
== References ==
-
[1] Harbison C. T., Gordon D. B., Lee T. I., Rinaldi N. J., Macisaac K. D., Danford T. W., Hannett N. M., Tagne J. B., Reynolds D. B., Yoo J., Jennings E. G., Zeitlinger J., Pokholok D. K., Kellis M., Rolfe P. A., Takusagawa K. T., Lander E. S., Gifford D. K., Fraenkel E., Young R. A. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004).<br>
+
[1] [https://2008.igem.org/Team:BCCS-Bristol/Modeling BCCS-Bristol: Best Model 2008]
-
[2] Hu, Z., Killion, P. J. & Iyer, V. R. Genetic reconstruction of a functional transcriptional regulatory network. Nature Genet. 39, 683-687 (2007).<br>
+
[2] Heintzman N. D. & Ren B. The gateway to transcription: identifying, characterizing and understanding promoters in the eukaryotic genome. ''Cellular and Molecular Life Science'' 64: 386-400 (2007).
-
[3] Gertz, J., Siggia E. D. & Cohen, B. A. Analysis of combinatorial ''cis''-regulation in synthetic and genomic promoters. Nature 457. 215-218 (2009)<br>
+
[3] Vardhanabhuti, S., Wang, J. & Hannenhalli, S. Position and distance specificity are important determinants of ''cis''-regulatory motifs in addition to evolutionary conservation. ''Nucl Acid Res'' 35(10): 3203-3213 (2007).<br>
-
[4] Roider H. G., Kanhere A., Manke T., Vingron M. Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 23, 134-141 (2006)<br>
+
[4] Yokoyama, K. D., Ohler, U. & Wray, G. A. Measuring spatial preferences at fine-scale resolution identifies known and novel ''cis''-regulatory element candidates and functional motif-pair relationships. ''Nucl Acid Res'' 37(13): e92 (2009)<br>
-
[5] ref to come<br>
+
[5] Ellis T., Wang X. & Collins J. J. Diversity-based, model-guided construction of synthetic gene networks with predicted functions. ''Nature Biotechnology'' 27: 465-471 (2009).
-
[6] Andrianantoandro E., Basu S., Karig D. K., Weiss R. Synthetic biology: new engineering rules for an emerging discipline. Mol Sys Biol (2006)<br>
+
[6] Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M. & Haussler D. The human genome browser at UCSC. ''Genome Res.'' 12(6): 996-1006 (2002).
-
[7] Alberts, B., Johnson A., Walter P., Lewis J. Molecular Biology of the Cell, 5th edition, 2008. Garland Science, Chapter 6<br>
+
[7] Kuhn R. M., Karolchik D., Zweig A. S., Wang T., Smith K. E., Rosenbloom K. R., Rhead B., Raney B. J., Pohl A., Pheasant M., Meyer L., Hsu F., Hinrichs A. S., Harte R. A., Giardine B., Fujita P., Diekhans M., Dreszer T., Clawson H., Barber G. P., Haussler D. & Kent W. J. The UCSC Genome Browser Database: update 2009. ''Nucleic Acids Res.'' 37(Database issue): D755-61 (2009).
-
[8] Vardhanabhuti, S., Wang, J. & Hannenhalli, S. Position and distance specificity are important determinants of ''cis''-regulatory motifs in addition to evolutionary conservation. Nucl Acid Res 35, 3203-3213 (2007).<br>
+
[8] Kanehisa M., Araki M., Goto S., Hattori M., Hirakawa M., Itoh M., Katayama T., Kawashima S., Okuda S., Tokimatsu T. & Yamanishi, Y. KEGG for linking genomes to life and the environment. ''Nucleic Acids Res.'' 36 (Database issue): D480-4 (2008).  
-
[9] Yokoyama, K. D., Ohler, U. & Wray, G. A. Measuring spatial preferences at fine-scale resolution identifies known and novel ''cis''-regulatory element candidates and functional motif-pair relationships. Nucl Acid Res, 1-21 (2009)<br>
+
[9] Kanehisa M., Goto S., Hattori M., Aoki-Kinoshita K.F., Itoh M., Kawashima S., Katayama T., Araki M. & Hirakawa M. From genomics to chemical genomics: new developments in KEGG. ''Nucleic Acids Res.'' 34 (Database issue): D354-7 (2006).
-
[10] Nelles, O. Nonlinear System Identification. Springer, 2000. <br>
+
[10] Kanehisa M. & Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. ''Nucleic Acids Res.'' 28(1): 27-30 (2000).  
-
[11] Bosl, W. J. BMC systems biology 1, 13 (2007).<br>
+
[11] del Val C., Pelz O., Glatting K.-H., Barta E. & Hotz-Wagenblatt, A. PromoterSweep: a tool for identification of transcription factor binding sites. Theor. Chem. Acc. DOI 10.1007: s00214-009-0643-8 (2009).
-
[12] Mathematical modeling of the lambda switch:a fuzzy logic approach.<br>
+
[12] [http://apr2006.archive.ensembl.org/info/software/compara/index.html Ensembl Compara (database)]
 +
Hubbard T., Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F., Curwen V., Cutts T., Down T., Durbin R., Fernandez-Suarez X. M., Gilbert J., Hammond M., Herrero J., Hotz H., Howe K., Iyer V., Jekosch K., Kahari A., Kasprzyk A., Keefe D., Keenan S., Kokocinsci F., London D., Longden I., McVicker G., Melsopp C., Meidl P., Potter S., Proctor G., Rae M., Rios D., Schuster M., Searle S., Severin J., Slater G., Smedley D., Smith J., Spooner W., Stabenau A., Stalker J., Storey R., Trevanion S., Ureta-Vidal A., Vogel J., White S., Woodwark C. & Birney E. Ensembl. ''Nucleic Acids Res.'' 33 (Database issue): D447-D453 (2005).  
-
[13] B. B. Aldridge, J. Saez-Rodriguez, J. L. Muhlich et al., PLoS
+
[13] [http://www.ncbi.nlm.nih.gov/homologene NCBI HomoloGene (database)]
-
computational biology 5 (4), e1000340 (2009).
+
-
<br>
+
 +
[14] [http://doop.abc.hu/ DoOP (database)]
 +
Barta E., Sebestyén E., Pálfy T. B., Tóth G., Ortutay C. P. & Patthy L. DoOP: Databases of Orthologous Promoters, collections of clusters of orthologous upstream sequences from chordates and plants. ''Nucleic Acids Res.'' 33 (Database issue): D86-D90 (2004).
 +
 +
[15] [http://www.epd.isb-sib.ch/ EPD (database)]
 +
Schmid C. D., Praz V., Delorenzi M., Périer R. & Bucher P. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. ''Nucleic Acids Res.'' 32 (Database issue): D82-5 (2004).
 +
 +
[16] [http://dbtss.hgc.jp/ DBTSS (database)] Wakaguri H., Yamashita R., Suzuki Y., Sugano S., Nakai K.
 +
DBTSS: database of transcription start sites, progress report 2008. ''Nucleic Acids Res''.36(Database issue): D97-101 (2008)
 +
 +
[17] [http://bayesweb.wadsworth.org/gibbs/gibbs.html Gibbs MotifSampler Lawrence (database)] C. E., Altschul S. F., Boguski M. S., Liu J. S., Neuwald A.F., Wootton J. C.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. ''Science'' 262:208-214(1993)
 +
 +
[18] [http://jaspar.genereg.net/ Jaspar Core Library (database)] Sandelin A., Alkema W., Engstrom P., Wasserman W. W., Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.
 +
''Nucleic Acids Res.'' 32(Database issue): D91-4 (2004).
 +
 +
[19] Matys V., Kel-Margoulis O. V., Fricke E., Liebich I., Land S., Barre-Dirrie A.,
 +
Reuter I., Chekmenev D., Krull M., Hornischer K., Voss N., Stegmaier P.,
 +
Lewicki-Potapov B., Saxel H., Kel A. E. & Wingender E. TRANSFAC'''®''' and its module TRANSCompel'''®''': transcriptional gene regulation in eukaryotes. ''Nucleic Acids Res.'' 34 (Database issue): D108-110 (2006).
 +
 +
[20] [http://www.mysql.com/?bydis_dis_index=1 Mysql]
 +
 +
[21] Kel A. E., Gössling E., Reuter I., Cheremushkin E., Kel-Margoulis O. V. & Wingender E. MATCH: A tool for searching transcription factor binding sites in DNA sequences.'' Nucleic Acids Res.'' 31: 3576-3579 (2003).
 +
 +
[22] Hannenhalli S. & Levy S. Predicting transcription factor synergism. ''Nucleic Acids Res.'' 30: 4278-84 (2002).
 +
 +
[23] FitzGerald P. C., Shlyakhtenko A., Mir A. A. & Vinson C. Clustering of DNA sequences in human promoters.'' Genome Res.'' 14: 1562-74 (2004).
 +
 +
[24] Alberts B., Johnson A., Walter P. & Lewis J. ''Molecular Biology of the Cell.'' 5th edition, 2008. Garland Science, Chapter 6
 +
 +
<br>
<!--
<!--
Journal
Journal
Line 263: Line 337:
-->
-->
-
|width="315px" style="background-color:#d8d5d0"|
+
[[Team:Heidelberg/HEARTBEAT_database#Introduction|[TOP]]]
 +
 
 +
|width="250px" style="padding: 0 20px 15px 15px; background-color:#d8d5d0"|
|}
|}

Latest revision as of 03:14, 22 October 2009

HEARTBEAT: Rational design of promoter sequences

Contents

Introduction

In the field of Synthetic Biology mathematical models based on bioinformatic methods are mainly used in order to describe the systems behavior of the involved synthetic constructs. To conduct the orchestra of these parts systems biologists try out different modeling approaches for example analytical or stochastic modeling depending on the underlying system. The introduced parameters in these models are naturally fitted to experimental data acquired in vivo [1]. But the interests in synthetic biology will not only be concerned about systems of existing parts, the focus of research in systems biology will rather be in the design of totally rational assembled constructs not described in nature so far. For that purpose we need models which predict and optimize the properties of these in silico designed constructs.

In order to enable this modeling approach it is useful to define a part as a basic biological function encoded as genetic material ([http://blip.tv/file/548184 listen to Drew Endy, President of the BioBricks Foundation]). Starting from this abstract concept we propose to devide up the "parental" part into smaller functional units which we define “subparts” that exhibit their own biological function. However, this function can only be exerted if all subparts are exactly assembled together in the “parental” part. If all system related parameters for every subpart are known, one will be able to model the characteristics of the "parental" part.

In this work we propose a model which describes the rational design of a certain promoter of interest. Furthermore, the model is able to predict the functional outcome of the created sequence and it can be used to improve the input sequence, both based on fuzzy logic modeling (HEARTBEAT fuzzy network (FN)). For the development of the model understanding of the functionality and composition of a promoter was essential: it consists of a core-promoter comprising one or more subparts which can work as the binding site for the basal transcription machinery, i.e. RNA polymerase, TATA-box and the Inr-element. Transcription factors regulating the transcriptional activity interact preferentially with the proximal promoter [2]. The intensity of the regulation is further defined by the position of the transcription factor binding site (TFBS) relative to the transcriptional start site (TSS), the binding affinity of the TF to a particular binding motive and also by synergistic effects evolving from interactions between the simultaneously binding of multiple Tfs. [3],[4] In addition the spacer sequences are directly influencing the promoter strength [5].

Based on these assumptions the optimal synthetic promoter sequence can be designed using the following parameters (Fig. 1):

  • all TFBS selected to be in the sequence
  • auxiliary TFBS, which are supposed to have a synergistic effect
  • optimal distance of all TFBS to the TSS
  • maximal spacer sequence quality between the TFBS
  • optimal distance between TFBSs

In the following articles we propose a standardized strategy to specify these parameters with computerized methods based on HEARTBEAT (Heidelberg Artificial Transcription Factor Binding Site Engineering and Assembly Tool). We will further explain the manual design of promoter sequences using these parameters for SREBP and VDR. Initital measurements show the potential of HEARTBEAT-designed sequences to function in vivo (see results). On the basis of our observations, we claim that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.

[TOP]

Figure 1: The generation of promoter sequences- from genome scanning to sequence design

[TOP]

Methods

Our promoter sequence was defined to be 1000 bp upstream of the TSS. The [http://genome.ucsc.edu/cite.html| UCSC Genome Browser] [6] provides reference sequence and working draft assemblies for a large collection of genomes including the human genome. We derived the provided sequences containing the 1000 bases upstream of annotated TSS for each RefSeq gene [7]. Upon this pre-selection, we further narrowed our choice of promoter set by selection of distinct pathways. KEGG ([http://www.genome.jp/kegg/Kyoto| Kyoto Encyclopedia of Genes and Genomes]) [8], [9], [10] provides a database of biological systems, consisting of several building blocks including for instance genes and proteins (KEGG GENES) or hierarchies and relationships of various biological objects (KEGG BRITE). Here, KEGG PATHWAY which comprises molecular interaction and reaction networks for metabolism, various cellular processes and human diseases was of particular interest. From this collection of molecular wiring maps, we chose all physiologically relevant pathways for our project, thereby ruling out tissue or other highly specific pathways, like olfactory / taste transduction as well as several pathways related to human diseases.

[TOP]

Promotersweep

One of the most challenging problems in bioinformatics remains the computerized localization of TFBS and the transcriptional start sites (TSS) as well as the determination of the core promoter. Many of the available motive discovery tools exhibit the problem of extended false positive predictions. The Promotersweep web-tool improves the accuracy of this analysis by combining a vaste number of different algorithms and methods simultaneously [11]. It integrates information from three homology databases (EnsEMBL Compara [12], NCBI HomoloGene [13],DoOP database [14]), five promoter databases(EPD [15], DBTSS [16]), six sequence motive identification tools (e.g. Gibbs MotifSampler [17]) and two matrix profile databases (Jaspar Core Library [18], Transfac ProfessionalLibrary [19]) to identify and annotate TFBS. The Promotersweep pipeline is started by entering a sequence, chosen between human or mouse as origin. Initially a homology search is performed by using different BLAST algorithms. As a result orthologous promoter regions are deduced from EnsEMBL, Homologene or DoOP, respectively. Subsequently, several motive discovery tools determine shared motives of orthologous or co-regulated sequences. In the last step each TFBS is identified and evaluated with the help of the Transfac - and the Jaspar Core library. Every identified TFBS is classified as weak, conserved or reliable - according to the similarity of the predictions of the different algorithms. In Fig. 2 the result for the Heat shock cognate 71 kDa protein (NM_153201) promoter is shown. For this promoter four different binding sites were discovered. Three of them were classified as reliable and one as conserved. For each hit, the output of Promotersweep contains the position of the motive relative to the TSS. So far we were able to analyse 4395 different promoter sequences, which hold 29966 TFBS in total.

Figure 2: Promotersweep output for the Heat shock cognate 71 kDa protein (NM_153201). Four different binding sites were discovered.

[TOP]

The HEARTBEAT-database

In order to retrieve the information computed by promotersweep as fast as possible we decided to develop a database structure based on MySQL (My systems query language) [20]. MySQL is one of the most popular relational database management systems. It offers not only a language to set up a hierarchical database but also an interface for easy manipulation of data. The advantage of MySQL is its very intuitive command language and the table structure which helps to minimize redundant data. Simple queries are written in a “SELECT - FROM - WHERE” format. With SELECT value all requested columns are specified. The FROM value calls the corresponding table and WHERE allows a more accurate selection. For our database the average query duration is below 200 ms. This enabled us to provide a fluent online access of HEARTBEAT through the HEARTBEAT GUI. Our data is stored in the tables “Main_Info” and “Gene_Info”. Main_Info contains all necessary data to define the location, binding motive and quality of a TFBS, whereas Gene_Info offers additional information for the gene as well as several gene annotations, where the TFBS is located on. In table 1 and 2 the table structure is shown for Main_Info and Gene_Info.

Table 1: Database structure: Main Info

RefseqID TF Name TF position start TF position end TF motive TF score BS quality TF matrix
NM_000201 VDR(V$VDR_Q3) 568 573 aagcga 0.906 conserved VDR
NM_000393 VDR(V$VDR_Q3) 825 832 tagggagg 0.955 conserved VDR
NM_000564 VDR(V$VDR_Q3) 235 243 tgggaaccc 0.908 conserved VDR
NM_000684 VDR(V$VDR_Q3) 660 665 ggggtg 0.900 reliable VDR
NM_000725 VDR(V$VDR_Q3) 911 916 gggtca 0.920 conserved VDR
NM_000525 SREBP(V$SREBP_Q6) 469 473 cgtga 0.991 conserved SREBP
NM_000817 SREBP(V$SREBP_Q3) 909 913 cccga 0.962 conserved SREBP
NM_000872 SREBP(V$SREBP_Q3) 352 357 acccca 0.989 conserved SREBP
NM_000905 SREBP(V$SREBP_Q3) 917 926 gagtcaccca 0.960 reliable SREBP
NM_000909 SREBP(V$SREBP_Q6) 526 532 gcgtgag 0.982 conserved SREBP
NM_001011551 SREBP(V$SREBP_Q3) 320 324 gaata 0.967 conserved SREBP
NM_001013620 SREBP(V$SREBP_Q6) 951 960 cactccagga 0.989 conserved SREBP
NM_001024 SREBP(V$SREBP_Q6) 974 978 acccg 0.987 reliable SREBP
NM_001025366 SREBP(V$SREBP_Q6) 556 561 ggggtc 0.983 reliable SREBP
NM_001025367 SREBP(V$SREBP_Q6) 556 561 ggggtc 0.983 reliable SREBP

Table 1: Database structure: Gene Info

RefseqID EntrezID Gene symbol EnsembleID TSS doop TSS DBTSS TSS EPD TSS MPromDB
NM_181537 342574 KRT27 ENSG00000171446 986 984 NA 984
NM_006522 7475 WNT6 ENSG00000115596 1118 1116 NA 1099
NM_013445 2571 GAD1 ENSG00000128683 NA 1199 NA 87

[TOP]

Promoter design

For the rational design of a responsive promoter construct several preliminary considerations have to be done. The first question which needs to be addressed concerns the inducibility of the pathway of interest. Preceding experiments revealed that for VDR (vitamin D receptor) as well as for SREBP (sterol regulatory element binding protein), convenient conditions and treatments exist under which each pathway can be exclusively activated without killing the chassis, that is in our studies the transfected cells. For further reference about induction conditions refer to Material and Methods. After deciding what kind of TFBS is to be included appropriate consensus motives where TFs would presumably bind on had to be selected. Reliable consensus motives can be deduced from matrices provided in the TRANSFAC database. In case of several different binding matrices, we chose the longest motive which contains the most definite bases [19]. Focusing from now on only on VDR and SREBP we created the frequency distributions of the TFBS occurrence for both TFs based on HEARTBEAT (see Fig. 3 and 4). As mentioned above the basic assumption of our model is that most transcription factors exhibit a spatial preference for binding to the DNA which includes not only the binding sequence and the distance to the TSS but also the mutual distance between potential TF pairs. Based on this concept we specified distance of the pdf-maxima to the TSS for both VDR and SREBP. Subsequently the binding motive is embedded into the artificial promoter around the position where the majority of binding sites are located in natural promoters. With this idea we created first a series of synthetic promoters in which we differed only the number of binding motives positioned around the pdf-maxima. For all sequences designed for VDR, see HEARTBEAT_VDR.pdf. For all sequences designed for SREBP, see HEARTBEAT_SREBP.pdf.

With a second series of artificially designed promoter sequences we tried to answer to what extend further auxiliary TFBS affect the binding activity of VDR and SREBP. Therefore we plotted the frequencies of all TFBS which are co-occuring when VDR or SREBP is present in a natural promoter sequence as well (see Fig. 5 and 6). For SREBP ZF5 was also present with a relative frequency of 60%. In case of VDR, AP-2 co-exists in 48% and WT1 in 54% of all VDR-promoters. In the following we proceeded in analogy to series one. We created a variety of sequences where we included TFBS in proximity to the pdf-maximum of their frequency distribution besides the VDR and SREBP binding sites. Depending on the number of species of TFBS we distinguish between a blue (1 TFs), green (2 TFs) and orange (3 different TFs) series. Finally all spacer sequences were filled with a random sequence with A:T and C:G content ratios being equal. To make sure that our sequences are as specific as possible we iteratively checked and modified our sequence with the Transfac match tool [21] until no other TFBS expect for our chosen ones were detected. Additionally we tested the sequence for every restriction site used in any Biobrick standard. Finally we added a HindIII at the 5' end and a SpeI restriction site at the 3' end to allow for cloning the construct into the reporter plasmid.

[TOP]

Results

HEARTBEAT analyzes transcription factor binding preferences

For the statistical analysis we plotted the absolute frequency of occurrence for each TF-binding site in a histogram against the position relative to the TSS where the TSS is located at base 1001 -1003. Each bin comprises 20 bases analogous to different low resolution approaches which analysed the spatial distribution of TFBS with a sliding window of 20-25 bp [22], [3], [23]. From 356 different TFBS for which Transfac contains at least one binding matrix 144 TFBS occurred at least within 50 from 4390 natural promoters. TFBS with less than 50 counts were removed from the selection and not considered for further analysis. In Fig. 7-10 the spatial distributions of Sp1, AP-2, IPF1 (Insulin promoter factor 1) and Kid3 binding sites are shown. The red solid line represents the re-scaled probability density function (pdf). We introduced this function for two reasons. On the one hand the pdf is more robust with respect to outliers than a normal histogram. On the other hand we used the rescaled area under the curve between a shifting frame of 20 bases as a measurement for the significance of a particular TFBS occurrence. The vertical red line in each plot defines the maximum of the pdf. Around the respective base position the majority of binding motives are located within the natural promoters. The maximum of the pdf will serve in the following as the position where binding sites are introduced into our rational designed promoter sequences.

[TOP]


An in vivo Test of HEARTBEAT predicted sequences

SREBP-responsive sequences designed as outlined above were transfected in HeLa cells and SREBP was induced as described. Promoter activity was then preliminarily analyzed by TECAN (automated fluorescence plate reader). Sequence HB9 showed significant induction (Fig. 11)
.
Figure 3: VDR binding site distribution X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.
Figure 4: SREBP binding site distribution
Figure 5: Transcription factors co-occurring with VDR. X axis corresponds to transcription factor ID, vertical axis corresponds to number of hits (transcription factor x occuring together with VDR)
Figure 6: Transcription factors co-occurring with SREBP. X axis corresponds to transcription factor ID, vertical axis corresponds to number of hits (transcription factor x occurring together with SREBP)
Figure 7: Kid3 binding site distributionX axis corresponds to base pair distance from the TSS, where 1000 is the TSS.
Figure 8: IPF1 binding site distribution X axis corresponds to base pair distance from the TSS, where 1000 is the TSS.
Figure 9: Ap-2 binding site distributionX axis corresponds to base pair distance from the TSS, where 1000 is the TSS.
Figure 10: Sp1 binding site distributionX axis corresponds to base pair distance from the TSS, where 1000 is the TSS.
Figure 11: HEARTBEAT-predicted SREBP responsive promoter is induced by SREBP. Characterization by TECAN plate fluorescence reader. For Induction and inhibition conditions, refer to Material and Methods

[TOP]

Discussion

In predicting a functional promoter sequence, we followed one of the basic principles of synthetic biology: Studying nature quantitatively and applying the knowledge thus gained to the de novo creation of biological entities. Four points require further discussion:

  1. What is the scientific significance of the analysis created by HEARTBEAT?
  2. How can the HEARTBEAT database be improved?
  3. What does the outcome of our in vivo test tell us?
  4. Can promoter design become fully automated?

[TOP]


Transcription factors have a spatial binding preference

Upon statistical analysis of over 4000 promoter sequences, we discovered 90 out of 356 TFBS distributions with one significant peak reflecting the spatial occurrence of a particular binding motive. 54 distributions contain two equally high local maxima. For the remaining 212 transcription factors less than 50 TFBS could be detected by Promotersweep and hence were not in the scope of our analysis. Transcription factors have been shown before to have certain spatial binding preferences [3], [4], but no analysis based upon multiple databases and gene orthologies was ever conducted. Our method therefore minimizes false positives, and still, our results confirm existing findings.

[TOP]

Improving HEARTBEAT database

As noted above, for 212 TFs, less than 50 TFBSs could be detected. In order to overcome this problem in the future we plan to accomplish the screening of the entire genome and to systematically expand our promoter screening by including genomes from different mammalian species. With the increased variety of input sequences we hope to decrease the influence of false positive hits in our statistics. Furthermore it could also help to understand the 54 distributions with multiple maxima and answer the question about the occurance of multiple potential binding sites of one single TF within a given promoter sequence. To make HEARTBEAT even more universal we plan to screen a broader range of the natural promoters which will comprise the sequence downstream of the TSS as well, as it is a well known fact that TF bind to introns and 5' UTR also [24]. In parallel we want to focus on promoters involved in disease related pathways which promises deeper insight in the differences in their molecular regulation.

[TOP]


In vivo test of predicted sequences shows that functional promoter can be predicted by HEARTBEAT

Figure 12: HEARTBEAT generates functional promoter sequences. Sequence HB 8 and 9 are inducible under appropriate conditions . Sequences marked red were introduced as a negative control
Of all the designed sequences, HB8 and HB9 showed activity under appropriate conditions(compare Fig. 11, Results). These sequences are highly similar (see Fig. 12 for how SREs were placed) since they have Transcription Factor Binding Sites under both peaks of the distribution with at least one auxiliary Sp1 element. The other sequences either lack the second SREBP binding site or the Sp1 binding site. Sequences having another type of putative SREBP response element are not functional. We thus showed that the sequences which most closely reflect the expected distribution also work best! These results point out that promoters can be predicted by HEARTBEAT. It also becomes clear that from the study of synthetic promoters, very useful information about gene regulation can be deduced. In order to obtain the information it requires elaborate models, which we started to develop. In the combination with RA-PCR, we provide two very powerful tools for the generation of synthetic promoters. We believe that especially by combining the two methods (see discussion M-RA-PCR), we are able to create any promoter desired in the foreseeable future, thereby accelerating the progress of virotherapy development and fundamental research alike.







Automatization of promoter sequence generation

In this work, the design of promoter sequences was an ad-hoc process which took an entire day. We therefore developed a GUI which assists users with sequence design and automates certain step. Tools such as this will make promoter design a very easy task in the future.

[TOP]

References

[1] BCCS-Bristol: Best Model 2008

[2] Heintzman N. D. & Ren B. The gateway to transcription: identifying, characterizing and understanding promoters in the eukaryotic genome. Cellular and Molecular Life Science 64: 386-400 (2007).

[3] Vardhanabhuti, S., Wang, J. & Hannenhalli, S. Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation. Nucl Acid Res 35(10): 3203-3213 (2007).

[4] Yokoyama, K. D., Ohler, U. & Wray, G. A. Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships. Nucl Acid Res 37(13): e92 (2009)

[5] Ellis T., Wang X. & Collins J. J. Diversity-based, model-guided construction of synthetic gene networks with predicted functions. Nature Biotechnology 27: 465-471 (2009).

[6] Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A. M. & Haussler D. The human genome browser at UCSC. Genome Res. 12(6): 996-1006 (2002).

[7] Kuhn R. M., Karolchik D., Zweig A. S., Wang T., Smith K. E., Rosenbloom K. R., Rhead B., Raney B. J., Pohl A., Pheasant M., Meyer L., Hsu F., Hinrichs A. S., Harte R. A., Giardine B., Fujita P., Diekhans M., Dreszer T., Clawson H., Barber G. P., Haussler D. & Kent W. J. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 37(Database issue): D755-61 (2009).

[8] Kanehisa M., Araki M., Goto S., Hattori M., Hirakawa M., Itoh M., Katayama T., Kawashima S., Okuda S., Tokimatsu T. & Yamanishi, Y. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36 (Database issue): D480-4 (2008).

[9] Kanehisa M., Goto S., Hattori M., Aoki-Kinoshita K.F., Itoh M., Kawashima S., Katayama T., Araki M. & Hirakawa M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34 (Database issue): D354-7 (2006).

[10] Kanehisa M. & Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28(1): 27-30 (2000).

[11] del Val C., Pelz O., Glatting K.-H., Barta E. & Hotz-Wagenblatt, A. PromoterSweep: a tool for identification of transcription factor binding sites. Theor. Chem. Acc. DOI 10.1007: s00214-009-0643-8 (2009).

[12] [http://apr2006.archive.ensembl.org/info/software/compara/index.html Ensembl Compara (database)] Hubbard T., Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F., Curwen V., Cutts T., Down T., Durbin R., Fernandez-Suarez X. M., Gilbert J., Hammond M., Herrero J., Hotz H., Howe K., Iyer V., Jekosch K., Kahari A., Kasprzyk A., Keefe D., Keenan S., Kokocinsci F., London D., Longden I., McVicker G., Melsopp C., Meidl P., Potter S., Proctor G., Rae M., Rios D., Schuster M., Searle S., Severin J., Slater G., Smedley D., Smith J., Spooner W., Stabenau A., Stalker J., Storey R., Trevanion S., Ureta-Vidal A., Vogel J., White S., Woodwark C. & Birney E. Ensembl. Nucleic Acids Res. 33 (Database issue): D447-D453 (2005).

[13] [http://www.ncbi.nlm.nih.gov/homologene NCBI HomoloGene (database)]

[14] [http://doop.abc.hu/ DoOP (database)] Barta E., Sebestyén E., Pálfy T. B., Tóth G., Ortutay C. P. & Patthy L. DoOP: Databases of Orthologous Promoters, collections of clusters of orthologous upstream sequences from chordates and plants. Nucleic Acids Res. 33 (Database issue): D86-D90 (2004).

[15] [http://www.epd.isb-sib.ch/ EPD (database)] Schmid C. D., Praz V., Delorenzi M., Périer R. & Bucher P. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 32 (Database issue): D82-5 (2004).

[16] [http://dbtss.hgc.jp/ DBTSS (database)] Wakaguri H., Yamashita R., Suzuki Y., Sugano S., Nakai K. DBTSS: database of transcription start sites, progress report 2008. Nucleic Acids Res.36(Database issue): D97-101 (2008)

[17] [http://bayesweb.wadsworth.org/gibbs/gibbs.html Gibbs MotifSampler Lawrence (database)] C. E., Altschul S. F., Boguski M. S., Liu J. S., Neuwald A.F., Wootton J. C. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208-214(1993)

[18] [http://jaspar.genereg.net/ Jaspar Core Library (database)] Sandelin A., Alkema W., Engstrom P., Wasserman W. W., Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32(Database issue): D91-4 (2004).

[19] Matys V., Kel-Margoulis O. V., Fricke E., Liebich I., Land S., Barre-Dirrie A., Reuter I., Chekmenev D., Krull M., Hornischer K., Voss N., Stegmaier P., Lewicki-Potapov B., Saxel H., Kel A. E. & Wingender E. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34 (Database issue): D108-110 (2006).

[20] [http://www.mysql.com/?bydis_dis_index=1 Mysql]

[21] Kel A. E., Gössling E., Reuter I., Cheremushkin E., Kel-Margoulis O. V. & Wingender E. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31: 3576-3579 (2003).

[22] Hannenhalli S. & Levy S. Predicting transcription factor synergism. Nucleic Acids Res. 30: 4278-84 (2002).

[23] FitzGerald P. C., Shlyakhtenko A., Mir A. A. & Vinson C. Clustering of DNA sequences in human promoters. Genome Res. 14: 1562-74 (2004).

[24] Alberts B., Johnson A., Walter P. & Lewis J. Molecular Biology of the Cell. 5th edition, 2008. Garland Science, Chapter 6


[TOP]