HEARTBEAT: Design of rational promoter sequences

Introduction

In parallel to the biochemical method RA-PCR our team worked on a computational approach for the rational design of promoter libraries. This strategy focused on spatial dependencies between pairs of binding motives and how the distance of a particular binding motive affects transcriptional activity. Similar to existing methods which predict spatial preferences of transcription factorbinding sites (TFBS) by detecting statistically overrepresented motives [] we used [http://genome.dkfz-heidelberg.de/| Promotersweep] [] to analyse and process the information of over 4000 human promoter sequences. The data of a total of 29966 TF-binding sites was then stored in a MySQL database (DB) which we named Heidelberg Artificial Transcription factor Binding site Engineering and Assembly Tool, HEARTBEAT. Based on frequency distributions (see results) for SREBP and VDR derived from HEARTBEAT, we were able to design 25 promoter sequences with different arrangement of TF-binding sites (see Promoter design).

VDR- as well as SREBP-responsive promoter constructs were first cloned into a reporter plasmid (BBa_K203100) and consequently transfected into MCF-7 and Hela cells, respectively. The final screening was accomplished by TECAN and Flow Cytometry measurements (see results). Sentence about first results. According to our results, we think that our way of combining rational generation and experimental validation of synthetic constructs provides a novel and effective strategy in synthetic biology.

To enable the construction of HEARTBEAT-based rational promoter sequences which comprises sequence assembly with TFBS of choice, adding spacer sequences, checking for restriction sites and unintentional TFBS etc., we developed the HEARTBEAT graphical user interface HEARTBEAT GUI. Furthermore, we developed a computer model based on fuzzy logic, which is able to simulate the activity of the designed rational and random synthetic promoter sequences. In a reverse approach considering the output, the model helps the user in optimizing the input sequence. Altogether HEARTBEAT which comprises data analysis (HEARTBEAT-DB), a graphical user interface HEARTBEAT GUI and network modeling (HEARTBEAT fuzzy network (FN)) provides a powerful and promising instrument for the synthetic biology community.

Methods

In order to create a reliable database providing sufficient data, the very first task we faced was the promoter / gene selection. For this purpose our promoter sequences was defined to be 1000 bp upstream of the TSS. The [http://genome.ucsc.edu/cite.html| UCSC Genome Browser] [refA] provides reference sequence and working draft assemblies for a large collection of genomes including the human genome. Text and sequence-based searches provide quick and precise access to any region of specific interest. In our case we derived the provided dataset which included the 1000 bases upstream of annotated TSS for each RefSeq genes [refB]. Upon this pre-selection, we further narrowed our choice of promoter set by selection of distinct pathways. KEGG ([http://www.genome.jp/kegg/Kyoto| Kyoto Encyclopedia of Genes and Genomes]) [refC,D,E] provides a database of biological systems, consisting of several building blocks including e.g. genes and proteins (KEGG GENES) or hierarchies and relationships of various biological objects (KEGG BRITE). Here, KEGG PATHWAY which comprises molecular interaction and reaction networks for metabolism, various cellular processes and human diseases was of particular interest. From this collection of molecular wiring maps, we chose all physiologically relevant pathways for our project, thereby ruling out tissue specific pathways, highly specific pathways like olfactory / taste transduction as well as several pathways related to human diseases.

Promotersweep

On of the most challenging problems in bioinformatics remains the computerised localisation of TFBS and the transcriptional start sites (TSS) as well as the determination of the core promoter. Many of the available motive discovery tools exhibit the problem, of extended false positive predictions. The Promotersweep web-tool improves the accuracy of this analysis by combining a vast number of different algorithms and methods simultaneously. It integrates information from three homology databases (EnsEMBL Compara, NCBI HomoloGene,DoOP database), five promoter databases(EPD, DBTSS), six sequence motive identification tools (e.g. Meme, Gibbs MotifSampler) and two matrix profile databases (Jaspar Core Library, Transfac ProfessionalLibrary) to identify and annotate TFBS. The Promotersweep pipeline is started by entering a sequence, chosen between human or mouse as origin. Initially a homology search is performed by using different BLAST algorithms. As a result orthologous promoter regions are deduced from EnsEMBL, Homologene or DoOP, respectively. Subsequently several motive discovery tools determine shared motives of orthologous or co-regulated sequences. In the last step each TFBS is identified and evaluated with the help of the Transfac core library and the Jaspar Core library. Every identified TFBS is classified as weak, conserved or reliable according to the similarity of the predictions of the different algorithms. In Fig. [] the result for the Heat shock cognate 71 kDa protein (NM_153201) promoter is shown. For this promoter four different binding sites were discovered. Three of them were classified as reliable and one as conserved. For each hit the output of promotersweep contains the position of the motive relative to the TSS. So far we were able to analyse 4395 different promoter sequences, which hold 29966 TFBS in total.

The HEARTBEAT-database

In order to retrieve the information computed by promotersweep as fast as possible we decided to develop a database structure based on MySQL (My systems query language). MySQL is one of the most popular relational database management systems. It offers not only a language to set up a hierarchical database but also an interface for easy manipulation of data. The advantage of MySQL is its very intuitive command language and the table structure which helps to minimize redundant data. Simple queries are written in a “SELECT - FROM - WHERE” format. With SELECT value all requested columns are specified. The FROM value calls the corresponding table and WHERE allows a more accurate selection. For our database the average query duration is below 200 ms. This enabled us to provide a fluent online access of HEARTBEAT through the HEARTBEAT GUI. Our data is stored in the tables “Main_Info” and “Gene_Info”. Main_Info contains all necessary data to define the location, binding motive and quality of a TFBS, whereas Gene_Info offers additional information for the gene as well as several gene annotations, where the TFBS is located on. In Fig. [] the table structure is shown for Main_Info and Gene_Info.

Results

For the statistical analysis we plotted the absolute frequency of occurrence for each TF-binding site in a histogram against the position relative to the TSS where the TSS is located at base 1001 -1003. Each bin comprises 20 bases analogue to different low resolution approaches which analysed the spatial distribution of TFBS with a sliding window of 20-25 bp [Daigoro]?. From 356 different TFBS for which Transfac contains at least one binding matrix 144 TFBS occurred at least within 50 from 4390 natural promoters. TFBS with less than 50 counts were removed from the selection and not considered for further analysis. In Fig. [] the spatial distributions of Sp1, AP-2, IPF1 (Insulin promoter factor 1) and Kid3 binding sites are shown. The red solid line represents the re-scaled probability density function (pdf). We introduced this function for two reasons. On the one hand the pdf is more robust with respect to outliers than a normal histogram. On the other hand we used the rescaled area under the curve between a shifting frame of 20 bases as a measure for the significance of a particular TFBS occurrence. The vertical red line in each plot defines the maximum of the pdf. Around the respective base position the majority of binding motives are located within the natural promoters. The maximum of the pdf will serve in the following as the position where binding sites are introduced into our rational designed promoter sequences.

MySQL

RefseqID	TF Name	TF position start	TF position end	TF motive	TF score	BS quality	TF matrix
NM_000201	VDR(V$VDR_Q3)	568	573	aagcga	0.906	conserved	VDR
NM_000393	VDR(V$VDR_Q3)	825	832	tagggagg	0.955	conserved	VDR
NM_000564	VDR(V$VDR_Q3)	235	243	tgggaaccc	0.908	conserved	VDR
NM_000684	VDR(V$VDR_Q3)	660	665	ggggtg	0.900	reliable	VDR
NM_000725	VDR(V$VDR_Q3)	911	916	gggtca	0.920	conserved	VDR
NM_000525	SREBP(V$SREBP_Q6)	469	473	cgtga	0.991	conserved	SREBP
NM_000817	SREBP(V$SREBP_Q3)	909	913	cccga	0.962	conserved	SREBP
NM_000872	SREBP(V$SREBP_Q3)	352	357	acccca	0.989	conserved	SREBP
NM_000905	SREBP(V$SREBP_Q3)	917	926	gagtcaccca	0.960	reliable	SREBP
NM_000909	SREBP(V$SREBP_Q6)	526	532	gcgtgag	0.982	conserved	SREBP
NM_001011551	SREBP(V$SREBP_Q3)	320	324	gaata	0.967	conserved	SREBP
NM_001013620	SREBP(V$SREBP_Q6)	951	960	cactccagga	0.989	conserved	SREBP
NM_001024	SREBP(V$SREBP_Q6)	974	978	acccg	0.987	reliable	SREBP
NM_001025366	SREBP(V$SREBP_Q6)	556	561	ggggtc	0.983	reliable	SREBP
NM_001025367	SREBP(V$SREBP_Q6)	556	561	ggggtc	0.983	reliable	SREBP

RefseqID	EntrezID	Gene symbol	EnsembleID	TSS doop	TSS DBTSS	TSS EPD	TSS MPromDB
NM_181537	342574	KRT27	ENSG00000171446	986	984	NA	984
NM_006522	7475	WNT6	ENSG00000115596	1118	1116	NA	1099
NM_013445	2571	GAD1	ENSG00000128683	NA	1199	NA	87

Promoter design

For the rational design of a responsive promoter construct several preliminary considerations have to be done. The first question which needs to be addressed concerns the inducibility of the pathway of interest. Preceding experiments revealed that for VDR (vitamin D receptor) as well as for SREBP (sterol regulatory element binding protein), convenient conditions and treatments exist under which each pathway can be exclusively activated without killing the chassis, that is in our studies the transfected cells (for further information about the experimental set up see []). After we decided what kind of TFBS we want to include we had to find appropriate consensus motives our chosen TFs would presumably bind on. Reliable consensus motives can be deduced from matrices provided in the Transfac database []. In case of several different binding matrices, we chose the longest motive which contains the most definite bases (further explanation in the eukarypedia)?. Focusing from now on only VDR and SREBP we created the frequency distributions of the TFBS occurrence for both TFs based on HEARTBEAT (see Fig []). As mentioned above the basic assumption of our model is that most transcription factors exhibit a spatial preference for binding to the DNA which includes not only the binding sequence and the distance to the TSS but also the mutual distance between potential TF pairs. Based on this concept we specified distance of the pdfmaxima to the TSS for both VDR and SREBP. Subsequently the binding motive is embedded into the artificial promoter around the position where the majority of binding sites are located in natural promoters. With this idea we created first a series of synthetic promoters in which we differed only the number of binding motives positioned around the pdf-maxima. Examples regarding this series can be found in Fig[]. With a second series of artificially designed promoter sequences we tried to answer to what extend further auxiliary TFBS affect the binding activity of VDR and SREBP. Therefore we plotted the frequencies of all TFBS which are co-occuring when VDR or SREBP is present in a natural promoter sequence as well(see Fig []). For SREBP ZF5 was also present with a relative frequency of 60%. In case of VDR, AP-2 co-exists in 48% and WT1 in 54% of all VDR-promoters. In the following we proceeded analogue to series one. We created a variety of sequences where we included TFBS in proximity to the pdf-maximum of their frequency distribution besides the VDR and SREBP binding sites. Depending on the number of species of TFBS we distinguish between a blue (1 TFs), green (2 TFs) and orange (3 different TFs) series. Finally all spacer sequences were filled with a random sequence with equal A:T and C:G content. To make sure that our sequences are as specific as possible we iteratively checked and modified our sequence with the Transfac match tool as long as no other TFBS expect for our chosen ones were detected. Additionally we tested the sequence for every restriction site used in any Biobrick standard. Finally we added a HindIII at the 5' end and a SpeI restriction site at the 3' end to enable to clone the construct into the reporter plasmid.

Discussion

During an intensive but at the same time extremely educative summer we seized the chance to glance at several fundamental domains of the bioinformatic research. Thereby we followed one of the basic principles of synthetic biology: learning from the natural ideal before using it for our own purposes. Starting from the initial idea to design synthetic promoter sequences in a totally rational and computerised process we first screened the human genome for patterns in the structure of natural promoter sequences. After analysing over 4000 promoter sequences from the UCSC database with the result of nearly 30000 detected transcription factor binding sites (TFBS) we established the HEARTBEAT-database. Moreover upon statistical analysis we discovered 90 out of 356 TFBS distributions with one significant peak reflecting the occurrence of a particular binding motive. 54 distributions contain two or equally high local maxima. For the remaining 212 transcription factors less than 50 TFBS could be detected by Promotersweep and hence , not in the scope of our analysis . In order to overcome this problem in the future we plan to accomplish the screening of human promoters and to systematically expand our promoter screening by including genomes from different mammalian species. With the increased variety of input sequences we hope to decrease the influence of false positive hits in our statistics. Furthermore it could also help to understand the 54 distributions with multiple maxima and answer the question about the occurance of multiple potential binding sites of one single TF within a given promoter sequence. To make HEARTBEAT even more universal we plan to screen a broader range of the natural promoters which will comprise the sequence downstream of the TSS as well. In parallel we want to include promoters involved in disease related pathways which promises deeper insight in the differences in their molecular regulation. With this approach we will be able to model not only the physiological state but contribute also to clinical research. So far we only have been able to speculate about the biological consequences of the TF binding in a particular distance to the TSS. Influenced by previous studies [] we assumed that not only the spatial preference but also the pairwise interaction of different TFs on one promoter determines a cis-regulatory element. To prove our assumption, we developed a strategy first to design artificial DNA constructs based on the outcome of HEARTBEAT and then to test these constructs in vivo. As a proof of principle we focused on the transcription factors VDR and SREBP. We were able to exclusively induce the pathways of VDR and SREBP in MCF-7 and Hela cells, respectively (see induction Subsequently we assembled a total of 25 constructs divided into a blue (only VDR or SREBP), a green (one auxiliary TFBS – e.g. AP-2) and an orange series (two auxiliary TFBS) based on the pdf maxima of each TFBS-frequency distribution (Results). We further standardised and automated the process of promoter construction in our HEARTBEAT-GUI, which is from now on online available for the iGEM community. By applying fuzzy logic, we furthermore purpose a network model which is capable not only of error checking as well as proving pathway functionality but also of investigating exclusive pathway activation. We believe that the succeeding appliance of the HEARTBEAT-DB, the HEARTBEAT-GUI and finally the HEARTBEAT-FN will enable the user to i) receive detailed information about the spatial occurrence of over 100 human TFBS in natural promoter sequences ii) fast but precise design rational synthetic promoters and iii) predict and check the experimental outcome and quality of his/her created sequence.

References

[1] Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99-104 (2004).

[2] Hu, Z., Killion, P. J. & Iyer, V. R. Genetic reconstruction of a functional transcriptional regulatory network. Nature Genet. 39, 683-687 (2007).

[3] Gertz, J., Siggia E. D. & Cohen, B. A. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457. 215-218 (2009)

[4] Roider, H. G. et al. Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 23, 134-141 (2006)

[5] ref to come

[6] Andianantoandro, E. et al. Synthetic biology: new engineering rules for an emerging discipline. Mol Sys Biol (2006)

[7] Alberts, B. et al. Molecular Biology of the Cell, 5th edition. Garland Science, 2008, Chapter 6

[8] Vardhanabhuti, S., Wang, J. & Hannenhalli, S. Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation. Nucl Acid Res 35, 3203-3213 (2007).

[9] Yokoyama, K. D., Ohler, U. & Wray, G. A. Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships. Nucl Acid Res, 1-21 (2009)

[10] Nelles, O. Nonlinear System Identification. Springer, 2000.

[11] Bosl, W. J. BMC systems biology 1, 13 (2007).

[12] Mathematical modeling of the lambda switch:a fuzzy logic approach.

[13] B. B. Aldridge, J. Saez-Rodriguez, J. L. Muhlich et al., PLoS computational biology 5 (4), e1000340 (2009).

|width="250px" style="background-color:#d8d5d0"|

|}

Team:Heidelberg/HEARTBEAT database

From 2009.igem.org