Team:Heidelberg/Notebook modeling

From 2009.igem.org

Revision as of 00:43, 20 October 2009 by Naoiwamoto (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Notebook HEARTBEAT

Welcome to the notebook of the HEARTBEAT (Heidelberg Artificial Transcription Factor Binding Sites Assembly and Engineering Tool) project. This notebook comprises the work on three sublanes: HEARTBEAT database (DB), HEARTBEAT graphical user interface (GUI) and HEARTBEAT fuzzy modeling (FN) as well as some additional work on logo as well as wiki design. Have fun!

Contents

July

August

September

October

July

7-27-2009

Meeting with Oliver Pelz
- Discuss general ideas of our Database Structure and Content
- An introduction into PromoterSweep (LINK). PromoterSweep screens a given sequence for conserved regions giving us consensus sequences and moreover screens them for TFBS by using database search (TRANSFAC, Jasper) (LINK)
- Our new database should contain following informations: promoter sequence, TFs, TFBS, position of TFBS, number of binding TFBS, "host organism"
- We decide to choose MySQL as a appropiate language solving this challenge which allows us also a graphical representation of the database on the web later.
- GUI on wiki: which language? php? javascript?
- Problems: access to PromoterSweep (Husar Bioinformatics Group, DKFZ), choice of Promoter Database (DoOP, UCSC, EnsEMBL) (LINK)

aim: create database until end of August

August

Week	Days
	Mon	Tue	Wed	Thu	Fri	Sat	Sun
31	-	-	-	-	-	1	2
32	3	4	5	6	7	8	9
33	10	11	12	13	14	15	16
34	17	18	19	20	21	22	23
35	24	25	26	27	28	29	30
36	31	-	-	-	-	-	-

8-3-2009

First contact with MySQL
Start making an overview of other team's projects
Configuring our Virtual Server

8-4-2009

Official Team Meeting (LINK) @ BQ seminar room 43: preparaing presentation & writing meeting report
Start installing developing environment on our internal server
- GNOME
- Mediawiki

8-5-2009

Meeting with Tobias Bauer & Anna-Lena Kranz (Theoretical Bioinformatics, DKFZ) @ TP3, DKFZ
- Integrating ideas of PromoterSweep, Transfac as well as DoOP/CisRED
- select "interesting" TFs (e.g. HIF, NFkB, c-myc, p53) for Wetlab
- select "interesting" pathways (e.g. cell cycle, inflammation, metabolism etc)
- future experimental validation: ChIP-on-Chip
  - for this we need a TFBS-free sequence
- idea: plot histogram of TFBS relative to TSS
  - problem: choice of sequence: upstream only? inculde downstream?
- new programming language: R and perl
- next meeting: Friday after team meeting

Meeting with Karl-Heinz Glatting (HUSAR, DKFZ) @ TP3, DKFZ
- An introduction into PromoterSweep
- Structure and analysis principles of PromoterSweep
- Output is stored in an XML file. This means we have to parse the xml code.
- Oliver Pelz will give help for us in programming

Protocol of the meeting can be downloaded from here.

Start working with MySQL
request UNIX/HUSAR/HPC access at DKFZ (Nao)
first contact with several databases: EmsEMBL, Compara, cisRED, DoOP, TiProD, contra (LINKS)

8-6-2009

Meeting with Oliver Pelz
- defining workflow with PromoterSweep, Matrix Profile Search and introduction into different Motif Discovery Algorithms

installation of NX server for access onto internal server from Windows
configure developing environment (printing from Linux, configure Mediawiki)
defining basic concept of database construction
- we select annotated promoter sequences in DoOP
- we make a selection of pathway of interest using KEGG
- narrow down number of target promoter sequences <10000.

8-7-2009

Official Team Meeting on Scheduling
Meeting with Anna-Lena and Tobias
- Introduction into R
- Tobias will give us access to their computing cluster (Group Roland Eils)
- Promoter Selection: DoOP, EnsEMBL, or UCSC?

HUSAR account arrived
installation of R, R editor and perl editor
further configuration of our internal server / mediawiki

8-10-2009

first contact with R and perl
playing around with R and perl
playing around with R library: Biobase
check working on DKFZ cluster

8-11-2009

defining programming languages: perl, R, MySQL
retrieving first Promotersweep output files

Meeting with Marti
- ideas for modeling
  - we will have at least three colors which overlap in their spectra.
  - a very nice approach will be Fuzzy Logic Modeling.
  - idea 1: error checking of affinity: compare expectation to experimental results and figure out where the error is hiding
  - idea 2: create&visualize fancy and fuzzy data from in silico simulation
- combine: promoter, output and graphic representation (GRAFIK!)
- next meeting with Marti: end of next week.

extract NCBI Entrez Gene IDs with R and perl
MAC adresses registered for bioquant network

8-12-2009

configure perl working environment
study structure of DoOP database
download DoOP and load DoOP database into MySQL

8-13-2009

trying out some DoOP queries
download fasta sequences from UCSC gene browser (LINK)
mapping of NCBI Entrez Gene IDs with RefSeq IDs
configure perl working environment on Windows XP
contact Endre Sebestyen concerning the perl module Bio-DoOP-DoOP (LINK)

8-14-2009

start PromoterSweep Analysis over Weekend

8-18-2009

Tim, Stephen, ab hier müsst ihr eure Sachen selber eintragen!

study outputfile of PromoterSweep. check out general structure and pick up useful information.
result is grouped in: General Info, Best Genomic Mapping, Promoter DB Search Result, Graphical Overview, Combined Binding Sites, TSS and Exon Info, Profile Matrices and Generated Output Files.
upon selection, sections of interest will be collected and made ready for entry into MySQL DB
discuss table structure of our database

How should our database be called? - Brainstorming -
- SHOULD contain: iGEM, Transcription Factor, Binding Site, Promoter, synthetic biology, Heidelberg
- MAY contain: position, heartbeat, prediction, assembly, eukaryotes
- and still more keywords to come

8-19-2009

parse Promotersweep xml file into tab-separated text file (PERL CODE?)
- the text file should contain: RefSeq ID, TF name, TFBS position, TF motif sequence, TFBS Quality, TSS, Entrez ID, EnsEMBL ID, further gene description.
- this provided us with several programming problems concerning working with multiple arrays, hashes and their combinations (arrays of hashes, hashes of hashes, etc.) thus
studying structure and basic concepts of hash & key

8-20-2009

pre-decision for our table-structure
- Table: Main_Info
  - RefSeq ID, TF, TF motif start & end position, TFBS motif score, TFBS quality, TSS database info
- Table: Gene_Info
  - Ensembl_ID, Gene Symbol, Gene Description.
- we go for the RefSeq ID to be the key connecting these two tables.

8-21-2009

update script for parsing the Promotersweep output files due to unexpected errors
we forgot to include "weak" as a category for the TFBS quality - added!
PromoterSweep result contains information about TSS derived from different promoter databases. On which should we rely, if they differ from each other?
- We set our highest priority to DoOP database since they show a good accordance within the RefseqID results when compared to other databases (e.g. DBTSS).

order [http://www.mathworks.com/| Matlab] iGEM licence

search for a tool to use MySQL in R programming environment
wiki: write an short article about the German Cancer Research Center (DKFZ)

Meeting with Anna-Lena: once we established our database... then
- two strategies:
  - manually select interesting transcription factors and analyse them using database queries
  - plot histograms of TFBS occurance within the target promoter sequence (TSS - 1000bp upstream) for each TF and make systematic analysis
- we go for both!
- idea for the future: we can analyze combinatorial appearance of distinct TF pairs

We have a name for our database - we call it -

- wait for it -

HEARTBEAT database (Heidelberg Artificial Transcription Factor Binding Site Engineering and Assembly Tool)

[TOP]

8-24-2009

Meeting with Marti: defining output modeling strategies
- "exclusive promoters"
  - a model for predicting the behaviour of activation of one, two, three... promoters at the same time.
  - the potential of this model lies in the possibility to model single as well as many pathways in combination and even check for synergistic effects
  - modeling logic: quantitative ODE VS. quantitative & qualitative fuzzy logic
- "error checking"
  - what to capture/measure: affinity of transcription factor binding to DNA
    - calculate score / reliabilty
    - phenotypic measurement
  - if we have time in the end: model/experiment optimization by wetlab-drylab-rounds (GRAFIK)
  - if we do not have much time: figure out where is catch
- modeling layers & final visualization
  - (i) capture affinity - (ii) model gene expression - (iii) pathway activity - (iv) fancy visualization (Mathworks Simulink?)
  - plot: time course, dynamic affinity
  - keep in mind the possible high amount of False Positives using promoter search/analysis

8-25-2009

official Team Meeting also with Mr. Kai Ludwig (LANGE + PFLANZ) as guest for Logo / Title Claim discussion

so far we have 1753 promoter sequences analyzed by PromoterSweep!

Meeting with Daniela (Nao): Cell Profiler for capturing biological images & data analysis based on MATLAB

working with R module RMySQL (LINK) for using the pipeline between R and MySQL
create a list of useful RMySQL commands

8-26-2009

Workflow for plotting histogram - workflow (SOURCE CODE/S?)
- make MySQL query using R
- make list of TFs, avoid duplicates using perl
- pick up each TF (perl/R) and plot histogram (R)

create MySQL command list including combinatorial queries

8-27-2009

check HEARTBEAT DB for duplicate entries
how should we plot the histogram?
- (a) histogram - how "wide" should be each bin? 100bp? 50bp? 20bp?
- (b) plot probability density
study Transfac PWM (position weight matrices) for
- difference in consensus sequences (also ask Anna-Lena)
- different PWM types (vertebrates, plant, insect, fungi, bacteria, nematodes...)
- positive control: when histograms are generated and plotted, check distribution of Sp1 (LINK)

so far we have 3640 promoter sequences "sweeped"!

8-28-2009

8-31-2009

September

Week	Days
	Mon	Tue	Wed	Thu	Fri	Sat	Sun
36	-	1	2	3	4	5	6
37	7	8	9	10	11	12	13
38	14	15	16	17	18	19	20
39	21	22	23	24	25	26	27
40	28	29	30	-	-	-	-

9-1-2009

derive transcription factor data using R and MySQL
plot HEARTBEAT TF hit distribution as histograms & density functions for different PWM subsets (all, vertebrates only, single matrices and joined TFs)

9-2-2009

discussion on how to make statistical studies on our gained distributions
- ideas: define maximum and variance -> Nao
look for motif sequences -> Tim

we have 4476 sequences analysed by Promotersweep so far!
- but we are expecting 4700 sequences - check missing ones!

9-3-2009

internal team meeting: Tim, Lars, Stephen, Nao
- select especially interesting TFs
  - criteria: (a) good hits in our distributions; (b) easy experimental handling
  - we go for HIF, SREBP and VDR to analyse and make synthetic promoter design
Transfac PWM: there are some annotaion inconveniences of some matrices
which "spacer" sequences should we use in order to generate TFBS free sequece parts

rational design of synthetic promoters
- Tim: SREBP, Nao: VDR
- both go for a total number of 10 sequences
- strategies:
  - single TFs: search for density maxima
  - check combinatorial appearance and design promoter sequences with multiple binding TFs
- use spacer sequences generated by Lars and check for TFBS using Transfac
- sequence length: max. 1000bp

back-up idea: if synthesis does not work for a long (~1000bp) sequence then try to work out a protocol for a two-step promoter synthesis combining one empty (TFBS free) sequence with another which consists of many TF and activator binding sites.

9-4-2009

work with Transfac PWM: structure, description, and using consensus sequence
write script for generating consensus sequence based on Transfac PWM and replacing ambiguity code with A, C, G or T
```
Getconsensus.pl, MakeConsensus.pl
```
(CODE)?

Wiki Meeting (Nao)
- Logo choice & modification
- choose header pics
- navigation layout
- develop a catchy, cool homepage

9-5-2009

Meeting with Tim, design synthetic promoter sequences
check spacer sequence (200bp) for TFBS: one TFBS found; remove it by cutting and shortening the sequence to 190bp)
Kid3 is a repressor!

9-6-2009

design more synthetic promoter sequences by manual iteration process which consists of (i) TFBS check and (ii) TFBS removal & filling up random sequence

aim: creation of an automatic designing tool for synthetic promoters which include sequence design, transfac search as well as filling the sequence up with spacer sequences.

9-7-2009

check designed sequences for restriction sites
```
CheckRestrictionsites.pl
```
CODE?
finish creating sequences
consider CMV core promoter into the calculation of the relative position of TFBS to the TSS
create sequences for negative control
- pure TFBS free sequence
- sequences with TFBS at minima of the density function

9-8-2009

check restriction sites for reverse complementary strand
add flanking sites with restriction sites and spacer nucleotides to our designed sequences
is there any possibility to automatize Transfac queries?
work with combined / joined MySQL query structures
or solve this process by simply writing new temporary tables?

workflow summary (short) for manual designing of a synthetic promoter:
- (A) use random sequence
- (B) check TF-matrices
- (C) validate TFs (mouse? human? repressor?)
- (D) check Transfac and restriction sites

Phone conference with Kai Ludwig, Logo & Web Design (Nao)

official Team Meeting (LINK)

wiki closure on Oct 21st!

9-9-2009

modify synthetic promoter sequences to be ready for ordering
Sweep more promoter sequences using Promotersweep
start Modeling
revise and improve HEARTBEAT
discuss differences between PWMs

9-10-2009

still modifying synthetic sequences to be ready for shipping
we have altogether 25 designed promoter sequences! (ID: HB_0001 - HB_0025)

9-11-2009

Software Meeting (Stephen, Tim, Nao)
- compartibility with mediawiki: HTML, perl, php, R, java?
- GUI design
  - simple interface: single TF, auxiliary TFs, #TFBS, sequence length
  - "interactive": multiple TF, choosing auxiliary TFs, additional information (see Eukaryopedia), density function plot & histogram
  - "hyper-interactive" step-by-step design & creation

Modeling Meeting with Marti and Anna-Lena (Tim, Nao)
- aim: fancy visualization to show expectation & prediction providing pathway insights
- TODO/QUESTIONS
  - what is the stimulus? collect possible inputs!
  - measurable outcome: experiments & pathways
  - quality of synthetic sequence: error checking
    - we need to define the quality of our sequences
- LEVELS of modeling
  - (1) DNA (2) expression/transcriptional activity (3) output
  - each with corresponding measurement

general modeling scheme: input - "What we are affecting" - possible outcomes
how? We use fuzzy logic (LINK to short intro of fuzzy)

9-14-2009

collect input for inducing the system (e.g. p53: CPT, Pifithrin-alpha; NFkB: TNF-alpha etc.)
phone conference with Kai Ludwig

9-15-2009

create network picture for meeting tomorrow
Logo discussion
Read paper: Fuzzy Logic Modeling of Signaling Networks (Aldridge 2009) (see References)

9-16-2009

Modeling Meeting with Marti (Douaa, Tim, Nao)
- update on available drugs/sequences
- decide what to model: (A) error checking, and (B) differential expression?
- use natural promoters to build up model for prediction of activity of synthetic promoters
- Discussion of TF score
  - Transfac sequence alignment score
  - promotersweep binding site quality
  - relative position to TSS: How?
    - (A) peak width & amplitude, (B) distance to maximal peak & position, (C) number of PEAK, (D) "sliding window" and calculate area under curve, (E) #TFBS (also for comparison of different synthetic promoters)
  - biophysical affinity using TRAP (REFERENZ)
- first model: build up either on CMV or on JeT
- potential: integrate many stimuli -> find out crosstalks of pathways?

TODO (meeting)
- collect data
- define WHAT we want to model
- summarize available sequences
- try to formulate IF ... THEN "sentences"
- check MATLAB & MATLAB Fuzzy Logic Toolbox availability

9-17-2009

internal Team Meeting

9-18-2009

9-21-2009

9-22-2009

Wiki Meeting (Dani, Cori, Nao)
- install image processing tool
- design wiki, brainstorming for possible navigation bars

9-23-2009

9-24-2009

9-25-2009

9-28-2009

9-29-2009

9-30-2009

October

Week	Days
	Mon	Tue	Wed	Thu	Fri	Sat	Sun
40	-	-	-	1	2	3	4
41	5	6	7	8	9	10	11
42	12	13	14	15	16	17	18
43	19	20	21	22	23	24	25
44	26	27	28	29	30	31	-

10-1-2009

10-2-2009

10-5-2009

10-6-2009

10-7-2009

10-8-2009

10-9-2009

10-12-2009

10-13-2009

Retrieved from "http://2009.igem.org/Team:Heidelberg/Notebook_modeling"