Team:Heidelberg/Notebook modeling

From 2009.igem.org

Revision as of 00:41, 20 October 2009 by Naoiwamoto (Talk | contribs)

Notebook HEARTBEAT

Welcome to the notebook of the HEARTBEAT (Heidelberg Artificial Transcription Factor Binding Sites Assembly and Engineering Tool) project. This notebook comprises the work on three sublanes: HEARTBEAT database (DB), HEARTBEAT graphical user interface (GUI) and HEARTBEAT fuzzy modeling (FN) as well as some additional work on logo as well as wiki design. Have fun!


Contents

July

7-27-2009

  • Meeting with Oliver Pelz
    • Discuss general ideas of our Database Structure and Content
    • An introduction into PromoterSweep (LINK). PromoterSweep screens a given sequence for conserved regions giving us consensus sequences and moreover screens them for TFBS by using database search (TRANSFAC, Jasper) (LINK)
    • Our new database should contain following informations: promoter sequence, TFs, TFBS, position of TFBS, number of binding TFBS, "host organism"
    • We decide to choose MySQL as a appropiate language solving this challenge which allows us also a graphical representation of the database on the web later.
    • GUI on wiki: which language? php? javascript?
    • Problems: access to PromoterSweep (Husar Bioinformatics Group, DKFZ), choice of Promoter Database (DoOP, UCSC, EnsEMBL) (LINK)
  • aim: create database until end of August

[TOP]

August

Week Days
Mon Tue Wed Thu Fri Sat Sun
31 - - - - - 1 2
32 3 4 5 6 7 8 9
33 10 11 12 13 14 15 16
34 17 18 19 20 21 22 23
35 24 25 26 27 28 29 30
36 31 - - - - - -

[TOP]

8-3-2009

  • First contact with MySQL
  • Start making an overview of other team's projects
  • Configuring our Virtual Server

8-4-2009

  • Official Team Meeting (LINK) @ BQ seminar room 43: preparaing presentation & writing meeting report
  • Start installing developing environment on our internal server
    • GNOME
    • Mediawiki

8-5-2009

  • Meeting with Tobias Bauer & Anna-Lena Kranz (Theoretical Bioinformatics, DKFZ) @ TP3, DKFZ
    • Integrating ideas of PromoterSweep, Transfac as well as DoOP/CisRED
    • select "interesting" TFs (e.g. HIF, NFkB, c-myc, p53) for Wetlab
    • select "interesting" pathways (e.g. cell cycle, inflammation, metabolism etc)
    • future experimental validation: ChIP-on-Chip
      • for this we need a TFBS-free sequence
    • idea: plot histogram of TFBS relative to TSS
      • problem: choice of sequence: upstream only? inculde downstream?
    • new programming language: R and perl
    • next meeting: Friday after team meeting
  • Meeting with Karl-Heinz Glatting (HUSAR, DKFZ) @ TP3, DKFZ
    • An introduction into PromoterSweep
    • Structure and analysis principles of PromoterSweep
    • Output is stored in an XML file. This means we have to parse the xml code.
    • Oliver Pelz will give help for us in programming
  • Protocol of the meeting can be downloaded from here.
  • Start working with MySQL
  • request UNIX/HUSAR/HPC access at DKFZ (Nao)
  • first contact with several databases: EmsEMBL, Compara, cisRED, DoOP, TiProD, contra (LINKS)

8-6-2009

  • Meeting with Oliver Pelz
    • defining workflow with PromoterSweep, Matrix Profile Search and introduction into different Motif Discovery Algorithms
  • installation of NX server for access onto internal server from Windows
  • configure developing environment (printing from Linux, configure Mediawiki)
  • defining basic concept of database construction
    • we select annotated promoter sequences in DoOP
    • we make a selection of pathway of interest using KEGG
    • narrow down number of target promoter sequences <10000.

8-7-2009

  • Official Team Meeting on Scheduling
  • Meeting with Anna-Lena and Tobias
    • Introduction into R
    • Tobias will give us access to their computing cluster (Group Roland Eils)
    • Promoter Selection: DoOP, EnsEMBL, or UCSC?
  • HUSAR account arrived
  • installation of R, R editor and perl editor
  • further configuration of our internal server / mediawiki

[TOP]

8-10-2009

  • first contact with R and perl
  • playing around with R and perl
  • playing around with R library: Biobase
  • check working on DKFZ cluster

8-11-2009

  • defining programming languages: perl, R, MySQL
  • retrieving first Promotersweep output files
  • Meeting with Marti
    • ideas for modeling
      • we will have at least three colors which overlap in their spectra.
      • a very nice approach will be Fuzzy Logic Modeling.
      • idea 1: error checking of affinity: compare expectation to experimental results and figure out where the error is hiding
      • idea 2: create&visualize fancy and fuzzy data from in silico simulation
    • combine: promoter, output and graphic representation (GRAFIK!)
    • next meeting with Marti: end of next week.
  • extract NCBI Entrez Gene IDs with R and perl
  • MAC adresses registered for bioquant network

8-12-2009

  • configure perl working environment
  • study structure of DoOP database
  • download DoOP and load DoOP database into MySQL

8-13-2009

  • trying out some DoOP queries
  • download fasta sequences from UCSC gene browser (LINK)
  • mapping of NCBI Entrez Gene IDs with RefSeq IDs
  • configure perl working environment on Windows XP
  • contact Endre Sebestyen concerning the perl module Bio-DoOP-DoOP (LINK)

8-14-2009

  • start PromoterSweep Analysis over Weekend

[TOP]

8-18-2009

Tim, Stephen, ab hier müsst ihr eure Sachen selber eintragen!

  • study outputfile of PromoterSweep. check out general structure and pick up useful information.
  • result is grouped in: General Info, Best Genomic Mapping, Promoter DB Search Result, Graphical Overview, Combined Binding Sites, TSS and Exon Info, Profile Matrices and Generated Output Files.
  • upon selection, sections of interest will be collected and made ready for entry into MySQL DB
  • discuss table structure of our database
  • How should our database be called? - Brainstorming -
    • SHOULD contain: iGEM, Transcription Factor, Binding Site, Promoter, synthetic biology, Heidelberg
    • MAY contain: position, heartbeat, prediction, assembly, eukaryotes
    • and still more keywords to come

8-19-2009

  • parse Promotersweep xml file into tab-separated text file (PERL CODE?)
    • the text file should contain: RefSeq ID, TF name, TFBS position, TF motif sequence, TFBS Quality, TSS, Entrez ID, EnsEMBL ID, further gene description.
    • this provided us with several programming problems concerning working with multiple arrays, hashes and their combinations (arrays of hashes, hashes of hashes, etc.) thus
  • studying structure and basic concepts of hash & key

8-20-2009

  • pre-decision for our table-structure
    • Table: Main_Info
      • RefSeq ID, TF, TF motif start & end position, TFBS motif score, TFBS quality, TSS database info
    • Table: Gene_Info
      • Ensembl_ID, Gene Symbol, Gene Description.
    • we go for the RefSeq ID to be the key connecting these two tables.

8-21-2009

  • update script for parsing the Promotersweep output files due to unexpected errors
  • we forgot to include "weak" as a category for the TFBS quality - added!
  • PromoterSweep result contains information about TSS derived from different promoter databases. On which should we rely, if they differ from each other?
    • We set our highest priority to DoOP database since they show a good accordance within the RefseqID results when compared to other databases (e.g. DBTSS).
  • search for a tool to use MySQL in R programming environment
  • wiki: write an short article about the German Cancer Research Center (DKFZ)
  • Meeting with Anna-Lena: once we established our database... then
    • two strategies:
      • manually select interesting transcription factors and analyse them using database queries
      • plot histograms of TFBS occurance within the target promoter sequence (TSS - 1000bp upstream) for each TF and make systematic analysis
    • we go for both!
    • idea for the future: we can analyze combinatorial appearance of distinct TF pairs
  • We have a name for our database - we call it -


- wait for it -


HEARTBEAT database (Heidelberg Artificial Transcription Factor Binding Site Engineering and Assembly Tool)


[TOP]

8-24-2009

  • Meeting with Marti: defining output modeling strategies
    • "exclusive promoters"
      • a model for predicting the behaviour of activation of one, two, three... promoters at the same time.
      • the potential of this model lies in the possibility to model single as well as many pathways in combination and even check for synergistic effects
      • modeling logic: quantitative ODE VS. quantitative & qualitative fuzzy logic
    • "error checking"
      • what to capture/measure: affinity of transcription factor binding to DNA
        • calculate score / reliabilty
        • phenotypic measurement
      • if we have time in the end: model/experiment optimization by wetlab-drylab-rounds (GRAFIK)
      • if we do not have much time: figure out where is catch
    • modeling layers & final visualization
      • (i) capture affinity - (ii) model gene expression - (iii) pathway activity - (iv) fancy visualization (Mathworks Simulink?)
      • plot: time course, dynamic affinity
      • keep in mind the possible high amount of False Positives using promoter search/analysis

8-25-2009

  • official Team Meeting also with Mr. Kai Ludwig (LANGE + PFLANZ) as guest for Logo / Title Claim discussion
  • so far we have 1753 promoter sequences analyzed by PromoterSweep!
  • Meeting with Daniela (Nao): Cell Profiler for capturing biological images & data analysis based on MATLAB
  • working with R module RMySQL (LINK) for using the pipeline between R and MySQL
  • create a list of useful RMySQL commands

8-26-2009

  • Workflow for plotting histogram - workflow (SOURCE CODE/S?)
    • make MySQL query using R
    • make list of TFs, avoid duplicates using perl
    • pick up each TF (perl/R) and plot histogram (R)
  • create MySQL command list including combinatorial queries

8-27-2009

  • check HEARTBEAT DB for duplicate entries
  • how should we plot the histogram?
    • (a) histogram - how "wide" should be each bin? 100bp? 50bp? 20bp?
    • (b) plot probability density
  • study Transfac PWM (position weight matrices) for
    • difference in consensus sequences (also ask Anna-Lena)
    • different PWM types (vertebrates, plant, insect, fungi, bacteria, nematodes...)
    • positive control: when histograms are generated and plotted, check distribution of Sp1 (LINK)
  • so far we have 3640 promoter sequences "sweeped"!

8-28-2009

[TOP]

8-31-2009

[TOP]

September

Week Days
Mon Tue Wed Thu Fri Sat Sun
36 - 1 2 3 4 5 6
37 7 8 9 10 11 12 13
38 14 15 16 17 18 19 20
39 21 22 23 24 25 26 27
40 28 29 30 - - - -

[TOP]

9-1-2009

  • derive transcription factor data using R and MySQL
  • plot HEARTBEAT TF hit distribution as histograms & density functions for different PWM subsets (all, vertebrates only, single matrices and joined TFs)

9-2-2009

  • discussion on how to make statistical studies on our gained distributions
    • ideas: define maximum and variance -> Nao
  • look for motif sequences -> Tim
  • we have 4476 sequences analysed by Promotersweep so far!
    • but we are expecting 4700 sequences - check missing ones!

9-3-2009

  • internal team meeting: Tim, Lars, Stephen, Nao
    • select especially interesting TFs
      • criteria: (a) good hits in our distributions; (b) easy experimental handling
      • we go for HIF, SREBP and VDR to analyse and make synthetic promoter design
  • Transfac PWM: there are some annotaion inconveniences of some matrices
  • which "spacer" sequences should we use in order to generate TFBS free sequece parts
  • rational design of synthetic promoters
    • Tim: SREBP, Nao: VDR
    • both go for a total number of 10 sequences
    • strategies:
      • single TFs: search for density maxima
      • check combinatorial appearance and design promoter sequences with multiple binding TFs
    • use spacer sequences generated by Lars and check for TFBS using Transfac
    • sequence length: max. 1000bp
  • back-up idea: if synthesis does not work for a long (~1000bp) sequence then try to work out a protocol for a two-step promoter synthesis combining one empty (TFBS free) sequence with another which consists of many TF and activator binding sites.

9-4-2009

  • work with Transfac PWM: structure, description, and using consensus sequence
  • write script for generating consensus sequence based on Transfac PWM and replacing ambiguity code with A, C, G or T
    Getconsensus.pl, MakeConsensus.pl
    (CODE)?
  • Wiki Meeting (Nao)
    • Logo choice & modification
    • choose header pics
    • navigation layout
    • develop a catchy, cool homepage

9-5-2009

  • Meeting with Tim, design synthetic promoter sequences
  • check spacer sequence (200bp) for TFBS: one TFBS found; remove it by cutting and shortening the sequence to 190bp)
  • Kid3 is a repressor!

9-6-2009

  • design more synthetic promoter sequences by manual iteration process which consists of (i) TFBS check and (ii) TFBS removal & filling up random sequence
  • aim: creation of an automatic designing tool for synthetic promoters which include sequence design, transfac search as well as filling the sequence up with spacer sequences.

[TOP]

9-7-2009

  • check designed sequences for restriction sites
    CheckRestrictionsites.pl
    CODE?
  • finish creating sequences
  • consider CMV core promoter into the calculation of the relative position of TFBS to the TSS
  • create sequences for negative control
    • pure TFBS free sequence
    • sequences with TFBS at minima of the density function

9-8-2009

  • check restriction sites for reverse complementary strand
  • add flanking sites with restriction sites and spacer nucleotides to our designed sequences
  • is there any possibility to automatize Transfac queries?
  • work with combined / joined MySQL query structures
  • or solve this process by simply writing new temporary tables?
  • workflow summary (short) for manual designing of a synthetic promoter:
    • (A) use random sequence
    • (B) check TF-matrices
    • (C) validate TFs (mouse? human? repressor?)
    • (D) check Transfac and restriction sites
  • Phone conference with Kai Ludwig, Logo & Web Design (Nao)
  • official Team Meeting (LINK)
  • wiki closure on Oct 21st!

9-9-2009

  • modify synthetic promoter sequences to be ready for ordering
  • Sweep more promoter sequences using Promotersweep
  • start Modeling
  • revise and improve HEARTBEAT
  • discuss differences between PWMs

9-10-2009

  • still modifying synthetic sequences to be ready for shipping
  • we have altogether 25 designed promoter sequences! (ID: HB_0001 - HB_0025)

9-11-2009

  • Software Meeting (Stephen, Tim, Nao)
    • compartibility with mediawiki: HTML, perl, php, R, java?
    • GUI design
      • simple interface: single TF, auxiliary TFs, #TFBS, sequence length
      • "interactive": multiple TF, choosing auxiliary TFs, additional information (see Eukaryopedia), density function plot & histogram
      • "hyper-interactive" step-by-step design & creation
  • Modeling Meeting with Marti and Anna-Lena (Tim, Nao)
    • aim: fancy visualization to show expectation & prediction providing pathway insights
    • TODO/QUESTIONS
      • what is the stimulus? collect possible inputs!
      • measurable outcome: experiments & pathways
      • quality of synthetic sequence: error checking
        • we need to define the quality of our sequences
    • LEVELS of modeling
      • (1) DNA (2) expression/transcriptional activity (3) output
      • each with corresponding measurement
  • general modeling scheme: input - "What we are affecting" - possible outcomes
  • how? We use fuzzy logic (LINK to short intro of fuzzy)

[TOP]

9-14-2009

  • collect input for inducing the system (e.g. p53: CPT, Pifithrin-alpha; NFkB: TNF-alpha etc.)
  • phone conference with Kai Ludwig

9-15-2009

  • create network picture for meeting tomorrow
  • Logo discussion
  • Read paper: Fuzzy Logic Modeling of Signaling Networks (Aldridge 2009) (see References)

9-16-2009

  • Modeling Meeting with Marti (Douaa, Tim, Nao)
    • update on available drugs/sequences
    • decide what to model: (A) error checking, and (B) differential expression?
    • use natural promoters to build up model for prediction of activity of synthetic promoters
    • Discussion of TF score
      • Transfac sequence alignment score
      • promotersweep binding site quality
      • relative position to TSS: How?
        • (A) peak width & amplitude, (B) distance to maximal peak & position, (C) number of PEAK, (D) "sliding window" and calculate area under curve, (E) #TFBS (also for comparison of different synthetic promoters)
      • biophysical affinity using TRAP (REFERENZ)
    • first model: build up either on CMV or on JeT
    • potential: integrate many stimuli -> find out crosstalks of pathways?
  • TODO (meeting)
    • collect data
    • define WHAT we want to model
    • summarize available sequences
    • try to formulate IF ... THEN "sentences"
    • check MATLAB & MATLAB Fuzzy Logic Toolbox availability

9-17-2009

  • internal Team Meeting

9-18-2009

[TOP]

9-21-2009

9-22-2009

9-23-2009

9-24-2009

9-25-2009

[TOP]

9-28-2009

9-29-2009

9-30-2009

[TOP]

October

Week Days
Mon Tue Wed Thu Fri Sat Sun
40 - - - 1 2 3 4
41 5 6 7 8 9 10 11
42 12 13 14 15 16 17 18
43 19 20 21 22 23 24 25
44 26 27 28 29 30 31 -

10-1-2009

10-2-2009

[TOP]

10-5-2009

10-6-2009

10-7-2009

10-8-2009

10-9-2009

[TOP]

10-12-2009

10-13-2009

[TOP]