From 2009.igem.org

(Difference between revisions)

Revision as of 23:02, 19 October 2009

Notebook HEARTBEAT

Welcome to the notebook of the HEARTBEAT (Heidelberg Artificial Transcription Factor Binding Sites Assembly and Engineering Tool) project. This notebook comprises the work on three sublanes: HEARTBEAT database (DB), HEARTBEAT graphical user interface (GUI) and HEARTBEAT fuzzy modeling (FN) as well as some additional work on logo as well as wiki design. Have fun!

July

7-27-2009

Meeting with Oliver Pelz
- Discuss general ideas of our Database Structure and Content
- An introduction into PromoterSweep (LINK). PromoterSweep screens a given sequence for conserved regions giving us consensus sequences and moreover screens them for TFBS by using database search (TRANSFAC, Jasper) (LINK)
- Our new database should contain following informations: promoter sequence, TFs, TFBS, position of TFBS, number of binding TFBS, "host organism"
- We decide to choose MySQL as a appropiate language solving this challenge which allows us also a graphical representation of the database on the web later.
- GUI on wiki: which language? php? javascript?
- Problems: access to PromoterSweep (Husar Bioinformatics Group, DKFZ), choice of Promoter Database (DoOP, UCSC, EnsEMBL) (LINK)

aim: create database until end of August

[TOP]

August

Week	Days
	Mon	Tue	Wed	Thu	Fri	Sat	Sun
31	-	-	-	-	-	1	2
32	3	4	5	6	7	8	9
33	10	11	12	13	14	15	16
34	17	18	19	20	21	22	23
35	24	25	26	27	28	29	30
36	31	-	-	-	-	-	-

[TOP]

8-3-2009

First contact with MySQL
Start making an overview of other team's projects
Configuring our Virtual Server

8-4-2009

Official Team Meeting (LINK) @ BQ seminar room 43: preparaing presentation & writing meeting report
Start installing developing environment on our internal server
- GNOME
- Mediawiki

8-5-2009

Meeting with Tobias Bauer & Anna-Lena Kranz (Theoretical Bioinformatics, DKFZ) @ TP3, DKFZ
- Integrating ideas of PromoterSweep, Transfac as well as DoOP/CisRED
- select "interesting" TFs (e.g. HIF, NFkB, c-myc, p53) for Wetlab
- select "interesting" pathways (e.g. cell cycle, inflammation, metabolism etc)
- future experimental validation: ChIP-on-Chip
  - for this we need a TFBS-free sequence
- idea: plot histogram of TFBS relative to TSS
  - problem: choice of sequence: upstream only? inculde downstream?
- new programming language: R and perl
- next meeting: Friday after team meeting

Meeting with Karl-Heinz Glatting (HUSAR, DKFZ) @ TP3, DKFZ
- An introduction into PromoterSweep
- Structure and analysis principles of PromoterSweep
- Output is stored in an XML file. This means we have to parse the xml code.
- Oliver Pelz will give help for us in programming

Protocol of the meeting can be downloaded from here.

Start working with MySQL
request UNIX/HUSAR/HPC access at DKFZ (Nao)
first contact with several databases: EmsEMBL, Compara, cisRED, DoOP, TiProD, contra (LINKS)

8-6-2009

Meeting with Oliver Pelz
- defining workflow with PromoterSweep, Matrix Profile Search and introduction into different Motif Discovery Algorithms

installation of NX server for access onto internal server from Windows
configure developing environment (printing from Linux, configure Mediawiki)
defining basic concept of database construction
- we select annotated promoter sequences in DoOP
- we make a selection of pathway of interest using KEGG
- narrow down number of target promoter sequences <10000.

8-7-2009

Official Team Meeting on Scheduling
Meeting with Anna-Lena and Tobias
- Introduction into R
- Tobias will give us access to their computing cluster (Group Roland Eils)
- Promoter Selection: DoOP, EnsEMBL, or UCSC?

HUSAR account arrived
installation of R, R editor and perl editor
further configuration of our internal server / mediawiki

[TOP]

8-10-2009

first contact with R and perl
playing around with R and perl
playing around with R library: Biobase
check working on DKFZ cluster

8-11-2009

defining programming languages: perl, R, MySQL
retrieving first Promotersweep output files

Meeting with Marti
- ideas for modeling
  - we will have at least three colors which overlap in their spectra.
  - a very nice approach will be Fuzzy Logic Modeling.
  - idea 1: error checking of affinity: compare expectation to experimental results and figure out where the error is hiding
  - idea 2: create&visualize fancy and fuzzy data from in silico simulation
- combine: promoter, output and graphic representation (GRAFIK!)
- next meeting with Marti: end of next week.

extract NCBI Entrez Gene IDs with R and perl
MAC adresses registered for bioquant network

8-12-2009

configure perl working environment
study structure of DoOP database
download DoOP and load DoOP database into MySQL

8-13-2009

trying out some DoOP queries
download fasta sequences from UCSC gene browser (LINK)
mapping of NCBI Entrez Gene IDs with RefSeq IDs
configure perl working environment on Windows XP
contact Endre Sebestyen concerning the perl module Bio-DoOP-DoOP (LINK)

8-14-2009

start PromoterSweep Analysis over Weekend

[TOP]

8-18-2009

Tim, Stephen, ab hier müsst ihr eure Sachen selber eintragen!

study outputfile of PromoterSweep. check out general structure and pick up useful information.
result is grouped in: General Info, Best Genomic Mapping, Promoter DB Search Result, Graphical Overview, Combined Binding Sites, TSS and Exon Info, Profile Matrices and Generated Output Files.
upon selection, sections of interest will be collected and made ready for entry into MySQL DB
discuss table structure of our database

How should our database be called? - Brainstorming -
- SHOULD contain: iGEM, Transcription Factor, Binding Site, Promoter, synthetic biology, Heidelberg
- MAY contain: position, heartbeat, prediction, assembly, eukaryotes
- and still more keywords to come

8-19-2009

parse Promotersweep xml file into tab-separated text file (PERL CODE?)
- the text file should contain: RefSeq ID, TF name, TFBS position, TF motif sequence, TFBS Quality, TSS, Entrez ID, EnsEMBL ID, further gene description.
- this provided us with several programming problems concerning working with multiple arrays, hashes and their combinations (arrays of hashes, hashes of hashes, etc.) thus
studying structure and basic concepts of hash & key

8-20-2009

pre-decision for our table-structure
- Table: Main_Info
  - RefSeq ID, TF, TF motif start & end position, TFBS motif score, TFBS quality, TSS database info
- Table: Gene_Info
  - Ensembl_ID, Gene Symbol, Gene Description.
- we go for the RefSeq ID to be the key connecting these two tables.

8-21-2009

update script for parsing the Promotersweep output files due to unexpected errors
we forgot to include "weak" as a category for the TFBS quality - added!
PromoterSweep result contains information about TSS derived from different promoter databases. On which should we rely, if they differ from each other?
- We set our highest priority to DoOP database since they show a good accordance within the RefseqID results when compared to other databases (e.g. DBTSS).

order [http://www.mathworks.com/| Matlab] iGEM licence

search for a tool to use MySQL in R programming environment
wiki: write an short article about the German Cancer Research Center (DKFZ)

Meeting with Anna-Lena: once we established our database... then
- two strategies:
  - manually select interesting transcription factors and analyse them using database queries
  - plot histograms of TFBS occurance within the target promoter sequence (TSS - 1000bp upstream) for each TF and make systematic analysis
- we go for both!
- idea for the future: we can analyze combinatorial appearance of distinct TF pairs

We have a name for our database - we call it -

- wait for it -

HEARTBEAT database (Heidelberg Artificial Transcription Factor Binding Site Engineering and Assembly Tool)

[TOP]

8-24-2009

Meeting with Marti: defining output modeling strategies
- "exclusive promoters"
  - a model for predicting the behaviour of activation of one, two, three... promoters at the same time.
  - the potential of this model lies in the possibility to model single as well as many pathways in combination and even check for synergistic effects
  - modeling logic: quantitative ODE VS. quantitative & qualitative fuzzy logic
- "error checking"
  - what to capture/measure: affinity of transcription factor binding to DNA
    - calculate score / reliabilty
    - phenotypic measurement
  - if we have time in the end: model/experiment optimization by wetlab-drylab-rounds (GRAFIK)
  - if we do not have much time: figure out where is catch
- modeling layers & final visualization
  - (i) capture affinity - (ii) model gene expression - (iii) pathway activity - (iv) fancy visualization (Mathworks Simulink?)
  - plot: time course, dynamic affinity
  - keep in mind the possible high amount of False Positives using promoter search/analysis

8-25-2009

official Team Meeting also with Mr. Kai Ludwig (LANGE + PFLANZ) as guest for Logo / Title Claim discussion

so far we have 1753 promoter sequences analyzed by PromoterSweep!

Meeting with Daniela (Nao): Cell Profiler for capturing biological images & data analysis based on MATLAB

working with R module RMySQL (LINK) for using the pipeline between R and MySQL
create a list of useful RMySQL commands

8-26-2009

Workflow for plotting histogram - workflow (SOURCE CODE/S?)
- make MySQL query using R
- make list of TFs, avoid duplicates using perl
- pick up each TF (perl/R) and plot histogram (R)

create MySQL command list including combinatorial queries

8-27-2009

check HEARTBEAT DB for duplicate entries
how should we plot the histogram?
- (a) histogram - how "wide" should be each bin? 100bp? 50bp? 20bp?
- (b) plot probability density
study Transfac PWM (position weight matrices) for
- difference in consensus sequences (also ask Anna-Lena)
- different PWM types (vertebrates, plant, insect, fungi, bacteria, nematodes...)
- positive control: when histograms are generated and plotted, check distribution of Sp1 (LINK)

so far we have 3640 promoter sequences "sweeped"!

8-28-2009

[TOP]

8-31-2009

[TOP]

September

Week	Days
	Mon	Tue	Wed	Thu	Fri	Sat	Sun
36	-	1	2	3	4	5	6
37	7	8	9	10	11	12	13
38	14	15	16	17	18	19	20
39	21	22	23	24	25	26	27
40	28	29	30	-	-	-	-

[TOP]

9-1-2009

derive transcription factor data using R and MySQL
plot HEARTBEAT TF hit distribution as histograms & density functions for different PWM subsets (all, vertebrates only, single matrices and joined TFs)

9-2-2009

discussion on how to make statistical studies on our gained distributions
- ideas: define maximum and variance -> Nao
look for motif sequences -> Tim

we have 4476 sequences analysed by Promotersweep so far!
- but we are expecting 4700 sequences - check missing ones!

9-3-2009

9-4-2009

[TOP]

9-7-2009

9-8-2009

9-9-2009

[TOP]

October

Week	Days
	Mon	Tue	Wed	Thu	Fri	Sat	Sun
40	-	-	-	1	2	3	4
41	5	6	7	8	9	10	11
42	12	13	14	15	16	17	18
43	19	20	21	22	23	24	25
44	26	27	28	29	30	31	-

10-1-2009

10-2-2009

[TOP]

10-5-2009

10-6-2009

10-7-2009

10-8-2009

10-9-2009

[TOP]

10-12-2009

10-13-2009

[TOP]

@@ Line 391: / Line 391: @@
 === 9-2-2009 ===
+* discussion on how to make statistical studies on our gained distributions
+** ideas: define maximum and variance -> Nao
+* look for motif sequences -> Tim
+* we have 4476 sequences analysed by Promotersweep so far!
+** but we are expecting 4700 sequences - check missing ones!
 === 9-3-2009 ===

Team:Heidelberg/Notebook modeling

From 2009.igem.org

Revision as of 23:02, 19 October 2009

Notebook HEARTBEAT

Contents

July

7-27-2009

August

8-3-2009

8-4-2009

8-5-2009

8-6-2009

8-7-2009

8-10-2009

8-11-2009

8-12-2009

8-13-2009

8-14-2009

8-18-2009

8-19-2009

8-20-2009

8-21-2009

8-24-2009

8-25-2009

8-26-2009

8-27-2009

8-28-2009

8-31-2009

September

9-1-2009

9-2-2009

9-3-2009

9-4-2009

9-7-2009

9-8-2009

9-9-2009

October

10-1-2009

10-2-2009

10-5-2009

10-6-2009

10-7-2009

10-8-2009

10-9-2009

10-12-2009

10-13-2009