We give here a quick and easy guide to do Molecular Biology by computer. There is also the real lecture ("Molecular Biology" for all Master students of any semester), Tue, Wed and Fri at 9:00 in A102.
So using the computer you always need some data to do molecular biology. This is solved here by giving you sample sequences. For each step of analysis (each lecture in real life) you get some test data to process. Next you need the computer. However, we are in the age of the internet. So a link to a program doing the calculation will be fine and these are also given for each lecture. Now bring both together and get your own result!
Tutorial 1: BLAST
Take the DNA sequences from the word files and run a Nucleotide BLAST search (for the first sequence select the Human genomic + transcript database): https://blast.ncbi.nlm.nih.gov/Blast.cgi
Which genes are searched?
Click here to download the sequences:
To learn more about BLAST and how to use it quiet well have a look on the NCBI tutorials on youtube:
Focus especially on the output parameters. What is the difference between the E- and p-Value? How do you interprete a high E-value?
- a gene related to cancer
- a gene coding for a hypothetical protein
- a viral nucleotide sequence
For more detailed information open the Results file and follow the included links.
Also make familiar with the new results format:
For further genomic online tools have a look at the UCSC Genome Browser https://genome.ucsc.edu/
Tutorial 2: RNA
1. lin-4 miRNA
The first discoverd microRNA is the lin-4 miRNA. Search this miRNA in the Rfam database. Does the lin-4 miRNA occur in humans or mice?
Take the RNA sequence from the human lin-4 miRNA and perform an analysis with the RNA Analyzer. Which structure does the miRNA show? What does a negative energy mean?
Take the sequence of the Hammerhead_1 from Clonochis sinensis and execute an analysis with RNAfold. Have a look at the dot plot containing the base pair probabilities and interprete it. What is a dot plot at all?
3. Take the following unknown RNA sequence and analyze it regarding the appearance of riboswitches using the Riboswitch Finder
What is a riboswitch?
At which posotions does the riboswitch start?
Please read the appropriate paper for a more detailed description of the tool.
- Riboswitch Finder
- RNA Analyzer
Tutorial 3: Proteins
1. Take the unknown amino acid sequence and run a Protein BLAST search (Algorithm: Quick BLASTP) to identify the organism the protein sequence is originated from.
Secondly, run a PSI-BLAST with 3 iterations to identify the supposed protein function. Only select the top 100 sequences. (Hint: the "Run PSI-Blast iteration X" is placed under descriptions). What does the yellow highlighting mean?
2. Search for the proteins BRCA1_Human, BRCA2_Human and BRCA2_Mouse in the UniProt database. Use the Align tool and align BRCA2_Human with BRCA1_Human and BRCA2_Mouse, respectively. Is BRCA2 more related to BRCA1 from human or BRCA2 from mice? Which information about the protein sequences do you also get?
For this protein alignment have a look on the really short video tutorials!
3. Take the protein sequence from BRCA1_Human and run a sequence analysis with SMART. Which domains do you get for BRCA1 from human? How many phosphorylations does the protein show?
4. Take this amino acid sequence from the lecture and run a BLAST search. You will get the histone acetyltransferase KAT2A from homo sapiens as result.
Verify your result by scanning the sequence using Prosite. Which motifs are uncovered and which functions do they have? Verify it a second time using AnDom. Note, that you need the Protein-Identifier for human KAT2A (Q92830 from UniProt).
Go to the Protein Data Bank and search for KAT2A. You will find 6 structures. Download the structure file of the 5TRL crystal structure as PDB Format or click here and visualize the 3D structure using the RasMol software. Generate a Ramachandran plot for 5tlr using the Uppsala Ramachandran Server.
To interpret the Ramachandran plot have a look at this video.
5. To identify a protein from an unknown sequence several terms has to fit. These are conditions like:
- intrinsic properties
- hydophob, hydrophil etc.
- secondary structure
- conserved domains
- gene expression
Take the following amino acid sequence and run a BLAST search.
Which protein do you get from the sequence alignment?
Verify your assumption using further tools like PFAM and InterPro.
Which domains do you get and does make biological sense at all?
Tutorial 4: Gene expression
The Gene Expression Omnibus (GEO) is a database from the National Center of Biotechnology Information (NCBI) in which sequencing data from microarrays and RNA-Seq experiments are stored.
1. Go to the GEO database and search for the dataset GSE54839 and get an overview.
What is the issue of the underlying experiment?
Which platform was used to analyse the molecular profile?
Which conditions are included in the dataset?
2. Watch the video tutorial on how to use GEO2R. Apply the steps from the tutorial to analyse the differential expressed genes in the dataset GSE54839 between the two conditions.
Which gene has the most significant change?
In which condition is this gene more strongly expressed?
How do you interprete the logFC value?
What is the adj.P.Val?
Genevestigator consists a gene expression database and tools to analyse the stored biomedical or plant data from microarray and RNA-Seq. Download the application and create a free account quoting your students email.
3. Watch the introductory video tutorial and make you familiar with the genevestigator application.
What are the two types of compendium-wide analyses?
4. Perform a condition search for the insulin receptor (INSR) in Homo sapiens using the HS_AFFY_U133PLUS2-0 data collection.
What is the average expression value for INSR in the rectum?
How many cancer categories are listed involving INSR?
Using the Perturbations tool, set the filtering criteria to Fold-Change = |2| and p-value < 0.001. If you do not know how to apply the Perturbations tool watch the corresponding video tutorial on the genevestigator homepage.
How is the INSR expression for the influenza virus condition?
Tutorial 5: Transcriptional regulation
1. Perform a transcriptional regulation analysis using MotifMap after watching the corresponding video tutorial from the homepage. Apply a Motif Search selecting the Human (hg18 multiz28wayy_placental) species track and search for the transcription factor E2F1.
How many entries do you get?
Select the entry MA0024.
How do you interpret the sequence logo?
Search for the binding sites and order them starting with the lowest FDR.
How many binding sites are listed for E2F1 (according to default criteria)?
2. Genomatix consists of comprehensive software tools for genome annotation, regulation analysis, the analysis for high throughout genomic technologies, literature and pathway mining and much more. Create a free account using your students email and have a look at the MatInspector tool with that you can execute a promotor analysis and search for transcription factor binding sites.
Perform a promotor analysis for IL10 redoing the steps from the video tutorial and interpret the results output.
To get an idea how the Genomatix software is used for a scientific investigation read this paper concerning the CA/C1 peptidases of malaria parasites.
3. Virtual Footprint is a software suite for recognizing single or composite DNA patterns especially to analyze regulatory and promoter regions in bacterial genomes. Execute a promoter analysis for the gene anr selecting Pseudomonas aeruginosa (strain ATCC 15692 / PAO1), set the upstream size to 1000 and selecting the position weight matrix "Anr_Dnr | Pseudomonas aeruginosa.
Which promotor at which position do you find?
For a better understanding first go through the guided tour.
4. The Blue light- and temperature-regulated antirepressor BluF (ycgF) from Escherichia coli (strain K12) interacts with BluR. Execute a STRING searching for BluF in Escherichia coli.
In which manner does the two proteins interact?
Wich cellular process is affected?
- Virtual footprint
Tutorial 6/7: Protein location
Protein localization is the accumulation of a protein at a given site (or compartiment). For protein recruitment the amino acid sequences carry specific signal peptides guiding them to the endoplasmic reticulum, nucleus, membrane, cytosol and so on. In bioinformatics, the identification of known signal peptides enables the prediction of the protein localization.
1. Take the amino acid sequence and run an analyses with the signal peptide database SPdb.
What is the annotation for the human hit of this sequence?
To which human protein family does the this signal peptide belong?
What is the length of the signal sequence?
2. Go to the SignalP-5.0 Server and read the instructions.
What types of signal peptides are predicted by the tool?
Run a SignalP analysis with this amino acid sequence.
Which signal peptide shows the highest predicted probability and what is the likelihood for it?
At which position does the cleavage site appear?
Run a BLAST search to identify the protein.
3. The identification of GPI anchors and/or transmembrane components enables the prediction of membrane-associated proteins.
Analyse the protein sequence of human CD48 due to the appearance of a GPI-anchor using the big-PI Predictor.
Is CD48 a GPI-anchored protein?
Run an analysis for predicting transmembrane helices in proteins using the TMHMM Server.
How many AAs arise to be located in the membrane, inside and outside the cell?
Does the result agree with the function of CD48?
- gib-PI Predictor
- TMHMM Server
Tutorial 8/9: Functional Sites in Proteins
Recognition of functional regions of a protein, like binding sites for other proteins, macromolecules or DNA/RNA is crucial for the prediction of molecular interactions and for functional classification. Similarity in the binding patterns of proteins is closely related to similarity in the proteins functions.
1. Short linear motifs (SLiMs) are small functional protein modules that mediate protein-protein interaction. The Eukaryotic Linear Motif resource (ELM) represents one of the of the largest motif databases for ELMs. Search for ELMs of SMAD9_HUMAN using the ELM server.
How many ELMs do you get for SMAD9?
Compare the results with the annotations for SMAD9 from Uniprot.
2. Take the sequence for SMAD9 from Uniprot and run a prediction analysis with PredictProtein.
Does SMAD9 have protein binding sites?
What are the GO terms for the molecular function and biological process with the highest reliability?
Name the predicted localization.
3. Tubulin is the major constituent of microtubules. Get the sequence for Tubulin alpha-1B chain (TBA1B_HUMAN) from UniProt and search for internal sequence repeats using the REPRO server.
Which length and which score has the first hit?
Visualize the 3D structure using the RasMol software.
Tutorial 10/11: Cellular Communication
1. The MAP kinase pathway is involved in embryogenesis, cell differentiation, cell growth and apoptosis. Have a look at the protein-protein interactions using the STRING database and searching for BRAF.
Which interactions are validated by experiments?
2. Open the Pathways in cancer - Homo sampiens pathway map on KEGG.
Which interactions do you get on the KEGG pathway and which cellular function does BRAF strengthen?
3. This paper by Schuster, Fell and Dandekar is about the analysis of metabolic networks. Summarize how the authors analyzed the the interplay between the pentose phosphate pathway (PPP) and glycolysis.
4. This paper deals with a cancer network reconstruction combining in vitro tissue models with in silico analyses. Have a look on the applied software, especially SQUAD and CellDesigner.
The paper by Di Cara et al. about the algorithm based on SQUAD.