István Albert, Bioinformatics, Penn State

BMMB 852: Applied Bioinformatics (Fall, 2016)

Lecture Notes

Lectures will appear below as they are presented. Homework are specified in each handout.

Lecture 1

Course information, homework and project information, introduction to computing, setting up you computer, basic unix command line usage, organizing your projects. Homework 1.

Lecture 2

Data formats, data analysis concepts, basic Unix commands: wc, grep, cut, sort, redirecting input and output streams, piping commands, processing a tabular file with UNIX tools. Homework 2.

Lecture 3

What do words mean? Sequence Ontology and Gene Ontology, Gene Set Enrichment, homework 3.

Lecture 4

Accessing data from scientific publications, the GenBank format, automated download of data from NCBI, installing and using Entrez Direct. Homework 4.

Lecture 5

FASTA and FASTQ formats, Phred quality scores, encodings. Homework 5.

Lecture 6

Handling compressed file archives, sequencing concepts, FASTQ quality control, the fastqc tool. Homework 6.

Lecture 7

Sequencing concepts, sequencing depth and coverages, more details on the fastq quality control plots. Homework 7.

Lecture 8

The short read archive, downloading data for the ebola project. Homework 8.

Lecture 9

Quality control of sequencing data. Trimming reads, removing adapters. Homework 9.

Lecture 10

Automating tasks, writing scripts, bash concepts, homework 10

Lecture 11

The basics of alignments, global, local and semiglobal alignments, scoring matrices, pairwise alignments, EMBOSS. Homework 11.

Lecture 12: Sequence patterns

Advanced pattern matching. K-mers. Catching up with materials presented in previous lectures: More on alignments, quality control and sequence adapters.

Lecture 13: Basic Local Alignment Search Tool, BLAST

Installing and Using BLAST, search strategies, BLAST settings and configuration. homework 13

Lecture 14: BLAST Databases

Using Blast. Interacting with blast databases, homework 14

Lecture 15: Short Read Aligners

Short read alignments: bwa, bowtie

Lecture 16: SAM Format

Sequence Alignment Maps: The SAM format

Lecture 17: Working with SAM/BAM files.

Working with SAM/BAM files, samtools

Lecture 18: Analyzing SAM files.

Analyzing with SAM/BAM files, samtools

Lecture 19: Some programming required

Programming skills, writing simple scripts with AWK.

Lecture 20: Data Visualization

Genomic data visualization, IGV, IGB.

Lecture 21: Visualizing Genomic Variation

Visualize large scale genomic reorganization. Get your pen and paper ready. There will be drawing involved...

Lecture 22: Genomic Variation

Genomic variation, pileups, definition of SNPs, SNVs and other

Lecture 23: The Variant Call Format

What is the variant call format, understand the fields and their meaning.

Lecture 24: Variant Calling in Practice

What makes variant calling difficult.

Lecture 25: Multi sample variant calling

Multi sample variant calling. Variant effect prediction.

Lecture 26: Introduction to RNA-Seq

Introduction to RNA-Seq analysis.

Lecture 27: Differential Expression with RNA-Seq

Differential expression with RNA-Seq data.

Lecture 27: Differential Expression with RNA-Seq

How to perform RNA-Seq data analysis.

Course Syllabus

Instructor: Istvan Albert

Course records: PSU ELion

Course registration: BMMB 852 - Applied Bioinformatics

The purpose of this course is to introduce students to the various applications of high-throughput sequencing including: chip-Seq, RNA-Seq, SNP calling, metagenomics, de-novo assembly and others. The course material will concentrate on presenting complete data analysis scenarios for each of these domains of applications and will introduce students to a wide variety of existing tools and techniques. We expect that by the end of the course work students will:

  • understand common bioinformatics data formats and standards
  • become familiar with the practice of analyzing short-read sequencing data from various instruments:
    • Illumina HiSeq/MiSeq sequencers, PacBio* sequencer, MinION platforms
  • develop a computationally oriented thinking that is necessary to take on large-scale data analysis projects
  • understand data analysis principles of methodologies such as:
    • short read and long read alignments
    • Chip-Seq analysis and peak calling
    • interval query and manipulation
    • SNP calling and genomic variation detection
    • genome assembly with open source tools
    • metagenomics analysis
  • filter, extract and combine data with scripting languages
  • automate tasks with shell scripts to create reusable data pipelines
  • plot and visualize results with R and other packages

Access to a Mac or Linux computer is necessary to perform the homework. Only Mac OSX (Tiger/Leopard) and Linux operating systems are supported.

Note: Computers using the Windows operating systems must install Linux (unfortunately due to the wide variety of Windows hardware we are unable to assist with this task).

Grading and Homework

This course will have a total of 30 homeworks that are given out at the end of each lecture and are due by the first lecture (Tuesday) each week.

The final grade will be an average of the grades obtained on the homework. For more details please refer to the information presented during the first lecture.

We want to emphasize that the primary goal of this course work is to improve students ability to handle and interpret data sets. Therefore the evaluation process is relative to the initial aptitudes. We aim to focus on developing permanent skills and talents that are not just immediately useful but also provide the foundation for further more in depth understanding of informatics in general.

All Penn State Policies regarding ethics and honorable behavior apply to this course.