Introduction - Genome Assembly

Generate layout for printing

Lecturer: Alexei Drummond

Acknowledgement

Slides adapted from material originally developed by

Sebastian Schmeier

s.schmeier@gmail.com

http://genomics.sschmeier.com

Overview

Learning objectives
DNA sequence assembly problem
The greedy puzzle
Graph-based genome assembly
Seven bridges of Königsberg
We construct a de Bruijn graph
Underlying assumptions of graph-based genome assemblies
Summary

Learning objectives

Be able to explain the DNA sequence assembly problem
Be able to list the assumptions underlying a genome assembly
Be able discuss how the assumptions relate to reality
Be able to describe the principles of the greedy algorithm
Be able to describe the principles underlying a graph-based genome assembly

DNA sequence assembly problem

The most commonly used DNA sequencing technology uses random fragmentation and sequencing of short reads < 300nt
This results in millions of short sequences
Genome assembly involves joining overlapping short sequence fragments together into long pieces to recover a continuous sequence of the genome
How to do this efficiently?

The greedy puzzle

Assumptions:

If two sequence reads share a common overlapping substring of letters, then it is because they are likely to have originated from the same chromosomal regions in the genome

Objective :

Find a final string “genome” such that:

all reads are substrings of “genome”
“genome” should be as short as possible

Greedy algorithm (find a solution incrementally):

Pick the highest scoring overlap.
Merge the two overlapping fragments and add the resulting new sequence to the pool of sequences.
Repeat until no more merges can be carried out.

Objective :

Find a final string “genome” such that:

all reads are substrings of “genome”
“genome” should be as short as possible

Greedy algorithm (find a solution incrementally):

Pick the highest scoring overlap.
Merge the two overlapping fragments and add the resulting new sequence to the pool of sequences.
Repeat until no more merges can be carried out.

Objective :

Find a final string “genome” such that:

all reads are substrings of “genome”
“genome” should be as short as possible

Greedy algorithm (find a solution incrementally):

Pick the highest scoring overlap.
Merge the two overlapping fragments and add the resulting new sequence to the pool of sequences.
Repeat until no more merges can be carried out.

Greedy algorithms are not a good solution :

Assumptions:

Reads are 100% accurate
Identical reads must come from the same location on the genome
All substrings from the genome are represented in the set of sequenced reads
But none of them hold true

Greedy algorithms are not a good solution :

Greedy algorithm for genome assembly are extremely slow.
It is not feasible to do this iterative process with millions of reads in a reasonable time and expect an “optimal” solution.

Graph theory

Represent relationships between entities

Graph theory

Analyse connections e.g. find well–connected hubs in graphs

Graph theory

Directed graph e.g. information flow

Graph-based genome assembly

Preprocess the sequence reads to determine the pair-wise overlap information and represent these binary relationships

Graph-based genome assembly

Preprocess the sequence reads to determine the pair-wise overlap information and represent these binary relationships

Graph-based genome assembly

Preprocess the sequence reads to determine the pair-wise overlap information and represent these binary relationships
The problem of finding a consistent lay-out (i.e. genome) can then be formulated in terms of searching a paths in the graph satisfying certain criteria

Graph-based genome assembly

Assumption: all substrings are represented

Even modern sequencers that generate 150nt+ reads do not cover all possible 150-mers present in the genome
Thus, people generally use substrings of certain length k (k-mers)

→ Here we use 3-mers by cutting the original reads into reads of length 3

Graph-based genome assembly

Make them unique

Graph-based genome assembly

Each k-mer is a vertex
Draw edge from x to y, where

suffix from x overlaps prefix from y

Graph-based genome assembly

Each k-mer is a vertex
Draw edge from x to y, where

suffix from x overlaps prefix from y

Graph-based genome assembly

Each k-mer is a vertex
Draw edge from x to y, where

suffix from x overlaps prefix from y

Graph-based genome assembly

Each k-mer is a vertex
Draw edge from x to y, where

suffix from x overlaps prefix from y

Graph-based genome assembly

Find Hamiltonian path, that is, a path that visits every vertex exactly once
Record the First letter of each vertex + All letters of last vertex
UNFORTUNATELY:

→ The Hamiltonian path problem is very difficult to solve (NP-complete)

Graph-based genome assembly

Find Hamiltonian path, that is, a path that visits every vertex exactly once
Record the First letter of each vertex + All letters of last vertex
UNFORTUNATELY:

→ The Hamiltonian path problem is very difficult to solve (NP-complete)

Seven bridges of Königsberg

In 1735 Leonhard Euler was presented with the following problem:

Find a walk through the city that would cross each bridge once and only once
He proved that a such a walk does not exist for Königsberg

Computationally finding an Eulerian path is much easier than an Hamiltonian

→ we need to reformulate our assembly problem

de Bruijn graph

We construct a de Bruijn graph :

edges represent k-mers
vertices correspond to (k-1)-mers

Form a node for every distinct prefix or suffix of a k-mer
Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x (e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

Original reads: ATGCG GCGTG GTGGC TGGCA

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Eulerian path

Can we find a DNA sequence containing all k-mers?

In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
Eulerian path

Eulerian path

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Eulerian path

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Eulerian path

k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

More assumptions of graph-based genome assemblies

Hidden assumptions that do not hold for short read sequencing. We took for granted that:

We can generate all k-mers present in the genome
All k-mers are error free
Each k-mer appears at most once in the genome

More assumptions of graph-based genome assemblies

Hidden assumptions that do not hold for short read sequencing. We took for granted that:

We can generate all k-mers present in the genome
All k-mers are error free
Each k-mer appears at most once in the genome

Due to these reasons we do NOT choose the longest possible k-mer
The smaller the k-mer the higher the probability that we see all k-mers
Also, errors:

More assumptions of graph-based genome assemblies

Hidden assumptions that do not hold for short read sequencing. We took for granted that:

We can generate all k-mers present in the genome
All k-mers are error free
Each k-mer appears at most once in the genome

→ Repeats
This is most often not true
This is known as k-mer multiplicity

More assumptions of graph-based genome assemblies

k-mers: ATG, GCA, TGC, TGC, GTG, GTG, GCG, GCG, CGT, CGT

Distinct (k-1)-mers: ATGCGTGCGTGCA

De Bruijn graph from reads with errors

De Bruijn graph of a genome with repeats

Summary

DNA sequence assembly problem

Given millions of short sequences, join overlapping short sequence fragments into long pieces to recover a continuous sequence of the genome.

Greedy assembly

Find the shortest common superstring by iteratively joining and merging overlapping reads until all reads have been merged.

Graph-based assembly

Pre-process sequence reads to determine the pair-wise overlap information of substrings and represent these binary relationships in a graph.
Search a consistent genome layout by finding an eulerian path (visit each edge once) through the graph satisfying certain criteria.

Summary (2)

Assumptions:

If two sequence reads share a common overlapping substring, they are likely originating from the same chromosomal regions in the genome
Sequenced reads are 100% accurate (not true)
Identical reads must come from the same location on the genome (mostly not true)
All substrings from the genome are represented in the set of sequenced reads (very likely not true)
All k-mers are error free (not true)
We can generate all k-mers present in the genome (very likely untrue)
Each k-mer appears at most once in the genome (likely untrue)

References:
How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler (2011) . Nature Biotechnology 29, 987–991
Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
Comparing De Novo Genome Assembly: The Long and Short of It. Narzisi G, Mishra B (2011) PLoS ONE6(4): e19175.

Seven bridges of Königsberg II

The case of directed graphs is similar:

A graph in which indegrees are equal to outdegrees for all nodes is called 'balanced'.
Euler's theorem states that a connected directed graph has an Eulerian cycle if and only if it is balanced.

References

How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011
Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
Comparing De Novo Genome Assembly: The Long and Short of It. Narzisi G, Mishra B (2011) PLoS ONE6(4): e19175.