Introduction - Genome Assembly
Lecturer: Alexei Drummond

Acknowledgement

Slides adapted from material originally developed by

Sebastian Schmeier

s.schmeier@gmail.com

http://genomics.sschmeier.com

Overview

  • Learning objectives
  • DNA sequence assembly problem
  • The greedy puzzle
  • Graph-based genome assembly
  • Seven bridges of Königsberg
  • We construct a de Bruijn graph
  • Underlying assumptions of graph-based genome assemblies
  • Summary

Learning objectives

  • Be able to explain the DNA sequence assembly problem
  • Be able to list the assumptions underlying a genome assembly
  • Be able discuss how the assumptions relate to reality
  • Be able to describe the principles of the greedy algorithm
  • Be able to describe the principles underlying a graph-based genome assembly

DNA sequence assembly problem

  1. The most commonly used DNA sequencing technology uses random fragmentation and sequencing of short reads < 300nt
  2. This results in millions of short sequences
  3. Genome assembly involves joining overlapping short sequence fragments together into long pieces to recover a continuous sequence of the genome
  4. How to do this efficiently?

The greedy puzzle

Assumptions:
  • If two sequence reads share a common overlapping substring of letters, then it is because they are likely to have originated from the same chromosomal regions in the genome

Objective :

Find a final string “genome” such that:
  • all reads are substrings of “genome”
  • “genome” should be as short as possible
Greedy algorithm (find a solution incrementally):
  1. Pick the highest scoring overlap.
  2. Merge the two overlapping fragments and add the resulting new sequence to the pool of sequences.
  3. Repeat until no more merges can be carried out.

Objective :

Find a final string “genome” such that:
  • all reads are substrings of “genome”
  • “genome” should be as short as possible
Greedy algorithm (find a solution incrementally):
  1. Pick the highest scoring overlap.
  2. Merge the two overlapping fragments and add the resulting new sequence to the pool of sequences.
  3. Repeat until no more merges can be carried out.

Objective :

Find a final string “genome” such that:
  • all reads are substrings of “genome”
  • “genome” should be as short as possible
Greedy algorithm (find a solution incrementally):
  1. Pick the highest scoring overlap.
  2. Merge the two overlapping fragments and add the resulting new sequence to the pool of sequences.
  3. Repeat until no more merges can be carried out.

Greedy algorithms are not a good solution :

Assumptions:
  • Reads are 100% accurate
  • Identical reads must come from the same location on the genome
  • All substrings from the genome are represented in the set of sequenced reads
  • But none of them hold true

Greedy algorithms are not a good solution :

  • Greedy algorithm for genome assembly are extremely slow.
  • It is not feasible to do this iterative process with millions of reads in a reasonable time and expect an “optimal” solution.

Graph theory

Represent relationships between entities

Graph theory

Analyse connections e.g. find well–connected hubs in graphs

Graph theory

Directed graph e.g. information flow

Graph-based genome assembly

  • Preprocess the sequence reads to determine the pair-wise overlap information and represent these binary relationships

Graph-based genome assembly

  • Preprocess the sequence reads to determine the pair-wise overlap information and represent these binary relationships

Graph-based genome assembly

  • Preprocess the sequence reads to determine the pair-wise overlap information and represent these binary relationships
  • The problem of finding a consistent lay-out (i.e. genome) can then be formulated in terms of searching a paths in the graph satisfying certain criteria

Graph-based genome assembly

  • Assumption: all substrings are represented
    • Even modern sequencers that generate 150nt+ reads do not cover all possible 150-mers present in the genome
    • Thus, people generally use substrings of certain length k (k-mers)
  • → Here we use 3-mers by cutting the original reads into reads of length 3

Graph-based genome assembly

Make them unique

Graph-based genome assembly

  • Each k-mer is a vertex
  • Draw edge from x to y, where
    • suffix from x overlaps prefix from y

Graph-based genome assembly

  • Each k-mer is a vertex
  • Draw edge from x to y, where
    • suffix from x overlaps prefix from y

Graph-based genome assembly

  • Each k-mer is a vertex
  • Draw edge from x to y, where
    • suffix from x overlaps prefix from y

Graph-based genome assembly

  • Each k-mer is a vertex
  • Draw edge from x to y, where
    • suffix from x overlaps prefix from y

Graph-based genome assembly

  • Find Hamiltonian path, that is, a path that visits every vertex exactly once
  • Record the First letter of each vertex + All letters of last vertex
  • UNFORTUNATELY:
    • → The Hamiltonian path problem is very difficult to solve (NP-complete)

Graph-based genome assembly

  • Find Hamiltonian path, that is, a path that visits every vertex exactly once
  • Record the First letter of each vertex + All letters of last vertex
  • UNFORTUNATELY:
    • → The Hamiltonian path problem is very difficult to solve (NP-complete)

Seven bridges of Königsberg

  • In 1735 Leonhard Euler was presented with the following problem:
    • Find a walk through the city that would cross each bridge once and only once
    • He proved that a such a walk does not exist for Königsberg
  • Computationally finding an Eulerian path is much easier than an Hamiltonian
we need to reformulate our assembly problem

de Bruijn graph

We construct a de Bruijn graph :

  • edges represent k-mers
  • vertices correspond to (k-1)-mers
  1. Form a node for every distinct prefix or suffix of a k-mer
  2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g., ATG) has prefix x (e.g., AT) and suffix y (e.g., TG), and label the edge with this k-mer.

Original reads:      ATGCG    GCGTG    GTGGC    TGGCA   

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

de Bruijn graph

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Eulerian path

Can we find a DNA sequence containing all k-mers?

  • In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
  • Eulerian path

Eulerian path

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Eulerian path

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

Eulerian path

k-mers:    ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT

Distinct (k-1)-mers:

More assumptions of graph-based genome assemblies

  • Hidden assumptions that do not hold for short read sequencing. We took for granted that:
    1. We can generate all k-mers present in the genome
    2. All k-mers are error free
    3. Each k-mer appears at most once in the genome

More assumptions of graph-based genome assemblies

  • Hidden assumptions that do not hold for short read sequencing. We took for granted that:
    1. We can generate all k-mers present in the genome
    2. All k-mers are error free
    3. Each k-mer appears at most once in the genome
  • Due to these reasons we do NOT choose the longest possible k-mer
  • The smaller the k-mer the higher the probability that we see all k-mers
  • Also, errors:

More assumptions of graph-based genome assemblies

  • Hidden assumptions that do not hold for short read sequencing. We took for granted that:
    1. We can generate all k-mers present in the genome
    2. All k-mers are error free
    3. Each k-mer appears at most once in the genome
      • → Repeats
      • This is most often not true
      • This is known as k-mer multiplicity

More assumptions of graph-based genome assemblies

k-mers:    ATG, GCA, TGC, TGC, GTG, GTG, GCG, GCG, CGT, CGT

Distinct (k-1)-mers:    ATGCGTGCGTGCA

De Bruijn graph from reads with errors

De Bruijn graph of a genome with repeats

Summary

  • DNA sequence assembly problem
    • Given millions of short sequences, join overlapping short sequence fragments into long pieces to recover a continuous sequence of the genome.
  • Greedy assembly
    • Find the shortest common superstring by iteratively joining and merging overlapping reads until all reads have been merged.
  • Graph-based assembly
    • Pre-process sequence reads to determine the pair-wise overlap information of substrings and represent these binary relationships in a graph.
    • Search a consistent genome layout by finding an eulerian path (visit each edge once) through the graph satisfying certain criteria.

Summary (2)

  • Assumptions:
    • If two sequence reads share a common overlapping substring, they are likely originating from the same chromosomal regions in the genome
    • Sequenced reads are 100% accurate (not true)
    • Identical reads must come from the same location on the genome (mostly not true)
    • All substrings from the genome are represented in the set of sequenced reads (very likely not true)
    • All k-mers are error free (not true)
    • We can generate all k-mers present in the genome (very likely untrue)
    • Each k-mer appears at most once in the genome (likely untrue)

References:
How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler (2011) . Nature Biotechnology 29, 987–991
Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
Comparing De Novo Genome Assembly: The Long and Short of It. Narzisi G, Mishra B (2011) PLoS ONE6(4): e19175.

Seven bridges of Königsberg II

  • The case of directed graphs is similar:
    • A graph in which indegrees are equal to outdegrees for all nodes is called 'balanced'.
    • Euler's theorem states that a connected directed graph has an Eulerian cycle if and only if it is balanced.

References

  • How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011
  • Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
  • Comparing De Novo Genome Assembly: The Long and Short of It. Narzisi G, Mishra B (2011) PLoS ONE6(4): e19175.