Dynamic Programming – Edit Distance

Dynamic programming is essentially recursion without repetition. Developing a dynamic programming algorithm generally involves two separate steps:

•  Formulate problem recursively. Write down a formula for the whole problem as a simple combination of answers to smaller subproblems.

•  Build solution to recurrence from bottom up. Write an algorithm that starts with base cases and works its way up to the final solution.

Dynamic programming algorithms need to store the results of intermediate subproblems. This is often but not always done with some kind of table. We will now cover a number of examples of problems in which the solution is based on dynamic programming strategy.
Edit Distance

The words “computer” and “commuter” are very similar, and a change of just one letter, p- m, will change the first word into the second. The word “sport” can be changed into “sort” by the deletion of the ‘p’, or equivalently, ‘sort’ can be changed into ‘sport’ by the insertion of ‘p’. The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

•  change a letter,

•  insert a letter or

•  delete a letter

For example, the edit distance between FOOD and MONEY is at most four:

FOOD −→ MOOD −→ MON –D

−→ MONED −→ MONEY

Edit Distance: Applications

There are numerous applications of the Edit Distance algorithm. Here are some examples:

Spelling Correction

If a text contains a word that is not in the dictionary, a ‘close’ word, i.e. one with a small edit distance, may be suggested as a correction. Most word processing applications, such as Microsoft Word, have spelling checking and correction facility. When Word, for example, finds an incorrectly spelled word, it makes suggestions of possible replacements.

Plagiarism Detection

If someone copies, say, a C program and makes a few changes here and there, for example, change variable names, add a comment of two, the edit distance between the source and copy may be small. The edit distance provides an indication of similarity that might be too close in some situations.

Computational Molecular Biology DNA is a polymer. The monomer units of DNA are nucleotides, and the polymer is known as a “polynucleotide.” Each nucleotide consists of a 5-carbon sugar (deoxyribose), a nitrogen containing base attached to the sugar, and a phosphate group. There are four different types of nucleotides found in DNA, differing only in the nitrogenous base. The four nucleotides are given one letter abbreviations as shorthand for the four bases.

•  A-adenine

•  G-guanine

•  C-cytosine

•  T-thymine

Double-helix of DNA molecule with nucleotides Figure of Double-helix of DNA molecule with nucleotides goes here. The edit distance like algorithms are used to compute a distance between DNA sequences (strings over A,C,G,T, or protein sequences (over an alphabet of 20 amino acids), for various purposes, e.g.:

•  to find genes or proteins that may have shared functions or properties

•  to infer family relationships and evolutionary trees over different organisms.

Speech Recognition

Algorithms similar to those for the edit-distance problem are used in some speech recognition systems. Find a close match between a new utterance and one in a library of classified utterances.

Edit Distance Algorithm

A better way to display this editing process is to place the words above the other:

S

D

I

M D M
MA A R TT H

S

S

The first word has a gap for every insertion (I) and the second word has a gap for every deletion (D). Columns with two different characters correspond to substitutions (S). Matches (M) do not count. The Edit transcript is defined as a string over the alphabet M, S, I, D that describes a transformation of one string into another. For example

 S      D      I      M     D     M

1+    1+    1+    0+    1+    0+    = 4

In general, it is not easy to determine the optimal edit distance. For example, the distance between ALGORITHM and ALTRUISTIC  is at most 6.

 A    L    G    O    R           I          T     H    M

A         L             T     R    U    I      S   T           I      C

Is this optimal?

Edit Distance: Dynamic Programming Algorithm

Suppose we have an m-character string A and an n-character string B. Define E(i, j) to be the edit distance between the first i characters of A and the first j characters of B. For example,

A L _G      O     R    I    T     H     M

A     L            T     R    U     I    S      T     I    C

The edit distance between entire strings A and B is E(m, n). The gap representation for the edit sequences has a crucial “optimal substructure” property. If we remove the last column, the remaining columns must represent the shortest edit sequence for the remaining substrings. The edit distance is 6 for the following two words.

A    L    G    O      R           I             T     H   M

A    L             T     R    U    I      S   T           I      C

If we remove the last column, the edit distance reduces to 5.

A    L    G    O    R           I          T    H

A       L           T     R    U    I    S    T    I

We can use the optimal substructure property to devise a recursive formulation of the edit distance problem. There are a couple of obvious base cases:

•  The only way to convert an empty string into a string of j characters is by doing j insertions. Thus

E(0, j) = j

•  The only way to convert a string of i characters into the empty string is with i deletions:

E(i, 0) = i

There are four possibilities for the last column in the shortest possible edit sequence:

Deletion: Last entry in bottom row is empty.

i=3

A  _L      G    O     R   I    T     H     M

A    Lj=2 _T     R    U     I    S     T     I    C

In this case

E(i, j) = E(i − 1, j) + 1

Insertion: The last entry in the top row is empty.

i=5

A L    G_ O     R         I    T     H     M

A    L            T     R    U   I    S      T     I    C

j=5

In this case

E(i, j) = E(i, j − 1) + 1

Substitution: Both rows have characters in the last column.

i=4

A L _ G     O      R   I    T     H     M

A    L            T  R   U     I    S     T     I    C

j=3

If the characters are different, then

E(i, j) = E(i − 1, j − 1) + 1

A L _G      O      R   I    T     H     M

i=5

A    L            T     R  U     I    S      T     I    C

j=4
If characters are same, no substitution is needed:
E(i, j) = E(i − 1, j − 1)

Thus the edit distance E(i, j) is the smallest of the four possibilities:

E(i, j) = min

E(i − 1, j) + 1

E(i, j − 1) + 1        E(i − 1, j − 1) + 1  if A[i] = B[j] E(i − 1, j − 1)             if A[i] = B[j]

Consider the example of edit between the words “ARTS” and “MATHS”:

A    R    T    S

M    A    T    H    S

The edit distance would be in E(4, 5). If we recursion to compute, we will have

E(3, 5) + 1

E(4, 5) = min   E(4, 4) + 1

E(3, 4) + 1  if A[4] = B[5]

            E(3, 4)   if A[4] = B[5]

Recursion clearly leads to the same repetitive call pattern that we saw in Fibonnaci sequence. To avoid this, we will use the DP approach. We will build the solution bottom-up. We will use the base case E(0, j) to fill first row and the base case E(i, 0) to fill first column. We will fill the remaining E matrix row by row.

A

R

T

S

M
A
T
H

S

A

R

T

S

3 4→
M ↓1
A ↓2

T

↓3
H ↓4

S

↓5

First row and first column entries using the base cases

We can now fill the second row. The table not only shows the values of the cells E[i, j] but also arrows

that indicate how it was computed using values in E[i − 1, j], E[i, j − 1] and E[i − 1, j − 1]. Thus, if a cell E[i, j] has a down arrow from E[i − 1, j] then the minimum was found using E[i − 1, j]. For a right arrow, the minimum was found using E[i, j − 1]. For a diagonal down right arrow, the minimum was found

using E[i − 1, j − 1]. There are certain cells that have two arrows pointed to it. In such a case, the minimum could be obtained from the diagonal E[i − 1, j − 1] and either of E[i − 1, j] and E[i, j − 1]. We will use these arrows later to determine the edit script.

A

R

T

S

M ↓1 1
A ↓2
T ↓3
H ↓4

S

↓5

A

R

T

S

3 4→
M ↓1 1 2→
A ↓2

T

↓3
H ↓4

S

↓5

Computing E[1, 1] and E[1, 2]

A

R

T

S

M
A ↓2
T ↓3
H ↓4

S

↓5

A

R

T

S

3 4→
M 3 4→
A ↓2

T

↓3
H ↓4

S

↓5

Computing E[1, 3] and E[1, 4]

An edit script can be extracted by following a unique path from E[0, 0] to E[4, 5]. There are three possible paths in the current example. Let us follow these paths and compute the edit script. In an actual implementation of the dynamic programming version of the edit distance algorithm, the arrows would be recorded using an appropriate data structure. For example, each cell in the matrix could be a record with fields for the value (numeric) and flags for the three incoming arrows.

A

R

T

S

1 →2 →3 4→
M ↓1 1 →2 →3 4→
A ↓2 1 →2 →3 4→

T

↓3

2

2 2 3→
H ↓4

3

↓3 ↓3 3

S

↓5

4

↓4 ↓4 3

The final table with all E[i, j] entries computed

Solution path 1:

1+    0+     1+     1+       0     = 3

A

R

T

S

1 →2 →3 4→
M ↓1 1 →2 →3 4→
A ↓2 1 →2 →3 4→

T

↓3

2

2 2 3→
H ↓4

3

↓3 ↓3 3

S

↓5

4

↓4 ↓4 3

Possible edit scripts. The red arrows from E[0, 0] to E[4, 5] show the paths that can be followed to extract edit scripts.

Solution path 2:

1+    1+    0+    1+     0     = 3

Solution path 3:

1+    0+    1+    0+    1+     0     = 3

Analysis of DP Edit Distance

There are Θ(n2) entries in the matrix. Each entry E(i, j) takes Θ(1) time to compute. The total running time is Θ(n2).

Advertisements

About Khuram Ali

Programming... Programming and Programming...!!!

Posted on April 27, 2013, in Algorithms and tagged , , , , , , , , , , , . Bookmark the permalink. 4 Comments.

  1. I think there is some issue with formatting which makes it hard to follow some of the expressions or equations. I am using latest Chrome. Text alignments are not right.

  1. Pingback: Dynamic Programming – Edit Distance | Khuram Ali

  2. Pingback: Khuram Ali

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: