Sequence Matcher
The SequenceMatcher
method will compare two given strings and return data that presents how
similar the two strings are. Let's try this out together using the ratio()
object. This will return the comparison data in decimal format.
>>> import difflib
>>> from difflib import SequenceMatcher
>>> str1 = 'I like pizza'
>>> str2 = 'I like tacos'
>>> seq = SequenceMatcher(a=str1, b=str2)
>>> print(seq.ratio())
0.66666666
We create a new variable that encapsulates the SequenceMatcher
class with two parameters, a
and b
. Although, the method actually accepts three parameters: None
, a
, and b
. In order for the the method to acknowledge our two strings, we need
to assign each of the string values to the method's variables, SequenceMatcher(a=str1, b=str2)
.
Once all of the necessary variables have been defined and the
SequenceMatcher has been given at least two parameters, we can now print
the value using the ratio()
object that we'd mentioned
earlier. This determines the ratio of characters that are similar in the
two strings and the result is then returned as a decimal. The ratio()
object is one of a few that belong to the
Sequence Matcher class.
Differ
The Differ
class is the opposite of SequenceMatcher
; it takes in lines of text and finds the differences between the strings. However, the Differ
class is unique in its usage of deltas, making it even more readable and easier for humans to spot the differences.
For instance, when adding new characters to the second string in a comparison between two strings, a '+ '
will appear before the line that has received the additional characters.
As you have probably guessed, deleting some of the characters that were visible in the first string will cause '- '
to pop up before the second line of text.
If a line is the same in both sequences, ' '
will be returned and if there is a line missing, then you will see '? '
. Additionally, you can also utilize attributes like ratio()
, which we saw in the last example. Let's see the Differ
class in action.
>>> import difflib
>>> from difflib import Differ
>>> str1 = "I would like to order a pepperoni pizza"
>>> str2 = "I would like to order a veggie burger"
>>> str1_lines = str1.splitlines()
>>> str2_lines = str2.splitlines()
>>> d = difflib.Differ()
>>> diff = d.compare(str1_lines, str2_lines)
>>> print('\n'.join(diff))
I would like to order a
'- ' pepperoni pizza
'+ ' veggie burger
In the example above, we begin by importing the module and Differ
class. Once we have defined our two strings that we want to compare, we must invoke the splitlines()
function on the two strings.
>>> str1_lines = str1.splitlines()
>>> str2_lines = str2.splitlines()
This will allow us to compare the strings by each line rather than by each individual character.
Once we have defined a variable that contains the Differ
class, we create another that contains Differ
with the compare()
object, taking in the two strings as parameters.
>>> diff = d.compare(str1_lines, str2_lines)
We call the print function and join the diff
variable with a line enter so that our result is formatted in a way that makes it more readable.
get_close_matches
Another simple yet powerful tool in difflib
is its get_close_matches
method. It's exactly what it sounds like: a tool that will take in
arguments and return the closest matches to the target string. In
pseudocode, the function works like this:
get_close_matches(target_word, list_of_possibilities, n=result_limit, cutoff)
As we can see above, get_close_matches
can take in 4 arguments but only requires the first 2 in order to return results.
The first parameter is the word that we are targeting; what we want
the method to return similarities to. The second parameter can be an
array of terms, or a variable that points to an array of strings. The
third parameter allows the user to define a limit to the number of
results that are returned. The last parameter determines how similar two
words need to be in order to be returned as a result.
With the first two parameters, alone, the method will return results
based on the default cutoff of 0.6 (in the range of 0 - 1) and a default
result limit of 3. Take a look at a couple of examples in order to see
how this function really works.
>>> import difflib
>>> from difflib import get_close_matches
>>> get_close_matches('bat', ['baton', 'chess', 'bat', 'bats', 'fireflies', 'batter'])
['bat', 'bats', 'baton']
Notice how the example above only returns three results even though there is a fourth term that is similar to 'bats': 'batter'.
This is because we did not specify a result limit as our third
parameter. Let's try that again, but this time we will define a
result_limit and a cutoff.
>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.6)
['bat', 'bats', 'baton', 'batter']
This time we get all four results that are at least 60% similar to
the word, 'bat'. The cutoff is equivalent to the original because we
just defined the same value as the default, 0.6. However, this can be
changed to make the results more or less strict. The closer to 1, the
more strict the constraints will be. In the example below, the
constraint has been changed to 0.9. This means that the results will
need to be at least 90% similar to the word 'bat'.
>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.9)
['bat']
unified_diff & context_diff
There are two classes in difflib
which operate in a very similar fashion; the unified_diff and the context_diff. The only major difference between the two is the result.
The unified_diff
takes in two strings of data and then
returns each word that was either added or removed from the first. The
best way to understand this concept is by seeing it in practice:
>>> import sys
>>> import difflib
>>> from difflib import unified_diff
>>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n']
>>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n']
>>> sys.stdout.writelines(unified_diff(str1, str2))
---
+++
@@ -1,5 +1,5 @@
-dog
-cat
-frog
-bear
+puppy
+kitten
+tadpole
+cub
animals
As evidenced by the results, the unified_diff
returns the removed words prefixed with -
and returns the added words prefixed with +
. The final word, 'animals' contains no prefix because it was present in both strings.
The context_diff
works in the same way as the unified_diff
.
However, instead of revealing what was added and removed from the
original string, it simply returns what lines have changed by returning
the changed lines with a prefix of '!'.
>>> from difflib import context_diff
>>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n']
>>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n']
>>> sys.stdout.writelines(context_diff(str1, str2))
***
---
***************
*** 1,5 ****
! dog
! cat
! frog
! bear
animals
--- 1,5 ----
! puppy
! kitten
! tadpole
! cub
animals
Within these examples, we can see that many of the functions and classes of the difflib
module resemble one another. Each have their own set of benefits and
it's important to analyze which will work best for your project.
Comparing sets of data becomes effortless when leveraging the difflib
module, but your results can be even better when your program returns
results in the most readable format possible for your data.
Source: iq.opengenus.org
Python has a built-in package called difflib
with the
function get_close_matches()
get_close_matches(word, possibilities, n, cutoff)
accepts
four parameters:
word
- the word to find close matches for in our listpossibilities
- the list in which to search for close
matches of word
n
(optional) - the maximum number of close matches to
return. Must be > 0
. Default is 3
.cutoff
(optional) - a float in the range [0, 1] that
a possibility
must score in order to be considered
similar to word
. 0
is very lenient,
1
is very strict. Default is 0.6
.
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
Parameters
This function accepts four parameters:
- word: This is the string for which we need the close matches.
- possibilities: This is usually a list of string values with which the word is matched.
- n: This is an optional parameter with a default value of 3. It specifies the maximum number of close matches required.
- cutoff: This is also an optional parameter with a
default value of 0.6. It specifies that the close matches should have a
score greater than the cutoff.
word = "learning"
possibilities = ["love", "learn", "lean", "moving", "hearing"]
n = 3
cutoff = 0.7
close_matches = difflib.get_close_matches(word,
possibilities, n, cutoff)
Or print differences to an html table:
import difflib
from IPython import display
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
difference = difflib.HtmlDiff(tabsize=2)
with open("compare.html", "w") as fp:
html = difference.make_file(fromlines=a, tolines=b, fromdesc="Original", todesc="Modified")
fp.write(html)
display.HTML(open("compare.html", "r").read())