Friday, February 17, 2023

Python Difflib Module - An Algorithm for Fuzzy Matches

 Sequence Matcher

 The SequenceMatcher method will compare two given strings and return data that presents how similar the two strings are. Let's try this out together using the ratio() object. This will return the comparison data in decimal format.

>>> import difflib
>>> from difflib import SequenceMatcher
>>> str1 = 'I like pizza'
>>> str2 = 'I like tacos'
>>> seq = SequenceMatcher(a=str1, b=str2)
>>> print(seq.ratio())
0.66666666

We create a new variable that encapsulates the SequenceMatcher class with two parameters, a and b. Although, the method actually accepts three parameters: None, a, and b. In order for the the method to acknowledge our two strings, we need to assign each of the string values to the method's variables, SequenceMatcher(a=str1, b=str2).

Once all of the necessary variables have been defined and the SequenceMatcher has been given at least two parameters, we can now print the value using the ratio() object that we'd mentioned earlier. This determines the ratio of characters that are similar in the two strings and the result is then returned as a decimal. The ratio() object is one of a few that belong to the Sequence Matcher class.

Differ

The Differ class is the opposite of SequenceMatcher; it takes in lines of text and finds the differences between the strings. However, the Differ class is unique in its usage of deltas, making it even more readable and easier for humans to spot the differences.

For instance, when adding new characters to the second string in a comparison between two strings, a '+ ' will appear before the line that has received the additional characters.

As you have probably guessed, deleting some of the characters that were visible in the first string will cause '- ' to pop up before the second line of text.

If a line is the same in both sequences, ' ' will be returned and if there is a line missing, then you will see '? '. Additionally, you can also utilize attributes like ratio(), which we saw in the last example. Let's see the Differ class in action.

>>> import difflib
>>> from difflib import Differ
>>> str1 = "I would like to order a pepperoni pizza"
>>> str2 = "I would like to order a veggie burger"
>>> str1_lines = str1.splitlines()
>>> str2_lines = str2.splitlines()
>>> d = difflib.Differ()
>>> diff = d.compare(str1_lines, str2_lines)
>>> print('\n'.join(diff))
# output
I would like to order a 
'- ' pepperoni pizza
'+ ' veggie burger

In the example above, we begin by importing the module and Differ class. Once we have defined our two strings that we want to compare, we must invoke the splitlines() function on the two strings.

>>> str1_lines = str1.splitlines()
>>> str2_lines = str2.splitlines()

This will allow us to compare the strings by each line rather than by each individual character.

Once we have defined a variable that contains the Differ class, we create another that contains Differ with the compare() object, taking in the two strings as parameters.

>>> diff = d.compare(str1_lines, str2_lines)

We call the print function and join the diff variable with a line enter so that our result is formatted in a way that makes it more readable.

get_close_matches

Another simple yet powerful tool in difflib is its get_close_matches method. It's exactly what it sounds like: a tool that will take in arguments and return the closest matches to the target string. In pseudocode, the function works like this:

get_close_matches(target_word, list_of_possibilities, n=result_limit, cutoff)

As we can see above, get_close_matches can take in 4 arguments but only requires the first 2 in order to return results.

The first parameter is the word that we are targeting; what we want the method to return similarities to. The second parameter can be an array of terms, or a variable that points to an array of strings. The third parameter allows the user to define a limit to the number of results that are returned. The last parameter determines how similar two words need to be in order to be returned as a result.

With the first two parameters, alone, the method will return results based on the default cutoff of 0.6 (in the range of 0 - 1) and a default result limit of 3. Take a look at a couple of examples in order to see how this function really works.

>>> import difflib
>>> from difflib import get_close_matches
>>> get_close_matches('bat', ['baton', 'chess', 'bat', 'bats', 'fireflies', 'batter'])

['bat', 'bats', 'baton']

Notice how the example above only returns three results even though there is a fourth term that is similar to 'bats': 'batter'. This is because we did not specify a result limit as our third parameter. Let's try that again, but this time we will define a result_limit and a cutoff.

>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.6)

['bat', 'bats', 'baton', 'batter']

This time we get all four results that are at least 60% similar to the word, 'bat'. The cutoff is equivalent to the original because we just defined the same value as the default, 0.6. However, this can be changed to make the results more or less strict. The closer to 1, the more strict the constraints will be. In the example below, the constraint has been changed to 0.9. This means that the results will need to be at least 90% similar to the word 'bat'.

>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.9)

['bat']

unified_diff & context_diff

There are two classes in difflib which operate in a very similar fashion; the unified_diff and the context_diff. The only major difference between the two is the result.

The unified_diff takes in two strings of data and then returns each word that was either added or removed from the first. The best way to understand this concept is by seeing it in practice:

>>> import sys
>>> import difflib
>>> from difflib import unified_diff
>>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n']
>>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n']
>>> sys.stdout.writelines(unified_diff(str1, str2))
---
+++
@@ -1,5 +1,5 @@
-dog
-cat
-frog
-bear
+puppy
+kitten
+tadpole
+cub
 animals

As evidenced by the results, the unified_diff returns the removed words prefixed with - and returns the added words prefixed with +. The final word, 'animals' contains no prefix because it was present in both strings.

The context_diff works in the same way as the unified_diff. However, instead of revealing what was added and removed from the original string, it simply returns what lines have changed by returning the changed lines with a prefix of '!'.

>>> from difflib import context_diff
>>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n']
>>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n']
>>> sys.stdout.writelines(context_diff(str1, str2))
***
---
***************
*** 1,5 ****
! dog
! cat
! frog
! bear
  animals
--- 1,5 ----
! puppy
! kitten
! tadpole
! cub
  animals

Within these examples, we can see that many of the functions and classes of the difflib module resemble one another. Each have their own set of benefits and it's important to analyze which will work best for your project. Comparing sets of data becomes effortless when leveraging the difflib module, but your results can be even better when your program returns results in the most readable format possible for your data.

Source: iq.opengenus.org 

Python has a built-in package called difflib with the function get_close_matches()

get_close_matches(word, possibilities, n, cutoff) accepts four parameters:

  • word - the word to find close matches for in our list
  • possibilities - the list in which to search for close matches of word
  • n (optional) - the maximum number of close matches to return. Must be > 0. Default is 3.
  • cutoff (optional) - a float in the range [0, 1] that a possibility must score in order to be considered similar to word. 0 is very lenient, 1 is very strict. Default is 0.6.

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

Parameters

This function accepts four parameters:

  • word: This is the string for which we need the close matches.
  • possibilities: This is usually a list of string values with which the word is matched.
  • n: This is an optional parameter with a default value of 3. It specifies the maximum number of close matches required.
  • cutoff: This is also an optional parameter with a default value of 0.6. It specifies that the close matches should have a score greater than the cutoff.

word = "learning"
possibilities = ["love", "learn", "lean", "moving", "hearing"]
n = 3
cutoff = 0.7
close_matches = difflib.get_close_matches(word,
                possibilities, n, cutoff)

Or print differences to an html table:

import difflib
from IPython import display

a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()

difference = difflib.HtmlDiff(tabsize=2)

with open("compare.html", "w") as fp:
    html = difference.make_file(fromlines=a, tolines=b, fromdesc="Original", todesc="Modified")
    fp.write(html)


display.HTML(open("compare.html", "r").read())