Skip to content Skip to sidebar Skip to footer

Getting More Granular Diffs From Difflib (or A Way To Post-process A Diff To Achieve The Same Thing)

Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68: I then parse both sources with BeauifulSoup and diff them with difflib. url =

Solution 1:

You can use nltk.sent_tokenize() to split soup strings into sentences:

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

Prints only an appropriate sentence where the change was detected:

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

Post a Comment for "Getting More Granular Diffs From Difflib (or A Way To Post-process A Diff To Achieve The Same Thing)"