NOTE: Before running this notebook, be sure to place your copies of the plays in the same folder as the notebook.

12.6 Similarity Detection with spaCy¶

Analyzing documents to determine how alike they are
Who wrote the works of William Shakespeare? Sir Francis Bacon? Christopher Marlowe? Others?
- Comparing word frequencies can reveal writing-style similarities
We’ll compare Doc objects for Shakespeare’s Romeo and Juliet and Christopher Marlowe's Edward the Second

Loading the Language Model¶

First, at the command line, execute the following command to load spaCy's medium sized model (~91mb) for better accuracy

ipython -m spacy download en_core_web_md
For spaCy's best accuracy, you can load the large sized model (~788mb)
ipython -m spacy download en_core_web_lg

import spacy

nlp = spacy.load('en_core_web_md')

Creating the spaCy `Doc`s¶

Create two Doc objects—one for Romeo and Juliet and one for Edward the Second:

from pathlib import Path

document1 = nlp(Path('RomeoAndJuliet.txt').read_text())

document2 = nlp(Path('EdwardTheSecond.txt').read_text())

Comparing the Books’ Similarity¶

Returns a value from 0.0 (not similar) to 1.0 (identical) indicating how similar the documents are

document1.similarity(document2)

0.9814659724155179

spaCy believes these two documents have significant similarities
Try copying a current news article into a text file, then performing the similarity comparison yourself

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.

12.6 Similarity Detection with spaCy¶

Loading the Language Model¶

Creating the spaCy Docs¶

Comparing the Books’ Similarity¶

Creating the spaCy `Doc`s¶