12.6 Similarity Detection with spaCy

Loading the Language Model

  • First, at the command line, execute the following command to load spaCy's medium sized model (~91mb) for better accuracy

    ipython -m spacy download en_core_web_md

  • For spaCy's best accuracy, you can load the large sized model (~788mb)
    ipython -m spacy download en_core_web_lg

import spacy
nlp = spacy.load('en_core_web_md')  

Creating the spaCy Docs

  • Create two Doc objects—one for Romeo and Juliet and one for Edward the Second:
from pathlib import Path
document1 = nlp(Path('RomeoAndJuliet.txt').read_text())
document2 = nlp(Path('EdwardTheSecond.txt').read_text())

Comparing the Books’ Similarity

  • Returns a value from 0.0 (not similar) to 1.0 (identical) indicating how similar the documents are
  • spaCy believes these two documents have significant similarities
  • Try copying a current news article into a text file, then performing the similarity comparison yourself

