Getting started with spaCy

Containers:

  • Doc: Sequence of Token
  • Token: An individual token — i.e. a word, punctuation symbol, whitespace, etc.
  • Span: A slice from a Doc object.
  • Lexeme: A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma


Tokenization:

The tokenizer processes the text from left to right. On each substring, it performs two checks:

  • Does the substring match a tokenizer exception rule?
  • Can a prefix, suffix or infix be split off?

If there’s a match, the rule is applied and the tokenizer continues its loop.


Named Entity Recognition:

  • Named entities are available as the ents property of a Doc
  • The standard way to access entity annotations is doc.ents , which produces a sequence of span.
  • We can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag.


IOB SCHEME

  • I – Token is inside an entity.
  • O – Token is outside an entity.
  • B – Token is the beginning of an entity