Getting started with spaCy
Containers:
- Doc: Sequence of Token
- Token: An individual token — i.e. a word, punctuation symbol, whitespace, etc.
- Span: A slice from a Doc object.
- Lexeme: A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma
Tokenization:
The tokenizer processes the text from left to right. On each substring, it performs two checks:
- Does the substring match a tokenizer exception rule?
- Can a prefix, suffix or infix be split off?
If there’s a match, the rule is applied and the tokenizer continues its loop.
Named Entity Recognition:
- Named entities are available as the ents property of a Doc
- The standard way to access entity annotations is doc.ents , which produces a sequence of span.
- We can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag.
IOB SCHEME
- I – Token is inside an entity.
- O – Token is outside an entity.
- B – Token is the beginning of an entity