Getting started with spaCy

Containers:

Doc: Sequence of Token
Token: An individual token — i.e. a word, punctuation symbol, whitespace, etc.
Span: A slice from a Doc object.
Lexeme: A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma

Tokenization:

The tokenizer processes the text from left to right. On each substring, it performs two checks:

If there’s a match, the rule is applied and the tokenizer continues its loop.

Named Entity Recognition:

Named entities are available as the ents property of a Doc
The standard way to access entity annotations is doc.ents , which produces a sequence of span.
We can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag.

IOB SCHEME