Tokenization
Tokenization is the first step in any NLP pipeline. It converts raw text into tokens (words, subwords, or characters) that models can process.
Definition
Tokenization is the process of breaking down text into smaller units called tokens. These can be words, subwords, or characters depending on the tokenization strategy.
Key Concepts
Word Tokenization
Splitting text by whitespace and punctuation. Simple but struggles with unknown words.
Subword Tokenization
Breaking words into smaller meaningful pieces. BPE, WordPiece, and SentencePiece are popular methods.
Vocabulary
The set of unique tokens the model knows. Out-of-vocabulary (OOV) tokens are handled with special [UNK] tokens.
Special Tokens
[CLS], [SEP], [PAD], [MASK] - tokens with special meaning for models like BERT.
Real-World Applications
Google Search
SearchQuery tokenization is crucial for understanding user intent and matching relevant results.
ChatGPT
AIUses BPE tokenization to handle any text input, including code, multiple languages, and novel words.
Code Example
import re
from collections import Counter
class SimpleTokenizer:
"""
Basic word tokenizer with vocabulary
"""
def __init__(self, vocab_size=10000):
self.vocab_size = vocab_size
self.word_to_id = {'[PAD]': 0, '[UNK]': 1}
self.id_to_word = {0: '[PAD]', 1: '[UNK]'}
def tokenize(self, text):
"""Split text into tokens"""
# Lowercase and split on non-alphanumeric
text = text.lower()
tokens = re.findall(r'\b\w+\b', text)
return tokens
def build_vocab(self, texts):
"""Build vocabulary from training texts"""
all_tokens = []
for text in texts:
all_tokens.extend(self.tokenize(text))
# Count and keep most common
counter = Counter(all_tokens)
most_common = counter.most_common(self.vocab_size - 2)
for word, count in most_common:
idx = len(self.word_to_id)
self.word_to_id[word] = idx
self.id_to_word[idx] = word
print(f"Vocabulary size: {len(self.word_to_id)}")
def encode(self, text):
"""Convert text to token IDs"""
tokens = self.tokenize(text)
return [self.word_to_id.get(token, 1) for token in tokens] # 1 = [UNK]
def decode(self, ids):
"""Convert token IDs back to text"""
tokens = [self.id_to_word.get(id, '[UNK]') for id in ids]
return ' '.join(tokens)
# Example usage
texts = [
"Machine learning is transforming the world.",
"Deep learning models learn from data.",
"Natural language processing understands text.",
"AI is machine intelligence demonstrated by machines."
]
tokenizer = SimpleTokenizer(vocab_size=100)
tokenizer.build_vocab(texts)
# Encode and decode
test_text = "Machine learning understands language"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)
print(f"\nOriginal: {test_text}")
print(f"Tokens: {tokenizer.tokenize(test_text)}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"\nVocabulary sample: {list(tokenizer.word_to_id.items())[:15]}")This simple tokenizer demonstrates core concepts: building vocabulary from data, handling unknown words, and converting between text and IDs. Production tokenizers like BPE are more sophisticated.
Practice Problems
- 1Implement character-level tokenization
- 2Add padding and truncation for fixed-length sequences
- 3Build a simple BPE tokenizer from scratch
Summary
Tokenization is foundational to NLP. The choice of tokenization strategy affects model performance - subword tokenization (BPE) is now standard for handling diverse text effectively.