AILearn
Text Processing FundamentalsTokenization

Tokenization

35 min
Text Processing Fundamentals

Tokenization is the first step in any NLP pipeline. It converts raw text into tokens (words, subwords, or characters) that models can process.

Definition

Tokenization is the process of breaking down text into smaller units called tokens. These can be words, subwords, or characters depending on the tokenization strategy.

Key Concepts

Word Tokenization

Splitting text by whitespace and punctuation. Simple but struggles with unknown words.

Subword Tokenization

Breaking words into smaller meaningful pieces. BPE, WordPiece, and SentencePiece are popular methods.

Vocabulary

The set of unique tokens the model knows. Out-of-vocabulary (OOV) tokens are handled with special [UNK] tokens.

Special Tokens

[CLS], [SEP], [PAD], [MASK] - tokens with special meaning for models like BERT.

Real-World Applications

Google Search

Search

Query tokenization is crucial for understanding user intent and matching relevant results.

ChatGPT

AI

Uses BPE tokenization to handle any text input, including code, multiple languages, and novel words.

Code Example

python
import re
from collections import Counter

class SimpleTokenizer:
    """
    Basic word tokenizer with vocabulary
    """
    def __init__(self, vocab_size=10000):
        self.vocab_size = vocab_size
        self.word_to_id = {'[PAD]': 0, '[UNK]': 1}
        self.id_to_word = {0: '[PAD]', 1: '[UNK]'}

    def tokenize(self, text):
        """Split text into tokens"""
        # Lowercase and split on non-alphanumeric
        text = text.lower()
        tokens = re.findall(r'\b\w+\b', text)
        return tokens

    def build_vocab(self, texts):
        """Build vocabulary from training texts"""
        all_tokens = []
        for text in texts:
            all_tokens.extend(self.tokenize(text))

        # Count and keep most common
        counter = Counter(all_tokens)
        most_common = counter.most_common(self.vocab_size - 2)

        for word, count in most_common:
            idx = len(self.word_to_id)
            self.word_to_id[word] = idx
            self.id_to_word[idx] = word

        print(f"Vocabulary size: {len(self.word_to_id)}")

    def encode(self, text):
        """Convert text to token IDs"""
        tokens = self.tokenize(text)
        return [self.word_to_id.get(token, 1) for token in tokens]  # 1 = [UNK]

    def decode(self, ids):
        """Convert token IDs back to text"""
        tokens = [self.id_to_word.get(id, '[UNK]') for id in ids]
        return ' '.join(tokens)

# Example usage
texts = [
    "Machine learning is transforming the world.",
    "Deep learning models learn from data.",
    "Natural language processing understands text.",
    "AI is machine intelligence demonstrated by machines."
]

tokenizer = SimpleTokenizer(vocab_size=100)
tokenizer.build_vocab(texts)

# Encode and decode
test_text = "Machine learning understands language"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"\nOriginal: {test_text}")
print(f"Tokens: {tokenizer.tokenize(test_text)}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

print(f"\nVocabulary sample: {list(tokenizer.word_to_id.items())[:15]}")

This simple tokenizer demonstrates core concepts: building vocabulary from data, handling unknown words, and converting between text and IDs. Production tokenizers like BPE are more sophisticated.

Practice Problems

  • 1Implement character-level tokenization
  • 2Add padding and truncation for fixed-length sequences
  • 3Build a simple BPE tokenizer from scratch

Summary

Tokenization is foundational to NLP. The choice of tokenization strategy affects model performance - subword tokenization (BPE) is now standard for handling diverse text effectively.