Understanding Tokenization in NLP¶

Tokenization is the fundamental process of breaking down textual data into smaller units called "tokens," essential for natural language processing (NLP) applications.

What are Tokens?¶

Tokens are the basic units resulting from the tokenization process, typically including:

Words
Subwords
Characters
Punctuation

Types of Tokenizers¶

1. Word-based Tokenization¶

Splits text into words based on spaces or punctuation.

Example:

"Hello, World!" → ["Hello", ",", "World", "!"]

Simple Python Example:

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

sentence = "Tokenization helps NLP models understand text."
tokens = word_tokenize(sentence)
print(tokens)

2. Character-based Tokenization¶

Splits text into individual characters.

Example:

"Hi!" → ['H', 'i', '!']

Python Example:

text = "Hi!"
char_tokens = list(text)
print(char_tokens)

3. Sentence Tokenization¶

Splits paragraphs into sentences.

Example:

"Hello. How are you?" → ["Hello.", "How are you?"]

Python Example:

from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

text = "Hello. How are you? I'm fine."
sentences = sent_tokenize(text)
print(sentences)

4. Subword Tokenization (Byte-Pair Encoding - BPE)¶

Splits words into smaller frequently occurring units. Widely used in models like GPT.

Example:

"tokenizing" → ["token", "izing"]

How Does BPE Work?¶

BPE splits sentences into subwords based on the frequency of these subwords in a training corpus. The tokenizer is first trained on a large corpus, creating merges of frequently occurring subwords, and then applies these merges to tokenize new sentences.

Tokenizer Training ≠ GPT Model Training: The tokenizer training happens before training GPT and independently.
Tokenizer Vocabulary ≠ GPT Learned Parameters: GPT learns patterns from subword token outputs generated by the tokenizer.

Implementing a Simple BPE Tokenizer¶

Here's a simplified implementation of Byte-Pair Encoding (BPE) tokenizer:

import re
from collections import Counter

class SimpleBPE:
    def __init__(self, num_merges):
        self.num_merges = num_merges
        self.merges = []

    def get_vocab(self, corpus):
        vocab = Counter()
        for word in corpus.strip().split():
            word = ' '.join(list(word)) + ' _'
            vocab[word] += 1
        return vocab

    def get_stats(self, vocab):
        pairs = Counter()
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols)-1):
                pairs[(symbols[i], symbols[i+1])] += freq
        return pairs

    def merge_vocab(self, pair, vocab):
        pattern = re.escape(' '.join(pair))
        replacement = ''.join(pair)
        new_vocab = {}
        for word in vocab:
            new_word = re.sub(pattern, replacement, word)
            new_vocab[new_word] = vocab[word]
        return new_vocab

    def train(self, corpus):
        vocab = self.get_vocab(corpus)
        for i in range(self.num_merges):
            pairs = self.get_stats(vocab)
            if not pairs:
                break
            best = pairs.most_common(1)[0][0]
            self.merges.append(best)
            vocab = self.merge_vocab(best, vocab)
            print(f'Merge {i+1}: {best}')

    def tokenize(self, word):
        word = list(word) + ['_']
        i = 0
        while i < len(word)-1:
            pair = (word[i], word[i+1])
            if pair in self.merges:
                word[i:i+2] = [''.join(pair)]
                i = max(i-1, 0)
            else:
                i += 1
        if word[-1] == '_':
            word = word[:-1]
        return word

# Usage
corpus = "token tokenization tokenizer token tokenizing"
bpe = SimpleBPE(num_merges=10)
bpe.train(corpus)

word = "tokenizing"
tokens = bpe.tokenize(word)
print("Tokenized:", tokens)

Using OpenAI's Tiktoken¶

Tiktoken is a tokenizer used by GPT models.

import tiktoken

encoder = tiktoken.encoding_for_model('gpt-4o')

text = "spaCy is great for tokenizing complex sentences!"
tokens = encoder.encode(text)
decoded_tokens = [encoder.decode([t]) for t in tokens]

print(f"Tokens ({len(tokens)}):", decoded_tokens)

Example Output:

Tokens (11): ['spa', 'C', 'y', ' is', ' great', ' for', ' token', 'izing', ' complex', ' sentences', '!']

Understanding a bit more about Subword Tokenizer¶

When we say "subword tokenizer", it means the tokenizer splits text into smaller units (subwords) based on their frequency in the training corpus used during the tokenizer's creation—not the GPT model directly.

Here's precisely how it works, step-by-step:

Step 1: Prepare the Corpus (Dataset)¶

Start with a huge text corpus (e.g., billions of words, scraped from web pages, books, Wikipedia).

Step 2: Train the Tokenizer (Pre-training Stage)¶

Before GPT itself is trained, the tokenizer is built independently on this corpus.
Algorithms like Byte-Pair Encoding (BPE) or WordPiece iteratively merge the most frequent character pairs into larger "subwords."

Example:

Suppose the corpus has many occurrences of:
- token, tokens, tokenizing, tokenization
- The tokenizer learns to create subwords like: token, izing, ization, etc.

Step 3: Generate the Vocabulary¶

After training, you get a fixed vocabulary of frequently occurring words and subwords.
Less frequent or rare words are represented by multiple subwords.
Highly frequent words remain as full words or larger chunks.

Step 4: Tokenizing New Sentences (Inference Stage)¶

When the tokenizer encounters new text (like your input to GPT), it splits the sentence into these learned subwords.

Example:

"spaCy is great for tokenizing complex sentences!"
GPT's tokenizer splits into:

["spa", "C", "y", " is", " great", " for", " token", "izing", " complex", " sentences", "!"]

Here, the tokenizer recognizes token and izing separately because these two fragments appeared frequently enough independently during tokenizer training.

Step 5: Using Subwords in GPT model¶

GPT (and other Transformer-based models) are trained on sequences of these subword tokens.
During training, the GPT model learns context and meaning based on these subword representations.
At inference (runtime), GPT interprets your input after subword tokenization, which is crucial for generating predictions or completions.

Understanding tokenization and its various techniques—word-based, character-based, sentence-based, and subword (BPE)—is foundational for working effectively in NLP and language modeling. Hope you learnt something new, see you in the next article!