Turing Talks
Posts
Issue #23: Understanding AutoTokenizer in Huggingface Transformers

Issue #23: Understanding AutoTokenizer in Huggingface Transformers

Learn how Autotokenizers work in the Huggingface Transformers Library

Manish Shivanandhan
June 18, 2024

In partnership with

Learn AI in 5 Minutes a Day

AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.

Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.

Let’s learn about AutoTokenizer in the Huggingface Transformers library. We'll break it down step by step to make it easy to understand, starting with why we need tokenizers in the first place.

What is a Tokenizer?

Let’s understand what a tokenizer is.

Tokenizers are essential tools in machine learning, especially in natural language processing (NLP). They break down text into smaller units called tokens.

These tokens can be words, subwords, or characters.

For example, let's take the sentence "I love apples." When we tokenize this sentence, it breaks down into three tokens: "I," "love," and "apples."

Tokenization makes it easier for computers to understand and process the text. It’s used for tasks like translation, sentiment analysis, and all of NLP.

What is AutoTokenizer?

AutoTokenizer is a special class in the Huggingface Transformers library. It helps you choose the right tokenizer for your model without knowing the details.

Think of it as a smart assistant that knows which tool to use for the job.

The AutoTokenizer is easy to use. You don’t have to remember which tokenizer goes with which model. It ensures you use the correct tokenizer for the model, reducing errors and improving consistency.

Autotokenizer is flexible. It works with many different models, allowing you to switch models without changing much code.

How to Use AutoTokenizer

Let’s see how to use AutoTokenizer with an example. We’ll use the GPT-2 model in our example.

1. Install the Transformers Library

First, make sure you have the Transformers library installed. You can install it using pip (use ! if you are trying this with a collab notebook)

pip install transformers

2. Import AutoTokenizer

Next, import AutoTokenizer from the Transformers library.

from transformers import AutoTokenizer

3. Load the Tokenizer

Use AutoTokenizer to load the tokenizer for GPT-2.

tokenizer = AutoTokenizer.from_pretrained("gpt2")

This line of code tells AutoTokenizer to load the tokenizer for the GPT-2 model. If you were using a different model, you would just change “gpt2” to the name of that model.

4. Tokenize Text

Now, let’s tokenize some text. We’ll use the sentence “I love reading books.”

text = “I love reading books”
tokens = tokenizer(text)
print(tokens)

The tokenizer function breaks the text into tokens and converts them into numbers. You’ll see an output like this:

{‘input_ids’: [40, 523, 10748, 11950], ‘attention_mask’: [1, 1, 1, 1]}

- input_ids are the token IDs.

- attention_mask tells the model which tokens to pay attention to (1 means pay attention, 0 means ignore).

5. Decode Tokens

You can also convert the token IDs back into text using the decode function.

decoded_text = tokenizer.decode(tokens[‘input_ids’])
print(decoded_text)

This will return the original sentence (or something very close to it).

Great job! You’ve learned what a tokenizer is and how to use AutoTokenizer in the Huggingface Transformers library.

Keep practising with different models and texts, and soon you’ll be very comfortable using Autotokenizer. As usual, if you have questions, reply to this email and I ll be glad to help you. Happy coding!

Reply

or to participate.