top of page

Insights

Generative AI for Text Generation — GPT-2 vs BERT

By

Mohit Kumar

We Demonstrates the training of language models using the Hugging Face Transformers library. There are two main tasks covered in the script: Causal Language Modeling (CLM) and Masked Language Modeling (MLM)

Generative AI for Text Generation — GPT-2 vs BERT

We Demonstrates the training of language models using the Hugging Face Transformers library. There are two main tasks covered in the script: Causal Language Modeling (CLM) and Masked Language Modeling (MLM)

Code Link: Train a language model — Colaboratory (google.com)

Flow of the process:

Input Text -> Tokenization  -> Group Texts  -> Model Initialization -> Trained Language Model -> Evaluation -> Model Upload -> Trained Language Model

Causal Language Modeling (CLM): It is part of LLM, which is used to generate next in a given sentence, Model can only use words on the left, means it can’t able to see future word/sentence. It used tokenization method to understand and generate text. It used GPT-2 principle.

GPT-2: — It is unidirectional, it can look back and generate sentences.

Masked Language Modeling (MLM): It is also part of LLM, task of this model is predicted randomly masked word. It also worked on tokenization, but it used BERT principle generating masked.

BERT: Bi-directional Encoder Representation Transformer, It is bi-directional that means, it can look back-forth and unmasked word based upon it.

Tokenization: It is a method of breaking a sentence into small words or characters.

Group Texts for Modeling (Chunks of Block Size): It is a collection of text, sentence or paragraphs. It is a large input data that used for processing, where modeling stands for the define the purpose of Grouping the text for training, he LLM.

Chunks of Block Size: Chunks are a subset of group of words which have fixed length that determine the sequence the training model. “group_texts” we used in the code to do it.

· Model Initialization: — It is the initial process of setting up model parameters.

AutoModelForCausalLM or 
 config = AutoConfig.from_pretrained(model_checkpoint)

model = AutoModelForCausalLM.from_config(config)

AutoModelForMaskedLM

config = AutoConfig.from_pretrained(model_checkpoint)

model = AutoModelForMaskedLM.from_config(config)

Training:

For CLM

# Instantiate the model for causal language modeling
model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

# Set up training arguments
training_args = TrainingArguments(
   f"{model_checkpoint}-wikitext2",
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   weight_decay=0.01,
   push_to_hub=True
)

# Instantiate the Trainer for CLM
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=lm_datasets["train"],
   eval_dataset=lm_datasets["validation"],
)

# Train the model
trainer. Train()

For MLM

# Instantiate the model for masked language modeling
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

# Set up training arguments
training_args = TrainingArguments(
   "test-clm",
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   weight_decay=0.01,
   push_to_hub=True,
   push_to_hub_model_id=f"{model_checkpoint}-wikitext2",
)

# Instantiate the data collator for MLM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

# Instantiate the Trainer for MLM
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=lm_datasets["train"],
   eval_dataset=lm_datasets["validation"],
   data_collator=data_collator,
)

# Train the model
trainer. Train()

Key Point:

o evaluation_strategy : — It is a parameter that tells model when to start evalution of model performance, we are using “epoch”. § That means pass through entire dataset.

o Learnig_rate:- it is a hypermeter that use to define the step takes to optimization the performance of model.

§ Hypermeter: — It is a external parameter that define by the engineer, for based upon the understanding of the model.

· Evaluation (Perplexity) :- It is a metrics that used to evaluate the performance of the model, based upon probabilities.

o Perplexity: — it defines how well a model predicts a set of words, Lower means model is performing better.

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

· Model Upload (Hugging Face Model Hub):

trainer.push_to_hub()
bottom of page