Business Consultants

Insights
Generative AI for Text Generation — GPT-2 vs BERT
By
Mohit Kumar
We Demonstrates the training of language models using the Hugging Face Transformers library. There are two main tasks covered in the script: Causal Language Modeling (CLM) and Masked Language Modeling (MLM)
Generative AI for Text Generation — GPT-2 vs BERT
We Demonstrates the training of language models using the Hugging Face Transformers library. There are two main tasks covered in the script: Causal Language Modeling (CLM) and Masked Language Modeling (MLM)
Code Link: Train a language model — Colaboratory (google.com)
Flow of the process:
Causal Language Modeling (CLM): It is part of LLM, which is used to generate next in a given sentence, Model can only use words on the left, means it can’t able to see future word/sentence. It used tokenization method to understand and generate text. It used GPT-2 principle.
GPT-2: — It is unidirectional, it can look back and generate sentences.
Masked Language Modeling (MLM): It is also part of LLM, task of this model is predicted randomly masked word. It also worked on tokenization, but it used BERT principle generating masked.
BERT: Bi-directional Encoder Representation Transformer, It is bi-directional that means, it can look back-forth and unmasked word based upon it.
Tokenization: It is a method of breaking a sentence into small words or characters.
Group Texts for Modeling (Chunks of Block Size): It is a collection of text, sentence or paragraphs. It is a large input data that used for processing, where modeling stands for the define the purpose of Grouping the text for training, he LLM.
Chunks of Block Size: Chunks are a subset of group of words which have fixed length that determine the sequence the training model. “group_texts” we used in the code to do it.
· Model Initialization: — It is the initial process of setting up model parameters.
AutoModelForCausalLM or
config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)
AutoModelForMaskedLM
config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_config(config)
Training:
For CLM
# Instantiate the model for causal language modeling
model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
# Set up training arguments
training_args = TrainingArguments(
f"{model_checkpoint}-wikitext2",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
push_to_hub=True
)
# Instantiate the Trainer for CLM
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["validation"],
)
# Train the model
trainer. Train()
For MLM
# Instantiate the model for masked language modeling
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
# Set up training arguments
training_args = TrainingArguments(
"test-clm",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
push_to_hub=True,
push_to_hub_model_id=f"{model_checkpoint}-wikitext2",
)
# Instantiate the data collator for MLM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
# Instantiate the Trainer for MLM
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["validation"],
data_collator=data_collator,
)
# Train the model
trainer. Train()
Key Point:
o evaluation_strategy : — It is a parameter that tells model when to start evalution of model performance, we are using “epoch”. § That means pass through entire dataset.
o Learnig_rate:- it is a hypermeter that use to define the step takes to optimization the performance of model.
§ Hypermeter: — It is a external parameter that define by the engineer, for based upon the understanding of the model.
· Evaluation (Perplexity) :- It is a metrics that used to evaluate the performance of the model, based upon probabilities.
o Perplexity: — it defines how well a model predicts a set of words, Lower means model is performing better.
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
· Model Upload (Hugging Face Model Hub):
trainer.push_to_hub()