Insights

Generative AI for Text Generation — GPT-2 vs BERT

Mohit Kumar

We Demonstrates the training of language models using the Hugging Face Transformers library. There are two main tasks covered in the script: Causal Language Modeling (CLM) and Masked Language Modeling (MLM)

Generative AI for Text Generation — GPT-2 vs BERT

Code Link: Train a language model — Colaboratory (google.com)

Flow of the process:

Input Text -> Tokenization -> Group Texts -> Model Initialization -> Trained Language Model -> Evaluation -> Model Upload -> Trained Language Model

Causal Language Modeling (CLM): It is part of LLM, which is used to generate next in a given sentence, Model can only use words on the left, means it can’t able to see future word/sentence. It used tokenization method to understand and generate text. It used GPT-2 principle.

GPT-2: — It is unidirectional, it can look back and generate sentences.

Masked Language Modeling (MLM): It is also part of LLM, task of this model is predicted randomly masked word. It also worked on tokenization, but it used BERT principle generating masked.

BERT: Bi-directional Encoder Representation Transformer, It is bi-directional that means, it can look back-forth and unmasked word based upon it.

Tokenization: It is a method of breaking a sentence into small words or characters.

Group Texts for Modeling (Chunks of Block Size): It is a collection of text, sentence or paragraphs. It is a large input data that used for processing, where modeling stands for the define the purpose of Grouping the text for training, he LLM.

Chunks of Block Size: Chunks are a subset of group of words which have fixed length that determine the sequence the training model. “group_texts” we used in the code to do it.

· Model Initialization: — It is the initial process of setting up model parameters.

AutoModelForCausalLM or
config = AutoConfig.from_pretrained(model_checkpoint)

model = AutoModelForCausalLM.from_config(config)

AutoModelForMaskedLM

config = AutoConfig.from_pretrained(model_checkpoint)

model = AutoModelForMaskedLM.from_config(config)

Training:

For CLM

# Instantiate the model for causal language modeling

model_checkpoint = "gpt2"

tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)



# Set up training arguments

training_args = TrainingArguments(

    f"{model_checkpoint}-wikitext2",

    evaluation_strategy="epoch",

    learning_rate=2e-5,

    weight_decay=0.01,

    push_to_hub=True

)



# Instantiate the Trainer for CLM

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=lm_datasets["train"],

    eval_dataset=lm_datasets["validation"],

)



# Train the model

trainer. Train()

For MLM

# Instantiate the model for masked language modeling

model_checkpoint = "bert-base-cased"

tokenizer_checkpoint = "sgugger/bert-like-tokenizer"

model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)



# Set up training arguments

training_args = TrainingArguments(

    "test-clm",

    evaluation_strategy="epoch",

    learning_rate=2e-5,

    weight_decay=0.01,

    push_to_hub=True,

    push_to_hub_model_id=f"{model_checkpoint}-wikitext2",

)



# Instantiate the data collator for MLM

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)



# Instantiate the Trainer for MLM

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=lm_datasets["train"],

    eval_dataset=lm_datasets["validation"],

    data_collator=data_collator,

)



# Train the model

trainer. Train()

Key Point:

o evaluation_strategy : — It is a parameter that tells model when to start evalution of model performance, we are using “epoch”. § That means pass through entire dataset.

o Learnig_rate:- it is a hypermeter that use to define the step takes to optimization the performance of model.

§ Hypermeter: — It is a external parameter that define by the engineer, for based upon the understanding of the model.

· Evaluation (Perplexity) :- It is a metrics that used to evaluate the performance of the model, based upon probabilities.

o Perplexity: — it defines how well a model predicts a set of words, Lower means model is performing better.

import math

eval_results = trainer.evaluate()

print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

· Model Upload (Hugging Face Model Hub):

trainer.push_to_hub()