Introduction to Language and Speech Technology - ReMA (RU)¶
Seminar 10
Last update: 2024/11/18
Aditya Kamlesh Parikh - @aditya.parikh@ru.nl
In this tutorial, we will learn how to fine-tune the Wav2Vec2-BERT model (a very latest model released by Meta from Wav2vec2.0 family), a self-supervised speech model for ASR. We will use a small dataset from Hugging Face to finetune model to convert speech to text. By the end, you will understand how to load, prepare, fine-tune, and evaluate a model for speech recognition.
Important note: Change your runtime to GPU. ✌
The first step is to install the required libraries:
%%capture
!pip install transformers datasets torchaudio evaluate jiwer accelerate
!apt install git-lfs ##to upload your model on huggingface
Would you like to upload this model on HuggingFace 🤗? Then first login in huggingface hub.
from huggingface_hub import notebook_login
notebook_login()
Give a repository name:
repo_name = "wav2vec2-bert-speechocean-762-tutorial"
1. Load Dataset¶
In this tutorial we are using a dataset from HuggingFace 🤗 "speechocean762: A non-native English corpus for pronunciation scoring task"
We will load the dataset from 🤗 and use it for the finetuning.
from datasets import load_dataset
speechocean = load_dataset("mispeech/speechocean762")
print(speechocean)
DatasetDict({
train: Dataset({
features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
num_rows: 2500
})
test: Dataset({
features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
num_rows: 2500
})
})
Now we will look more into the dataset.
speechocean
DatasetDict({
train: Dataset({
features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
num_rows: 2500
})
test: Dataset({
features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
num_rows: 2500
})
}) speechocean['train']
Dataset({
features: ['accuracy', 'completeness', 'fluency', 'prosodic', 'text', 'total', 'words', 'speaker', 'gender', 'age', 'audio'],
num_rows: 2500
}) A Dataset contains columns of data, and each column can be a different type of data. The index, or axis label, is used to access examples from the dataset. For example, indexing by the row returns a dictionary of an example from the dataset:
speechocean["train"][15]
{'accuracy': 8,
'completeness': 10.0,
'fluency': 9,
'prosodic': 9,
'text': 'DORA IS NOT A CLEANER',
'total': 8,
'words': [{'accuracy': 10,
'phones': ['D', 'AO1', 'R', 'AH0'],
'phones-accuracy': [2.0, 1.8, 2.0, 2.0],
'stress': 10,
'text': 'DORA',
'total': 10,
'mispronunciations': []},
{'accuracy': 10,
'phones': ['IH0', 'Z'],
'phones-accuracy': [2.0, 2.0],
'stress': 10,
'text': 'IS',
'total': 10,
'mispronunciations': []},
{'accuracy': 10,
'phones': ['N', 'AA0', 'T'],
'phones-accuracy': [2.0, 1.6, 2.0],
'stress': 10,
'text': 'NOT',
'total': 10,
'mispronunciations': []},
{'accuracy': 10,
'phones': ['AH0'],
'phones-accuracy': [2.0],
'stress': 10,
'text': 'A',
'total': 10,
'mispronunciations': []},
{'accuracy': 10,
'phones': ['K', 'L', 'IY1', 'N', 'ER0'],
'phones-accuracy': [2.0, 2.0, 2.0, 2.0, 2.0],
'stress': 10,
'text': 'CLEANER',
'total': 10,
'mispronunciations': []}],
'speaker': '0001',
'gender': 'm',
'age': 6,
'audio': {'path': '000010140.wav',
'array': array([ 0.00021362, -0.0005188 , -0.00186157, ..., 0.00164795,
0.00048828, -0.00079346]),
'sampling_rate': 16000}} But, I think it is equally important that you understand dataset library from 🤗 and understand how to use it when you have your own dataset in tsv,csv, json or arrow format. Then you can convert your dataset in 🤗 dataset-dict and use it quickly and efficiently. I recommand you to go through this page: https://huggingface.co/docs/datasets/en/load_hub
Sometimes, you need to upsample/downsample the audio. For example, in above datacard the sampling_rate is 16000 (So it is fine). But if you are using Common-Voice dataset or Librispeech then sampling rate can be different. In that case, use the cast_column() function and set the sampling_rate parameter in the Audio feature to upsample/downsample the audio signal. This can also be very much useful to you.
One more thing: Sometimes you need to prepare your data to make it more usable. The 🤗 datasets gives you freedom to make any changes with the help of map() function.
Can you try to add another one column to dataset namely phonetic_transcription by joining phonemes of all the words keeping a single space between them? Take it as a task. (Optional !!!)
For example:
Orthographic_transcription: THEN HE WENT TO THEME PARK
Phonetic_transcription: DH EH0 N HH IY0 W EH0 N T T UW0 TH IY0 M P AA0 R K
Add phonetic transcription¶
def add_phonetic_transcription(entry):
# Extract phonetic transcription for the words in the entry
phonetic_transcription = " ".join(" ".join(word['phones']) for word in entry['words'])
# Add the phonetic transcription to the entry
entry['phonetic_transcription'] = phonetic_transcription
return entry
speechocean['train'] = speechocean['train'].map(add_phonetic_transcription)
speechocean['test'] = speechocean['test'].map(add_phonetic_transcription)
speechocean['train'][0]
{'accuracy': 8,
'completeness': 10.0,
'fluency': 9,
'prosodic': 9,
'text': 'WE CALL IT BEAR',
'total': 8,
'words': [{'accuracy': 10,
'phones': ['W', 'IY0'],
'phones-accuracy': [2.0, 2.0],
'stress': 10,
'text': 'WE',
'total': 10,
'mispronunciations': []},
{'accuracy': 10,
'phones': ['K', 'AO0', 'L'],
'phones-accuracy': [2.0, 1.8, 1.8],
'stress': 10,
'text': 'CALL',
'total': 10,
'mispronunciations': []},
{'accuracy': 10,
'phones': ['IH0', 'T'],
'phones-accuracy': [2.0, 2.0],
'stress': 10,
'text': 'IT',
'total': 10,
'mispronunciations': []},
{'accuracy': 6,
'phones': ['B', 'EH0', 'R'],
'phones-accuracy': [2.0, 1.0, 1.0],
'stress': 10,
'text': 'BEAR',
'total': 6,
'mispronunciations': []}],
'speaker': '0001',
'gender': 'm',
'age': 6,
'audio': {'path': '000010011.wav',
'array': array([-9.46044922e-04, -2.38037109e-03, -1.31225586e-03, ...,
-9.15527344e-05, 3.05175781e-04, -2.44140625e-04]),
'sampling_rate': 16000},
'phonetic_transcription': 'W IY0 K AO0 L IH0 T B EH0 R'} 2. Prepare Data¶
Here you will perform some simple steps for data preparation.
First we will start with removing columns which are not useful for us for finetuning.
Task 1:
Remove all the columns from dataset except text column with the help of remove_columns function.
¶
Once you are done with this, we will clean the text and remove any punctuation marks, foreign/special characters present in the text.
Task 2:
Write a function to remove all special characters, if they are present in the text. Also consider here the language. So, language specific special characters can be stayed for better understanding of language.
Hint: You can use regular expressions. Also, such functions you have created in your previous tutorials.
¶
Finally, we will create a vocabulary. In simple terms vocabulary is the all distinct letters/characters present in your dataset. For example, if your data is in English then 26 English alphabet can be your vocabulary.
Task 3:
You will write a function to extract all the unique characters present in your dataset.
¶
Once you are done with this, we will create a json formatted vocab file, it will have a key-value structure, where each character (key) will have a numerical value. For example: A:1, B:2, C:3 and so on.
Question:
For English, apart from 26 alphabets which are important punctualtion mark need to be stayed in the vocabulary?
Shall we also consider a space " " in vocabulary? Why?
¶
If you really want to understand finetuning of pretrained models like wav2vec2.0 or Hubert, it is very important that you learn more about Connectionist Temporal Classification (CTC) framework. It is a framework used in sequencial tasks. Some great resources to learn CTC are here: (1),(2)
For these 3 tasks I will give you some time. If you are unable to perform at that time, please do it later. I have also uploaded a vocab.json file with this tutorial and we can continue with that.
Your code in below cell for Task 1
speechocean = speechocean.remove_columns(['accuracy', 'completeness', 'fluency', 'prosodic','total', 'words', 'speaker', 'gender', 'age','phonetic_transcription'])
speechocean
DatasetDict({
train: Dataset({
features: ['text', 'audio'],
num_rows: 2500
})
test: Dataset({
features: ['text', 'audio'],
num_rows: 2500
})
}) Your code in below cell for Task 2
# Your code here for Task 2
# Write a code to remove any special characters in the train/test dataset
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'
def remove_special_characters(batch):
batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"])
return batch
speechocean = speechocean.map(remove_special_characters)
Your code in below cell for Task 3
# Your code here for Task 3
# Your output should look like this. For each unique character there should be a number.
def extract_all_chars(batch):
all_text = " ".join(batch["text"])
vocab = list(set(all_text))
return {"vocab": [vocab], "all_text": [all_text]}
vocabs = speechocean.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=speechocean.column_names["train"])
vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict
{'T': 0,
'K': 1,
'X': 2,
'M': 3,
'L': 4,
'E': 5,
'H': 6,
'B': 7,
'W': 8,
'C': 9,
'J': 10,
'Y': 11,
'V': 12,
'P': 13,
'D': 14,
'G': 15,
'F': 16,
'R': 17,
'N': 18,
"'": 19,
'I': 20,
'S': 21,
'O': 22,
' ': 23,
'Z': 24,
'U': 25,
'A': 26,
'Q': 27} # If you are not able to perform all the 3 tasks, please complete it later. You can download the vocab.json file directly.
import json
with open('/content/vocab.json', 'r', encoding='utf-8') as file:
vocab = json.load(file)
vocab
{'H': 0,
'Q': 1,
'F': 2,
'Z': 3,
'X': 4,
'K': 5,
'Y': 6,
'R': 7,
'N': 8,
'M': 9,
'A': 10,
'C': 11,
'O': 12,
'J': 13,
'T': 14,
"'": 15,
' ': 16,
'P': 17,
'W': 18,
'L': 19,
'S': 20,
'V': 21,
'U': 22,
'I': 23,
'B': 24,
'E': 25,
'D': 26,
'G': 27} Now we will add some special tokens in the vocabulary. [UNK] and [PAD] token. [PAD] tokens are also known as blank tokens in CTC alignment. If you are facing difficulties to understand this, please refer to CTC blogs I mentioned before.
vocab["[UNK]"] = len(vocab)
vocab["[PAD]"] = len(vocab)
# Also for convience, change your " " token with |
# So it can be more visible to you.
vocab["|"] = vocab[" "]
del vocab[" "]
vocab
{'H': 0,
'Q': 1,
'F': 2,
'Z': 3,
'X': 4,
'K': 5,
'Y': 6,
'R': 7,
'N': 8,
'M': 9,
'A': 10,
'C': 11,
'O': 12,
'J': 13,
'T': 14,
"'": 15,
'P': 17,
'W': 18,
'L': 19,
'S': 20,
'V': 21,
'U': 22,
'I': 23,
'B': 24,
'E': 25,
'D': 26,
'G': 27,
'[UNK]': 28,
'[PAD]': 29,
'|': 16} len(vocab)
30
Can you tell me what does it mean? What will be the dimention of our output? How many classes we will get in our output?
# Save this vocab.json
import json
with open('vocab.json', 'w') as vocab_file:
json.dump(vocab, vocab_file)
3. Create Tokenizer¶
Hubert model can also be fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer. In the next tutorial session we will explain more details about this.
from transformers import Wav2Vec2CTCTokenizer
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
# Push tokenizer to repository
tokenizer.push_to_hub(repo_name)
CommitInfo(commit_url='https://huggingface.co/Aditya3107/wav2vec2-bert-speechocean-762-tutorial/commit/11612c1aa4fa0b0093e45a73791a35df34b504f9', commit_message='Upload tokenizer', commit_description='', oid='11612c1aa4fa0b0093e45a73791a35df34b504f9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Aditya3107/wav2vec2-bert-speechocean-762-tutorial', endpoint='https://huggingface.co', repo_type='model', repo_id='Aditya3107/wav2vec2-bert-speechocean-762-tutorial'), pr_revision=None, pr_num=None) 4. Feature Extractor¶
In audio fine-tuning with models like Wav2Vec2 or HuBERT, the feature extractor processes raw waveform audio into input representations that the model can understand.
It converts audio to a consistent format (e.g., sample rate, duration) to match the model's pretraining setup.
Extracts low-level features (like spectrogram-like representations) directly from the waveform, so that model can understand speech signals.
from transformers import SeamlessM4TFeatureExtractor
feature_extractor = SeamlessM4TFeatureExtractor.from_pretrained("facebook/w2v-bert-2.0")
from transformers import Wav2Vec2BertProcessor
processor = Wav2Vec2BertProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
processor.push_to_hub(repo_name)
CommitInfo(commit_url='https://huggingface.co/Aditya3107/wav2vec2-bert-speechocean-762-tutorial/commit/098be93ab22606d8fba9b33ca3b052a6f175d4d7', commit_message='Upload processor', commit_description='', oid='098be93ab22606d8fba9b33ca3b052a6f175d4d7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Aditya3107/wav2vec2-bert-speechocean-762-tutorial', endpoint='https://huggingface.co', repo_type='model', repo_id='Aditya3107/wav2vec2-bert-speechocean-762-tutorial'), pr_revision=None, pr_num=None) # if you check the speechocean dataset, then we do not need to do anything.
speechocean['train'][20]['audio']
{'path': '000050003.wav',
'array': array([-0.02224731, -0.02105713, -0.0227356 , ..., 0.0010376 ,
-0.00030518, 0.00030518]),
'sampling_rate': 16000} import numpy as np
print("Target text:", speechocean["train"][50]["text"])
print("Input array shape:", np.asarray(speechocean["train"][50]["audio"]["array"]).shape)
print("Sampling rate:", speechocean["train"][50]["audio"]["sampling_rate"])
Target text: TOM LIKES THE OLD SWEATER Input array shape: (50880,) Sampling rate: 16000
5. Prepare for training¶
In the code below,
- First we load and resample the audio.
- Extract the
input_featuresfrom the loaded audio file, in our case it isLog-Mel Feature Extraction - Encode the trasncription/text to labels.
def prepare_dataset(batch):
audio = batch["audio"]
batch["input_features"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
batch["input_length"] = len(batch["input_features"])
batch["labels"] = processor(text=batch["text"]).input_ids
return batch
speechocean = speechocean.map(prepare_dataset)
Data Collator¶
A data collator prepares batches of data during training. It ensures the inputs and labels are appropriately padded and formatted for the model
Basically, what is happening your input data size (audio features, transcripts) very in their length. So, you need to align them.
DataCollatorCTCWithPadding
This class customizes how batches are created for Connectionist Temporal Classification (CTC)-based training tasks like speech recognition. It takes care of:
- Padding Inputs: Handles variable-length input audio features by padding them to the longest sequence in a batch.
- Padding Labels: Pads transcription labels separately.
- Handling Loss Masking: Ensures that padding in the labels is ignored during loss computation by replacing padding tokens with -100.
import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
@dataclass
class DataCollatorCTCWithPadding:
processor: Wav2Vec2BertProcessor
padding: Union[bool, str] = True
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lenghts and need
# different padding methods
input_features = [{"input_features": feature["input_features"]} for feature in features]
label_features = [{"input_ids": feature["labels"]} for feature in features]
batch = self.processor.pad(
input_features,
padding=self.padding,
return_tensors="pt",
)
labels_batch = self.processor.pad(
labels=label_features,
padding=self.padding,
return_tensors="pt",
)
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
batch["labels"] = labels
return batch
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
Now, we want to check how our model is performing. So, we need to define evaluation metric. We choose Word-Error-Rate(WER). We will talk about the evaluation metrics in the next tutorial.
from evaluate import load
wer_metric = load("wer")
def compute_metrics(pred):
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)
pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
pred_str = processor.batch_decode(pred_ids)
# we do not want to group tokens when computing the metrics
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
wer = wer_metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
Load pretrained model¶
Now we will load the main pretrained model and provide the training arguments.
Here we will also define some hyperparameters.
Some important are:
processor.tokenizer.pad_token_id: You must need to definepad_token_idas it works as blank token in CTC alignment.ctc_loss_reduction="mean": it determines how the CTC loss is aggregated across the batch during training.
Task
Can you define other hyperaprameters? Define as many as possible from below training arguments.
from transformers import Wav2Vec2BertForCTC
model = Wav2Vec2BertForCTC.from_pretrained(
"facebook/w2v-bert-2.0",
attention_dropout=0.0,
hidden_dropout=0.0,
feat_proj_dropout=0.0,
mask_time_prob=0.0,
layerdrop=0.0,
ctc_loss_reduction="mean",
add_adapter=True,
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer),
)
Some weights of Wav2Vec2BertForCTC were not initialized from the model checkpoint at facebook/w2v-bert-2.0 and are newly initialized: ['adapter.layers.0.ffn.intermediate_dense.bias', 'adapter.layers.0.ffn.intermediate_dense.weight', 'adapter.layers.0.ffn.output_dense.bias', 'adapter.layers.0.ffn.output_dense.weight', 'adapter.layers.0.ffn_layer_norm.bias', 'adapter.layers.0.ffn_layer_norm.weight', 'adapter.layers.0.residual_conv.bias', 'adapter.layers.0.residual_conv.weight', 'adapter.layers.0.residual_layer_norm.bias', 'adapter.layers.0.residual_layer_norm.weight', 'adapter.layers.0.self_attn.linear_k.bias', 'adapter.layers.0.self_attn.linear_k.weight', 'adapter.layers.0.self_attn.linear_out.bias', 'adapter.layers.0.self_attn.linear_out.weight', 'adapter.layers.0.self_attn.linear_q.bias', 'adapter.layers.0.self_attn.linear_q.weight', 'adapter.layers.0.self_attn.linear_v.bias', 'adapter.layers.0.self_attn.linear_v.weight', 'adapter.layers.0.self_attn_conv.bias', 'adapter.layers.0.self_attn_conv.weight', 'adapter.layers.0.self_attn_layer_norm.bias', 'adapter.layers.0.self_attn_layer_norm.weight', 'lm_head.bias', 'lm_head.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=repo_name,
group_by_length=True,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=10,
gradient_checkpointing=True,
fp16=True,
save_steps=600,
eval_steps=300,
logging_steps=300,
learning_rate=5e-5,
warmup_steps=500,
save_total_limit=2,
push_to_hub=True,
#report_to="wandb" # Uncomment if using Weights and Bias account
)
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1568: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn(
from transformers import Trainer
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=speechocean['train'],
eval_dataset=speechocean['test'],
tokenizer=processor.feature_extractor,
)
<ipython-input-25-64f8541731cf>:3: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. trainer = Trainer(
Introduction to Weights and Bias.¶
Weights and Biases (W&B) is a tool that helps you track, visualize, and organize your machine learning experiments. It can:
- Track of metrics like loss, accuracy, and learning rate for each training step or epoch.
- Visualize Training Progress by providing real-time graphs and dashboards to help you see how well your model is learning.
- Stores your model configurations, hyperparameters.
I highly recommand you to get fimilier with WANDB; if you want to train/finetune AI models in future. You can login to WANDB from this Notebook.
If you do not want to use WANDB then remove report_to="wandb" from training arguments.
Training will take multiple hours depending on the GPU allocations. But this can give you a general idea how the finetuning can be possible.
trainer.train()
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter. wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: aditya3107. Use `wandb login --relogin` to force relogin
/content/wandb/run-20241118_081349-9am62noc Upload the result of finetuning on 🤗 hub.
trainer.push_to_hub()
This tutorial is highly adapted from a well-known blog: https://huggingface.co/blog/fine-tune-wav2vec2-english
Please check it out in caseyou wan tmore details.
In the next tutorial, we will use the model which we finetuned here and evaluate the output from finetuned model. We will calculate Word error rate and character error rate for our predictions.
We will also try to open the output from finetuned models to demonstrate CTC algorithm.